Limits...
High depth, whole-genome sequencing of cholera isolates from Haiti and the Dominican Republic.

Sealfon R, Gire S, Ellis C, Calderwood S, Qadri F, Hensley L, Kellis M, Ryan ET, LaRocque RC, Harris JB, Sabeti PC - BMC Genomics (2012)

Bottom Line: Using these sequence data, we examined the effect of depth of coverage and sequencing platform on genome assembly and identification of sequence variants.We found that 50x coverage is sufficient to construct a whole-genome assembly and to accurately call most variants from 100 base pair paired-end sequencing reads.Sequence variant analyses of V. cholerae isolates, including multiple isolates from the Haitian outbreak, identify coverage-specific and technology-specific effects on variant detection, and provide insight into genomic change and functional evolution during an epidemic.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. rsealfon@mit.edu

ABSTRACT

Background: Whole-genome sequencing is an important tool for understanding microbial evolution and identifying the emergence of functionally important variants over the course of epidemics. In October 2010, a severe cholera epidemic began in Haiti, with additional cases identified in the neighboring Dominican Republic. We used whole-genome approaches to sequence four Vibrio cholerae isolates from Haiti and the Dominican Republic and three additional V. cholerae isolates to a high depth of coverage (>2000x); four of the seven isolates were previously sequenced.

Results: Using these sequence data, we examined the effect of depth of coverage and sequencing platform on genome assembly and identification of sequence variants. We found that 50x coverage is sufficient to construct a whole-genome assembly and to accurately call most variants from 100 base pair paired-end sequencing reads. Phylogenetic analysis between the newly sequenced and thirty-three previously sequenced V. cholerae isolates indicates that the Haitian and Dominican Republic isolates are closest to strains from South Asia. The Haitian and Dominican Republic isolates form a tight cluster, with only four variants unique to individual isolates. These variants are located in the CTX region, the SXT region, and the core genome. Of the 126 mutations identified that separate the Haiti-Dominican Republic cluster from the V. cholerae reference strain (N16961), 73 are non-synonymous changes, and a number of these changes cluster in specific genes and pathways.

Conclusions: Sequence variant analyses of V. cholerae isolates, including multiple isolates from the Haitian outbreak, identify coverage-specific and technology-specific effects on variant detection, and provide insight into genomic change and functional evolution during an epidemic.

Show MeSH

Related in: MedlinePlus

Fiftyfold coverage suffices for whole-genome assembly and detection of most sequence varients.(A) The N50 of the assembly, shown over a range of coverage depths (5x-250x), rapidly increases up to 50x coverage, and then plateaus. The median N50 of assemblies of five disjoint sets of reads at each depth of coverage is shown. (B) The number of SNPs detected increases rapidly up to 50x coverage, and gradually thereafter. (C) The number of insertions and deletions detected increases rapidly up to 20x coverage, and plateaus after 50x coverage. SNPs, insertions, and deletions in all isolates except for O395* are called relative to the N16961 genome [GenBank:AE003852, GenBank:AE003853]. For the O395* sample, due to the large number of differences (>20,000 SNPs) from the N16961 reference, SNPs, insertions, and deletions were identified instead against the Sanger-sequenced O395 reference [GenBank:CP000626, GenBank:CP000627].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3473251&req=5

Figure 1: Fiftyfold coverage suffices for whole-genome assembly and detection of most sequence varients.(A) The N50 of the assembly, shown over a range of coverage depths (5x-250x), rapidly increases up to 50x coverage, and then plateaus. The median N50 of assemblies of five disjoint sets of reads at each depth of coverage is shown. (B) The number of SNPs detected increases rapidly up to 50x coverage, and gradually thereafter. (C) The number of insertions and deletions detected increases rapidly up to 20x coverage, and plateaus after 50x coverage. SNPs, insertions, and deletions in all isolates except for O395* are called relative to the N16961 genome [GenBank:AE003852, GenBank:AE003853]. For the O395* sample, due to the large number of differences (>20,000 SNPs) from the N16961 reference, SNPs, insertions, and deletions were identified instead against the Sanger-sequenced O395 reference [GenBank:CP000626, GenBank:CP000627].

Mentions: The high depth of coverage of our sequencing enabled comparison of the efficacy of de novo assembly and variant detection at multiple depths of coverage. To assess the assembly quality, we used the N50 statistic. N50, a common metric of assembly quality, is the number of base pairs in the longest contig C such that fewer than half of the base pairs in the genome lie in contigs that are longer than C. We selected a random sample of the total reads for each isolate and compared the median N50 value for assemblies produced by Velvet at a range of coverage depths (5x to 250x), with three random read samples at each depth of coverage. For most isolates, N50 is stable across the range of depths from 50x to 250x, suggesting that 50x coverage is sufficient to construct a de novo assembly for these samples (Figure 1A). However, N50 continues to increase up to 100x coverage in sample H1*. The average read quality in H1* is the lowest of all the samples (Additional file 2: Figure S2), suggesting that while 50x is sufficient depth of coverage for de novo genome assembly on most samples, greater coverage is needed when average base quality is low.


High depth, whole-genome sequencing of cholera isolates from Haiti and the Dominican Republic.

Sealfon R, Gire S, Ellis C, Calderwood S, Qadri F, Hensley L, Kellis M, Ryan ET, LaRocque RC, Harris JB, Sabeti PC - BMC Genomics (2012)

Fiftyfold coverage suffices for whole-genome assembly and detection of most sequence varients.(A) The N50 of the assembly, shown over a range of coverage depths (5x-250x), rapidly increases up to 50x coverage, and then plateaus. The median N50 of assemblies of five disjoint sets of reads at each depth of coverage is shown. (B) The number of SNPs detected increases rapidly up to 50x coverage, and gradually thereafter. (C) The number of insertions and deletions detected increases rapidly up to 20x coverage, and plateaus after 50x coverage. SNPs, insertions, and deletions in all isolates except for O395* are called relative to the N16961 genome [GenBank:AE003852, GenBank:AE003853]. For the O395* sample, due to the large number of differences (>20,000 SNPs) from the N16961 reference, SNPs, insertions, and deletions were identified instead against the Sanger-sequenced O395 reference [GenBank:CP000626, GenBank:CP000627].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3473251&req=5

Figure 1: Fiftyfold coverage suffices for whole-genome assembly and detection of most sequence varients.(A) The N50 of the assembly, shown over a range of coverage depths (5x-250x), rapidly increases up to 50x coverage, and then plateaus. The median N50 of assemblies of five disjoint sets of reads at each depth of coverage is shown. (B) The number of SNPs detected increases rapidly up to 50x coverage, and gradually thereafter. (C) The number of insertions and deletions detected increases rapidly up to 20x coverage, and plateaus after 50x coverage. SNPs, insertions, and deletions in all isolates except for O395* are called relative to the N16961 genome [GenBank:AE003852, GenBank:AE003853]. For the O395* sample, due to the large number of differences (>20,000 SNPs) from the N16961 reference, SNPs, insertions, and deletions were identified instead against the Sanger-sequenced O395 reference [GenBank:CP000626, GenBank:CP000627].
Mentions: The high depth of coverage of our sequencing enabled comparison of the efficacy of de novo assembly and variant detection at multiple depths of coverage. To assess the assembly quality, we used the N50 statistic. N50, a common metric of assembly quality, is the number of base pairs in the longest contig C such that fewer than half of the base pairs in the genome lie in contigs that are longer than C. We selected a random sample of the total reads for each isolate and compared the median N50 value for assemblies produced by Velvet at a range of coverage depths (5x to 250x), with three random read samples at each depth of coverage. For most isolates, N50 is stable across the range of depths from 50x to 250x, suggesting that 50x coverage is sufficient to construct a de novo assembly for these samples (Figure 1A). However, N50 continues to increase up to 100x coverage in sample H1*. The average read quality in H1* is the lowest of all the samples (Additional file 2: Figure S2), suggesting that while 50x is sufficient depth of coverage for de novo genome assembly on most samples, greater coverage is needed when average base quality is low.

Bottom Line: Using these sequence data, we examined the effect of depth of coverage and sequencing platform on genome assembly and identification of sequence variants.We found that 50x coverage is sufficient to construct a whole-genome assembly and to accurately call most variants from 100 base pair paired-end sequencing reads.Sequence variant analyses of V. cholerae isolates, including multiple isolates from the Haitian outbreak, identify coverage-specific and technology-specific effects on variant detection, and provide insight into genomic change and functional evolution during an epidemic.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. rsealfon@mit.edu

ABSTRACT

Background: Whole-genome sequencing is an important tool for understanding microbial evolution and identifying the emergence of functionally important variants over the course of epidemics. In October 2010, a severe cholera epidemic began in Haiti, with additional cases identified in the neighboring Dominican Republic. We used whole-genome approaches to sequence four Vibrio cholerae isolates from Haiti and the Dominican Republic and three additional V. cholerae isolates to a high depth of coverage (>2000x); four of the seven isolates were previously sequenced.

Results: Using these sequence data, we examined the effect of depth of coverage and sequencing platform on genome assembly and identification of sequence variants. We found that 50x coverage is sufficient to construct a whole-genome assembly and to accurately call most variants from 100 base pair paired-end sequencing reads. Phylogenetic analysis between the newly sequenced and thirty-three previously sequenced V. cholerae isolates indicates that the Haitian and Dominican Republic isolates are closest to strains from South Asia. The Haitian and Dominican Republic isolates form a tight cluster, with only four variants unique to individual isolates. These variants are located in the CTX region, the SXT region, and the core genome. Of the 126 mutations identified that separate the Haiti-Dominican Republic cluster from the V. cholerae reference strain (N16961), 73 are non-synonymous changes, and a number of these changes cluster in specific genes and pathways.

Conclusions: Sequence variant analyses of V. cholerae isolates, including multiple isolates from the Haitian outbreak, identify coverage-specific and technology-specific effects on variant detection, and provide insight into genomic change and functional evolution during an epidemic.

Show MeSH
Related in: MedlinePlus