Limits...
Gene-boosted assembly of a novel bacterial genome from very short reads.

Salzberg SL, Sommer DD, Puiu D, Lee VT - PLoS Comput. Biol. (2008)

Bottom Line: The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths.From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly.

View Article: PubMed Central - PubMed

Affiliation: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America. salzberg@umd.edu

ABSTRACT
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.

Show MeSH

Related in: MedlinePlus

Comparative assembly using multiple genomes.The target genome is shown in the center, aligned to two related genomes, A and B. The DNA sequence of the target diverges from the reference genomes in distinct loci, labeled X, Y, and Z. The comparative assembly based on genome A contains a gap corresponding to region Y, while the assembly based on genome B contains two gaps, corresponding to X and Z. The merged assembly will cover all of the target genome with no gaps.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2529408&req=5

pcbi-1000186-g001: Comparative assembly using multiple genomes.The target genome is shown in the center, aligned to two related genomes, A and B. The DNA sequence of the target diverges from the reference genomes in distinct loci, labeled X, Y, and Z. The comparative assembly based on genome A contains a gap corresponding to region Y, while the assembly based on genome B contains two gaps, corresponding to X and Z. The merged assembly will cover all of the target genome with no gaps.

Mentions: Our second step was a novel enhancement to the comparative assembly strategy, in which we used multiple reference genomes (Figure 1). We used the complete genomes of both PAO1 [19] and PA14 [20] separately to build multiple comparative assemblies, and found that PA14 produced the better assembly, comprising 2,053 contigs containing 6,206,284 bp. (We also used the PA7 strain, but its greater evolutionary distance made it less useful.) The bulk of the sequence was contained in 157 contigs longer than 10 Kbp, which collectively covered 5,568,616 bp. There were 331,364 bp in the PA14 genome that were not covered by the initial assembly, due to divergence between the two strains. However, the gaps in the comparative assembly based on PAO1 occurred in different locations due to differences between the strains. The best assembly based on PAO1 comprised 2797 contigs covering 6,043,652 bp.


Gene-boosted assembly of a novel bacterial genome from very short reads.

Salzberg SL, Sommer DD, Puiu D, Lee VT - PLoS Comput. Biol. (2008)

Comparative assembly using multiple genomes.The target genome is shown in the center, aligned to two related genomes, A and B. The DNA sequence of the target diverges from the reference genomes in distinct loci, labeled X, Y, and Z. The comparative assembly based on genome A contains a gap corresponding to region Y, while the assembly based on genome B contains two gaps, corresponding to X and Z. The merged assembly will cover all of the target genome with no gaps.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2529408&req=5

pcbi-1000186-g001: Comparative assembly using multiple genomes.The target genome is shown in the center, aligned to two related genomes, A and B. The DNA sequence of the target diverges from the reference genomes in distinct loci, labeled X, Y, and Z. The comparative assembly based on genome A contains a gap corresponding to region Y, while the assembly based on genome B contains two gaps, corresponding to X and Z. The merged assembly will cover all of the target genome with no gaps.
Mentions: Our second step was a novel enhancement to the comparative assembly strategy, in which we used multiple reference genomes (Figure 1). We used the complete genomes of both PAO1 [19] and PA14 [20] separately to build multiple comparative assemblies, and found that PA14 produced the better assembly, comprising 2,053 contigs containing 6,206,284 bp. (We also used the PA7 strain, but its greater evolutionary distance made it less useful.) The bulk of the sequence was contained in 157 contigs longer than 10 Kbp, which collectively covered 5,568,616 bp. There were 331,364 bp in the PA14 genome that were not covered by the initial assembly, due to divergence between the two strains. However, the gaps in the comparative assembly based on PAO1 occurred in different locations due to differences between the strains. The best assembly based on PAO1 comprised 2797 contigs covering 6,043,652 bp.

Bottom Line: The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths.From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly.

View Article: PubMed Central - PubMed

Affiliation: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America. salzberg@umd.edu

ABSTRACT
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.

Show MeSH
Related in: MedlinePlus