Limits...
Gene-boosted assembly of a novel bacterial genome from very short reads.

Salzberg SL, Sommer DD, Puiu D, Lee VT - PLoS Comput. Biol. (2008)

Bottom Line: The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths.From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly.

View Article: PubMed Central - PubMed

Affiliation: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America. salzberg@umd.edu

ABSTRACT
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.

Show MeSH

Related in: MedlinePlus

Gene-boosted assembly.All contigs are aligned with predicted gene sequences to identify genes that span 2 or more contigs. The DNA sequences of these spanning genes are cut out with a small buffer on each end. The amino acid translation of each gene fragment is then searched against a translated database of all singleton reads that have not yet been placed in the assembly. Finally, the reads identified by this process are assembled together with the two contigs to fill in the gap.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2529408&req=5

pcbi-1000186-g002: Gene-boosted assembly.All contigs are aligned with predicted gene sequences to identify genes that span 2 or more contigs. The DNA sequences of these spanning genes are cut out with a small buffer on each end. The amino acid translation of each gene fragment is then searched against a translated database of all singleton reads that have not yet been placed in the assembly. Finally, the reads identified by this process are assembled together with the two contigs to fill in the gap.

Mentions: From the initial annotation, we identified those genes that extended beyond the ends of contigs or that spanned the gaps between contigs. We extracted the amino acid sequences corresponding to these gap positions, with a small buffer sequence included on each side of each gap. Next we used tblastn [25] to align each protein sequence to all the unused reads translated in all 6 frames (Figure 2). This step identified, for each gap, a small set of reads that would fill in the missing protein sequence, and the tblastn results provided initial locations for a multiple alignment. We then used a new program, ABBA (Assembly Boosted By Amino acids), to assemble the reads together with the flanking contigs and close the gaps. This gene-boosted assembly protocol extended many contigs and closed 185 gaps, ranging in length from 14–1095 bp, reducing the number of long contigs to 120.


Gene-boosted assembly of a novel bacterial genome from very short reads.

Salzberg SL, Sommer DD, Puiu D, Lee VT - PLoS Comput. Biol. (2008)

Gene-boosted assembly.All contigs are aligned with predicted gene sequences to identify genes that span 2 or more contigs. The DNA sequences of these spanning genes are cut out with a small buffer on each end. The amino acid translation of each gene fragment is then searched against a translated database of all singleton reads that have not yet been placed in the assembly. Finally, the reads identified by this process are assembled together with the two contigs to fill in the gap.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2529408&req=5

pcbi-1000186-g002: Gene-boosted assembly.All contigs are aligned with predicted gene sequences to identify genes that span 2 or more contigs. The DNA sequences of these spanning genes are cut out with a small buffer on each end. The amino acid translation of each gene fragment is then searched against a translated database of all singleton reads that have not yet been placed in the assembly. Finally, the reads identified by this process are assembled together with the two contigs to fill in the gap.
Mentions: From the initial annotation, we identified those genes that extended beyond the ends of contigs or that spanned the gaps between contigs. We extracted the amino acid sequences corresponding to these gap positions, with a small buffer sequence included on each side of each gap. Next we used tblastn [25] to align each protein sequence to all the unused reads translated in all 6 frames (Figure 2). This step identified, for each gap, a small set of reads that would fill in the missing protein sequence, and the tblastn results provided initial locations for a multiple alignment. We then used a new program, ABBA (Assembly Boosted By Amino acids), to assemble the reads together with the flanking contigs and close the gaps. This gene-boosted assembly protocol extended many contigs and closed 185 gaps, ranging in length from 14–1095 bp, reducing the number of long contigs to 120.

Bottom Line: The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths.From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly.

View Article: PubMed Central - PubMed

Affiliation: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America. salzberg@umd.edu

ABSTRACT
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.

Show MeSH
Related in: MedlinePlus