Limits...
Improving Phrap-based assembly of the rat using "reliable" overlaps.

Roberts M, Zimin AV, Hayes W, Hunt BR, Ustun C, White JR, Havlak P, Yorke J - PLoS ONE (2008)

Bottom Line: The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes.To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list.If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better.

View Article: PubMed Central - PubMed

Affiliation: Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America.

ABSTRACT
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.

Show MeSH

Related in: MedlinePlus

Two alignments of assemblies to the finished sequence of BAC GQQD.The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2266800&req=5

pone-0001836-g002: Two alignments of assemblies to the finished sequence of BAC GQQD.The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.

Mentions: Figure 2 shows one of our most dramatically improved BACs. We used NUCmer, a variant of the MUMmer program [15], to align the assemblies of the BAC GQQD to the finished sequence. This particular BAC was initially assembled by Atlas into two scaffolds, and one scaffold contained a 20 Kb section that was reversed and misplaced. Using PhrapUMD with reliable overlaps, UMD+Atlas assembled the entire BAC into one scaffold and fixed the major misassembly. Our assembly of this BAC matched 20.0% more finished sequence than the Atlas assembly and reduced the interior error rate from 4.3 errors per 10 Kb in the Atlas assembly to 1.7 errors per 10 Kb. Figure 3 demonstrates the worst UMD+Atlas assembly. This was the only BAC that got assembled into two separate scaffolds; the rest of them were assembled into a single scaffold. In this BAC a 26Kb section in the middle was assembled into a separate scaffold, Scaffold 1, whereas the rest of the BAC was assembled into Scaffold 2. The gap in the middle of Scaffold 2, matching the size and position of the Scaffold 1, was estimated correctly. We do not view this scenario as a misassembly.


Improving Phrap-based assembly of the rat using "reliable" overlaps.

Roberts M, Zimin AV, Hayes W, Hunt BR, Ustun C, White JR, Havlak P, Yorke J - PLoS ONE (2008)

Two alignments of assemblies to the finished sequence of BAC GQQD.The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2266800&req=5

pone-0001836-g002: Two alignments of assemblies to the finished sequence of BAC GQQD.The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.
Mentions: Figure 2 shows one of our most dramatically improved BACs. We used NUCmer, a variant of the MUMmer program [15], to align the assemblies of the BAC GQQD to the finished sequence. This particular BAC was initially assembled by Atlas into two scaffolds, and one scaffold contained a 20 Kb section that was reversed and misplaced. Using PhrapUMD with reliable overlaps, UMD+Atlas assembled the entire BAC into one scaffold and fixed the major misassembly. Our assembly of this BAC matched 20.0% more finished sequence than the Atlas assembly and reduced the interior error rate from 4.3 errors per 10 Kb in the Atlas assembly to 1.7 errors per 10 Kb. Figure 3 demonstrates the worst UMD+Atlas assembly. This was the only BAC that got assembled into two separate scaffolds; the rest of them were assembled into a single scaffold. In this BAC a 26Kb section in the middle was assembled into a separate scaffold, Scaffold 1, whereas the rest of the BAC was assembled into Scaffold 2. The gap in the middle of Scaffold 2, matching the size and position of the Scaffold 1, was estimated correctly. We do not view this scenario as a misassembly.

Bottom Line: The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes.To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list.If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better.

View Article: PubMed Central - PubMed

Affiliation: Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America.

ABSTRACT
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.

Show MeSH
Related in: MedlinePlus