Limits...
Improving Phrap-based assembly of the rat using "reliable" overlaps.

Roberts M, Zimin AV, Hayes W, Hunt BR, Ustun C, White JR, Havlak P, Yorke J - PLoS ONE (2008)

Bottom Line: The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes.To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list.If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better.

View Article: PubMed Central - PubMed

Affiliation: Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America.

ABSTRACT
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.

Show MeSH

Related in: MedlinePlus

Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2266800&req=5

pone-0001836-g001: Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.

Mentions: Figure 1a shows a scenario where a genome contains two copies of a repeat region R. The correct positions of reads A, B, C and D are shown. The repeat region causes a “fork” in the overlaps, as shown in Figure 1b. The fork is created because read A has a plausible overlap with reads B, C and D, but D does not overlap B and C. We call the overlaps of A with B and D “fork overlaps”. Our goal is to design a method that eliminates the fork overlaps from the list of plausible overlaps, thereby producing a list of overlaps we call “reliable”. In Figure 1b, the only overlap that we would like to call reliable is between reads A and C, because part of the overlap region is outside the repeat region.


Improving Phrap-based assembly of the rat using "reliable" overlaps.

Roberts M, Zimin AV, Hayes W, Hunt BR, Ustun C, White JR, Havlak P, Yorke J - PLoS ONE (2008)

Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2266800&req=5

pone-0001836-g001: Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.
Mentions: Figure 1a shows a scenario where a genome contains two copies of a repeat region R. The correct positions of reads A, B, C and D are shown. The repeat region causes a “fork” in the overlaps, as shown in Figure 1b. The fork is created because read A has a plausible overlap with reads B, C and D, but D does not overlap B and C. We call the overlaps of A with B and D “fork overlaps”. Our goal is to design a method that eliminates the fork overlaps from the list of plausible overlaps, thereby producing a list of overlaps we call “reliable”. In Figure 1b, the only overlap that we would like to call reliable is between reads A and C, because part of the overlap region is outside the repeat region.

Bottom Line: The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes.To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list.If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better.

View Article: PubMed Central - PubMed

Affiliation: Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America.

ABSTRACT
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.

Show MeSH
Related in: MedlinePlus