Misassembly detection using paired-end sequence reads and optical mapping data.
Bottom Line: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes.We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data.Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data.
Affiliation: Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA.Show MeSH
Related in: MedlinePlus
Mentions: misSEQuel first aligns reads to contigs to identify regions that contain abnormal read alignments. Collapsed or expanded repeats will present as the read coverage being greater or lower than the expected genome coverage in the region that has been misassembled. Similarly, inversion and rearrangement errors will present as the alignment of the mate-pairs being rearranged. Figure 1 illustrates these concordant and discordant read alignments. More specifically, this step consists of aligning all the (paired-end) reads to all the contigs and then calculating three thresholds, ΔL, ΔU and Γ. The range defines the acceptable read depth, and Γ defines the maximum allowable number of reads whose mate-pair aligns in an inverted orientation. To calculate these thresholds, we consider all alignments of each read as opposed to just the best alignment of each read since misassembly errors frequently occur within repetitive regions where the reads will align to multiple locations. misSEQuel performs this step using BWA (version 0.5.9) in paired-end mode with default parameters (Li and Durbin 2009). Subsequently, after alignment, each contig is treated as a series of consecutive 200-bp regions. These are sampled uniformly at random times, and the mean (µd) and the standard deviation (σd) of the read depth and the mean (µi) and the standard deviation (σi) of the number of alignments where a discordant mate-pair orientation is witnessed are calculated from these sampled regions. ΔL is set to the maximum of , ΔU is set to and Γ is set to . The default for is th of the contig length, and this parameter can be changed in the input to misSEQuel.Fig. 1.
Affiliation: Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA.