Misassembly detection using paired-end sequence reads and optical mapping data.
Bottom Line: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes.We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar.Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine.
Affiliation: Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA.Show MeSH
Mentions: misSEQuel first aligns reads to contigs to identify regions that contain abnormal read alignments. Collapsed or expanded repeats will present as the read coverage being greater or lower than the expected genome coverage in the region that has been misassembled. Similarly, inversion and rearrangement errors will present as the alignment of the mate-pairs being rearranged. Figure 1 illustrates these concordant and discordant read alignments. More specifically, this step consists of aligning all the (paired-end) reads to all the contigs and then calculating three thresholds, ΔL, ΔU and Γ. The range defines the acceptable read depth, and Γ defines the maximum allowable number of reads whose mate-pair aligns in an inverted orientation. To calculate these thresholds, we consider all alignments of each read as opposed to just the best alignment of each read since misassembly errors frequently occur within repetitive regions where the reads will align to multiple locations. misSEQuel performs this step using BWA (version 0.5.9) in paired-end mode with default parameters (Li and Durbin 2009). Subsequently, after alignment, each contig is treated as a series of consecutive 200-bp regions. These are sampled uniformly at random times, and the mean (µd) and the standard deviation (σd) of the read depth and the mean (µi) and the standard deviation (σi) of the number of alignments where a discordant mate-pair orientation is witnessed are calculated from these sampled regions. ΔL is set to the maximum of , ΔU is set to and Γ is set to . The default for is th of the contig length, and this parameter can be changed in the input to misSEQuel.Fig. 1.
Affiliation: Department of Computer Science, Colorado State University, Fort Collins, CO 80526, USA, Department of Computer Science, University of Helsinki, Finland and Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA.