Limits...
Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data.

Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE - BMC Bioinformatics (2013)

Bottom Line: However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases.We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods.The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA. wej@bu.edu.

ABSTRACT

Background: DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample.

Results: Here, we present a highly accurate probabilistic algorithm, which is an extension of the Genomic Next-generation Universal MAPper to accommodate bisulfite sequencing data (GNUMAP-bs), that addresses the computational problems associated with aligning bisulfite sequencing data to a reference genome. GNUMAP-bs integrates uncertainty from read and mapping qualities to help resolve the difference between poor quality bases and the ambiguity inherent in bisulfite conversion. We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods.

Conclusions: The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments. The GNUMAP-bs algorithm is freely available for download at: http://dna.cs.byu.edu/gnumap. The software runs on multiple threads and multiple processors to increase the alignment speed.

Show MeSH

Related in: MedlinePlus

GNUMAP-bs workflow for mapping high-throughput bisulfite reads. A flow-chart of the GNUMAP-bs algorithm is shown. For the reference genome, all Cs are converted to Ts and then each k-mer in the genome is hashed, producing a list of positions in the genome to which the k-mer sequence is mapped. Given a completely BS-converted read (e.g. TGGTGTT) where the corresponding original bisulfite read (BSR) is known (e.g. CGGTGTC), a query k-mer (e.g. TGGTG) is searched against the BS-converted hash table. Locations with a k-mer match to both the genome and the read are aligned using the original read and original genome and the GNUMAP-bs probabilistic Needleman–Wunsch (N-W) algorithm. If the alignment score passes a quality threshold, the location is considered a match and recorded on the genome for future output. The posterior probability is computed for each mapped location using Equation (1). Finally, all the mapped BSRs are used to quantify the fraction of methylation at each CG location across the reference genome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3924334&req=5

Figure 1: GNUMAP-bs workflow for mapping high-throughput bisulfite reads. A flow-chart of the GNUMAP-bs algorithm is shown. For the reference genome, all Cs are converted to Ts and then each k-mer in the genome is hashed, producing a list of positions in the genome to which the k-mer sequence is mapped. Given a completely BS-converted read (e.g. TGGTGTT) where the corresponding original bisulfite read (BSR) is known (e.g. CGGTGTC), a query k-mer (e.g. TGGTG) is searched against the BS-converted hash table. Locations with a k-mer match to both the genome and the read are aligned using the original read and original genome and the GNUMAP-bs probabilistic Needleman–Wunsch (N-W) algorithm. If the alignment score passes a quality threshold, the location is considered a match and recorded on the genome for future output. The posterior probability is computed for each mapped location using Equation (1). Finally, all the mapped BSRs are used to quantify the fraction of methylation at each CG location across the reference genome.

Mentions: To determine if increased coverage resulted in decreased alignment quality, we computed the correlation coefficient between the estimated HEP methylation levels and the methylation levels estimated from each alignment method. The correlation coefficients were 0.887 for GNUMAP-bs and ranged from 0.895 for Bismark-Bowtie2 to 0.882 for LAST (Table3). We also computed a concordance statistic for each of the methods as defined previously[27]. Briefly, the concordance is the fraction of sites for which the methylation levels (aligner vs. HEP) differ by less than a predefined cutoff (we used 0.25). Based on this statistic, we found that the performances of most of the aligners were similar, with concordance values of 0.869 for GNUMAP-bs and ranged from 0.872 for Bismark-Bowtie2 0.872 and BRAT-BW to 0.858 for LAST (Table3). These results showed that the increased coverage levels produced by LAST result in lower consistency with the HEP data in terms of both correlation and concordance. Pairwise comparisons of the concordance for the CGs covered/not covered between GNUMAP-bs and the other aligners are shown in Figure1. For example, Figure1 shows 577 CGs that were covered by GNUMAP-bs but not covered by Bismark; the concordance of the GNUMAP-bs methylation estimates for these sites was 0.78. In contrast, only six CGs covered by Bismark were not covered by GNUMAP-bs, and the concordance for these sites was extremely low (0.33). These results clearly showed that the increased mapping coverage (for both numbers of CGs covered and reads per CG) obtained using GNUMAP-bs did not result in decreased alignment quality or methylation estimates.


Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data.

Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE - BMC Bioinformatics (2013)

GNUMAP-bs workflow for mapping high-throughput bisulfite reads. A flow-chart of the GNUMAP-bs algorithm is shown. For the reference genome, all Cs are converted to Ts and then each k-mer in the genome is hashed, producing a list of positions in the genome to which the k-mer sequence is mapped. Given a completely BS-converted read (e.g. TGGTGTT) where the corresponding original bisulfite read (BSR) is known (e.g. CGGTGTC), a query k-mer (e.g. TGGTG) is searched against the BS-converted hash table. Locations with a k-mer match to both the genome and the read are aligned using the original read and original genome and the GNUMAP-bs probabilistic Needleman–Wunsch (N-W) algorithm. If the alignment score passes a quality threshold, the location is considered a match and recorded on the genome for future output. The posterior probability is computed for each mapped location using Equation (1). Finally, all the mapped BSRs are used to quantify the fraction of methylation at each CG location across the reference genome.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3924334&req=5

Figure 1: GNUMAP-bs workflow for mapping high-throughput bisulfite reads. A flow-chart of the GNUMAP-bs algorithm is shown. For the reference genome, all Cs are converted to Ts and then each k-mer in the genome is hashed, producing a list of positions in the genome to which the k-mer sequence is mapped. Given a completely BS-converted read (e.g. TGGTGTT) where the corresponding original bisulfite read (BSR) is known (e.g. CGGTGTC), a query k-mer (e.g. TGGTG) is searched against the BS-converted hash table. Locations with a k-mer match to both the genome and the read are aligned using the original read and original genome and the GNUMAP-bs probabilistic Needleman–Wunsch (N-W) algorithm. If the alignment score passes a quality threshold, the location is considered a match and recorded on the genome for future output. The posterior probability is computed for each mapped location using Equation (1). Finally, all the mapped BSRs are used to quantify the fraction of methylation at each CG location across the reference genome.
Mentions: To determine if increased coverage resulted in decreased alignment quality, we computed the correlation coefficient between the estimated HEP methylation levels and the methylation levels estimated from each alignment method. The correlation coefficients were 0.887 for GNUMAP-bs and ranged from 0.895 for Bismark-Bowtie2 to 0.882 for LAST (Table3). We also computed a concordance statistic for each of the methods as defined previously[27]. Briefly, the concordance is the fraction of sites for which the methylation levels (aligner vs. HEP) differ by less than a predefined cutoff (we used 0.25). Based on this statistic, we found that the performances of most of the aligners were similar, with concordance values of 0.869 for GNUMAP-bs and ranged from 0.872 for Bismark-Bowtie2 0.872 and BRAT-BW to 0.858 for LAST (Table3). These results showed that the increased coverage levels produced by LAST result in lower consistency with the HEP data in terms of both correlation and concordance. Pairwise comparisons of the concordance for the CGs covered/not covered between GNUMAP-bs and the other aligners are shown in Figure1. For example, Figure1 shows 577 CGs that were covered by GNUMAP-bs but not covered by Bismark; the concordance of the GNUMAP-bs methylation estimates for these sites was 0.78. In contrast, only six CGs covered by Bismark were not covered by GNUMAP-bs, and the concordance for these sites was extremely low (0.33). These results clearly showed that the increased mapping coverage (for both numbers of CGs covered and reads per CG) obtained using GNUMAP-bs did not result in decreased alignment quality or methylation estimates.

Bottom Line: However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases.We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods.The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA. wej@bu.edu.

ABSTRACT

Background: DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample.

Results: Here, we present a highly accurate probabilistic algorithm, which is an extension of the Genomic Next-generation Universal MAPper to accommodate bisulfite sequencing data (GNUMAP-bs), that addresses the computational problems associated with aligning bisulfite sequencing data to a reference genome. GNUMAP-bs integrates uncertainty from read and mapping qualities to help resolve the difference between poor quality bases and the ambiguity inherent in bisulfite conversion. We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods.

Conclusions: The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments. The GNUMAP-bs algorithm is freely available for download at: http://dna.cs.byu.edu/gnumap. The software runs on multiple threads and multiple processors to increase the alignment speed.

Show MeSH
Related in: MedlinePlus