Limits...
Bis-class: a new classification tool of methylation status using bayes classifier and local methylation information.

Huh I, Yang X, Park T, Yi SV - BMC Genomics (2014)

Bottom Line: We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates.This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Bioinformatics and Biostatistics Laboratory, Seoul National University, 151-742 Seoul, Korea. tspark@stats.snu.ac.kr.

ABSTRACT

Background: Whole genome sequencing of bisulfite converted DNA ('methylC-seq') method provides comprehensive information of DNA methylation. An important application of these whole genome methylation maps is classifying each position as a methylated versus non-methylated nucleotide. A widely used current method for this purpose, the so-called binomial method, is intuitive and straightforward, but lacks power when the sequence coverage and the genome-wide methylation level are low. These problems present a particular challenge when analyzing sparsely methylated genomes, such as those of many invertebrates and plants.

Results: We demonstrate that the number of sequence reads per position from methylC-seq data displays a large variance and can be modeled as a shifted negative binomial distribution. We also show that DNA methylation levels of adjacent CpG sites are correlated, and this similarity in local DNA methylation levels extends several kilobases. Taking these observations into account, we propose a new method based on Bayesian classification to infer DNA methylation status while considering the neighborhood DNA methylation levels of a specific site. We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates. This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.

Conclusions: Our method improves the existing binomial method for binary methylation calls by utilizing a posterior odds framework and incorporating local methylation information. This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes. Bis-Class and example data are provided at a dedicated website (http://bibs.snu.ac.kr/software/Bisclass).

Show MeSH

Related in: MedlinePlus

The GB 16479 locus exhibits qualitatively identical information yet opposite methylation calling under the binomial method. Data are from unpublished methylC-seq experiments of two honey bee individuals from the Yi lab, and are available upon request. All reads mapped to the four CpGs are ‘C’ reads (indicating 100% methylation). However, the binomial method provides a different methylation calls for these two samples. Specifically, the binomial calls all CpGs in the sample A as non-methylated (white dots), and all CpGs in the sample B as methylated (black dots). Bis-Class correctly identifies identical methylation features in the two replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4117951&req=5

Fig5: The GB 16479 locus exhibits qualitatively identical information yet opposite methylation calling under the binomial method. Data are from unpublished methylC-seq experiments of two honey bee individuals from the Yi lab, and are available upon request. All reads mapped to the four CpGs are ‘C’ reads (indicating 100% methylation). However, the binomial method provides a different methylation calls for these two samples. Specifically, the binomial calls all CpGs in the sample A as non-methylated (white dots), and all CpGs in the sample B as methylated (black dots). Bis-Class correctly identifies identical methylation features in the two replicates.

Mentions: First, we found that many of these mCpGs detected by Bis-class are sites that are covered by a single C read that occur in highly methylated regions (Additional file 4). This improved detection is because while the binomial method cannot recognize any mCpGs covered by only a single read (e.g., Additional file 4), Bis-Class can provide methylation calling if that position occurs near other methylated CpGs. We demonstrate this property using two examples recovered from the data. The first example is the gene (GB-16479) from two honey bee MethylC-seq data sets (Figure 5). In this data, four cytosines cluster in a region with high overall methylation levels (the fractional methylation level of a 1000 bps encompassing these four sites is ~ 0.9 in both samples). In sample A, the four cytosines were covered by only single reads, all ‘C’s. In sample B, the same four cytosines were covered by two ‘C’ reads. The binomial method calls all cytosines in the first sample as ‘unmethylated’, while calling all four cytosines in the second sample as ‘methylated’. This example demonstrates the pitfalls of the binomial method clearly: two samples with exactly same qualitative information (100% ‘C’ reads in both cases) are classified as opposite directions due to the low sample size. Bis-Class, on the other hand, classified all four cytosines as ‘methylated’ for both cases. In the second example (Figure 6), we show the distribution of reads mapped to the locus GB 13135 in Herb et al. [7]. There are twelve samples in this data (six forager bees and six nurse bees). In the Forager 1 sample, the third and fifth positions are covered by single C reads. binomial method will call these as non-methylated (Figure 6B). However, since these sites occur in a heavily methylated region, Bis-Class calls both of these sites as methylated (Figure 6B). In other samples, these sites are covered by more than one read. For example, in the Forager 6 sample, both positions third and five are covered by seven C reads, and consequently called as methylated CpGs. The similarity between different biological replicates indicates that using local information improves methylation-calling accuracy. FDR-corrected q-values and posterior odds for each position of this locus are provided in the Additional file 5.Figure 5


Bis-class: a new classification tool of methylation status using bayes classifier and local methylation information.

Huh I, Yang X, Park T, Yi SV - BMC Genomics (2014)

The GB 16479 locus exhibits qualitatively identical information yet opposite methylation calling under the binomial method. Data are from unpublished methylC-seq experiments of two honey bee individuals from the Yi lab, and are available upon request. All reads mapped to the four CpGs are ‘C’ reads (indicating 100% methylation). However, the binomial method provides a different methylation calls for these two samples. Specifically, the binomial calls all CpGs in the sample A as non-methylated (white dots), and all CpGs in the sample B as methylated (black dots). Bis-Class correctly identifies identical methylation features in the two replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4117951&req=5

Fig5: The GB 16479 locus exhibits qualitatively identical information yet opposite methylation calling under the binomial method. Data are from unpublished methylC-seq experiments of two honey bee individuals from the Yi lab, and are available upon request. All reads mapped to the four CpGs are ‘C’ reads (indicating 100% methylation). However, the binomial method provides a different methylation calls for these two samples. Specifically, the binomial calls all CpGs in the sample A as non-methylated (white dots), and all CpGs in the sample B as methylated (black dots). Bis-Class correctly identifies identical methylation features in the two replicates.
Mentions: First, we found that many of these mCpGs detected by Bis-class are sites that are covered by a single C read that occur in highly methylated regions (Additional file 4). This improved detection is because while the binomial method cannot recognize any mCpGs covered by only a single read (e.g., Additional file 4), Bis-Class can provide methylation calling if that position occurs near other methylated CpGs. We demonstrate this property using two examples recovered from the data. The first example is the gene (GB-16479) from two honey bee MethylC-seq data sets (Figure 5). In this data, four cytosines cluster in a region with high overall methylation levels (the fractional methylation level of a 1000 bps encompassing these four sites is ~ 0.9 in both samples). In sample A, the four cytosines were covered by only single reads, all ‘C’s. In sample B, the same four cytosines were covered by two ‘C’ reads. The binomial method calls all cytosines in the first sample as ‘unmethylated’, while calling all four cytosines in the second sample as ‘methylated’. This example demonstrates the pitfalls of the binomial method clearly: two samples with exactly same qualitative information (100% ‘C’ reads in both cases) are classified as opposite directions due to the low sample size. Bis-Class, on the other hand, classified all four cytosines as ‘methylated’ for both cases. In the second example (Figure 6), we show the distribution of reads mapped to the locus GB 13135 in Herb et al. [7]. There are twelve samples in this data (six forager bees and six nurse bees). In the Forager 1 sample, the third and fifth positions are covered by single C reads. binomial method will call these as non-methylated (Figure 6B). However, since these sites occur in a heavily methylated region, Bis-Class calls both of these sites as methylated (Figure 6B). In other samples, these sites are covered by more than one read. For example, in the Forager 6 sample, both positions third and five are covered by seven C reads, and consequently called as methylated CpGs. The similarity between different biological replicates indicates that using local information improves methylation-calling accuracy. FDR-corrected q-values and posterior odds for each position of this locus are provided in the Additional file 5.Figure 5

Bottom Line: We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates.This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Bioinformatics and Biostatistics Laboratory, Seoul National University, 151-742 Seoul, Korea. tspark@stats.snu.ac.kr.

ABSTRACT

Background: Whole genome sequencing of bisulfite converted DNA ('methylC-seq') method provides comprehensive information of DNA methylation. An important application of these whole genome methylation maps is classifying each position as a methylated versus non-methylated nucleotide. A widely used current method for this purpose, the so-called binomial method, is intuitive and straightforward, but lacks power when the sequence coverage and the genome-wide methylation level are low. These problems present a particular challenge when analyzing sparsely methylated genomes, such as those of many invertebrates and plants.

Results: We demonstrate that the number of sequence reads per position from methylC-seq data displays a large variance and can be modeled as a shifted negative binomial distribution. We also show that DNA methylation levels of adjacent CpG sites are correlated, and this similarity in local DNA methylation levels extends several kilobases. Taking these observations into account, we propose a new method based on Bayesian classification to infer DNA methylation status while considering the neighborhood DNA methylation levels of a specific site. We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates. This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.

Conclusions: Our method improves the existing binomial method for binary methylation calls by utilizing a posterior odds framework and incorporating local methylation information. This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes. Bis-Class and example data are provided at a dedicated website (http://bibs.snu.ac.kr/software/Bisclass).

Show MeSH
Related in: MedlinePlus