Limits...
Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH

Related in: MedlinePlus

Prediction performance of methylation status and level.(A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4389802&req=5

Fig5: Prediction performance of methylation status and level.(A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].

Mentions: Using 122 features (excluding β for one upstream and one downstream neighboring CpG site but including status τ) and considering all CpG sites with two neighboring CpG sites in our data, we achieved an accuracy of 91.9% and an AUC of 0.96 (Figure 5A). We considered the role of each subset of features (Table 1). For example, if we only included genomic position features, the classifier had an accuracy of 78.6% and AUC of 0.85. Including DNA sequence properties and TFBS features increased the accuracy to 85.7% and the AUC to 0.92. When we included all classes of features except for neighbors, the classifier achieved an accuracy of 89.0% and an AUC of 0.94, a significant improvement in prediction from only considering genomic position features (t-test; P=7.75×10−23). These results suggest that TFBSs, histone modifications, and chromatin state are predictive of DNA methylation. However, we also found that the genomic context features improved prediction significantly over using only the neighbor features, which has an accuracy of 90.7% and an AUC of 0.94 (t-test; P=3.45×10−18).Figure 5


Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Prediction performance of methylation status and level.(A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4389802&req=5

Fig5: Prediction performance of methylation status and level.(A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].
Mentions: Using 122 features (excluding β for one upstream and one downstream neighboring CpG site but including status τ) and considering all CpG sites with two neighboring CpG sites in our data, we achieved an accuracy of 91.9% and an AUC of 0.96 (Figure 5A). We considered the role of each subset of features (Table 1). For example, if we only included genomic position features, the classifier had an accuracy of 78.6% and AUC of 0.85. Including DNA sequence properties and TFBS features increased the accuracy to 85.7% and the AUC to 0.92. When we included all classes of features except for neighbors, the classifier achieved an accuracy of 89.0% and an AUC of 0.94, a significant improvement in prediction from only considering genomic position features (t-test; P=7.75×10−23). These results suggest that TFBSs, histone modifications, and chromatin state are predictive of DNA methylation. However, we also found that the genomic context features improved prediction significantly over using only the neighbor features, which has an accuracy of 90.7% and an AUC of 0.94 (t-test; P=3.45×10−18).Figure 5

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH
Related in: MedlinePlus