Limits...
Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH

Related in: MedlinePlus

Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4389802&req=5

Fig4: Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.

Mentions: To quantify the amount of variation in DNA methylation explained by genomic context, we considered the correlation between genomic context and principal components (PCs) of methylation levels across all 100 samples (Figure 4). We found that many of the features derived from a CpG site’s genomic context appear to be correlated with the first principal component (PC1). The methylation status of upstream and downstream neighboring CpG sites and a co-localized DNAse I hypersensitive (DHS) site are the most highly correlated features, with Pearson’s correlation r=[0.58,0.59] (P<2.2×10−16). Ten genomic features have correlation r>0.5 (P<2.2×10−16) with PC1, including co-localized active TFBSs ELF1 (ETS-related transcription factor 1), MAZ (Myc-associated zinc finger protein), MXI1 (MAX-interacting protein 1) and RUNX3 (Runt-related transcription factor 3), and co-localized histone modification trimethylation of histone H3 at lysine 4 (H3K4me3), suggesting that they may be useful in predicting DNA methylation status (Additional file 1: Figure S3). That said, the features themselves are well correlated; for example, active TFBS ELF1 is highly enriched within DHS sites (r=0.67,P<2.2×10−16) [53,54].Figure 4


Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4389802&req=5

Fig4: Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.
Mentions: To quantify the amount of variation in DNA methylation explained by genomic context, we considered the correlation between genomic context and principal components (PCs) of methylation levels across all 100 samples (Figure 4). We found that many of the features derived from a CpG site’s genomic context appear to be correlated with the first principal component (PC1). The methylation status of upstream and downstream neighboring CpG sites and a co-localized DNAse I hypersensitive (DHS) site are the most highly correlated features, with Pearson’s correlation r=[0.58,0.59] (P<2.2×10−16). Ten genomic features have correlation r>0.5 (P<2.2×10−16) with PC1, including co-localized active TFBSs ELF1 (ETS-related transcription factor 1), MAZ (Myc-associated zinc finger protein), MXI1 (MAX-interacting protein 1) and RUNX3 (Runt-related transcription factor 3), and co-localized histone modification trimethylation of histone H3 at lysine 4 (H3K4me3), suggesting that they may be useful in predicting DNA methylation status (Additional file 1: Figure S3). That said, the features themselves are well correlated; for example, active TFBS ELF1 is highly enriched within DHS sites (r=0.67,P<2.2×10−16) [53,54].Figure 4

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH
Related in: MedlinePlus