Limits...
Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH

Related in: MedlinePlus

Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4389802&req=5

Fig3: Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.

Mentions: As we observed that methylation patterns at neighboring CpG sites depend on genomic content, we further investigated methylation patterns within CGIs, CGI shores, and CGI shelves. Methylation levels at CGIs and CGI shelves were fairly constant genome-wide and across individuals – CGIs are hypomethylated and CGI shelves are hypermethylated – but CGI shores exhibit a reproducible but drastic pattern of change (Figure 3A). CpG sites in CGI shores have a monotone increasing pattern of methylation status from CGIs towards CGI shelves, and this pattern is symmetric in the CGI shores upstream and downstream of CGIs. We examined the MED between methylation status for pairs of CpG sites in these regions, and we found that MED within the CGI and within the CGI shelves is low, consistent with the variance we observed within DNA methylation profiles in these regions (Figure 3B). Additionally, we found that the MED between CpG sites in the shelves appears to increase as the sites are further away from the CGI on the shelf, suggesting a circular dependency in CpG site methylation across the ends of the shelf sequences. It is interesting that the CpG sites in the shore regions are substantially more predictive of CpG sites in the shelf regions than those in the CGI regions, although this may indicate a less precise delineation of the shore and shelf regions relative to the CGI and CGI shore delineation.Figure 3


Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE - Genome Biol. (2015)

Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4389802&req=5

Fig3: Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.
Mentions: As we observed that methylation patterns at neighboring CpG sites depend on genomic content, we further investigated methylation patterns within CGIs, CGI shores, and CGI shelves. Methylation levels at CGIs and CGI shelves were fairly constant genome-wide and across individuals – CGIs are hypomethylated and CGI shelves are hypermethylated – but CGI shores exhibit a reproducible but drastic pattern of change (Figure 3A). CpG sites in CGI shores have a monotone increasing pattern of methylation status from CGIs towards CGI shelves, and this pattern is symmetric in the CGI shores upstream and downstream of CGIs. We examined the MED between methylation status for pairs of CpG sites in these regions, and we found that MED within the CGI and within the CGI shelves is low, consistent with the variance we observed within DNA methylation profiles in these regions (Figure 3B). Additionally, we found that the MED between CpG sites in the shelves appears to increase as the sites are further away from the CGI on the shelf, suggesting a circular dependency in CpG site methylation across the ends of the shelf sequences. It is interesting that the CpG sites in the shore regions are substantially more predictive of CpG sites in the shelf regions than those in the CGI regions, although this may indicate a less precise delineation of the shore and shelf regions relative to the CGI and CGI shore delineation.Figure 3

Bottom Line: The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity.Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA. wz31@duke.edu.

ABSTRACT

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

Show MeSH
Related in: MedlinePlus