Limits...
A statistical model for the analysis of beta values in DNA methylation studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M/(M+U) that are generated by Illumina’s 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for the analysis of bounded variables, such as M-value regression and beta regression, are based on regularity assumptions that are often too strong to adequately describe the distribution of beta values.

Results: We develop a statistical model for the analysis of beta values that is derived from a bivariate gamma distribution for the signal intensities M and U. By allowing for possible correlations between M and U, the proposed model explicitly takes into account the data-generating process underlying the calculation of beta values. Using simulated data and a real sample of DNA methylation data from the Heinz Nixdorf Recall cohort study, we demonstrate that the proposed model fits our data significantly better than beta regression and M-value regression.

Conclusion: The proposed model contributes to an improved identification of associations between beta values and covariates such as clinical variables and lifestyle factors in epigenome-wide association studies. It is as easy to apply to a sample of beta values as beta regression and M-value regression.

Electronic supplementary material: The online version of this article (doi:10.1186/s12859-016-1347-4) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus

First part of the analysis of the HNR Study data. The boxplots show the average log-score differences obtained from beta regression, M-value regression and the RCG model. The left panel refers to the full models with five covariates, whereas the right panel refers to the covariate-free  models. At each of the 429,750 CpG sites, models were fitted to ten randomly sampled learning data sets of size n=750 each. Log-scores were calculated by evaluating the model fits on the respective independent test data sets (n=368). The boxplots refer to the 429,750 averages of the ten log-score differences
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5120494&req=5

Fig5: First part of the analysis of the HNR Study data. The boxplots show the average log-score differences obtained from beta regression, M-value regression and the RCG model. The left panel refers to the full models with five covariates, whereas the right panel refers to the covariate-free models. At each of the 429,750 CpG sites, models were fitted to ten randomly sampled learning data sets of size n=750 each. Log-scores were calculated by evaluating the model fits on the respective independent test data sets (n=368). The boxplots refer to the 429,750 averages of the ten log-score differences

Mentions: The average log-score differences obtained from beta regression, M-value regression and the RCG model are shown Fig. 5. The RCG model fitted the HNR Study data systematically better than beta and M-value regression (P-values of Wilcoxon signed rank tests <0.001). This result was obtained for both the full model containing all five covariates (left panel of Fig. 5) and the covariate-free model (right panel of Fig. 5).Fig. 5


A statistical model for the analysis of beta values in DNA methylation studies
First part of the analysis of the HNR Study data. The boxplots show the average log-score differences obtained from beta regression, M-value regression and the RCG model. The left panel refers to the full models with five covariates, whereas the right panel refers to the covariate-free  models. At each of the 429,750 CpG sites, models were fitted to ten randomly sampled learning data sets of size n=750 each. Log-scores were calculated by evaluating the model fits on the respective independent test data sets (n=368). The boxplots refer to the 429,750 averages of the ten log-score differences
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5120494&req=5

Fig5: First part of the analysis of the HNR Study data. The boxplots show the average log-score differences obtained from beta regression, M-value regression and the RCG model. The left panel refers to the full models with five covariates, whereas the right panel refers to the covariate-free models. At each of the 429,750 CpG sites, models were fitted to ten randomly sampled learning data sets of size n=750 each. Log-scores were calculated by evaluating the model fits on the respective independent test data sets (n=368). The boxplots refer to the 429,750 averages of the ten log-score differences
Mentions: The average log-score differences obtained from beta regression, M-value regression and the RCG model are shown Fig. 5. The RCG model fitted the HNR Study data systematically better than beta and M-value regression (P-values of Wilcoxon signed rank tests <0.001). This result was obtained for both the full model containing all five covariates (left panel of Fig. 5) and the covariate-free model (right panel of Fig. 5).Fig. 5

View Article: PubMed Central - PubMed

ABSTRACT

Background: The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M/(M+U) that are generated by Illumina&rsquo;s 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for the analysis of bounded variables, such as M-value regression and beta regression, are based on regularity assumptions that are often too strong to adequately describe the distribution of beta values.

Results: We develop a statistical model for the analysis of beta values that is derived from a bivariate gamma distribution for the signal intensities M and U. By allowing for possible correlations between M and U, the proposed model explicitly takes into account the data-generating process underlying the calculation of beta values. Using simulated data and a real sample of DNA methylation data from the Heinz Nixdorf Recall cohort study, we demonstrate that the proposed model fits our data significantly better than beta regression and M-value regression.

Conclusion: The proposed model contributes to an improved identification of associations between beta values and covariates such as clinical variables and lifestyle factors in epigenome-wide association studies. It is as easy to apply to a sample of beta values as beta regression and M-value regression.

Electronic supplementary material: The online version of this article (doi:10.1186/s12859-016-1347-4) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus