Limits...
Reference-free cell mixture adjustments in analysis of DNA methylation data.

Houseman EA, Molitor J, Marsit CJ - Bioinformatics (2014)

Bottom Line: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation.Software is available in the R package RefFreeEWAS.Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981. andres.houseman@oregonstate.edu Supplementary data are available at Bioinformatics online.

View Article: PubMed Central - PubMed

Affiliation: School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, USA and Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.

ABSTRACT

Motivation: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation. A recently published set of statistical methods exploits this association to infer changes in cell mixture proportions, and these methods are presently being applied to adjust for cell mixture effect in the context of epigenome-wide association studies. However, these adjustments require the existence of reference datasets, which may be laborious or expensive to collect. For some tissues such as placenta, saliva, adipose or tumor tissue, the relevant underlying cell types may not be known.

Results: We propose a method for conducting epigenome-wide association studies analysis when a reference dataset is unavailable, including a bootstrap method for estimating standard errors. We demonstrate via simulation study and several real data analyses that our proposed method can perform as well as or better than methods that make explicit use of reference datasets. In particular, it may adjust for detailed cell type differences that may be unavailable even in existing reference datasets.

Availability and implementation: Software is available in the R package RefFreeEWAS. Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981.

Contact: andres.houseman@oregonstate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH

Related in: MedlinePlus

Arthritis dataset: reference-based versus reference-free. comparison of reference-free coefficient estimates  with the corresponding reference-based estimates
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4016702&req=5

btu029-F4: Arthritis dataset: reference-based versus reference-free. comparison of reference-free coefficient estimates with the corresponding reference-based estimates

Mentions: We demonstrate our proposed methodology on four datasets. The first consists of Illumina Infinium HumanMethylation450 BeadChip array data on which bisulfite-converted DNA from whole blood was hybridized; the data were obtained from Gene Expression Omnibus (GEO), Accession number GSE42861, and consisted of n = 689 subjects: 354 rheumatoid arthritis patients (cases) and 335 normal controls, originally published by Liu et al. (2013). Using the data available on GEO, we obtained four different estimates for the difference in DNA methylation measured on the beta scale between case and control at 384 410 autosomal CpGs whose Infinium probes contained no single nucleotide polymorphism (SNP) and had no SNP at a flanking G site (‘non-SNP CpG sites’). In the unadjusted analysis, we simply applied the limma procedure (Smyth, 2004), with design matrix consisting of an intercept and an indicator variable for case status. We then applied the method of Houseman et al. (2012) to data from 387 CpG sites overlapping between non-SNP CpG sites on the HumanMethylation450 and the 500 leukocyte differentially methylated regions made available publicly to infer leukocyte proportions (the full Illumina Infinium HumanMethylation27 dataset is available on GEO, Accession number GSE39981); in the reference-based analysis, we used limma to estimate case–control differences adjusted for leukocyte type by using five of the six available types: B-cell, CD4+ T, CD8+T, granulocyte and NK (monocyte proportions were dropped to avoid an ill-conditioned design matrix). In the SVA-adjusted approach, we used the R package SVA (version 3.6.0) both to compute the dimension of the surrogate variables and to determine the surrogate variables themselves. Using the method of Buja and Eyuboglu (1992) implemented in sva, we found d = 53 surrogate variables; after estimating the 53 surrogate variables, we adjusted for them using limma in a manner similar to the previous analysis. Finally, we applied our proposed reference-free analysis, with equal to the same design matrix used in the unadjusted analysis. In the latter analysis, the latent variable dimension was estimated to be 37 by the RMT method of Teschendorff et al. (2011); because simulations suggest accurate dimension estimation by the RMT method, we used d = 37. 500 bootstrap samples were used for inference. The DNA methylation dataset available on GEO contains no missing values. Figure 3 shows volcano plots of the arthritis case coefficient for the three different analyses, demonstrating diminished significant for both adjusted analyses shown in the figure (reference-based and reference-free). Interestingly, significance for the reference-free analysis is diminished relative to the reference-based analysis, suggesting that the six leukocyte types profiled by Houseman et al. (2012) and available in GEO Accession GSE39981 may be insufficient for analysis of blood data. This interpretation is further reinforced by Figure 4, which shows reference-free coefficient estimates by their corresponding reference-based estimates; there is general agreement between methods for coefficients with larger magnitude, except for a single CpG whose magnitude appears larger by the reference-based method than by the reference-free method; in general, for the relatively CpGs, the reference-based method produces estimates of larger magnitude than the reference-free method. Supplementary Material (Section VI) provides volcano and scatter plots similar to those shown in Figures 3 and 4, but showing the SVA results. In addition, Supplementary Figure S9 shows a comparison of RMSE between our reference-free approach and SVA, where the 500-DMR reference-based analysis was used as a presumed gold standard. In general, the SVA results were dissimilar from both of the other adjusted analyses, with greater significance relative to both, and more similarity with the unadjusted analysis. This suggests inadequate adjustment by SVA. SVA results with d = 37 were similar to those obtained using d = 53 (data not shown in detail, but a summary appears in Supplementary Fig. S9). For each of the four analysis types, Table 3 summarizes overall significance as measured by q-value methodology (Storey et al., 2004) implemented in the R package qvalue, including an estimate of , the proportion of s, as well as the number of CpG sites for which . Interestingly, the unadjusted and reference-based analyses produced about the same estimated proportion of CpGs as well as a relatively large number of CpGs for which . Despite apparently increased significance shown in the volcano plot appearing in the Supplementary Material, the SVA-adjusted approach produced a higher value of and fewer CpGs (though still a substantial number) for which . The reference-free analysis produced a much higher estimated value of and no CpGs for which . Although there were no significant q-values after reference-free adjustment, an omnibus test of significance proposed in Section V of the Supplementary Material results in an overall even after reference-free adjustment. As mentioned in Leek and Storey (2007), q-values can be misleading when applied to multiple correlated tests.Fig. 3.


Reference-free cell mixture adjustments in analysis of DNA methylation data.

Houseman EA, Molitor J, Marsit CJ - Bioinformatics (2014)

Arthritis dataset: reference-based versus reference-free. comparison of reference-free coefficient estimates  with the corresponding reference-based estimates
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4016702&req=5

btu029-F4: Arthritis dataset: reference-based versus reference-free. comparison of reference-free coefficient estimates with the corresponding reference-based estimates
Mentions: We demonstrate our proposed methodology on four datasets. The first consists of Illumina Infinium HumanMethylation450 BeadChip array data on which bisulfite-converted DNA from whole blood was hybridized; the data were obtained from Gene Expression Omnibus (GEO), Accession number GSE42861, and consisted of n = 689 subjects: 354 rheumatoid arthritis patients (cases) and 335 normal controls, originally published by Liu et al. (2013). Using the data available on GEO, we obtained four different estimates for the difference in DNA methylation measured on the beta scale between case and control at 384 410 autosomal CpGs whose Infinium probes contained no single nucleotide polymorphism (SNP) and had no SNP at a flanking G site (‘non-SNP CpG sites’). In the unadjusted analysis, we simply applied the limma procedure (Smyth, 2004), with design matrix consisting of an intercept and an indicator variable for case status. We then applied the method of Houseman et al. (2012) to data from 387 CpG sites overlapping between non-SNP CpG sites on the HumanMethylation450 and the 500 leukocyte differentially methylated regions made available publicly to infer leukocyte proportions (the full Illumina Infinium HumanMethylation27 dataset is available on GEO, Accession number GSE39981); in the reference-based analysis, we used limma to estimate case–control differences adjusted for leukocyte type by using five of the six available types: B-cell, CD4+ T, CD8+T, granulocyte and NK (monocyte proportions were dropped to avoid an ill-conditioned design matrix). In the SVA-adjusted approach, we used the R package SVA (version 3.6.0) both to compute the dimension of the surrogate variables and to determine the surrogate variables themselves. Using the method of Buja and Eyuboglu (1992) implemented in sva, we found d = 53 surrogate variables; after estimating the 53 surrogate variables, we adjusted for them using limma in a manner similar to the previous analysis. Finally, we applied our proposed reference-free analysis, with equal to the same design matrix used in the unadjusted analysis. In the latter analysis, the latent variable dimension was estimated to be 37 by the RMT method of Teschendorff et al. (2011); because simulations suggest accurate dimension estimation by the RMT method, we used d = 37. 500 bootstrap samples were used for inference. The DNA methylation dataset available on GEO contains no missing values. Figure 3 shows volcano plots of the arthritis case coefficient for the three different analyses, demonstrating diminished significant for both adjusted analyses shown in the figure (reference-based and reference-free). Interestingly, significance for the reference-free analysis is diminished relative to the reference-based analysis, suggesting that the six leukocyte types profiled by Houseman et al. (2012) and available in GEO Accession GSE39981 may be insufficient for analysis of blood data. This interpretation is further reinforced by Figure 4, which shows reference-free coefficient estimates by their corresponding reference-based estimates; there is general agreement between methods for coefficients with larger magnitude, except for a single CpG whose magnitude appears larger by the reference-based method than by the reference-free method; in general, for the relatively CpGs, the reference-based method produces estimates of larger magnitude than the reference-free method. Supplementary Material (Section VI) provides volcano and scatter plots similar to those shown in Figures 3 and 4, but showing the SVA results. In addition, Supplementary Figure S9 shows a comparison of RMSE between our reference-free approach and SVA, where the 500-DMR reference-based analysis was used as a presumed gold standard. In general, the SVA results were dissimilar from both of the other adjusted analyses, with greater significance relative to both, and more similarity with the unadjusted analysis. This suggests inadequate adjustment by SVA. SVA results with d = 37 were similar to those obtained using d = 53 (data not shown in detail, but a summary appears in Supplementary Fig. S9). For each of the four analysis types, Table 3 summarizes overall significance as measured by q-value methodology (Storey et al., 2004) implemented in the R package qvalue, including an estimate of , the proportion of s, as well as the number of CpG sites for which . Interestingly, the unadjusted and reference-based analyses produced about the same estimated proportion of CpGs as well as a relatively large number of CpGs for which . Despite apparently increased significance shown in the volcano plot appearing in the Supplementary Material, the SVA-adjusted approach produced a higher value of and fewer CpGs (though still a substantial number) for which . The reference-free analysis produced a much higher estimated value of and no CpGs for which . Although there were no significant q-values after reference-free adjustment, an omnibus test of significance proposed in Section V of the Supplementary Material results in an overall even after reference-free adjustment. As mentioned in Leek and Storey (2007), q-values can be misleading when applied to multiple correlated tests.Fig. 3.

Bottom Line: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation.Software is available in the R package RefFreeEWAS.Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981. andres.houseman@oregonstate.edu Supplementary data are available at Bioinformatics online.

View Article: PubMed Central - PubMed

Affiliation: School of Biological and Population Health Sciences, College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, USA and Section of Biostatistics and Epidemiology, Department of Community and Family Medicine, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.

ABSTRACT

Motivation: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation. A recently published set of statistical methods exploits this association to infer changes in cell mixture proportions, and these methods are presently being applied to adjust for cell mixture effect in the context of epigenome-wide association studies. However, these adjustments require the existence of reference datasets, which may be laborious or expensive to collect. For some tissues such as placenta, saliva, adipose or tumor tissue, the relevant underlying cell types may not be known.

Results: We propose a method for conducting epigenome-wide association studies analysis when a reference dataset is unavailable, including a bootstrap method for estimating standard errors. We demonstrate via simulation study and several real data analyses that our proposed method can perform as well as or better than methods that make explicit use of reference datasets. In particular, it may adjust for detailed cell type differences that may be unavailable even in existing reference datasets.

Availability and implementation: Software is available in the R package RefFreeEWAS. Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981.

Contact: andres.houseman@oregonstate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH
Related in: MedlinePlus