Limits...
Comparison of different cell type correction methods for genome-scale epigenetics studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Whole blood is frequently utilized in genome-wide association studies of DNA methylation patterns in relation to environmental exposures or clinical outcomes. These associations can be confounded by cellular heterogeneity. Algorithms have been developed to measure or adjust for this heterogeneity, and some have been compared in the literature. However, with new methods available, it is unknown whether the findings will be consistent, if not which method(s) perform better.

Results: Methods: We compared eight cell-type correction methods including the method in the minfi R package, the method by Houseman et al., the Removing unwanted variation (RUV) approach, the methods in FaST-LMM-EWASher, ReFACTor, RefFreeEWAS, and RefFreeCellMix R programs, along with one approach utilizing surrogate variables (SVAs). We first evaluated the association of DNA methylation at each CpG across the whole genome with prenatal arsenic exposure levels and with cancer status, adjusted for estimated cell-type information obtained from different methods. We then compared CpGs showing statistical significance from different approaches. For the methods implemented in minfi and proposed by Houseman et al., we utilized homogeneous data with composition of some blood cells available and compared them with the estimated cell compositions. Finally, for methods not explicitly estimating cell compositions, we evaluated their performance using simulated DNA methylation data with a set of latent variables representing “cell types”.

Results: Results: Results from the SVA-based method overall showed the highest agreement with all other methods except for FaST-LMM-EWASher. Using homogeneous data, minfi provided better estimations on cell types compared to the originally proposed method by Houseman et al. Further simulation studies on methods free of reference data revealed that SVA provided good sensitivities and specificities, RefFreeCellMix in general produced high sensitivities but specificities tended to be low when confounding is present, and FaST-LMM-EWASher gave the lowest sensitivity but highest specificity.

Conclusions: Results from real data and simulations indicated that SVA is recommended when the focus is on the identification of informative CpGs. When appropriate reference data are available, the method implemented in the minfi package is recommended. However, if no such reference data are available or if the focus is not on estimating cell proportions, the SVA method is suggested.

Electronic supplementary material: The online version of this article (doi:10.1186/s12859-017-1611-2) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus

Venn diagram illustrating the overlap of identified CpG sites that were associated with prenatal arsenic exposure at FDR level of 0.05 after incorporating estimated cell type compositions by different methods for the association study of prenatal arsenic exposure with DNA-methylation. Results from Houseman et al., minfi, RefFreeEWAS, and SVA as well as the analyses without adjusting for cell types are displayed (Results from other methods are in the text). “UN”: results from an analysis without adjusting for cell type compositions
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5391562&req=5

Fig1: Venn diagram illustrating the overlap of identified CpG sites that were associated with prenatal arsenic exposure at FDR level of 0.05 after incorporating estimated cell type compositions by different methods for the association study of prenatal arsenic exposure with DNA-methylation. Results from Houseman et al., minfi, RefFreeEWAS, and SVA as well as the analyses without adjusting for cell types are displayed (Results from other methods are in the text). “UN”: results from an analysis without adjusting for cell type compositions

Mentions: Via linear regressions, we assessed the association of DNA methylation at each CpG site across the whole genome with prenatal urinary arsenic exposure levels (a continuous measure), adjusting for cell-type effects with cell type information inferred from one of the eight methods. For each method, we recorded the number of CpGs showing statistically significant associations with prenatal urinary arsenic exposure after adjusting for multiple testing by controlling false discovery rate (FDR) at 0.05. ReFACTor identified the largest number of CpGs (~60,000) and no CpGs were detected by FaST-LMM-EWASher (Table 1). RefFreeCellMix also identified a large number of CpGs (~3000). SVA and RefFreeEWAS detected more CpGs compared to the remaining methods. (Table 1). Next, we assessed the number of identified CpGs that overlapped between different methods. The diagram in Fig. 1 shows the overlap of CpG sites from four approaches (Houseman et al., minfi, RefFreeEWAS, and SVA) as well as the analyses without adjusting for cell types (we did not include all eight methods in this Venn diagram for clarity). Results from SVA showed the best agreement with findings from the other four analyses (Fig. 1). Two identified CpG sites cg06434480 and cg10662395 were common to all these five analytical methods labeled in Fig. 1. Further comparisons indicated that CpG site cg10662395 was also identified by RefFreeCellMix and RUV, and was the only CpG site to overlap among all seven analyses (Houseman et al., minfi, RefFreeEWAS, SVA, RefFreeCellMix and RUV, as well as the analyses without adjusting for cell types). Although ReFACTor identified the largest number of CpGs, they did not overlap with the joint findings from the aforementioned seven analyses. Overall, CpGs identified via SVA overlapped with those from the Houseman et al. method, minfi and RefFreeEWAS (p-value < 0.0001, Table 1, Fig. 1. The definition of percentage overlap is given in the Methods section). One of the two CpGs (cg06434480 and cg10662395), cg06434480, is located within 200 base pairs of the transcription start site of gene HMGCR (3-hydroxy-3-methylglutaryl-CoA reductase) which is known to be associated with inorganic arsenic exposure [15]. In a study conducted in humans, Mono-methylated arsenic (MMA) was found to downregulate the gene expression of HMGCR, a gene involved in cholesterol biosynthesis [16]. The other CpG, cg10662395, is located in the body region of gene HCN2 (hyperpolarization activated cyclic nucleotide gated potassium channel 2). This gene was not found to be directly associated with arsenic exposure in the literature, but HCN2 has been known to regulate pacemaker activity in the heart and the brain of mice and humans [17, 18]. Arsenic has been found to induce QT interval (i.e., time between initial deflection of QRS complex to the end of T wave) prolongation probably by altering potassium ion channel [19].Table 1


Comparison of different cell type correction methods for genome-scale epigenetics studies
Venn diagram illustrating the overlap of identified CpG sites that were associated with prenatal arsenic exposure at FDR level of 0.05 after incorporating estimated cell type compositions by different methods for the association study of prenatal arsenic exposure with DNA-methylation. Results from Houseman et al., minfi, RefFreeEWAS, and SVA as well as the analyses without adjusting for cell types are displayed (Results from other methods are in the text). “UN”: results from an analysis without adjusting for cell type compositions
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5391562&req=5

Fig1: Venn diagram illustrating the overlap of identified CpG sites that were associated with prenatal arsenic exposure at FDR level of 0.05 after incorporating estimated cell type compositions by different methods for the association study of prenatal arsenic exposure with DNA-methylation. Results from Houseman et al., minfi, RefFreeEWAS, and SVA as well as the analyses without adjusting for cell types are displayed (Results from other methods are in the text). “UN”: results from an analysis without adjusting for cell type compositions
Mentions: Via linear regressions, we assessed the association of DNA methylation at each CpG site across the whole genome with prenatal urinary arsenic exposure levels (a continuous measure), adjusting for cell-type effects with cell type information inferred from one of the eight methods. For each method, we recorded the number of CpGs showing statistically significant associations with prenatal urinary arsenic exposure after adjusting for multiple testing by controlling false discovery rate (FDR) at 0.05. ReFACTor identified the largest number of CpGs (~60,000) and no CpGs were detected by FaST-LMM-EWASher (Table 1). RefFreeCellMix also identified a large number of CpGs (~3000). SVA and RefFreeEWAS detected more CpGs compared to the remaining methods. (Table 1). Next, we assessed the number of identified CpGs that overlapped between different methods. The diagram in Fig. 1 shows the overlap of CpG sites from four approaches (Houseman et al., minfi, RefFreeEWAS, and SVA) as well as the analyses without adjusting for cell types (we did not include all eight methods in this Venn diagram for clarity). Results from SVA showed the best agreement with findings from the other four analyses (Fig. 1). Two identified CpG sites cg06434480 and cg10662395 were common to all these five analytical methods labeled in Fig. 1. Further comparisons indicated that CpG site cg10662395 was also identified by RefFreeCellMix and RUV, and was the only CpG site to overlap among all seven analyses (Houseman et al., minfi, RefFreeEWAS, SVA, RefFreeCellMix and RUV, as well as the analyses without adjusting for cell types). Although ReFACTor identified the largest number of CpGs, they did not overlap with the joint findings from the aforementioned seven analyses. Overall, CpGs identified via SVA overlapped with those from the Houseman et al. method, minfi and RefFreeEWAS (p-value < 0.0001, Table 1, Fig. 1. The definition of percentage overlap is given in the Methods section). One of the two CpGs (cg06434480 and cg10662395), cg06434480, is located within 200 base pairs of the transcription start site of gene HMGCR (3-hydroxy-3-methylglutaryl-CoA reductase) which is known to be associated with inorganic arsenic exposure [15]. In a study conducted in humans, Mono-methylated arsenic (MMA) was found to downregulate the gene expression of HMGCR, a gene involved in cholesterol biosynthesis [16]. The other CpG, cg10662395, is located in the body region of gene HCN2 (hyperpolarization activated cyclic nucleotide gated potassium channel 2). This gene was not found to be directly associated with arsenic exposure in the literature, but HCN2 has been known to regulate pacemaker activity in the heart and the brain of mice and humans [17, 18]. Arsenic has been found to induce QT interval (i.e., time between initial deflection of QRS complex to the end of T wave) prolongation probably by altering potassium ion channel [19].Table 1

View Article: PubMed Central - PubMed

ABSTRACT

Background: Whole blood is frequently utilized in genome-wide association studies of DNA methylation patterns in relation to environmental exposures or clinical outcomes. These associations can be confounded by cellular heterogeneity. Algorithms have been developed to measure or adjust for this heterogeneity, and some have been compared in the literature. However, with new methods available, it is unknown whether the findings will be consistent, if not which method(s) perform better.

Results: Methods: We compared eight cell-type correction methods including the method in the minfi R package, the method by Houseman et al., the Removing unwanted variation (RUV) approach, the methods in FaST-LMM-EWASher, ReFACTor, RefFreeEWAS, and RefFreeCellMix R programs, along with one approach utilizing surrogate variables (SVAs). We first evaluated the association of DNA methylation at each CpG across the whole genome with prenatal arsenic exposure levels and with cancer status, adjusted for estimated cell-type information obtained from different methods. We then compared CpGs showing statistical significance from different approaches. For the methods implemented in minfi and proposed by Houseman et al., we utilized homogeneous data with composition of some blood cells available and compared them with the estimated cell compositions. Finally, for methods not explicitly estimating cell compositions, we evaluated their performance using simulated DNA methylation data with a set of latent variables representing &ldquo;cell types&rdquo;.

Results: Results: Results from the SVA-based method overall showed the highest agreement with all other methods except for FaST-LMM-EWASher. Using homogeneous data, minfi provided better estimations on cell types compared to the originally proposed method by Houseman et al. Further simulation studies on methods free of reference data revealed that SVA provided good sensitivities and specificities, RefFreeCellMix in general produced high sensitivities but specificities tended to be low when confounding is present, and FaST-LMM-EWASher gave the lowest sensitivity but highest specificity.

Conclusions: Results from real data and simulations indicated that SVA is recommended when the focus is on the identification of informative CpGs. When appropriate reference data are available, the method implemented in the minfi package is recommended. However, if no such reference data are available or if the focus is not on estimating cell proportions, the SVA method is suggested.

Electronic supplementary material: The online version of this article (doi:10.1186/s12859-017-1611-2) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus