Limits...
Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions.

Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT - BMC Bioinformatics (2008)

Bottom Line: Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes.We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, 02115, USA. ahousema@hsph.harvard.edu

ABSTRACT

Background: Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence. One of the most commonly studied epigenetic alterations is cytosine methylation, which is a well recognized mechanism of epigenetic gene silencing and often occurs at tumor suppressor gene loci in human cancer. Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes. Model-based cluster analysis is often used to identify DNA methylation subgroups in data, but it is unclear how to cluster DNA methylation data from arrays in a scalable and reliable manner.

Results: We propose a novel model-based recursive-partitioning algorithm to navigate clusters in a beta mixture model. We present simulations that show that the method is more reliable than competing nonparametric clustering approaches, and is at least as reliable as conventional mixture model methods. We also show that our proposed method is more computationally efficient than conventional mixture model approaches. We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.

Conclusion: Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.

Show MeSH

Related in: MedlinePlus

Examples of simulated data. Yellow = 1.0, black = 0.5, blue = 0.0. True classes indicated and separated by yellow dividing line. Height of region indicates the relative number of subjects in each class.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2553421&req=5

Figure 3: Examples of simulated data. Yellow = 1.0, black = 0.5, blue = 0.0. True classes indicated and separated by yellow dividing line. Height of region indicates the relative number of subjects in each class.

Mentions: We conducted simulations to compare the properties of our proposed method with similar competing methods. For our first case (Case I), each simulated data set consisted of n = 100 subjects, each having 1413 continuous responses lying in the unit interval. Each subject was a member of one of 5 classes, each class occurring with 0.2 probability. The classes were defined by beta-distribution parameters for each of L = 1413 methylation loci that were autosomal and passed quality-assurance, obtained by fitting a beta model on each locus to one of five data sets from our normal data: adult blood, newborn blood, placenta, lung/pleura, and everything else. Figure 3(A) illustrates a typical data set generated from these parameters. For each data set, we conducted 7 analyses, each using the J most variable loci, J ∈ {25, 50, 500, 1000}. The first analysis used hierarchical clustering, implemented using hclust in the R cluster package, with Euclidean metric and average linkage, and assigned 5 classes by cutting the resulting dendrogram at the appropriate height using the cutree function in the same package. The second analysis used the same clustering procedure, but determined the number of classes using the Dynamic Tree Cutting algorithm [28], implemented via the R package dynamicTreeCut with default settings. The third analysis used HOPACH (R hopach package) to select the "best" classes as defined in the function settings. The fourth analysis used HOPACH with classes obtained by the "greedy" version of the algorithm. The fifth analysis fit 6 sequential mixture models (1 ≤ K ≤ 6), each initialized two different ways (hierarchical clustering and the fanny algorithm), selecting the value of K producing the lowest BIC. The fifth and sixth analyses were applications of our proposed method, using ICL-BIC and BIC respectively. In the last three types of analysis, we recorded the value of J that produced the best augmented BIC. Note that each mixture model analysis tended to produce perfect classification, so that an ICL-BIC version of the time-consuming fourth analysis was deemed unnecessary.


Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions.

Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT - BMC Bioinformatics (2008)

Examples of simulated data. Yellow = 1.0, black = 0.5, blue = 0.0. True classes indicated and separated by yellow dividing line. Height of region indicates the relative number of subjects in each class.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2553421&req=5

Figure 3: Examples of simulated data. Yellow = 1.0, black = 0.5, blue = 0.0. True classes indicated and separated by yellow dividing line. Height of region indicates the relative number of subjects in each class.
Mentions: We conducted simulations to compare the properties of our proposed method with similar competing methods. For our first case (Case I), each simulated data set consisted of n = 100 subjects, each having 1413 continuous responses lying in the unit interval. Each subject was a member of one of 5 classes, each class occurring with 0.2 probability. The classes were defined by beta-distribution parameters for each of L = 1413 methylation loci that were autosomal and passed quality-assurance, obtained by fitting a beta model on each locus to one of five data sets from our normal data: adult blood, newborn blood, placenta, lung/pleura, and everything else. Figure 3(A) illustrates a typical data set generated from these parameters. For each data set, we conducted 7 analyses, each using the J most variable loci, J ∈ {25, 50, 500, 1000}. The first analysis used hierarchical clustering, implemented using hclust in the R cluster package, with Euclidean metric and average linkage, and assigned 5 classes by cutting the resulting dendrogram at the appropriate height using the cutree function in the same package. The second analysis used the same clustering procedure, but determined the number of classes using the Dynamic Tree Cutting algorithm [28], implemented via the R package dynamicTreeCut with default settings. The third analysis used HOPACH (R hopach package) to select the "best" classes as defined in the function settings. The fourth analysis used HOPACH with classes obtained by the "greedy" version of the algorithm. The fifth analysis fit 6 sequential mixture models (1 ≤ K ≤ 6), each initialized two different ways (hierarchical clustering and the fanny algorithm), selecting the value of K producing the lowest BIC. The fifth and sixth analyses were applications of our proposed method, using ICL-BIC and BIC respectively. In the last three types of analysis, we recorded the value of J that produced the best augmented BIC. Note that each mixture model analysis tended to produce perfect classification, so that an ICL-BIC version of the time-consuming fourth analysis was deemed unnecessary.

Bottom Line: Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes.We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, 02115, USA. ahousema@hsph.harvard.edu

ABSTRACT

Background: Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence. One of the most commonly studied epigenetic alterations is cytosine methylation, which is a well recognized mechanism of epigenetic gene silencing and often occurs at tumor suppressor gene loci in human cancer. Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes. Model-based cluster analysis is often used to identify DNA methylation subgroups in data, but it is unclear how to cluster DNA methylation data from arrays in a scalable and reliable manner.

Results: We propose a novel model-based recursive-partitioning algorithm to navigate clusters in a beta mixture model. We present simulations that show that the method is more reliable than competing nonparametric clustering approaches, and is at least as reliable as conventional mixture model methods. We also show that our proposed method is more computationally efficient than conventional mixture model approaches. We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.

Conclusion: Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.

Show MeSH
Related in: MedlinePlus