Limits...
Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH

Related in: MedlinePlus

Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. Figure A &B show heatmaps of all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method, respectively. Figures C &D show heatmaps of 50 kidney tumors using top 1000 features filtered by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at the second separation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4230495&req=5

Figure 5: Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. Figure A &B show heatmaps of all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method, respectively. Figures C &D show heatmaps of 50 kidney tumors using top 1000 features filtered by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at the second separation.

Mentions: To visualize the different performance of our top filters, TM-GOF and SD-b, we created heatmaps of the top 1000 filtered features following RPMM clustering. Figures 4 and 5 show the clusters identified for the colon cancer data set (data set #1, tumor-only) and the TCGA kidney data set (data set #4, tumor and non-tumor kidney), respectively. Using TM-GOF in the colon cancer data, our subcluster identified 4 out of 6 CIMP samples, leaving 2 CIMP samples misclassified in our first split of the data (Figure 4A). (We note that we use the word “misclassified” loosely, as our definition of CIMP is likely not a gold standard (see Additional file 2, Data set #1)). Using SD-b as the filter, the first split identified two clusters more equal in sample size (11 and 15), misclassifying 5 non-CIMP samples (Figure 4B). Interestingly, in the next split of the data all 6 CIMP samples separated themselves from the others, appearing together in one of the four sub clusters (Figure 4B). Thus the CIMP subtype was found by SD-b filtering, but only after further sub-clustering. Using the adjusted rand index measure of the co-clustering of sample pairs by cluster category and tissue label the TM-GOF filter showed superiority over the SD-b filter because of the smaller number of clusters estimated by the clustering method (Additional file 6: Table S2). In Figures 4A and B, 47 out of 1000 features are shared by the two filter methods. For data set #4 (Figure 5), SD-b filter resulted in the successful identification of tumor and normal kidney samples at the first division (Figure 5B). Using TM-GOF, the first cluster division identified a subtype of 4 cancer samples (Figure 5A, light green) with high DNA methylation in a subset of features. However, a second division of the data resulted in the separation of non-cancer tissues from the cancer samples (red bar). Thus, again we found the substructure sought in the cluster analysis, but not until the second division of the clusters. In Figure 5A and B, only 4 of the 1000 features overlapped. Interestingly, using different feature selection methods the cluster substructure became similar for both data sets, even when the first split found different subgroups. This common substructure in the data set was captured by the adjusted rand index (Additional file 6: Table S2). However, this same cluster substructure was not reproduced on the HM450 platform.We also asked whether omitting features with outlier values might discover larger clusters than the small disease subset discovered when filtering using TM-GOF (Figure 5A). We omitted all features with values greater than or less than the median Beta value +/−3 times the inter quartile range (IQR), excluding a relatively large number of features from the kidney data set (data set #4). We then selected the top 1000 features using TM-GOF, performed cluster analysis, and found perfect discrimination of cancer and non-cancer kidney (Figure not shown). This confirmed that the TM-GOF filter favoured features that identified disease subtypes prior to its selection of features identifying tissue disease state.


Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. Figure A &B show heatmaps of all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method, respectively. Figures C &D show heatmaps of 50 kidney tumors using top 1000 features filtered by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at the second separation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4230495&req=5

Figure 5: Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. Figure A &B show heatmaps of all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method, respectively. Figures C &D show heatmaps of 50 kidney tumors using top 1000 features filtered by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at the second separation.
Mentions: To visualize the different performance of our top filters, TM-GOF and SD-b, we created heatmaps of the top 1000 filtered features following RPMM clustering. Figures 4 and 5 show the clusters identified for the colon cancer data set (data set #1, tumor-only) and the TCGA kidney data set (data set #4, tumor and non-tumor kidney), respectively. Using TM-GOF in the colon cancer data, our subcluster identified 4 out of 6 CIMP samples, leaving 2 CIMP samples misclassified in our first split of the data (Figure 4A). (We note that we use the word “misclassified” loosely, as our definition of CIMP is likely not a gold standard (see Additional file 2, Data set #1)). Using SD-b as the filter, the first split identified two clusters more equal in sample size (11 and 15), misclassifying 5 non-CIMP samples (Figure 4B). Interestingly, in the next split of the data all 6 CIMP samples separated themselves from the others, appearing together in one of the four sub clusters (Figure 4B). Thus the CIMP subtype was found by SD-b filtering, but only after further sub-clustering. Using the adjusted rand index measure of the co-clustering of sample pairs by cluster category and tissue label the TM-GOF filter showed superiority over the SD-b filter because of the smaller number of clusters estimated by the clustering method (Additional file 6: Table S2). In Figures 4A and B, 47 out of 1000 features are shared by the two filter methods. For data set #4 (Figure 5), SD-b filter resulted in the successful identification of tumor and normal kidney samples at the first division (Figure 5B). Using TM-GOF, the first cluster division identified a subtype of 4 cancer samples (Figure 5A, light green) with high DNA methylation in a subset of features. However, a second division of the data resulted in the separation of non-cancer tissues from the cancer samples (red bar). Thus, again we found the substructure sought in the cluster analysis, but not until the second division of the clusters. In Figure 5A and B, only 4 of the 1000 features overlapped. Interestingly, using different feature selection methods the cluster substructure became similar for both data sets, even when the first split found different subgroups. This common substructure in the data set was captured by the adjusted rand index (Additional file 6: Table S2). However, this same cluster substructure was not reproduced on the HM450 platform.We also asked whether omitting features with outlier values might discover larger clusters than the small disease subset discovered when filtering using TM-GOF (Figure 5A). We omitted all features with values greater than or less than the median Beta value +/−3 times the inter quartile range (IQR), excluding a relatively large number of features from the kidney data set (data set #4). We then selected the top 1000 features using TM-GOF, performed cluster analysis, and found perfect discrimination of cancer and non-cancer kidney (Figure not shown). This confirmed that the TM-GOF filter favoured features that identified disease subtypes prior to its selection of features identifying tissue disease state.

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH
Related in: MedlinePlus