Limits...
Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH

Related in: MedlinePlus

Misclassification rates of RPMM cluster analysis using top filtered features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). Average 100 simulations of misclassification rates from a cluster analysis performed using RPMM, for the top 100, 200, or 400 features of seven different filtering methods under different sample size ratios. A*. Sample size ratio 9:1 (non-CIMP/CIMP); B*. Sample size ratio 1:9; C* &D**. Sample size ratio 1:1. For A*-C*: informative features have effective size smaller than 1; For D**: informative features have effective size smaller than 0.5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4230495&req=5

Figure 3: Misclassification rates of RPMM cluster analysis using top filtered features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). Average 100 simulations of misclassification rates from a cluster analysis performed using RPMM, for the top 100, 200, or 400 features of seven different filtering methods under different sample size ratios. A*. Sample size ratio 9:1 (non-CIMP/CIMP); B*. Sample size ratio 1:9; C* &D**. Sample size ratio 1:1. For A*-C*: informative features have effective size smaller than 1; For D**: informative features have effective size smaller than 0.5.

Mentions: Applying cluster analysis to the data simulated in Figure 2, all methods performed nearly perfectly (results not shown). Presumably this is due to the selection of a few features with very large signal between the two cancer subtypes. To introduce variation in behaviour, we reduced the effect sizes for the informative features in the simulation (see Methods). Figure 3 shows the misclassification error rates from a cluster analysis of data simulated under these reduced effect sizes. For all scenarios, TM-GOF and TQ-GOF performed best among all single statistic methods. We see that the Precision filter, SD-b and SD-m performed worst. Filters MAD and DIP also performed poorly (data not shown). Regardless of the number of features retained, the Precision filter was unable to find the correct subgroups in the cluster analysis. For other methods, the misclassification rates increased as we increased the number of features in the analysis. Among the three summary methods, BR and AR performed similarly to the best single filter methods (AR data not shown); the WAR filter did not perform as well (data not shown). These results are consistent with previous ROC curves (Figure 2).We note that the maximum error rate for each panel in Figure 3 depended on the sample sizes in the two subgroups. When there are no clear clusters in the data, RPMM tends to find one big cluster. This resulted in a maximum error rate of 10% (=20/200) for sample size ratios of 1:9 and 9:1 (Figure 3A and B) and an error rate of 50% (=100/200) when the sample sizes were equal (Figure 3C and D).


Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

Misclassification rates of RPMM cluster analysis using top filtered features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). Average 100 simulations of misclassification rates from a cluster analysis performed using RPMM, for the top 100, 200, or 400 features of seven different filtering methods under different sample size ratios. A*. Sample size ratio 9:1 (non-CIMP/CIMP); B*. Sample size ratio 1:9; C* &D**. Sample size ratio 1:1. For A*-C*: informative features have effective size smaller than 1; For D**: informative features have effective size smaller than 0.5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4230495&req=5

Figure 3: Misclassification rates of RPMM cluster analysis using top filtered features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). Average 100 simulations of misclassification rates from a cluster analysis performed using RPMM, for the top 100, 200, or 400 features of seven different filtering methods under different sample size ratios. A*. Sample size ratio 9:1 (non-CIMP/CIMP); B*. Sample size ratio 1:9; C* &D**. Sample size ratio 1:1. For A*-C*: informative features have effective size smaller than 1; For D**: informative features have effective size smaller than 0.5.
Mentions: Applying cluster analysis to the data simulated in Figure 2, all methods performed nearly perfectly (results not shown). Presumably this is due to the selection of a few features with very large signal between the two cancer subtypes. To introduce variation in behaviour, we reduced the effect sizes for the informative features in the simulation (see Methods). Figure 3 shows the misclassification error rates from a cluster analysis of data simulated under these reduced effect sizes. For all scenarios, TM-GOF and TQ-GOF performed best among all single statistic methods. We see that the Precision filter, SD-b and SD-m performed worst. Filters MAD and DIP also performed poorly (data not shown). Regardless of the number of features retained, the Precision filter was unable to find the correct subgroups in the cluster analysis. For other methods, the misclassification rates increased as we increased the number of features in the analysis. Among the three summary methods, BR and AR performed similarly to the best single filter methods (AR data not shown); the WAR filter did not perform as well (data not shown). These results are consistent with previous ROC curves (Figure 2).We note that the maximum error rate for each panel in Figure 3 depended on the sample sizes in the two subgroups. When there are no clear clusters in the data, RPMM tends to find one big cluster. This resulted in a maximum error rate of 10% (=20/200) for sample size ratios of 1:9 and 9:1 (Figure 3A and B) and an error rate of 50% (=100/200) when the sample sizes were equal (Figure 3C and D).

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH
Related in: MedlinePlus