Limits...
Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH

Related in: MedlinePlus

ROC curves for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). For each data set the sensitivity and specificity of selecting informative features using the top ranked list (1–2000 features) are averaged over 100 replications. Figure A-C show ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF, TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio 9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom three panels D-F are partial ROC curves obtained from the panels A-C by restricting the axis ranges to the region relevant to the diagonal line. The solid black diagonal line in Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4230495&req=5

Figure 2: ROC curves for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). For each data set the sensitivity and specificity of selecting informative features using the top ranked list (1–2000 features) are averaged over 100 replications. Figure A-C show ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF, TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio 9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom three panels D-F are partial ROC curves obtained from the panels A-C by restricting the axis ranges to the region relevant to the diagonal line. The solid black diagonal line in Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.

Mentions: The ROC curves from the analysis of simulated data show the ability of the filters to enrich the top ranking genes with those truly informative of subgroup (Figure 2). The features that enrich the top of the list will show increasing sensitivity under low false-positives, falling above the diagonal line, and having an area under the curve (AUC) greater than 0.5. Figure 2 shows that for all scenarios considered, filters that combine a Beta variance stabilization transformation with goodness-of-fit statistic (TQ-GOF, TM-GOF, BQ-GOF) appear to enrich the most highly ranked features with ones informative for cluster subgroup. The filters SD-b, SD-m, and Precision appear non-informative, with 95% confidence intervals for the AUC containing 0.5 (Additional file 5: Table S1). The figures also show that for the informative filters, the greatest enrichment occurs when the subgroups have equal sample size. This is to be expected, as equal sample sizes will give the greatest power to detect differential DNA methylation in supervised analyses. Between the two analyses with unequal sample sizes, the better discrimination occurs when the group with larger variance has the larger sample size (Figure 2C, F).In ROC analyses, another quantity of interest is the partial AUC, the AUC for a given false-positive rate. In this setting, fixing the error rate will give a variable number of features for different filters. Instead, we select a fixed number of features (p) to discuss the sensitivity and specificity. This approach reflects how non-specific filtering is performed in practice. The solid black diagonal line in Figure 2D-F indicates the estimated sensitivity and specificity levels for the top 100 features. The diagonal line connects the boundary points indicating the maximum true-positive fraction (y-axis) for 0 false-positives (x-axis) and the maximum false-positive fraction for 0 true-positives. In the simulation, the true number of informative features is always 200 out of 2000. Thus, the maximum possible true-positive fraction is 0.5, corresponding to all 100 features selected being true positives (100/200 true-positives and 0/1800 false-positives); the maximum possible false-positive fraction is 0.056 (100/1800 false-positives and 0/200 true positives). The diagonal lines in Figures 2D-F connect the coordinates for these boundary points: (0,0.5) and (0.056,0), respectively. We estimate the sensitivity and specificity for the top ranked 100 features from each filter by the coordinates where the ROC curve crosses the diagonal line. These intersection points provide a more clear comparison of enrichment for the top ranking features than is evident when viewing the entire ROC curve. For subgroups of equal size, the TQ-GOF statistic shows the greatest sensitivity and specificity in selecting informative features for a fixed number of features. For unequal sized subgroups, the methods TM-GOF and BQ-GOF, performed competitively.


Non-specific filtering of beta-distributed data.

Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD - BMC Bioinformatics (2014)

ROC curves for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). For each data set the sensitivity and specificity of selecting informative features using the top ranked list (1–2000 features) are averaged over 100 replications. Figure A-C show ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF, TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio 9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom three panels D-F are partial ROC curves obtained from the panels A-C by restricting the axis ranges to the region relevant to the diagonal line. The solid black diagonal line in Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4230495&req=5

Figure 2: ROC curves for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). For each data set the sensitivity and specificity of selecting informative features using the top ranked list (1–2000 features) are averaged over 100 replications. Figure A-C show ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF, TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio 9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom three panels D-F are partial ROC curves obtained from the panels A-C by restricting the axis ranges to the region relevant to the diagonal line. The solid black diagonal line in Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.
Mentions: The ROC curves from the analysis of simulated data show the ability of the filters to enrich the top ranking genes with those truly informative of subgroup (Figure 2). The features that enrich the top of the list will show increasing sensitivity under low false-positives, falling above the diagonal line, and having an area under the curve (AUC) greater than 0.5. Figure 2 shows that for all scenarios considered, filters that combine a Beta variance stabilization transformation with goodness-of-fit statistic (TQ-GOF, TM-GOF, BQ-GOF) appear to enrich the most highly ranked features with ones informative for cluster subgroup. The filters SD-b, SD-m, and Precision appear non-informative, with 95% confidence intervals for the AUC containing 0.5 (Additional file 5: Table S1). The figures also show that for the informative filters, the greatest enrichment occurs when the subgroups have equal sample size. This is to be expected, as equal sample sizes will give the greatest power to detect differential DNA methylation in supervised analyses. Between the two analyses with unequal sample sizes, the better discrimination occurs when the group with larger variance has the larger sample size (Figure 2C, F).In ROC analyses, another quantity of interest is the partial AUC, the AUC for a given false-positive rate. In this setting, fixing the error rate will give a variable number of features for different filters. Instead, we select a fixed number of features (p) to discuss the sensitivity and specificity. This approach reflects how non-specific filtering is performed in practice. The solid black diagonal line in Figure 2D-F indicates the estimated sensitivity and specificity levels for the top 100 features. The diagonal line connects the boundary points indicating the maximum true-positive fraction (y-axis) for 0 false-positives (x-axis) and the maximum false-positive fraction for 0 true-positives. In the simulation, the true number of informative features is always 200 out of 2000. Thus, the maximum possible true-positive fraction is 0.5, corresponding to all 100 features selected being true positives (100/200 true-positives and 0/1800 false-positives); the maximum possible false-positive fraction is 0.056 (100/1800 false-positives and 0/200 true positives). The diagonal lines in Figures 2D-F connect the coordinates for these boundary points: (0,0.5) and (0.056,0), respectively. We estimate the sensitivity and specificity for the top ranked 100 features from each filter by the coordinates where the ROC curve crosses the diagonal line. These intersection points provide a more clear comparison of enrichment for the top ranking features than is evident when viewing the entire ROC curve. For subgroups of equal size, the TQ-GOF statistic shows the greatest sensitivity and specificity in selecting informative features for a fixed number of features. For unequal sized subgroups, the methods TM-GOF and BQ-GOF, performed competitively.

Bottom Line: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data.We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions.We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Preventive Medicine, USC Keck School of Medicine, University of Southern California, 2001 N Soto Street, Suite 202W, Los Angeles 90089-9239 California, USA. kims@usc.edu.

ABSTRACT

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Show MeSH
Related in: MedlinePlus