Limits...
Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data.

Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE - PLoS ONE (2009)

Bottom Line: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power.Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis.Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

View Article: PubMed Central - PubMed

Affiliation: Center of Health Informatics and Bioinformatics, New York University, New York, New York, United States of America. constantin.aliferis@nyumc.org

ABSTRACT

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.

Methodology/principal findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.

Conclusions/significance: THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Show MeSH

Related in: MedlinePlus

Comparison of Protocols I and II in simulated data.Left: Example where the Protocol I [23] applied to simulated data with true moderate-strength signal fails to detect statistical significance at all training set sizes. Right: a more powerful protocol (Protocol II, based on event balanced repeated 10-fold cross-validation with SVM classifiers and AUC metric) detects statistically significant predictive signal according to an outcome-value permutation test. Specifically, the p-value of the  hypothesis of no signal is 0.0025. The blue bars depict the distribution of repeated 10-fold cross-validation AUC estimates over 400 random datasets produced via outcome value permutation. The red line depicts the value of repeated 10-fold cross-validation AUC on the original data (i.e., without perturbing the outcome values).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2654113&req=5

pone-0004922-g001: Comparison of Protocols I and II in simulated data.Left: Example where the Protocol I [23] applied to simulated data with true moderate-strength signal fails to detect statistical significance at all training set sizes. Right: a more powerful protocol (Protocol II, based on event balanced repeated 10-fold cross-validation with SVM classifiers and AUC metric) detects statistically significant predictive signal according to an outcome-value permutation test. Specifically, the p-value of the hypothesis of no signal is 0.0025. The blue bars depict the distribution of repeated 10-fold cross-validation AUC estimates over 400 random datasets produced via outcome value permutation. The red line depicts the value of repeated 10-fold cross-validation AUC on the original data (i.e., without perturbing the outcome values).

Mentions: The left part of Figure 1 demonstrates the inability of Protocol I [23] to detect signal which is detectable by Protocol II. The right part of this figure shows results of application of Protocol II and assessment of its statistical significance by permutation testing (details about statistical significance testing are provided in the Materials and Methods section). Overall Protocol I has remarkably small power ranging from less than 0.002 to 0.3 (depending on the criterion used for rejecting the hypothesis, please see Supporting Information File S3). In contrast, Protocol II has power 0.93. By replacing proportion of misclassifications with AUC in Protocol I, its power increases to 0.6, and by additionally adding the use of SVMs, it further increases to 0.75. Conversely, if we start with Protocol II and replace AUC with proportion of misclassifications and SVMs with the classifier from [23], these changes reduce the power from 0.93 to 0.46. These empirical power estimates do not provide the exact power in real datasets since the true nature of the corresponding distributions is not known and varies among datasets. However the simulation strengthens our hypothesis that the choice of error metric, classifier, event balancing and error estimator have large impact on study results and sheds light on the limitations of the analyses described in prior work [23]. In the next sub-section we test the Protocol II in real data (where Protocol I was previously independently applied).


Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data.

Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE - PLoS ONE (2009)

Comparison of Protocols I and II in simulated data.Left: Example where the Protocol I [23] applied to simulated data with true moderate-strength signal fails to detect statistical significance at all training set sizes. Right: a more powerful protocol (Protocol II, based on event balanced repeated 10-fold cross-validation with SVM classifiers and AUC metric) detects statistically significant predictive signal according to an outcome-value permutation test. Specifically, the p-value of the  hypothesis of no signal is 0.0025. The blue bars depict the distribution of repeated 10-fold cross-validation AUC estimates over 400 random datasets produced via outcome value permutation. The red line depicts the value of repeated 10-fold cross-validation AUC on the original data (i.e., without perturbing the outcome values).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2654113&req=5

pone-0004922-g001: Comparison of Protocols I and II in simulated data.Left: Example where the Protocol I [23] applied to simulated data with true moderate-strength signal fails to detect statistical significance at all training set sizes. Right: a more powerful protocol (Protocol II, based on event balanced repeated 10-fold cross-validation with SVM classifiers and AUC metric) detects statistically significant predictive signal according to an outcome-value permutation test. Specifically, the p-value of the hypothesis of no signal is 0.0025. The blue bars depict the distribution of repeated 10-fold cross-validation AUC estimates over 400 random datasets produced via outcome value permutation. The red line depicts the value of repeated 10-fold cross-validation AUC on the original data (i.e., without perturbing the outcome values).
Mentions: The left part of Figure 1 demonstrates the inability of Protocol I [23] to detect signal which is detectable by Protocol II. The right part of this figure shows results of application of Protocol II and assessment of its statistical significance by permutation testing (details about statistical significance testing are provided in the Materials and Methods section). Overall Protocol I has remarkably small power ranging from less than 0.002 to 0.3 (depending on the criterion used for rejecting the hypothesis, please see Supporting Information File S3). In contrast, Protocol II has power 0.93. By replacing proportion of misclassifications with AUC in Protocol I, its power increases to 0.6, and by additionally adding the use of SVMs, it further increases to 0.75. Conversely, if we start with Protocol II and replace AUC with proportion of misclassifications and SVMs with the classifier from [23], these changes reduce the power from 0.93 to 0.46. These empirical power estimates do not provide the exact power in real datasets since the true nature of the corresponding distributions is not known and varies among datasets. However the simulation strengthens our hypothesis that the choice of error metric, classifier, event balancing and error estimator have large impact on study results and sheds light on the limitations of the analyses described in prior work [23]. In the next sub-section we test the Protocol II in real data (where Protocol I was previously independently applied).

Bottom Line: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power.Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis.Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

View Article: PubMed Central - PubMed

Affiliation: Center of Health Informatics and Bioinformatics, New York University, New York, New York, United States of America. constantin.aliferis@nyumc.org

ABSTRACT

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.

Methodology/principal findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.

Conclusions/significance: THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Show MeSH
Related in: MedlinePlus