Limits...
Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data.

Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE - PLoS ONE (2009)

Bottom Line: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power.Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis.Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

View Article: PubMed Central - PubMed

Affiliation: Center of Health Informatics and Bioinformatics, New York University, New York, New York, United States of America. constantin.aliferis@nyumc.org

ABSTRACT

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.

Methodology/principal findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.

Conclusions/significance: THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Show MeSH

Related in: MedlinePlus

Application of Protocol II to human microarray data.Each histogram is the distribution of the repeated 10-fold cross-validation AUC estimates for each dataset under the  hypothesis “there is no signal present in the data” (as computed by 400 random outcome value permutations). The red line in each graph is the observed value of AUC estimated by the repeated 10-fold cross-validation on the original data. AUC and p-values are shown for each dataset in the embedded table. Bold p-values indicate that the  hypothesis is rejected at the 0.05 level in these datasets.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2654113&req=5

pone-0004922-g002: Application of Protocol II to human microarray data.Each histogram is the distribution of the repeated 10-fold cross-validation AUC estimates for each dataset under the hypothesis “there is no signal present in the data” (as computed by 400 random outcome value permutations). The red line in each graph is the observed value of AUC estimated by the repeated 10-fold cross-validation on the original data. AUC and p-values are shown for each dataset in the embedded table. Bold p-values indicate that the hypothesis is rejected at the 0.05 level in these datasets.

Mentions: When in the context of error estimation one enforces that both the training and testing sets have the same proportion of events and non-events as the original full data, we will call such error estimation “event balanced”. An important and subtle shortcoming of some data analysis protocols is to not balance the training and testing data, seriously affecting variance, statistical power (and potentially biasing error estimates). For example, in [23] the models were trained on samples with 50% event rates. They were then tested on samples the event prevalence of which was far below 50% thus yielding estimates that were less efficient than the standard holdout estimator in which the data are split at random. The result of this is evident in Figure 2 from [23] in which as the sampling moves to larger training sets, this forces the testing sets in addition to being smaller, to implicitly have a very low event rate and thus large variance of error estimates. Notice that most classifiers, including the one used by [23], are designed to work under the assumption that the training and testing sets are identically distributed [30]. It is thus unrealistic to expect in general that a classifier that is trained using data from a distribution where events and non-events are equally likely will perform well, without adjustments [37], in a different distribution where this ratio is heavily distorted. This is especially so when using an error metric that is sensitive to event priors such as proportion of misclassifications. Supporting Information File S2 shows via an example that this shift in distributions can affect the performance of even an optimal classifier, i.e., one that has learned perfectly the distribution of the training data, to the point of appearing to be no better than flipping a coin.


Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data.

Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE - PLoS ONE (2009)

Application of Protocol II to human microarray data.Each histogram is the distribution of the repeated 10-fold cross-validation AUC estimates for each dataset under the  hypothesis “there is no signal present in the data” (as computed by 400 random outcome value permutations). The red line in each graph is the observed value of AUC estimated by the repeated 10-fold cross-validation on the original data. AUC and p-values are shown for each dataset in the embedded table. Bold p-values indicate that the  hypothesis is rejected at the 0.05 level in these datasets.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2654113&req=5

pone-0004922-g002: Application of Protocol II to human microarray data.Each histogram is the distribution of the repeated 10-fold cross-validation AUC estimates for each dataset under the hypothesis “there is no signal present in the data” (as computed by 400 random outcome value permutations). The red line in each graph is the observed value of AUC estimated by the repeated 10-fold cross-validation on the original data. AUC and p-values are shown for each dataset in the embedded table. Bold p-values indicate that the hypothesis is rejected at the 0.05 level in these datasets.
Mentions: When in the context of error estimation one enforces that both the training and testing sets have the same proportion of events and non-events as the original full data, we will call such error estimation “event balanced”. An important and subtle shortcoming of some data analysis protocols is to not balance the training and testing data, seriously affecting variance, statistical power (and potentially biasing error estimates). For example, in [23] the models were trained on samples with 50% event rates. They were then tested on samples the event prevalence of which was far below 50% thus yielding estimates that were less efficient than the standard holdout estimator in which the data are split at random. The result of this is evident in Figure 2 from [23] in which as the sampling moves to larger training sets, this forces the testing sets in addition to being smaller, to implicitly have a very low event rate and thus large variance of error estimates. Notice that most classifiers, including the one used by [23], are designed to work under the assumption that the training and testing sets are identically distributed [30]. It is thus unrealistic to expect in general that a classifier that is trained using data from a distribution where events and non-events are equally likely will perform well, without adjustments [37], in a different distribution where this ratio is heavily distorted. This is especially so when using an error metric that is sensitive to event priors such as proportion of misclassifications. Supporting Information File S2 shows via an example that this shift in distributions can affect the performance of even an optimal classifier, i.e., one that has learned perfectly the distribution of the training data, to the point of appearing to be no better than flipping a coin.

Bottom Line: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power.Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis.Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

View Article: PubMed Central - PubMed

Affiliation: Center of Health Informatics and Bioinformatics, New York University, New York, New York, United States of America. constantin.aliferis@nyumc.org

ABSTRACT

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.

Methodology/principal findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.

Conclusions/significance: THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

Show MeSH
Related in: MedlinePlus