Limits...
Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections.

Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A - PLoS ONE (2011)

Bottom Line: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis.Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs.Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

View Article: PubMed Central - PubMed

Affiliation: Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, New York, United States of America.

ABSTRACT

Background: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.

Methodology and principal findings: Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.

Conclusions and significance: Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

Show MeSH

Related in: MedlinePlus

Effects of preprocessing by the supplementary software of Zaas                                et al.                            [9] on real gene                            expression data.Gene expression profiles of the uninfected subjects are shown in blue                            staggered on top of the profiles of the infected subjects highlighted                            with red. The blue and red vertical line segments denote locations of                            the mean expression in the uninfected and infected groups, respectively.                            Likewise, blue and red horizontal line segments emanating in both                            directions from the means denote one standard deviation within the                            uninfected and infected groups, respectively. P-values produced by a                            two-sample t-test with unequal variances are shown in                            parenthesis.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3105991&req=5

pone-0020662-g001: Effects of preprocessing by the supplementary software of Zaas et al. [9] on real gene expression data.Gene expression profiles of the uninfected subjects are shown in blue staggered on top of the profiles of the infected subjects highlighted with red. The blue and red vertical line segments denote locations of the mean expression in the uninfected and infected groups, respectively. Likewise, blue and red horizontal line segments emanating in both directions from the means denote one standard deviation within the uninfected and infected groups, respectively. P-values produced by a two-sample t-test with unequal variances are shown in parenthesis.

Mentions: A specific example illustrating the above effects of preprocessing in the original (non-permuted) data is shown in Figure 1. As can be seen in that figure, within-class variances decreased roughly five-fold for gene RIBC2 as a result of batch correction using the supplementary software of Zaas et al. Consequently, the p-value produced by a two-sample t-test for differential expression decreased from roughly 0.5 to below 10−3 causing an appearance of a statistically significant association between gene RIBC2 and the panviral phenotype. Although the two classes of gene expression profiles could not be separated without errors using only gene RIBC2 in the preprocessed data, in general, such preprocessing may force classes to become perfectly separable as shown using simulated data in Figure S4.


Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections.

Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A - PLoS ONE (2011)

Effects of preprocessing by the supplementary software of Zaas                                et al.                            [9] on real gene                            expression data.Gene expression profiles of the uninfected subjects are shown in blue                            staggered on top of the profiles of the infected subjects highlighted                            with red. The blue and red vertical line segments denote locations of                            the mean expression in the uninfected and infected groups, respectively.                            Likewise, blue and red horizontal line segments emanating in both                            directions from the means denote one standard deviation within the                            uninfected and infected groups, respectively. P-values produced by a                            two-sample t-test with unequal variances are shown in                            parenthesis.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3105991&req=5

pone-0020662-g001: Effects of preprocessing by the supplementary software of Zaas et al. [9] on real gene expression data.Gene expression profiles of the uninfected subjects are shown in blue staggered on top of the profiles of the infected subjects highlighted with red. The blue and red vertical line segments denote locations of the mean expression in the uninfected and infected groups, respectively. Likewise, blue and red horizontal line segments emanating in both directions from the means denote one standard deviation within the uninfected and infected groups, respectively. P-values produced by a two-sample t-test with unequal variances are shown in parenthesis.
Mentions: A specific example illustrating the above effects of preprocessing in the original (non-permuted) data is shown in Figure 1. As can be seen in that figure, within-class variances decreased roughly five-fold for gene RIBC2 as a result of batch correction using the supplementary software of Zaas et al. Consequently, the p-value produced by a two-sample t-test for differential expression decreased from roughly 0.5 to below 10−3 causing an appearance of a statistically significant association between gene RIBC2 and the panviral phenotype. Although the two classes of gene expression profiles could not be separated without errors using only gene RIBC2 in the preprocessed data, in general, such preprocessing may force classes to become perfectly separable as shown using simulated data in Figure S4.

Bottom Line: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis.Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs.Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

View Article: PubMed Central - PubMed

Affiliation: Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, New York, United States of America.

ABSTRACT

Background: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.

Methodology and principal findings: Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.

Conclusions and significance: Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

Show MeSH
Related in: MedlinePlus