Limits...
Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections.

Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A - PLoS ONE (2011)

Bottom Line: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis.Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs.Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

View Article: PubMed Central - PubMed

Affiliation: Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, New York, United States of America.

ABSTRACT

Background: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.

Methodology and principal findings: Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.

Conclusions and significance: Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

Show MeSH

Related in: MedlinePlus

Visualization of subjects in the dataset from [12] in the space of the                            first two principal components of the panviral signature of Zaas                                et al.The solid line is an approximation of the molecular signature                            (classifier) of Zaas et al.; subjects to the left of                            this line are classified as uninfected (healthy) and subjects to the                            right are classified as virally infected (Influenza A). Blue and red                            gradient highlighting corresponds to the regions where the majority of                            bacterial and viral profiles belong, respectively. Green highlighting                            shows the area with uninfected (healthy) profiles.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3105991&req=5

pone-0020662-g002: Visualization of subjects in the dataset from [12] in the space of the first two principal components of the panviral signature of Zaas et al.The solid line is an approximation of the molecular signature (classifier) of Zaas et al.; subjects to the left of this line are classified as uninfected (healthy) and subjects to the right are classified as virally infected (Influenza A). Blue and red gradient highlighting corresponds to the regions where the majority of bacterial and viral profiles belong, respectively. Green highlighting shows the area with uninfected (healthy) profiles.

Mentions: Figure 2 graphically depicts subjects from the dataset of Ramilo et al. in the space of the first two principal components obtained from genes that constituted the panviral signature of Zaas et al. The solid line is an approximation of the molecular signature (classifier) of Zaas et al. This signature would classify subjects to the left of the line as uninfected (healthy) whereas subjects to the right of the line would be classified as virally infected. Figure 2 also demonstrates that the same molecular signature can incidentally be used to accurately differentiate between subjects with bacterial and viral infections from the dataset of Ramilo et al., thus confirming the finding of Zaas et al. However, this result is due to a lucky choice of genes in the molecular signature of Zaas et al. that was either helped by redundant genes for the viral phenotype (recall that only 20 gene probes were non-redundant) and/or could have been informed by other criteria and procedures not reported in the original publication. When we substituted factor analysis-based gene selection in the protocol of Zaas et al. with GLL, which by design yields only non-redundant genes for the viral phenotype, predictive accuracy for the bacterial vs. viral classification task was reduced to 0.60 AUC. This indicates that the finding of Zaas et al. is method-dependent. Moreover, the following subsection shows that the methodology employed by Zaas et al. for evaluating the specificity of their molecular signature to viral infections does not generalize to other datasets.


Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections.

Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A - PLoS ONE (2011)

Visualization of subjects in the dataset from [12] in the space of the                            first two principal components of the panviral signature of Zaas                                et al.The solid line is an approximation of the molecular signature                            (classifier) of Zaas et al.; subjects to the left of                            this line are classified as uninfected (healthy) and subjects to the                            right are classified as virally infected (Influenza A). Blue and red                            gradient highlighting corresponds to the regions where the majority of                            bacterial and viral profiles belong, respectively. Green highlighting                            shows the area with uninfected (healthy) profiles.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3105991&req=5

pone-0020662-g002: Visualization of subjects in the dataset from [12] in the space of the first two principal components of the panviral signature of Zaas et al.The solid line is an approximation of the molecular signature (classifier) of Zaas et al.; subjects to the left of this line are classified as uninfected (healthy) and subjects to the right are classified as virally infected (Influenza A). Blue and red gradient highlighting corresponds to the regions where the majority of bacterial and viral profiles belong, respectively. Green highlighting shows the area with uninfected (healthy) profiles.
Mentions: Figure 2 graphically depicts subjects from the dataset of Ramilo et al. in the space of the first two principal components obtained from genes that constituted the panviral signature of Zaas et al. The solid line is an approximation of the molecular signature (classifier) of Zaas et al. This signature would classify subjects to the left of the line as uninfected (healthy) whereas subjects to the right of the line would be classified as virally infected. Figure 2 also demonstrates that the same molecular signature can incidentally be used to accurately differentiate between subjects with bacterial and viral infections from the dataset of Ramilo et al., thus confirming the finding of Zaas et al. However, this result is due to a lucky choice of genes in the molecular signature of Zaas et al. that was either helped by redundant genes for the viral phenotype (recall that only 20 gene probes were non-redundant) and/or could have been informed by other criteria and procedures not reported in the original publication. When we substituted factor analysis-based gene selection in the protocol of Zaas et al. with GLL, which by design yields only non-redundant genes for the viral phenotype, predictive accuracy for the bacterial vs. viral classification task was reduced to 0.60 AUC. This indicates that the finding of Zaas et al. is method-dependent. Moreover, the following subsection shows that the methodology employed by Zaas et al. for evaluating the specificity of their molecular signature to viral infections does not generalize to other datasets.

Bottom Line: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis.Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs.Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

View Article: PubMed Central - PubMed

Affiliation: Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, New York, United States of America.

ABSTRACT

Background: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.

Methodology and principal findings: Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.

Conclusions and significance: Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.

Show MeSH
Related in: MedlinePlus