Limits...
Diagnostic biases in translational bioinformatics.

Han H - BMC Med Genomics (2015)

Bottom Line: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection.Our work identifies and solves an important but less addressed problem in translational research.It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, 10023, NY, USA. xhan9@fordham.edu.

ABSTRACT

Background: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

Methods: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

Results: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

Conclusions: Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

No MeSH data available.


Related in: MedlinePlus

ROC plots. The ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4522082&req=5

Fig4: ROC plots. The ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data

Mentions: Figure 4 compares the ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data [16, 33]. It is easy to see that the proposed DCA-SVM diagnosis conquers the label skewness bias by achieving the best performance, which prepares itself as a good candidate in personalized diagnostics in the coming personalized medicine for its unbiased exceptional diagnostic performance for different omics data. It is worthwhile to point out that such a rivaling clinical-level diagnosis is mainly because the true signals extraction in DCA that forces the SVM hyperplane construction to rely on both subtle and global data characteristics of the whole profile in a de-noised feature space, which seems to contribute to a robust and consistent high-accuracy diagnosis greatly. In fact, since such a consistent performance applies to different data sets rather than work only on an individual data set, it almost prevents from any overfitting possibility. Moreover, the following two subsections further demonstrate such an exceptional performance is impossible from overfitting because our proposed algorithm works well consistently for different data sets with different training and test data selection methods. Especially, the phenotype separation results in Fig. 5 strongly validate the effectiveness from a biomarker discovery and visualization standing point.Fig. 4


Diagnostic biases in translational bioinformatics.

Han H - BMC Med Genomics (2015)

ROC plots. The ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4522082&req=5

Fig4: ROC plots. The ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data
Mentions: Figure 4 compares the ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data [16, 33]. It is easy to see that the proposed DCA-SVM diagnosis conquers the label skewness bias by achieving the best performance, which prepares itself as a good candidate in personalized diagnostics in the coming personalized medicine for its unbiased exceptional diagnostic performance for different omics data. It is worthwhile to point out that such a rivaling clinical-level diagnosis is mainly because the true signals extraction in DCA that forces the SVM hyperplane construction to rely on both subtle and global data characteristics of the whole profile in a de-noised feature space, which seems to contribute to a robust and consistent high-accuracy diagnosis greatly. In fact, since such a consistent performance applies to different data sets rather than work only on an individual data set, it almost prevents from any overfitting possibility. Moreover, the following two subsections further demonstrate such an exceptional performance is impossible from overfitting because our proposed algorithm works well consistently for different data sets with different training and test data selection methods. Especially, the phenotype separation results in Fig. 5 strongly validate the effectiveness from a biomarker discovery and visualization standing point.Fig. 4

Bottom Line: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection.Our work identifies and solves an important but less addressed problem in translational research.It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, 10023, NY, USA. xhan9@fordham.edu.

ABSTRACT

Background: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

Methods: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

Results: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

Conclusions: Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

No MeSH data available.


Related in: MedlinePlus