Limits...
Diagnostic biases in translational bioinformatics.

Han H - BMC Med Genomics (2015)

Bottom Line: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection.Our work identifies and solves an important but less addressed problem in translational research.It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, 10023, NY, USA. xhan9@fordham.edu.

ABSTRACT

Background: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

Methods: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

Results: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

Conclusions: Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

No MeSH data available.


Related in: MedlinePlus

The distributions of α values. The distributions of α values of each diagnostic trial in the 5-fold cross validation for three data sets. The skewness of sample label distribution leads to the skewness of the distributions of α values of the diagnoses of the BreastIBC and Kidney data sets. The signs of the α values indicate the group property of corresponding support vectors. As such, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in diagnosis
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4522082&req=5

Fig2: The distributions of α values. The distributions of α values of each diagnostic trial in the 5-fold cross validation for three data sets. The skewness of sample label distribution leads to the skewness of the distributions of α values of the diagnoses of the BreastIBC and Kidney data sets. The signs of the α values indicate the group property of corresponding support vectors. As such, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in diagnosis

Mentions: The label skewness bias is due to the skewness of the label distributions that lead to there are more support vectors from the majority-count type samples and the class type of an unknown sample is more likely to be determined as the majority-count type. Figure 2 shows the distributions of α values, i.e., the Lagrange multipliers’ values: α1,α2⋯αm in the dual problem, in each diagnostic trial in the 5-fold cross validation. As the weights of corresponding support vectors, its values are always positive or zero as we pointed out before. However, the sign of a weight is assigned in our SVM implementation for the convenience of indicating its class property, i.e. a positive (negative) sign means this weight (e.g. α1) is for the support vector belonging to the positive (negative) target group. It is easy to detect that the distributions of α values are nearly balanced for the Hepatocellular carcinoma (HCC) data that has a relatively balanced sample label distributions, where the number of positive signs are almost equal as that of the negative signs. However, the the distributions of α values of the BreastIBC and Kidney data are obviously skewed to the positive targets, which are the majority-count samples in each data set. In other words, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in the following decision making. For example, since there are 256 and 178 α values carrying the positive and negative signs respectively in the 5th trial of diagnosis for the Kidney data, there will be a more likelihood for a test sample to be detected as a positive target.Fig. 2


Diagnostic biases in translational bioinformatics.

Han H - BMC Med Genomics (2015)

The distributions of α values. The distributions of α values of each diagnostic trial in the 5-fold cross validation for three data sets. The skewness of sample label distribution leads to the skewness of the distributions of α values of the diagnoses of the BreastIBC and Kidney data sets. The signs of the α values indicate the group property of corresponding support vectors. As such, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in diagnosis
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4522082&req=5

Fig2: The distributions of α values. The distributions of α values of each diagnostic trial in the 5-fold cross validation for three data sets. The skewness of sample label distribution leads to the skewness of the distributions of α values of the diagnoses of the BreastIBC and Kidney data sets. The signs of the α values indicate the group property of corresponding support vectors. As such, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in diagnosis
Mentions: The label skewness bias is due to the skewness of the label distributions that lead to there are more support vectors from the majority-count type samples and the class type of an unknown sample is more likely to be determined as the majority-count type. Figure 2 shows the distributions of α values, i.e., the Lagrange multipliers’ values: α1,α2⋯αm in the dual problem, in each diagnostic trial in the 5-fold cross validation. As the weights of corresponding support vectors, its values are always positive or zero as we pointed out before. However, the sign of a weight is assigned in our SVM implementation for the convenience of indicating its class property, i.e. a positive (negative) sign means this weight (e.g. α1) is for the support vector belonging to the positive (negative) target group. It is easy to detect that the distributions of α values are nearly balanced for the Hepatocellular carcinoma (HCC) data that has a relatively balanced sample label distributions, where the number of positive signs are almost equal as that of the negative signs. However, the the distributions of α values of the BreastIBC and Kidney data are obviously skewed to the positive targets, which are the majority-count samples in each data set. In other words, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in the following decision making. For example, since there are 256 and 178 α values carrying the positive and negative signs respectively in the 5th trial of diagnosis for the Kidney data, there will be a more likelihood for a test sample to be detected as a positive target.Fig. 2

Bottom Line: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection.Our work identifies and solves an important but less addressed problem in translational research.It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, 10023, NY, USA. xhan9@fordham.edu.

ABSTRACT

Background: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

Methods: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

Results: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

Conclusions: Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

No MeSH data available.


Related in: MedlinePlus