Limits...
Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

Bottom Line: However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making.We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.


The comparison of the SVM diagnosis for “sparse-kernel”, “linear”, “quadratic”, “polynomial”, multilayer perceptron kernel (“mlp”), and an “rbf” kernel with adjusted sigma value on six omics datasets on average accuracy, sensitivity, specificity, and positive prediction ratios. The sparse kernel conquers overfitting with the best diagnosis performance compared with other kernels. Each dataset is represented by its first letter expect the Colorectal dataset, which is represented by “L.”
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4264614&req=5

f4-cin-suppl.1-2014-145: The comparison of the SVM diagnosis for “sparse-kernel”, “linear”, “quadratic”, “polynomial”, multilayer perceptron kernel (“mlp”), and an “rbf” kernel with adjusted sigma value on six omics datasets on average accuracy, sensitivity, specificity, and positive prediction ratios. The sparse kernel conquers overfitting with the best diagnosis performance compared with other kernels. Each dataset is represented by its first letter expect the Colorectal dataset, which is represented by “L.”

Mentions: Figure 4 illustrates the average SVM diagnosis accuracy, sensitivity, specificity, and positive prediction ratio for the “sparse-kernel” and other five kernels. It is interesting to find that SVM diagnosis with a sparse kernel not only successfully overcomes overfitting, but also achieves almost best performance among all kernels stably, though the linear kernel achieves the same level of performance on the Cirrhosis and Medulloblastoma data. Moreover, it seems that such a “sparse-kernel” SVM brings a lower standard deviation value than that with the linear-SVM under the same-level performance scenarios. For example, both of them achieve 94.02% diagnostic accuracy, but the sparse kernel has only 1.43% standard deviation compared with the 2.69% of the linear kernel.


Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

The comparison of the SVM diagnosis for “sparse-kernel”, “linear”, “quadratic”, “polynomial”, multilayer perceptron kernel (“mlp”), and an “rbf” kernel with adjusted sigma value on six omics datasets on average accuracy, sensitivity, specificity, and positive prediction ratios. The sparse kernel conquers overfitting with the best diagnosis performance compared with other kernels. Each dataset is represented by its first letter expect the Colorectal dataset, which is represented by “L.”
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4264614&req=5

f4-cin-suppl.1-2014-145: The comparison of the SVM diagnosis for “sparse-kernel”, “linear”, “quadratic”, “polynomial”, multilayer perceptron kernel (“mlp”), and an “rbf” kernel with adjusted sigma value on six omics datasets on average accuracy, sensitivity, specificity, and positive prediction ratios. The sparse kernel conquers overfitting with the best diagnosis performance compared with other kernels. Each dataset is represented by its first letter expect the Colorectal dataset, which is represented by “L.”
Mentions: Figure 4 illustrates the average SVM diagnosis accuracy, sensitivity, specificity, and positive prediction ratio for the “sparse-kernel” and other five kernels. It is interesting to find that SVM diagnosis with a sparse kernel not only successfully overcomes overfitting, but also achieves almost best performance among all kernels stably, though the linear kernel achieves the same level of performance on the Cirrhosis and Medulloblastoma data. Moreover, it seems that such a “sparse-kernel” SVM brings a lower standard deviation value than that with the linear-SVM under the same-level performance scenarios. For example, both of them achieve 94.02% diagnostic accuracy, but the sparse kernel has only 1.43% standard deviation compared with the 2.69% of the linear kernel.

Bottom Line: However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making.We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.