Limits...
Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

Bottom Line: We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.


The minimum, median, first percentile, and maximum of the pairwise distance squares:  under the sparse kernel and the eigenvalues of the “spare-kernel” matrices across six omics datasets.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4264614&req=5

f6-cin-suppl.1-2014-145: The minimum, median, first percentile, and maximum of the pairwise distance squares: under the sparse kernel and the eigenvalues of the “spare-kernel” matrices across six omics datasets.

Mentions: The reason why the SVM overfitting is conquered by the sparse kernel lies in the fact that our sparse coding decreases the pairwise distances in each kernel matrix and optimizes it to be a more meaningful representative structure because of the data localization mechanism brought by the sparse-coding kernels. Figure 6 illustrates the minimum , first percentile , median and maximum , values of the pairwise distance squares in the kernel matrices of the “sparse-kernel” SVM classifier for all samples in each omics dataset. Compared with the fact that the original pairwise distance square minimum values are in the order of 102 under the original ‘rbf’ kernel, the values are in a much smaller interval under the sparse kernel for all data, ie, It means corresponding minimum non-diagonal entries will be between exp(−10−0.157466/2) = 0.7061 and exp(−10−3.079699/2) = 0.9996, for i ≠ j. In other words, the kernel matrices under the sparse kernel are representative and meaningful instead of the original identity or isometric identity matrices.


Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

The minimum, median, first percentile, and maximum of the pairwise distance squares:  under the sparse kernel and the eigenvalues of the “spare-kernel” matrices across six omics datasets.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4264614&req=5

f6-cin-suppl.1-2014-145: The minimum, median, first percentile, and maximum of the pairwise distance squares: under the sparse kernel and the eigenvalues of the “spare-kernel” matrices across six omics datasets.
Mentions: The reason why the SVM overfitting is conquered by the sparse kernel lies in the fact that our sparse coding decreases the pairwise distances in each kernel matrix and optimizes it to be a more meaningful representative structure because of the data localization mechanism brought by the sparse-coding kernels. Figure 6 illustrates the minimum , first percentile , median and maximum , values of the pairwise distance squares in the kernel matrices of the “sparse-kernel” SVM classifier for all samples in each omics dataset. Compared with the fact that the original pairwise distance square minimum values are in the order of 102 under the original ‘rbf’ kernel, the values are in a much smaller interval under the sparse kernel for all data, ie, It means corresponding minimum non-diagonal entries will be between exp(−10−0.157466/2) = 0.7061 and exp(−10−3.079699/2) = 0.9996, for i ≠ j. In other words, the kernel matrices under the sparse kernel are representative and meaningful instead of the original identity or isometric identity matrices.

Bottom Line: We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.