Limits...
Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

Bottom Line: However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making.We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.


The contour plots of the kernel matrices of all six omics datasets under the sparse kernel. Most of the kernel matrices have entry values spanning more layers, which contributes to enhancing the SVM classifier’s diagnostic power because of the optimized kernel structures.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4264614&req=5

f5-cin-suppl.1-2014-145: The contour plots of the kernel matrices of all six omics datasets under the sparse kernel. Most of the kernel matrices have entry values spanning more layers, which contributes to enhancing the SVM classifier’s diagnostic power because of the optimized kernel structures.

Mentions: Figure 5 illustrates the kernel matrices’ contour plots under the sparse kernel for the six omics datasets, where all samples in each dataset are viewed as the training data in the SVM diagnosis for the convenience of analysis. It is obvious that our sparse coding successfully avoids the original identity or isometric identity kernel matrices associated with the ‘rbf’ kernel with bandwidth σ = 1, and causes each kernel matrix to be a meaningful kernel matrix. Moreover, it is interesting to see that most of the kernel matrices have entry values spanning more layers in the contour plot, which contributes to enhancing the SVM classifier’s diagnostic power. Instead, the kernel matrices, whose entry values have relatively small ranges, may lead to a low diagnostic performance. For example, the kernel matrix of the Breast data has most entries on or close to the surface z = 0.6, which corresponds to the lowest diagnostic accuracies among the six datasets.


Overcome support vector machine diagnosis overfitting.

Han H, Jiang X - Cancer Inform (2014)

The contour plots of the kernel matrices of all six omics datasets under the sparse kernel. Most of the kernel matrices have entry values spanning more layers, which contributes to enhancing the SVM classifier’s diagnostic power because of the optimized kernel structures.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4264614&req=5

f5-cin-suppl.1-2014-145: The contour plots of the kernel matrices of all six omics datasets under the sparse kernel. Most of the kernel matrices have entry values spanning more layers, which contributes to enhancing the SVM classifier’s diagnostic power because of the optimized kernel structures.
Mentions: Figure 5 illustrates the kernel matrices’ contour plots under the sparse kernel for the six omics datasets, where all samples in each dataset are viewed as the training data in the SVM diagnosis for the convenience of analysis. It is obvious that our sparse coding successfully avoids the original identity or isometric identity kernel matrices associated with the ‘rbf’ kernel with bandwidth σ = 1, and causes each kernel matrix to be a meaningful kernel matrix. Moreover, it is interesting to see that most of the kernel matrices have entry values spanning more layers in the contour plot, which contributes to enhancing the SVM classifier’s diagnostic power. Instead, the kernel matrices, whose entry values have relatively small ranges, may lead to a low diagnostic performance. For example, the kernel matrix of the Breast data has most entries on or close to the surface z = 0.6, which corresponds to the lowest diagnostic accuracies among the six datasets.

Bottom Line: However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making.We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies.Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer and Information Science, Fordham University, New York, NY, USA. ; Quantitative Proteomics Center, Columbia University, New York, NY, USA.

ABSTRACT
Support vector machines (SVMs) are widely employed in molecular diagnosis of disease for their efficiency and robustness. However, there is no previous research to analyze their overfitting in high-dimensional omics data based disease diagnosis, which is essential to avoid deceptive diagnostic results and enhance clinical decision making. In this work, we comprehensively investigate this problem from both theoretical and practical standpoints to unveil the special characteristics of SVM overfitting. We found that disease diagnosis under an SVM classifier would inevitably encounter overfitting under a Gaussian kernel because of the large data variations generated from high-throughput profiling technologies. Furthermore, we propose a novel sparse-coding kernel approach to overcome SVM overfitting in disease diagnosis. Unlike traditional ad-hoc parametric tuning approaches, it not only robustly conquers the overfitting problem, but also achieves good diagnostic accuracy. To our knowledge, it is the first rigorous method proposed to overcome SVM overfitting. Finally, we propose a novel biomarker discovery algorithm: Gene-Switch-Marker (GSM) to capture meaningful biomarkers by taking advantage of SVM overfitting on single genes.

No MeSH data available.