Limits...
ccSVM: correcting Support Vector Machines for confounding factors in biological data classification.

Li L, Rakitsch B, Borgwardt K - Bioinformatics (2011)

Bottom Line: However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors.In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.

View Article: PubMed Central - PubMed

Affiliation: Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany. limin.li@tuebingen.mpg.de

ABSTRACT

Motivation: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.

Results: In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.

Availability: A ccSVM-implementation in MATLAB is available from http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/.

Contact: limin.li@tuebingen.mpg.de; karsten.borgwardt@tuebingen.mpg.de.

Show MeSH

Related in: MedlinePlus

Gene expression levels are sorted according to the weight vector of ccSVM (blue dashed line) and according to the weight vector of standard SVM (green line). The correlation coefficient between each gene expression level and ethnic origin (African) is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117385&req=5

Figure 2: Gene expression levels are sorted according to the weight vector of ccSVM (blue dashed line) and according to the weight vector of standard SVM (green line). The correlation coefficient between each gene expression level and ethnic origin (African) is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.

Mentions: Weight vector analysis: we examined the weight vector of the ccSVM to get a further understanding for its improved performance. Specifically, we trained on four ethnic groups and then used it to predict on a fifth. In Figure 2, we plot the averaged absolute correlation coefficients between membership in one ethnicity (African) and the expression levels of the 10 000 top ranked genes. We can observe that the ccSVM assigns the largest weights to genes that do not correlate with the confounder, while the standard SVM is unaware of the confounder and puts large weight on the features that correlate with the confounding variable.Fig. 2.


ccSVM: correcting Support Vector Machines for confounding factors in biological data classification.

Li L, Rakitsch B, Borgwardt K - Bioinformatics (2011)

Gene expression levels are sorted according to the weight vector of ccSVM (blue dashed line) and according to the weight vector of standard SVM (green line). The correlation coefficient between each gene expression level and ethnic origin (African) is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117385&req=5

Figure 2: Gene expression levels are sorted according to the weight vector of ccSVM (blue dashed line) and according to the weight vector of standard SVM (green line). The correlation coefficient between each gene expression level and ethnic origin (African) is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.
Mentions: Weight vector analysis: we examined the weight vector of the ccSVM to get a further understanding for its improved performance. Specifically, we trained on four ethnic groups and then used it to predict on a fifth. In Figure 2, we plot the averaged absolute correlation coefficients between membership in one ethnicity (African) and the expression levels of the 10 000 top ranked genes. We can observe that the ccSVM assigns the largest weights to genes that do not correlate with the confounder, while the standard SVM is unaware of the confounder and puts large weight on the features that correlate with the confounding variable.Fig. 2.

Bottom Line: However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors.In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.

View Article: PubMed Central - PubMed

Affiliation: Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany. limin.li@tuebingen.mpg.de

ABSTRACT

Motivation: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.

Results: In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.

Availability: A ccSVM-implementation in MATLAB is available from http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/.

Contact: limin.li@tuebingen.mpg.de; karsten.borgwardt@tuebingen.mpg.de.

Show MeSH
Related in: MedlinePlus