Limits...
Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia.

Struyf J, Dobrin S, Page D - BMC Genomics (2008)

Bottom Line: We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases.These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases.Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium. jan.struyf@cs.kuleuven.be

ABSTRACT

Background: This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.

Results: We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.

Conclusion: This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Show MeSH

Related in: MedlinePlus

Illustration of the (a) support vector machines, (b) nearest shrunken centroids, (c) decision trees, and (d) nearest neighbor methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2628394&req=5

Figure 1: Illustration of the (a) support vector machines, (b) nearest shrunken centroids, (c) decision trees, and (d) nearest neighbor methods.

Mentions: Support vector machines (SVMs, [4]) belong to the family of generalized linear models. We employ linear SVMs, which exhibit good classification performance on gene expression data [5]. A linear SVM is essentially an (n-1)-dimensional hyper-plane that separates the instances of the two classes in the n-dimensional feature space. Figure 1a illustrates this for the two dimensional case: the hyper-plane reduces here to a line, which separates the empty (class 1) and filled (class 2) circles. The hyper-plane maximizes the margin with the closest training instances. These instances are called the "support vectors" because they fix the position and orientation of the hyper-plane. Linear SVMs assume that the training data is linearly separable. If this is not the case, then SVMs rely instead on the concept of a soft margin [6]. In the evaluation, we use a soft margin SVM, which minimizes, in addition to the margin, also the sum of the distances to the training instances that are incorrectly classified by the hyper-plane (the di in Figure 1a).


Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia.

Struyf J, Dobrin S, Page D - BMC Genomics (2008)

Illustration of the (a) support vector machines, (b) nearest shrunken centroids, (c) decision trees, and (d) nearest neighbor methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2628394&req=5

Figure 1: Illustration of the (a) support vector machines, (b) nearest shrunken centroids, (c) decision trees, and (d) nearest neighbor methods.
Mentions: Support vector machines (SVMs, [4]) belong to the family of generalized linear models. We employ linear SVMs, which exhibit good classification performance on gene expression data [5]. A linear SVM is essentially an (n-1)-dimensional hyper-plane that separates the instances of the two classes in the n-dimensional feature space. Figure 1a illustrates this for the two dimensional case: the hyper-plane reduces here to a line, which separates the empty (class 1) and filled (class 2) circles. The hyper-plane maximizes the margin with the closest training instances. These instances are called the "support vectors" because they fix the position and orientation of the hyper-plane. Linear SVMs assume that the training data is linearly separable. If this is not the case, then SVMs rely instead on the concept of a soft margin [6]. In the evaluation, we use a soft margin SVM, which minimizes, in addition to the margin, also the sum of the distances to the training instances that are incorrectly classified by the hyper-plane (the di in Figure 1a).

Bottom Line: We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases.These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases.Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium. jan.struyf@cs.kuleuven.be

ABSTRACT

Background: This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.

Results: We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.

Conclusion: This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Show MeSH
Related in: MedlinePlus