Limits...
Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia.

Struyf J, Dobrin S, Page D - BMC Genomics (2008)

Bottom Line: We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases.These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases.Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium. jan.struyf@cs.kuleuven.be

ABSTRACT

Background: This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.

Results: We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.

Conclusion: This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Show MeSH

Related in: MedlinePlus

Biological network representing the schizophrenia p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2628394&req=5

Figure 15: Biological network representing the schizophrenia p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.

Mentions: Interestingly, most of the remaining genes in the list are known to interact with the genes that have a documented association with either bipolar disorder or schizophrenia. These interactions were determined using Ingenuity Systems software. 14 of the 20 genes in the schizophrenia sample are involved in the same biological pathway (Figure 15). By combining the two networks generated by the software package via 3 overlapping genes, 19 of the 20 genes are in a single biological network. Similarly, 13 of the 20 genes are in a single pathway for bipolar disorder (Figure 16). By combining two of the 3 generated pathways through 3 overlapping genes, this biological network represents 16 of the 20 genes on the list.


Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia.

Struyf J, Dobrin S, Page D - BMC Genomics (2008)

Biological network representing the schizophrenia p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2628394&req=5

Figure 15: Biological network representing the schizophrenia p-value ranking. The network was generated using Ingenuity Systems Pathway analysis. The darker the red the more significant the correlation with the disease.
Mentions: Interestingly, most of the remaining genes in the list are known to interact with the genes that have a documented association with either bipolar disorder or schizophrenia. These interactions were determined using Ingenuity Systems software. 14 of the 20 genes in the schizophrenia sample are involved in the same biological pathway (Figure 15). By combining the two networks generated by the software package via 3 overlapping genes, 19 of the 20 genes are in a single biological network. Similarly, 13 of the 20 genes are in a single pathway for bipolar disorder (Figure 16). By combining two of the 3 generated pathways through 3 overlapping genes, this biological network represents 16 of the 20 genes on the list.

Bottom Line: We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases.These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases.Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium. jan.struyf@cs.kuleuven.be

ABSTRACT

Background: This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.

Results: We compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.

Conclusion: This paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Show MeSH
Related in: MedlinePlus