Limits...
Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

Yang L, Ainali C, Tsoka S, Papageorgiou LG - BMC Bioinformatics (2014)

Bottom Line: We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature.Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user.Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

View Article: PubMed Central - PubMed

Affiliation: Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK. lingjian.yang.10@ucl.ac.uk.

ABSTRACT

Background: Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.

Results: A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile.

Conclusions: The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

Show MeSH

Related in: MedlinePlus

Classification accuracy comparison of 7 competing methods using 5-NN (A) and NN (B) classifiers. The proposed DIGS pathway activity inference method is compared against other pathway activity inference methods (Mean, Median, PCA and CORGs) and also genes-based methods (SG and per_pathway). Classification accuracy is summarised as average prediction rates over 50 runs of random partition of datasets into a 70% training set and a 30% testing set. With 5-NN classifier (A), it is evident that DIGS outperforms other methods by some distance as topping the chart on 6 datasets (Singh, Popovici, Desmedt, Swindell, Farmer and Pawitan) while being tied 1st on the other 2 datasets (Shipp and Yao). Prediction rates achieved by DIGS are generally high, over 80% in most datasets, which facilities its application in real world. With NN classifier (B), the same trend can be observed that prediction accuracies achieved by DIGS at least matches the state-of-the-arts methods in literature for binary disease classification problems, while consistently outperforms the competing methods for multi-phenotype problems.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4269079&req=5

Fig3: Classification accuracy comparison of 7 competing methods using 5-NN (A) and NN (B) classifiers. The proposed DIGS pathway activity inference method is compared against other pathway activity inference methods (Mean, Median, PCA and CORGs) and also genes-based methods (SG and per_pathway). Classification accuracy is summarised as average prediction rates over 50 runs of random partition of datasets into a 70% training set and a 30% testing set. With 5-NN classifier (A), it is evident that DIGS outperforms other methods by some distance as topping the chart on 6 datasets (Singh, Popovici, Desmedt, Swindell, Farmer and Pawitan) while being tied 1st on the other 2 datasets (Shipp and Yao). Prediction rates achieved by DIGS are generally high, over 80% in most datasets, which facilities its application in real world. With NN classifier (B), the same trend can be observed that prediction accuracies achieved by DIGS at least matches the state-of-the-arts methods in literature for binary disease classification problems, while consistently outperforms the competing methods for multi-phenotype problems.

Mentions: The performance of the proposed DIGS model against other competing methods in literature is compared and discussed here. As described in the Methods section, extensive comparisons were implemented across 8 datasets (collectively referred to as dataset) and 7 competing methods (method). To also account for the effect of classifier choice in the computational procedure, we tested the DIGS model across 5 classifiers (classifier). The results across all dataset, method and classifier combination are illustrated in FigureĀ 3A and B (for 5-NN and Neural Network classifiers) and in the Additional file 3 (for SMO, HB and logistic).Figure 3


Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

Yang L, Ainali C, Tsoka S, Papageorgiou LG - BMC Bioinformatics (2014)

Classification accuracy comparison of 7 competing methods using 5-NN (A) and NN (B) classifiers. The proposed DIGS pathway activity inference method is compared against other pathway activity inference methods (Mean, Median, PCA and CORGs) and also genes-based methods (SG and per_pathway). Classification accuracy is summarised as average prediction rates over 50 runs of random partition of datasets into a 70% training set and a 30% testing set. With 5-NN classifier (A), it is evident that DIGS outperforms other methods by some distance as topping the chart on 6 datasets (Singh, Popovici, Desmedt, Swindell, Farmer and Pawitan) while being tied 1st on the other 2 datasets (Shipp and Yao). Prediction rates achieved by DIGS are generally high, over 80% in most datasets, which facilities its application in real world. With NN classifier (B), the same trend can be observed that prediction accuracies achieved by DIGS at least matches the state-of-the-arts methods in literature for binary disease classification problems, while consistently outperforms the competing methods for multi-phenotype problems.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4269079&req=5

Fig3: Classification accuracy comparison of 7 competing methods using 5-NN (A) and NN (B) classifiers. The proposed DIGS pathway activity inference method is compared against other pathway activity inference methods (Mean, Median, PCA and CORGs) and also genes-based methods (SG and per_pathway). Classification accuracy is summarised as average prediction rates over 50 runs of random partition of datasets into a 70% training set and a 30% testing set. With 5-NN classifier (A), it is evident that DIGS outperforms other methods by some distance as topping the chart on 6 datasets (Singh, Popovici, Desmedt, Swindell, Farmer and Pawitan) while being tied 1st on the other 2 datasets (Shipp and Yao). Prediction rates achieved by DIGS are generally high, over 80% in most datasets, which facilities its application in real world. With NN classifier (B), the same trend can be observed that prediction accuracies achieved by DIGS at least matches the state-of-the-arts methods in literature for binary disease classification problems, while consistently outperforms the competing methods for multi-phenotype problems.
Mentions: The performance of the proposed DIGS model against other competing methods in literature is compared and discussed here. As described in the Methods section, extensive comparisons were implemented across 8 datasets (collectively referred to as dataset) and 7 competing methods (method). To also account for the effect of classifier choice in the computational procedure, we tested the DIGS model across 5 classifiers (classifier). The results across all dataset, method and classifier combination are illustrated in FigureĀ 3A and B (for 5-NN and Neural Network classifiers) and in the Additional file 3 (for SMO, HB and logistic).Figure 3

Bottom Line: We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature.Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user.Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

View Article: PubMed Central - PubMed

Affiliation: Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK. lingjian.yang.10@ucl.ac.uk.

ABSTRACT

Background: Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.

Results: A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile.

Conclusions: The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

Show MeSH
Related in: MedlinePlus