Limits...
Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

Yang L, Ainali C, Tsoka S, Papageorgiou LG - BMC Bioinformatics (2014)

Bottom Line: We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature.Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user.Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

View Article: PubMed Central - PubMed

Affiliation: Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK. lingjian.yang.10@ucl.ac.uk.

ABSTRACT

Background: Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.

Results: A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile.

Conclusions: The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

Show MeSH

Related in: MedlinePlus

Schematic flow chart of the DIGS-based approach for multiclass disease classification problems. Pathway specific gene expression profiles are created by integrating gene expression profile and pathway information. For each pathway, build pathway activity as a weighted (variables) linear summation of expression of member genes, with the objective function maximising the number of samples whose pathway activity are inside the range of their own classes. The maximum number of member genes in a pathway allowed to have non-zero weights is explicitly constrained in the model by specifying the parameter NoG. Create pathway activity profile by assembling all pathway activities and a classifier is trained on the pathway activity profile and predicts the class label of a new sample. It is important to note that training procedure, i.e., inferring pathway activity and training a classifier, is always blind to testing samples to achieve an objective evaluation of classification performance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4269079&req=5

Fig1: Schematic flow chart of the DIGS-based approach for multiclass disease classification problems. Pathway specific gene expression profiles are created by integrating gene expression profile and pathway information. For each pathway, build pathway activity as a weighted (variables) linear summation of expression of member genes, with the objective function maximising the number of samples whose pathway activity are inside the range of their own classes. The maximum number of member genes in a pathway allowed to have non-zero weights is explicitly constrained in the model by specifying the parameter NoG. Create pathway activity profile by assembling all pathway activities and a classifier is trained on the pathway activity profile and predicts the class label of a new sample. It is important to note that training procedure, i.e., inferring pathway activity and training a classifier, is always blind to testing samples to achieve an objective evaluation of classification performance.

Mentions: An overview of the computational procedure developed for pathway activity-based disease classification is illustrated in FigureĀ 1. A microarray gene expression profile and a set of pathways with their constituent genes form the input to create pathway-specific gene expression matrices. For each pathway, m denotes member genes, s samples and Asm the expression value of gene m in sample s. Asm are standardised to Gsm by subtracting the population mean from the raw value and then dividing by the standard deviation. The first stage of our computational procedure derives a new composite feature, pathway activity pas, from the standardised pathway specific gene expression profile Gsm. In the second stage of our protocol, the inferred pathway activities for all pathways are assembled to form a pathway activity profile matrix, on where a classifier is trained to predict the phenotype of a new sample. In next section, we present a novel mathematical model, which infers pathway activity with optimal classification accuracy.Figure 1


Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

Yang L, Ainali C, Tsoka S, Papageorgiou LG - BMC Bioinformatics (2014)

Schematic flow chart of the DIGS-based approach for multiclass disease classification problems. Pathway specific gene expression profiles are created by integrating gene expression profile and pathway information. For each pathway, build pathway activity as a weighted (variables) linear summation of expression of member genes, with the objective function maximising the number of samples whose pathway activity are inside the range of their own classes. The maximum number of member genes in a pathway allowed to have non-zero weights is explicitly constrained in the model by specifying the parameter NoG. Create pathway activity profile by assembling all pathway activities and a classifier is trained on the pathway activity profile and predicts the class label of a new sample. It is important to note that training procedure, i.e., inferring pathway activity and training a classifier, is always blind to testing samples to achieve an objective evaluation of classification performance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4269079&req=5

Fig1: Schematic flow chart of the DIGS-based approach for multiclass disease classification problems. Pathway specific gene expression profiles are created by integrating gene expression profile and pathway information. For each pathway, build pathway activity as a weighted (variables) linear summation of expression of member genes, with the objective function maximising the number of samples whose pathway activity are inside the range of their own classes. The maximum number of member genes in a pathway allowed to have non-zero weights is explicitly constrained in the model by specifying the parameter NoG. Create pathway activity profile by assembling all pathway activities and a classifier is trained on the pathway activity profile and predicts the class label of a new sample. It is important to note that training procedure, i.e., inferring pathway activity and training a classifier, is always blind to testing samples to achieve an objective evaluation of classification performance.
Mentions: An overview of the computational procedure developed for pathway activity-based disease classification is illustrated in FigureĀ 1. A microarray gene expression profile and a set of pathways with their constituent genes form the input to create pathway-specific gene expression matrices. For each pathway, m denotes member genes, s samples and Asm the expression value of gene m in sample s. Asm are standardised to Gsm by subtracting the population mean from the raw value and then dividing by the standard deviation. The first stage of our computational procedure derives a new composite feature, pathway activity pas, from the standardised pathway specific gene expression profile Gsm. In the second stage of our protocol, the inferred pathway activities for all pathways are assembled to form a pathway activity profile matrix, on where a classifier is trained to predict the phenotype of a new sample. In next section, we present a novel mathematical model, which infers pathway activity with optimal classification accuracy.Figure 1

Bottom Line: We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature.Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user.Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

View Article: PubMed Central - PubMed

Affiliation: Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK. lingjian.yang.10@ucl.ac.uk.

ABSTRACT

Background: Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.

Results: A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile.

Conclusions: The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

Show MeSH
Related in: MedlinePlus