Limits...
Identification and optimization of classifier genes from multi-class earthworm microarray dataset.

Li Y, Wang N, Perkins EJ, Zhang C, Gong P - PLoS ONE (2010)

Bottom Line: A variety of toxicological effects have been associated with explosive compounds TNT and RDX.We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither.This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.

View Article: PubMed Central - PubMed

Affiliation: School of Computing, University of Southern Mississippi, Hattiesburg, Mississippi, United States of America.

ABSTRACT
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.

Show MeSH

Related in: MedlinePlus

The number and overlapping of significant genes statistically inferred from class comparisons for (a) TNT and (b) RDX treatments.TNT/RDX-Control: two-class comparison between pooled controls and pooled TNT/RDX treatments; TNT/RDX-D4orig: multiple-class comparison of 4-day TNT/RDX treatments including the control group; TNT/RDX-D4Rpt: multiple-class comparison of 4-day repeat TNT/RDX treatments including the control group; TNT/RDX-D14: multiple-class comparison of 14-day TNT/RDX treatments including the control group (also see Materials and Methods).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2965664&req=5

pone-0013715-g002: The number and overlapping of significant genes statistically inferred from class comparisons for (a) TNT and (b) RDX treatments.TNT/RDX-Control: two-class comparison between pooled controls and pooled TNT/RDX treatments; TNT/RDX-D4orig: multiple-class comparison of 4-day TNT/RDX treatments including the control group; TNT/RDX-D4Rpt: multiple-class comparison of 4-day repeat TNT/RDX treatments including the control group; TNT/RDX-D14: multiple-class comparison of 14-day TNT/RDX treatments including the control group (also see Materials and Methods).

Mentions: Differentially expressed genes were inferred by univariate statistical analysis. At the same level of statistical stringency, the significant gene lists derived from four different comparisons for either TNT or RDX shared very few common genes (Fig. 2), suggesting different genes may be significantly altered under different conditions. To validate these results, we used ANOVA in GeneSpring GX 10 to analyze the same dataset by applying the Benjamini-Hochberg method for multiple testing corrections and a cut-off of 1.5-fold change. By allowing a variable threshold of cut-off p-value, the same amount of top significant genes can be derived from the same comparisons as we did using BRB-ArrayTools. The two sets of significant gene lists share 85∼95% common genes (data not shown), indicating a high level of statistical reproducibility. The difference in the resulting gene lists may be primarily attributed to the use of a 1.5-fold change as the cut-off level by GeneSpring. A total of 869 unique genes were obtained after combining all significantly changed gene lists from TNT- and RDX-exposures. The expression information of these 869 transcripts in all 248 earthworm samples is provided in Table S1.


Identification and optimization of classifier genes from multi-class earthworm microarray dataset.

Li Y, Wang N, Perkins EJ, Zhang C, Gong P - PLoS ONE (2010)

The number and overlapping of significant genes statistically inferred from class comparisons for (a) TNT and (b) RDX treatments.TNT/RDX-Control: two-class comparison between pooled controls and pooled TNT/RDX treatments; TNT/RDX-D4orig: multiple-class comparison of 4-day TNT/RDX treatments including the control group; TNT/RDX-D4Rpt: multiple-class comparison of 4-day repeat TNT/RDX treatments including the control group; TNT/RDX-D14: multiple-class comparison of 14-day TNT/RDX treatments including the control group (also see Materials and Methods).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2965664&req=5

pone-0013715-g002: The number and overlapping of significant genes statistically inferred from class comparisons for (a) TNT and (b) RDX treatments.TNT/RDX-Control: two-class comparison between pooled controls and pooled TNT/RDX treatments; TNT/RDX-D4orig: multiple-class comparison of 4-day TNT/RDX treatments including the control group; TNT/RDX-D4Rpt: multiple-class comparison of 4-day repeat TNT/RDX treatments including the control group; TNT/RDX-D14: multiple-class comparison of 14-day TNT/RDX treatments including the control group (also see Materials and Methods).
Mentions: Differentially expressed genes were inferred by univariate statistical analysis. At the same level of statistical stringency, the significant gene lists derived from four different comparisons for either TNT or RDX shared very few common genes (Fig. 2), suggesting different genes may be significantly altered under different conditions. To validate these results, we used ANOVA in GeneSpring GX 10 to analyze the same dataset by applying the Benjamini-Hochberg method for multiple testing corrections and a cut-off of 1.5-fold change. By allowing a variable threshold of cut-off p-value, the same amount of top significant genes can be derived from the same comparisons as we did using BRB-ArrayTools. The two sets of significant gene lists share 85∼95% common genes (data not shown), indicating a high level of statistical reproducibility. The difference in the resulting gene lists may be primarily attributed to the use of a 1.5-fold change as the cut-off level by GeneSpring. A total of 869 unique genes were obtained after combining all significantly changed gene lists from TNT- and RDX-exposures. The expression information of these 869 transcripts in all 248 earthworm samples is provided in Table S1.

Bottom Line: A variety of toxicological effects have been associated with explosive compounds TNT and RDX.We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither.This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.

View Article: PubMed Central - PubMed

Affiliation: School of Computing, University of Southern Mississippi, Hattiesburg, Mississippi, United States of America.

ABSTRACT
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.

Show MeSH
Related in: MedlinePlus