Limits...
Predicting qualitative phenotypes from microarray data - the Eadgene pig data set.

Robert-Granié C, Lê Cao KA, Sancristobal M - BMC Proc (2009)

Bottom Line: The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes.Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.

View Article: PubMed Central - HTML - PubMed

Affiliation: INRA, UR631 Station d'Amélioration Génétique des Animaux, F-31326 Castanet-Tolosan, France. christele.robert-granie@toulouse.inra.fr

ABSTRACT

Background: The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the Eadgene-SABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes.

Methods: We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables.

Results: All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.

Conclusion: We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.

No MeSH data available.


Related in: MedlinePlus

Comparison of significance level (-log10 of the p-value in the differential analysis) with the importance measure of Random Forest. The genes above the horizontal line are differentially expressed genes (t test) whereas the genes on the right hand side of the vertical line are declared as most important and highly predictive by Random forest.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2712743&req=5

Figure 1: Comparison of significance level (-log10 of the p-value in the differential analysis) with the importance measure of Random Forest. The genes above the horizontal line are differentially expressed genes (t test) whereas the genes on the right hand side of the vertical line are declared as most important and highly predictive by Random forest.

Mentions: Comparing the significant level of genes (-log10 of the p-value of the Fisher test in the differential analysis performed in [1]) with the importance given by RF (Mean Decrease Accuracy measure), we obtained a relatively high correlation between both measures (Figure 1). However, some of the most significant genes were not the most important (and vice-versa). Horizontal and vertical lines in Figure 1 were drawn to highlight the differences in the selections performed with the Fisher test or the Random Forest. Generally, this high correlation between the results of these two approaches was not encountered in other data sets [11], and may be explained here by the high proportion of differential genes with additive effects on genotype and treatment.


Predicting qualitative phenotypes from microarray data - the Eadgene pig data set.

Robert-Granié C, Lê Cao KA, Sancristobal M - BMC Proc (2009)

Comparison of significance level (-log10 of the p-value in the differential analysis) with the importance measure of Random Forest. The genes above the horizontal line are differentially expressed genes (t test) whereas the genes on the right hand side of the vertical line are declared as most important and highly predictive by Random forest.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2712743&req=5

Figure 1: Comparison of significance level (-log10 of the p-value in the differential analysis) with the importance measure of Random Forest. The genes above the horizontal line are differentially expressed genes (t test) whereas the genes on the right hand side of the vertical line are declared as most important and highly predictive by Random forest.
Mentions: Comparing the significant level of genes (-log10 of the p-value of the Fisher test in the differential analysis performed in [1]) with the importance given by RF (Mean Decrease Accuracy measure), we obtained a relatively high correlation between both measures (Figure 1). However, some of the most significant genes were not the most important (and vice-versa). Horizontal and vertical lines in Figure 1 were drawn to highlight the differences in the selections performed with the Fisher test or the Random Forest. Generally, this high correlation between the results of these two approaches was not encountered in other data sets [11], and may be explained here by the high proportion of differential genes with additive effects on genotype and treatment.

Bottom Line: The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes.Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.

View Article: PubMed Central - HTML - PubMed

Affiliation: INRA, UR631 Station d'Amélioration Génétique des Animaux, F-31326 Castanet-Tolosan, France. christele.robert-granie@toulouse.inra.fr

ABSTRACT

Background: The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the Eadgene-SABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes.

Methods: We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables.

Results: All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.

Conclusion: We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.

No MeSH data available.


Related in: MedlinePlus