Limits...
A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.

Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA - BMC Bioinformatics (2009)

Bottom Line: The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only.A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Heidelberg, Germany. bjoern.menze@iwr.uni-heidelberg.de

ABSTRACT

Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.

Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.

Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

Show MeSH

Related in: MedlinePlus

The effect of different noise processes on the performance of the feature selection methods in the synthetic bivariate classification problem illustrated in Fig. 1. In the left column feature vectors are extended by a random variable scaled by S, in the right column a random offset of size S is added to the feature vectors. Top row: classification accuracy of the synthetic two-class problem (as in Fig. 7, for comparison); second row: multivariate Gini importance, bottom row: p-values of univariate t-test. The black lines correspond to the values of the two features spanning the bivariate classification task (Fig. 1), the blue dotted line corresponds to the third feature in the synthetic data set, the random variable. The performance of the random forest remains nearly unchanged even under the presence of a strong source of "local" noise for high values of S.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2724423&req=5

Figure 8: The effect of different noise processes on the performance of the feature selection methods in the synthetic bivariate classification problem illustrated in Fig. 1. In the left column feature vectors are extended by a random variable scaled by S, in the right column a random offset of size S is added to the feature vectors. Top row: classification accuracy of the synthetic two-class problem (as in Fig. 7, for comparison); second row: multivariate Gini importance, bottom row: p-values of univariate t-test. The black lines correspond to the values of the two features spanning the bivariate classification task (Fig. 1), the blue dotted line corresponds to the third feature in the synthetic data set, the random variable. The performance of the random forest remains nearly unchanged even under the presence of a strong source of "local" noise for high values of S.

Mentions: In the presence of increasing additive noise, both univariate and multivariate (i.e., the Gini importance) feature importance measures lost their power to discriminate between relevant and random variables at the end (Fig. 8DF), with the Gini importance retaining discriminative power somewhat longer finally converging to a similar value for all three variables correlating well with a random classification and an (equally) random assignment of feature importance (Fig. 8D). When introducing a source of local random noise and normalizing the data accordingly, the univariate tests degraded to random output (Fig. 8E), while the Gini importance measure (Fig. 8CE) virtually ignored the presence and upscaling of the non-discriminatory variable (as did the random forest classifier in Fig. 7ACE).


A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.

Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA - BMC Bioinformatics (2009)

The effect of different noise processes on the performance of the feature selection methods in the synthetic bivariate classification problem illustrated in Fig. 1. In the left column feature vectors are extended by a random variable scaled by S, in the right column a random offset of size S is added to the feature vectors. Top row: classification accuracy of the synthetic two-class problem (as in Fig. 7, for comparison); second row: multivariate Gini importance, bottom row: p-values of univariate t-test. The black lines correspond to the values of the two features spanning the bivariate classification task (Fig. 1), the blue dotted line corresponds to the third feature in the synthetic data set, the random variable. The performance of the random forest remains nearly unchanged even under the presence of a strong source of "local" noise for high values of S.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2724423&req=5

Figure 8: The effect of different noise processes on the performance of the feature selection methods in the synthetic bivariate classification problem illustrated in Fig. 1. In the left column feature vectors are extended by a random variable scaled by S, in the right column a random offset of size S is added to the feature vectors. Top row: classification accuracy of the synthetic two-class problem (as in Fig. 7, for comparison); second row: multivariate Gini importance, bottom row: p-values of univariate t-test. The black lines correspond to the values of the two features spanning the bivariate classification task (Fig. 1), the blue dotted line corresponds to the third feature in the synthetic data set, the random variable. The performance of the random forest remains nearly unchanged even under the presence of a strong source of "local" noise for high values of S.
Mentions: In the presence of increasing additive noise, both univariate and multivariate (i.e., the Gini importance) feature importance measures lost their power to discriminate between relevant and random variables at the end (Fig. 8DF), with the Gini importance retaining discriminative power somewhat longer finally converging to a similar value for all three variables correlating well with a random classification and an (equally) random assignment of feature importance (Fig. 8D). When introducing a source of local random noise and normalizing the data accordingly, the univariate tests degraded to random output (Fig. 8E), while the Gini importance measure (Fig. 8CE) virtually ignored the presence and upscaling of the non-discriminatory variable (as did the random forest classifier in Fig. 7ACE).

Bottom Line: The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only.A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Heidelberg, Germany. bjoern.menze@iwr.uni-heidelberg.de

ABSTRACT

Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.

Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.

Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

Show MeSH
Related in: MedlinePlus