Limits...
A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.

Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA - BMC Bioinformatics (2009)

Bottom Line: The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only.A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Heidelberg, Germany. bjoern.menze@iwr.uni-heidelberg.de

ABSTRACT

Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.

Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.

Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

Show MeSH

Related in: MedlinePlus

Channel-wise variance of each feature (horizontal axis) and its correlation with the dependent variable (vertical axis). For the data sets of the left and the central column, a feature selection was not required for optimal performance, while the data sets shown in the right columns benefitted from a feature selection. Circle diameter indicates magnitude of the coefficient in the PLS regression. In the right column selected features are shown by red circles, while (the original values of) eliminated features are indicated by black dots. Relevant features show both a high variance and correlation with the class labels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2724423&req=5

Figure 6: Channel-wise variance of each feature (horizontal axis) and its correlation with the dependent variable (vertical axis). For the data sets of the left and the central column, a feature selection was not required for optimal performance, while the data sets shown in the right columns benefitted from a feature selection. Circle diameter indicates magnitude of the coefficient in the PLS regression. In the right column selected features are shown by red circles, while (the original values of) eliminated features are indicated by black dots. Relevant features show both a high variance and correlation with the class labels.

Mentions: with the response y in case of PLS [2,3]. Thus, for a better understanding of D-PCR and D-PLS, both Corr(x, y) and Var(x) were plotted for individual channels and for individual learning tasks in Fig. 6 (with the absolute value of the coefficients of c encoded by the size of the circles in Fig. 6). On data sets which did not benefit greatly from the feature selection, we observed variance and correlation to be maximal in those variables which were finally assigned the largest coefficients in the regression (indicated by the size of the black circles in Fig. 6). Conversely, in data sets where a feature selection was required, features with high variance but only moderate relevance to the classification problem (as indicated by a low univariate correlation or multivariate Gini importance) were frequently present in the unselected data (Fig. 6, black dots). This might be seen as a likely reason for the bad performance of D-PCR and D-PLS when used without preceding feature selection on the BSE and wine data: Here the selection process allowed to identify those features where variance coincided with class-label correlation (Fig. 6, red circles), leading to a similar situation in the subsequent regression as for those data sets where a feature selection was not required (Fig. 6, compare subselected features indicated red in the left and central row with features in the right row).


A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.

Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA - BMC Bioinformatics (2009)

Channel-wise variance of each feature (horizontal axis) and its correlation with the dependent variable (vertical axis). For the data sets of the left and the central column, a feature selection was not required for optimal performance, while the data sets shown in the right columns benefitted from a feature selection. Circle diameter indicates magnitude of the coefficient in the PLS regression. In the right column selected features are shown by red circles, while (the original values of) eliminated features are indicated by black dots. Relevant features show both a high variance and correlation with the class labels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2724423&req=5

Figure 6: Channel-wise variance of each feature (horizontal axis) and its correlation with the dependent variable (vertical axis). For the data sets of the left and the central column, a feature selection was not required for optimal performance, while the data sets shown in the right columns benefitted from a feature selection. Circle diameter indicates magnitude of the coefficient in the PLS regression. In the right column selected features are shown by red circles, while (the original values of) eliminated features are indicated by black dots. Relevant features show both a high variance and correlation with the class labels.
Mentions: with the response y in case of PLS [2,3]. Thus, for a better understanding of D-PCR and D-PLS, both Corr(x, y) and Var(x) were plotted for individual channels and for individual learning tasks in Fig. 6 (with the absolute value of the coefficients of c encoded by the size of the circles in Fig. 6). On data sets which did not benefit greatly from the feature selection, we observed variance and correlation to be maximal in those variables which were finally assigned the largest coefficients in the regression (indicated by the size of the black circles in Fig. 6). Conversely, in data sets where a feature selection was required, features with high variance but only moderate relevance to the classification problem (as indicated by a low univariate correlation or multivariate Gini importance) were frequently present in the unselected data (Fig. 6, black dots). This might be seen as a likely reason for the bad performance of D-PCR and D-PLS when used without preceding feature selection on the BSE and wine data: Here the selection process allowed to identify those features where variance coincided with class-label correlation (Fig. 6, red circles), leading to a similar situation in the subsequent regression as for those data sets where a feature selection was not required (Fig. 6, compare subselected features indicated red in the left and central row with features in the right row).

Bottom Line: The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only.A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Heidelberg, Germany. bjoern.menze@iwr.uni-heidelberg.de

ABSTRACT

Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.

Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.

Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

Show MeSH
Related in: MedlinePlus