Limits...
A jackknife and voting classifier approach to feature selection and classification.

Taylor SL, Kim K - Cancer Inform (2011)

Bottom Line: While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting.We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy.Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.

View Article: PubMed Central - PubMed

Affiliation: Division of Biostatistics, Department of Public Health Sciences, University of California School of Medicine, Davis, CA, USA.

ABSTRACT
With technological advances now allowing measurement of thousands of genes, proteins and metabolites, researchers are using this information to develop diagnostic and prognostic tests and discern the biological pathways underlying diseases. Often, an investigator's objective is to develop a classification rule to predict group membership of unknown samples based on a small set of features and that could ultimately be used in a clinical setting. While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting. We first use a jackknife procedure to identify important features and then, for classification, we use voting classifiers which are simple and easy to implement. We compared our method to random forest and support vector machines using three benchmark cancer 'omics datasets with different characteristics. We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy. Further, the jackknife procedure yielded stable feature sets. Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.

No MeSH data available.


Related in: MedlinePlus

Frequency of occurrence of features in voting classifiers. Frequency of occurrence of features used in voting classifiers containing 51 features across 1,000 random training: test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3091410&req=5

f6-cin-2011-133: Frequency of occurrence of features in voting classifiers. Frequency of occurrence of features used in voting classifiers containing 51 features across 1,000 random training: test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.

Mentions: The frequency distributions of the features in the voting classifiers generated in the MRV strategy was indicative of the performance of the classifier for each data set. Considering the voting classifiers with 51 features, we tallied the frequency that each feature occurred in the classifier across the 1,000 training:test set pairs. All classifiers performed well for the lung cancer data set; the frequency distribution for this data set showed a small number of features occurring in all random partitions (Fig. 6). The leukemia data set had the next best classification accuracy. With the top 1% of features, this data set showed a small number of features occurring in every partition like the lung cancer data set but with using the top 5% of features, none of the features in the leukemia data set occurred in all partitions. In fact, the most frequently occurring feature occurred in only 600 of the training:test set pairs. Accordingly, classifier accuracy for the leukemia data set was lower using the top 5% features as compared to the top 1%. Finally, the prostate data set had the poorest classification accuracy and the frequency distribution of features differed substantially from the lung cancer and leukemia data sets. None of the features occurred in all random partitions and with a 5% threshold, the most frequently occurring features occurred in the classifier in only about 300 of the training:test set partitions. The best performance for the voting classifiers occurred when there were a small number of features that occurred in a large number of the classifiers constructed for each of the training:test set pairs.


A jackknife and voting classifier approach to feature selection and classification.

Taylor SL, Kim K - Cancer Inform (2011)

Frequency of occurrence of features in voting classifiers. Frequency of occurrence of features used in voting classifiers containing 51 features across 1,000 random training: test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3091410&req=5

f6-cin-2011-133: Frequency of occurrence of features in voting classifiers. Frequency of occurrence of features used in voting classifiers containing 51 features across 1,000 random training: test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
Mentions: The frequency distributions of the features in the voting classifiers generated in the MRV strategy was indicative of the performance of the classifier for each data set. Considering the voting classifiers with 51 features, we tallied the frequency that each feature occurred in the classifier across the 1,000 training:test set pairs. All classifiers performed well for the lung cancer data set; the frequency distribution for this data set showed a small number of features occurring in all random partitions (Fig. 6). The leukemia data set had the next best classification accuracy. With the top 1% of features, this data set showed a small number of features occurring in every partition like the lung cancer data set but with using the top 5% of features, none of the features in the leukemia data set occurred in all partitions. In fact, the most frequently occurring feature occurred in only 600 of the training:test set pairs. Accordingly, classifier accuracy for the leukemia data set was lower using the top 5% features as compared to the top 1%. Finally, the prostate data set had the poorest classification accuracy and the frequency distribution of features differed substantially from the lung cancer and leukemia data sets. None of the features occurred in all random partitions and with a 5% threshold, the most frequently occurring features occurred in the classifier in only about 300 of the training:test set partitions. The best performance for the voting classifiers occurred when there were a small number of features that occurred in a large number of the classifiers constructed for each of the training:test set pairs.

Bottom Line: While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting.We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy.Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.

View Article: PubMed Central - PubMed

Affiliation: Division of Biostatistics, Department of Public Health Sciences, University of California School of Medicine, Davis, CA, USA.

ABSTRACT
With technological advances now allowing measurement of thousands of genes, proteins and metabolites, researchers are using this information to develop diagnostic and prognostic tests and discern the biological pathways underlying diseases. Often, an investigator's objective is to develop a classification rule to predict group membership of unknown samples based on a small set of features and that could ultimately be used in a clinical setting. While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting. We first use a jackknife procedure to identify important features and then, for classification, we use voting classifiers which are simple and easy to implement. We compared our method to random forest and support vector machines using three benchmark cancer 'omics datasets with different characteristics. We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy. Further, the jackknife procedure yielded stable feature sets. Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.

No MeSH data available.


Related in: MedlinePlus