Limits...
To aggregate or not to aggregate high-dimensional classifiers.

Xu CJ, Hoefsloot HC, Smilde AK - BMC Bioinformatics (2011)

Bottom Line: The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting.The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models.The disadvantage and limitations of aggregating were also discussed.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biosystems Data Analysis group, University of Amsterdam, Amsterdam, The Netherlands.

ABSTRACT

Background: High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.

Results: Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.

Conclusions: The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.

Show MeSH

Related in: MedlinePlus

Boxplot of cross-validation errors for three real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double cross-validations. The ratios of feature to samples in training sets are 38/7129 (Leukemia), 39/500(Gaucher) and 200/412 (Grape). The most stable case is from grape data, and the ratio of feature to sample is the lowest among all three data sets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3113942&req=5

Figure 3: Boxplot of cross-validation errors for three real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double cross-validations. The ratios of feature to samples in training sets are 38/7129 (Leukemia), 39/500(Gaucher) and 200/412 (Grape). The most stable case is from grape data, and the ratio of feature to sample is the lowest among all three data sets.

Mentions: We further applied PCDA and aggregated PCDA on three real data sets. Figures 3 and 4 illustrate the variation of misclassification rate of the data sets in training and predictions.


To aggregate or not to aggregate high-dimensional classifiers.

Xu CJ, Hoefsloot HC, Smilde AK - BMC Bioinformatics (2011)

Boxplot of cross-validation errors for three real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double cross-validations. The ratios of feature to samples in training sets are 38/7129 (Leukemia), 39/500(Gaucher) and 200/412 (Grape). The most stable case is from grape data, and the ratio of feature to sample is the lowest among all three data sets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3113942&req=5

Figure 3: Boxplot of cross-validation errors for three real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double cross-validations. The ratios of feature to samples in training sets are 38/7129 (Leukemia), 39/500(Gaucher) and 200/412 (Grape). The most stable case is from grape data, and the ratio of feature to sample is the lowest among all three data sets.
Mentions: We further applied PCDA and aggregated PCDA on three real data sets. Figures 3 and 4 illustrate the variation of misclassification rate of the data sets in training and predictions.

Bottom Line: The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting.The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models.The disadvantage and limitations of aggregating were also discussed.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biosystems Data Analysis group, University of Amsterdam, Amsterdam, The Netherlands.

ABSTRACT

Background: High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.

Results: Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.

Conclusions: The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.

Show MeSH
Related in: MedlinePlus