Limits...
To aggregate or not to aggregate high-dimensional classifiers.

Xu CJ, Hoefsloot HC, Smilde AK - BMC Bioinformatics (2011)

Bottom Line: The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting.The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models.The disadvantage and limitations of aggregating were also discussed.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biosystems Data Analysis group, University of Amsterdam, Amsterdam, The Netherlands.

ABSTRACT

Background: High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.

Results: Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.

Conclusions: The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.

Show MeSH

Related in: MedlinePlus

The partition of a data set for model selection and the estimation of the cross-validation error and prediction error. In the inner loop cross-validation, inner training set and inner validation set are used to determine the number of principal components (PC), and the model is fit on the inner training set. In the outer loop cross-validation, the model is built on the inner training set and the inner validation set, and an outer validation set are used to estimate the cross- validation error. In prediction, the model is built on the inner training set and the inner validation set, and the test set is used to obtain the prediction error.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3113942&req=5

Figure 1: The partition of a data set for model selection and the estimation of the cross-validation error and prediction error. In the inner loop cross-validation, inner training set and inner validation set are used to determine the number of principal components (PC), and the model is fit on the inner training set. In the outer loop cross-validation, the model is built on the inner training set and the inner validation set, and an outer validation set are used to estimate the cross- validation error. In prediction, the model is built on the inner training set and the inner validation set, and the test set is used to obtain the prediction error.

Mentions: The optimal number value of reduced dimensions of PCA is usually determined by cross-validation. The simplest form of cross-validation is to split the data randomly into K mutually exclusive parts, building a model on all but one part, and to evaluate the model on the omitted part. This strategy allows for estimating the optimal model complexity; however, the resulting prediction performance estimate is often too optimistic since the same samples were also used to find the best number of PC's and thus they are not completely independent. It is therefore recommended to use a double cross-validation approach [13,17,19,20]. As shown in Figure 1, first the original data set was divided into two parts, training set and test sets. The test set was not used in double cross-validation scheme and it was employed afterwards to evaluate how good the built classifier really is. The training set was partitioned into K parts. Of the K parts, a single part is retained as the outer validation set, and the remaining K-1 subsamples are used as inner training data and inner validation set. On the remaining K-1 parts, a K-1-fold cross-validation is performed to find the best number of PC components. This is a nested validation scheme. The inner validation set is used to determine the optimal number of principal components, and the outer validation set is used to find the cross-validation error of the method. In summary, the double cross-validation with PCDA is summarized in the following pseudo code


To aggregate or not to aggregate high-dimensional classifiers.

Xu CJ, Hoefsloot HC, Smilde AK - BMC Bioinformatics (2011)

The partition of a data set for model selection and the estimation of the cross-validation error and prediction error. In the inner loop cross-validation, inner training set and inner validation set are used to determine the number of principal components (PC), and the model is fit on the inner training set. In the outer loop cross-validation, the model is built on the inner training set and the inner validation set, and an outer validation set are used to estimate the cross- validation error. In prediction, the model is built on the inner training set and the inner validation set, and the test set is used to obtain the prediction error.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3113942&req=5

Figure 1: The partition of a data set for model selection and the estimation of the cross-validation error and prediction error. In the inner loop cross-validation, inner training set and inner validation set are used to determine the number of principal components (PC), and the model is fit on the inner training set. In the outer loop cross-validation, the model is built on the inner training set and the inner validation set, and an outer validation set are used to estimate the cross- validation error. In prediction, the model is built on the inner training set and the inner validation set, and the test set is used to obtain the prediction error.
Mentions: The optimal number value of reduced dimensions of PCA is usually determined by cross-validation. The simplest form of cross-validation is to split the data randomly into K mutually exclusive parts, building a model on all but one part, and to evaluate the model on the omitted part. This strategy allows for estimating the optimal model complexity; however, the resulting prediction performance estimate is often too optimistic since the same samples were also used to find the best number of PC's and thus they are not completely independent. It is therefore recommended to use a double cross-validation approach [13,17,19,20]. As shown in Figure 1, first the original data set was divided into two parts, training set and test sets. The test set was not used in double cross-validation scheme and it was employed afterwards to evaluate how good the built classifier really is. The training set was partitioned into K parts. Of the K parts, a single part is retained as the outer validation set, and the remaining K-1 subsamples are used as inner training data and inner validation set. On the remaining K-1 parts, a K-1-fold cross-validation is performed to find the best number of PC components. This is a nested validation scheme. The inner validation set is used to determine the optimal number of principal components, and the outer validation set is used to find the cross-validation error of the method. In summary, the double cross-validation with PCDA is summarized in the following pseudo code

Bottom Line: The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting.The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models.The disadvantage and limitations of aggregating were also discussed.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biosystems Data Analysis group, University of Amsterdam, Amsterdam, The Netherlands.

ABSTRACT

Background: High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.

Results: Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.

Conclusions: The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.

Show MeSH
Related in: MedlinePlus