Limits...
Genomic data sampling and its effect on classification performance assessment.

Azuaje F - BMC Bioinformatics (2003)

Bottom Line: These methods are designed to reduce the bias and variance of small-sample estimations.Conservative and optimistic accuracy estimations can be obtained by applying different methods.Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computing and Mathematics, University of Ulster, Jordanstown, Nothern Ireland, UK. fj.azuaje@ulster.ac.uk

ABSTRACT

Background: Supervised classification is fundamental in bioinformatics. Machine learning models, such as neural networks, have been applied to discover genes and expression patterns. This process is achieved by implementing training and test phases. In the training phase, a set of cases and their respective labels are used to build a classifier. During testing, the classifier is used to predict new cases. One approach to assessing its predictive quality is to estimate its accuracy during the test phase. Key limitations appear when dealing with small-data samples. This paper investigates the effect of data sampling techniques on the assessment of neural network classifiers.

Results: Three data sampling techniques were studied: Cross-validation, leave-one-out, and bootstrap. These methods are designed to reduce the bias and variance of small-sample estimations. Two prediction problems based on small-sample sets were considered: Classification of microarray data originating from a leukemia study and from small, round blue-cell tumours. A third problem, the prediction of splice-junctions, was analysed to perform comparisons. Different accuracy estimations were produced for each problem. The variations are accentuated in the small-data samples. The quality of the estimates depends on the number of train-test experiments and the amount of data used for training the networks.

Conclusion: The predictive quality assessment of biomolecular data classifiers depends on the data size, sampling techniques and the number of train-test experiments. Conservative and optimistic accuracy estimations can be obtained by applying different methods. Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.

Show MeSH

Related in: MedlinePlus

Accuracy estimation for leukaemia data classifier (III) Cross-validation method based on a 95%–5% splitting. Prediction accuracy values and the confidence intervals for the means (95% confidence) are depicted for a number of train-test runs. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC149349&req=5

Figure 3: Accuracy estimation for leukaemia data classifier (III) Cross-validation method based on a 95%–5% splitting. Prediction accuracy values and the confidence intervals for the means (95% confidence) are depicted for a number of train-test runs. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Mentions: In this study, the cross-validation results were analysed for three different data splitting methods: a) 50% of the available cases were used for training the classifiers and the remaining 50% for testing, b) 75% for training and 25% for testing, and c) 95% for training and 5% for testing. Figures 1, 2 and 3 show the prediction accuracy values and their confidence intervals (95% confidence) obtained for each splitting technique respectively. Figure 1 indicates that more than 500 train-test runs are required to significantly reduce the variance of these estimates (confidence interval size for the mean equal to 0.01). The smallest numbers of train-test experiments allow the generation of the highest or most optimistic accuracy estimates. This method produced the lowest (most conservative) cross-validation accuracy estimates. Figure 2 shows that more than 1000 train-test runs are required to significantly reduce the variance of this cross-validation estimate. Again, the smallest numbers of train-test experiments generated the most optimistic accuracy estimates. The 95%–5% cross-validation method (Figure 3) requires more than 5000 train-test runs to reduce the variance of these estimates. This method produced the most optimistic cross-validation accuracy estimates. The leave-one-out method estimated the highest prediction accuracy for this classification problem (0.81). The results generated by the bootstrap method are depicted in Figure 4. It indicates that more than 1000 train-test runs were required to achieve a confidence interval size equal to 0.01. The bootstrap result originating from 1000 train-test runs is not significantly different than that produced by the 50%–50% cross-validation technique based on 1000 train-test runs. These results suggest that the estimation of high accuracy values may be associated with an increase of the size of the training datasets.


Genomic data sampling and its effect on classification performance assessment.

Azuaje F - BMC Bioinformatics (2003)

Accuracy estimation for leukaemia data classifier (III) Cross-validation method based on a 95%–5% splitting. Prediction accuracy values and the confidence intervals for the means (95% confidence) are depicted for a number of train-test runs. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC149349&req=5

Figure 3: Accuracy estimation for leukaemia data classifier (III) Cross-validation method based on a 95%–5% splitting. Prediction accuracy values and the confidence intervals for the means (95% confidence) are depicted for a number of train-test runs. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.
Mentions: In this study, the cross-validation results were analysed for three different data splitting methods: a) 50% of the available cases were used for training the classifiers and the remaining 50% for testing, b) 75% for training and 25% for testing, and c) 95% for training and 5% for testing. Figures 1, 2 and 3 show the prediction accuracy values and their confidence intervals (95% confidence) obtained for each splitting technique respectively. Figure 1 indicates that more than 500 train-test runs are required to significantly reduce the variance of these estimates (confidence interval size for the mean equal to 0.01). The smallest numbers of train-test experiments allow the generation of the highest or most optimistic accuracy estimates. This method produced the lowest (most conservative) cross-validation accuracy estimates. Figure 2 shows that more than 1000 train-test runs are required to significantly reduce the variance of this cross-validation estimate. Again, the smallest numbers of train-test experiments generated the most optimistic accuracy estimates. The 95%–5% cross-validation method (Figure 3) requires more than 5000 train-test runs to reduce the variance of these estimates. This method produced the most optimistic cross-validation accuracy estimates. The leave-one-out method estimated the highest prediction accuracy for this classification problem (0.81). The results generated by the bootstrap method are depicted in Figure 4. It indicates that more than 1000 train-test runs were required to achieve a confidence interval size equal to 0.01. The bootstrap result originating from 1000 train-test runs is not significantly different than that produced by the 50%–50% cross-validation technique based on 1000 train-test runs. These results suggest that the estimation of high accuracy values may be associated with an increase of the size of the training datasets.

Bottom Line: These methods are designed to reduce the bias and variance of small-sample estimations.Conservative and optimistic accuracy estimations can be obtained by applying different methods.Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computing and Mathematics, University of Ulster, Jordanstown, Nothern Ireland, UK. fj.azuaje@ulster.ac.uk

ABSTRACT

Background: Supervised classification is fundamental in bioinformatics. Machine learning models, such as neural networks, have been applied to discover genes and expression patterns. This process is achieved by implementing training and test phases. In the training phase, a set of cases and their respective labels are used to build a classifier. During testing, the classifier is used to predict new cases. One approach to assessing its predictive quality is to estimate its accuracy during the test phase. Key limitations appear when dealing with small-data samples. This paper investigates the effect of data sampling techniques on the assessment of neural network classifiers.

Results: Three data sampling techniques were studied: Cross-validation, leave-one-out, and bootstrap. These methods are designed to reduce the bias and variance of small-sample estimations. Two prediction problems based on small-sample sets were considered: Classification of microarray data originating from a leukemia study and from small, round blue-cell tumours. A third problem, the prediction of splice-junctions, was analysed to perform comparisons. Different accuracy estimations were produced for each problem. The variations are accentuated in the small-data samples. The quality of the estimates depends on the number of train-test experiments and the amount of data used for training the networks.

Conclusion: The predictive quality assessment of biomolecular data classifiers depends on the data size, sampling techniques and the number of train-test experiments. Conservative and optimistic accuracy estimations can be obtained by applying different methods. Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.

Show MeSH
Related in: MedlinePlus