Limits...
One-step extrapolation of the prediction performance of a gene signature derived from a small study.

Wang LY, Lee WC - BMJ Open (2015)

Bottom Line: Microarray-related studies often involve a very large number of genes and small sample size.We propose to make a one-step extrapolation from the fitted learning curve to estimate the prediction/classification performance of the model trained by all the samples.Three microarray data sets are used for demonstration.

View Article: PubMed Central - PubMed

Affiliation: Research Center for Genes, Environment and Human Health, and Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan Department of Medical Research, Tzu Chi General Hospital, Hualien, Taiwan.

Show MeSH

Related in: MedlinePlus

Demonstration of the three microarray data using the proposed extrapolation method. In this double-inverse coordinate system, the ordinate is the inverse of the (transformed) AUC value, and the abscissa is the inverse of the training sample size. The red, blue and green lines represent the colon cancer data (example 1), the GSE2990 breast cancer data (example 2) and the GSE2034 breast cancer data (example 3), respectively. The five dots along each line are the estimates of the five different fold CV and the * is the extrapolated AUC (AUC, area under the receiver operating characteristic curve; CV, cross-validation).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4401865&req=5

BMJOPEN2014007170F3: Demonstration of the three microarray data using the proposed extrapolation method. In this double-inverse coordinate system, the ordinate is the inverse of the (transformed) AUC value, and the abscissa is the inverse of the training sample size. The red, blue and green lines represent the colon cancer data (example 1), the GSE2990 breast cancer data (example 2) and the GSE2034 breast cancer data (example 3), respectively. The five dots along each line are the estimates of the five different fold CV and the * is the extrapolated AUC (AUC, area under the receiver operating characteristic curve; CV, cross-validation).

Mentions: For naïve multiple regression, the s (averaged from 1000 Monte Carlo partitions) are 0.936 (LOOCV), 0.929 (10-fold CV), 0.928 (5-fold CV), 0.925 (3-fold CV) and 0.921 (2-fold CV), respectively. The (x, y)s are then calculated as: =(0.182, 0.432) for LOOCV, (0.200, 0.465) for 10-fold CV, (0.220, 0.466) for 5-fold CV, (0.250, 0.483) for 3-fold CV and (0.330, 0.503) for 2-fold CV, respectively. These results are plotted in figure 3A. We then draw a linear regression based on the five (x, y) points: (the red line in figure 3A). To predict the performance with a sample size of 24 (all samples in the model building and CV data set are used as the training set, ie, 12 tumour and 12 normal tissue samples), we enter into the regression equation to get (* in figure 3A). The extrapolated performance is therefore . We next perform a total of 100 bootstrapping for this example and the bootstrapped SE for is calculated as 0.080. The results for this example when SVM is used for constructing the prediction model are shown in figure 3B. The (± bootstrapped SE) is calculated as 0.940 (±0.079).


One-step extrapolation of the prediction performance of a gene signature derived from a small study.

Wang LY, Lee WC - BMJ Open (2015)

Demonstration of the three microarray data using the proposed extrapolation method. In this double-inverse coordinate system, the ordinate is the inverse of the (transformed) AUC value, and the abscissa is the inverse of the training sample size. The red, blue and green lines represent the colon cancer data (example 1), the GSE2990 breast cancer data (example 2) and the GSE2034 breast cancer data (example 3), respectively. The five dots along each line are the estimates of the five different fold CV and the * is the extrapolated AUC (AUC, area under the receiver operating characteristic curve; CV, cross-validation).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4401865&req=5

BMJOPEN2014007170F3: Demonstration of the three microarray data using the proposed extrapolation method. In this double-inverse coordinate system, the ordinate is the inverse of the (transformed) AUC value, and the abscissa is the inverse of the training sample size. The red, blue and green lines represent the colon cancer data (example 1), the GSE2990 breast cancer data (example 2) and the GSE2034 breast cancer data (example 3), respectively. The five dots along each line are the estimates of the five different fold CV and the * is the extrapolated AUC (AUC, area under the receiver operating characteristic curve; CV, cross-validation).
Mentions: For naïve multiple regression, the s (averaged from 1000 Monte Carlo partitions) are 0.936 (LOOCV), 0.929 (10-fold CV), 0.928 (5-fold CV), 0.925 (3-fold CV) and 0.921 (2-fold CV), respectively. The (x, y)s are then calculated as: =(0.182, 0.432) for LOOCV, (0.200, 0.465) for 10-fold CV, (0.220, 0.466) for 5-fold CV, (0.250, 0.483) for 3-fold CV and (0.330, 0.503) for 2-fold CV, respectively. These results are plotted in figure 3A. We then draw a linear regression based on the five (x, y) points: (the red line in figure 3A). To predict the performance with a sample size of 24 (all samples in the model building and CV data set are used as the training set, ie, 12 tumour and 12 normal tissue samples), we enter into the regression equation to get (* in figure 3A). The extrapolated performance is therefore . We next perform a total of 100 bootstrapping for this example and the bootstrapped SE for is calculated as 0.080. The results for this example when SVM is used for constructing the prediction model are shown in figure 3B. The (± bootstrapped SE) is calculated as 0.940 (±0.079).

Bottom Line: Microarray-related studies often involve a very large number of genes and small sample size.We propose to make a one-step extrapolation from the fitted learning curve to estimate the prediction/classification performance of the model trained by all the samples.Three microarray data sets are used for demonstration.

View Article: PubMed Central - PubMed

Affiliation: Research Center for Genes, Environment and Human Health, and Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan Department of Medical Research, Tzu Chi General Hospital, Hualien, Taiwan.

Show MeSH
Related in: MedlinePlus