Limits...
Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

Sun YV, Cai Z, Desai K, Lawrance R, Leff R, Jawaid A, Kardia SL, Yang H - BMC Proc (2007)

Bottom Line: Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes.However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation.However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology, School of Public Health, University of Michigan, 611 Church Street #244, Ann Arbor, Michigan 48104, USA. yansun@umich.edu

ABSTRACT
Using the North American Rheumatoid Arthritis Consortium (NARAC) candidate gene and genome-wide single-nucleotide polymorphism (SNP) data sets, we applied regression methods and tree-based random forests to identify genetic associations with rheumatoid arthritis (RA) and to predict RA disease status. Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes. Using random forests, the tested candidate gene SNPs were not sufficient to predict RA patients and normal subjects with high accuracy. However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation. However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.

No MeSH data available.


Related in: MedlinePlus

Reducing dimensionality can improve the predictive ability of RFs. , Using the most important predictors in data set 1; , using the most important predictors in data set 2; , using random predictors in data set 1; , using random predictors in data set 2.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2367463&req=5

Figure 3: Reducing dimensionality can improve the predictive ability of RFs. , Using the most important predictors in data set 1; , using the most important predictors in data set 2; , using random predictors in data set 1; , using random predictors in data set 2.

Mentions: We examined the predictive ability of RFs using the 5742 genome-wide linkage SNPs. Using the classic training-testing strategy with five-fold CV and the AUC of the ROC curves, we found that the RFs model did not build a reliable prediction model for RA status in either data set 1 or 2. However, when we only used SNPs that were highly ranked on the importance score measure (e.g., the top 500 SNPs), high sensitivities and specificities were consistently obtained in all five CVs for both data set 1 (Figure 2) and 2 (similar results, not shown). The sensitivities and specificities were approximately 90% and 80%, respectively. Figure 3 shows the impact of the number of selected SNP predictors. Using the AUC of the ROC curve to present the predictive ability of each RFs model, we found that the overall predictive ability declined when more than the 500 top ranking SNPs were used to build the model. When compared to the randomly selected SNPs in RF modeling, we concluded that the predictive ability can be significantly improved by selecting the most important SNPs and the improvement was not due to the random effects (Figure 3). When we estimated the most important SNPs only in the training data set for each CV, the improvement of predictive ability was not as obvious as in the full data set. This suggests that the size of the data set may not be sufficient and each training set could not represent the whole data set. Therefore, the best set of predictors identified from each training set would not generalize well across the whole data set.


Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

Sun YV, Cai Z, Desai K, Lawrance R, Leff R, Jawaid A, Kardia SL, Yang H - BMC Proc (2007)

Reducing dimensionality can improve the predictive ability of RFs. , Using the most important predictors in data set 1; , using the most important predictors in data set 2; , using random predictors in data set 1; , using random predictors in data set 2.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2367463&req=5

Figure 3: Reducing dimensionality can improve the predictive ability of RFs. , Using the most important predictors in data set 1; , using the most important predictors in data set 2; , using random predictors in data set 1; , using random predictors in data set 2.
Mentions: We examined the predictive ability of RFs using the 5742 genome-wide linkage SNPs. Using the classic training-testing strategy with five-fold CV and the AUC of the ROC curves, we found that the RFs model did not build a reliable prediction model for RA status in either data set 1 or 2. However, when we only used SNPs that were highly ranked on the importance score measure (e.g., the top 500 SNPs), high sensitivities and specificities were consistently obtained in all five CVs for both data set 1 (Figure 2) and 2 (similar results, not shown). The sensitivities and specificities were approximately 90% and 80%, respectively. Figure 3 shows the impact of the number of selected SNP predictors. Using the AUC of the ROC curve to present the predictive ability of each RFs model, we found that the overall predictive ability declined when more than the 500 top ranking SNPs were used to build the model. When compared to the randomly selected SNPs in RF modeling, we concluded that the predictive ability can be significantly improved by selecting the most important SNPs and the improvement was not due to the random effects (Figure 3). When we estimated the most important SNPs only in the training data set for each CV, the improvement of predictive ability was not as obvious as in the full data set. This suggests that the size of the data set may not be sufficient and each training set could not represent the whole data set. Therefore, the best set of predictors identified from each training set would not generalize well across the whole data set.

Bottom Line: Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes.However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation.However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology, School of Public Health, University of Michigan, 611 Church Street #244, Ann Arbor, Michigan 48104, USA. yansun@umich.edu

ABSTRACT
Using the North American Rheumatoid Arthritis Consortium (NARAC) candidate gene and genome-wide single-nucleotide polymorphism (SNP) data sets, we applied regression methods and tree-based random forests to identify genetic associations with rheumatoid arthritis (RA) and to predict RA disease status. Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes. Using random forests, the tested candidate gene SNPs were not sufficient to predict RA patients and normal subjects with high accuracy. However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation. However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.

No MeSH data available.


Related in: MedlinePlus