Limits...
Picking single-nucleotide polymorphisms in forests.

Schwarz DF, Szymczak S, Ziegler A, König IR - BMC Proc (2007)

Bottom Line: In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging.In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci.After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany. schwarz@imbs.uni-luebeck.de

ABSTRACT
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

No MeSH data available.


Related in: MedlinePlus

Prediction error in random forests based on different numbers of variables. Prediction error of random forests based on different numbers of variables, estimated in the out-of-bag (OOB) samples. Only error estimates of the first 100 sets are displayed. The first local minimum in prediction error is for the set including 37 variables, which was selected for further analyses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2367487&req=5

Figure 2: Prediction error in random forests based on different numbers of variables. Prediction error of random forests based on different numbers of variables, estimated in the out-of-bag (OOB) samples. Only error estimates of the first 100 sets are displayed. The first local minimum in prediction error is for the set including 37 variables, which was selected for further analyses.

Mentions: For further analyses, the OOB prediction errors were estimated in RFs with different numbers of variables (Figure 2). It can be seen that with more variables, a strong increase in the estimate is followed by a similarly steep decrease. After this, the error estimate only varies between about 0.13 and 0.14. From this latter region, the point was chosen where the error estimate reaches its first minimum, which is for 37 variables. By haplotype tagging on nine closely neighboring SNPs, this was further reduced to 32 variables for the second stage of analysis.


Picking single-nucleotide polymorphisms in forests.

Schwarz DF, Szymczak S, Ziegler A, König IR - BMC Proc (2007)

Prediction error in random forests based on different numbers of variables. Prediction error of random forests based on different numbers of variables, estimated in the out-of-bag (OOB) samples. Only error estimates of the first 100 sets are displayed. The first local minimum in prediction error is for the set including 37 variables, which was selected for further analyses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2367487&req=5

Figure 2: Prediction error in random forests based on different numbers of variables. Prediction error of random forests based on different numbers of variables, estimated in the out-of-bag (OOB) samples. Only error estimates of the first 100 sets are displayed. The first local minimum in prediction error is for the set including 37 variables, which was selected for further analyses.
Mentions: For further analyses, the OOB prediction errors were estimated in RFs with different numbers of variables (Figure 2). It can be seen that with more variables, a strong increase in the estimate is followed by a similarly steep decrease. After this, the error estimate only varies between about 0.13 and 0.14. From this latter region, the point was chosen where the error estimate reaches its first minimum, which is for 37 variables. By haplotype tagging on nine closely neighboring SNPs, this was further reduced to 32 variables for the second stage of analysis.

Bottom Line: In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging.In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci.After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany. schwarz@imbs.uni-luebeck.de

ABSTRACT
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

No MeSH data available.


Related in: MedlinePlus