Limits...
Picking single-nucleotide polymorphisms in forests.

Schwarz DF, Szymczak S, Ziegler A, König IR - BMC Proc (2007)

Bottom Line: In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging.In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci.After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany. schwarz@imbs.uni-luebeck.de

ABSTRACT
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

No MeSH data available.


Related in: MedlinePlus

Importance of SNPs. Global importance scores for the single SNPs in the genome-wide scan in chromosomal order. Vertical dotted lines show chromosomal boundaries.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2367487&req=5

Figure 1: Importance of SNPs. Global importance scores for the single SNPs in the genome-wide scan in chromosomal order. Vertical dotted lines show chromosomal boundaries.

Mentions: Figure 1 shows the global importance scores from the RFs in our first-stage analysis across the genome. It can be seen that highest importance is assigned to SNPs on chromosomes 6 and 11. In addition, high global importance was estimated for phenotypic covariates (not shown). It should be noted that the importance of the covariates might even be underestimated, because the estimated importance in a RF depends on the number of categories of the variable [9]. Specifically, higher importance may be assigned to variables with more categories, and in this case, the covariates were binary in contrast to the SNPs with three categories.


Picking single-nucleotide polymorphisms in forests.

Schwarz DF, Szymczak S, Ziegler A, König IR - BMC Proc (2007)

Importance of SNPs. Global importance scores for the single SNPs in the genome-wide scan in chromosomal order. Vertical dotted lines show chromosomal boundaries.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2367487&req=5

Figure 1: Importance of SNPs. Global importance scores for the single SNPs in the genome-wide scan in chromosomal order. Vertical dotted lines show chromosomal boundaries.
Mentions: Figure 1 shows the global importance scores from the RFs in our first-stage analysis across the genome. It can be seen that highest importance is assigned to SNPs on chromosomes 6 and 11. In addition, high global importance was estimated for phenotypic covariates (not shown). It should be noted that the importance of the covariates might even be underestimated, because the estimated importance in a RF depends on the number of categories of the variable [9]. Specifically, higher importance may be assigned to variables with more categories, and in this case, the covariates were binary in contrast to the SNPs with three categories.

Bottom Line: In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging.In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci.After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany. schwarz@imbs.uni-luebeck.de

ABSTRACT
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.

No MeSH data available.


Related in: MedlinePlus