Limits...
Detecting epistatic effects in association studies at a genomic level based on an ensemble approach.

Li J, Horstman B, Chen Y - Bioinformatics (2011)

Bottom Line: Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms.Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs. jingli@case.edu.

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA. jingli@case.edu

ABSTRACT

Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.

Results: In this article, we propose an ensemble approach based on boosting to study gene-gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.

Contact: jingli@case.edu.

Show MeSH
The modified AdaBoost algorithm with variable importance score calculation.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117367&req=5

Figure 2: The modified AdaBoost algorithm with variable importance score calculation.

Mentions: AdaBoost (Freund and Schapire, 1997; Schapire et al., 1998) is a popular algorithm among a class of supervised learners called ensemble systems, which also includes RFs and Bagging. Boosting is a general technique developed by Schapire et al. (1998) that attempts to decrease the error of a weak learning algorithm using clever resampling of the training data. AdaBoost is the most popular Boosting algorithm and we use the classical algorithm without modification. The core idea of AdaBoost is to draw bootstrap samples to increase the power of a weak learner. This is done by weighting the individuals when drawing the bootstrap sample. When a weak learner instance misclassifies an individual, the weight of that individual is increased (and increased more if the weak learner instance was otherwise accurate). Thus, hard to classify individuals are more likely to be included in future bootstrap samples. In the end, the ensemble votes for class labels weighting the weak learner instances by training set accuracy. While AdaBoost was designed to decrease training set error, some have argued that instead it primarily reduces weak learner variance. This is disputed; the modern consensus is that Boosting and many other approaches can be reformulated in terms of margin theory. The goal of this approach is to maximize the distance from the class decision boundary and the training set. This improves generalization power over other algorithms that have similar test set error. The algorithm is described in Figure 2.Fig. 2.


Detecting epistatic effects in association studies at a genomic level based on an ensemble approach.

Li J, Horstman B, Chen Y - Bioinformatics (2011)

The modified AdaBoost algorithm with variable importance score calculation.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117367&req=5

Figure 2: The modified AdaBoost algorithm with variable importance score calculation.
Mentions: AdaBoost (Freund and Schapire, 1997; Schapire et al., 1998) is a popular algorithm among a class of supervised learners called ensemble systems, which also includes RFs and Bagging. Boosting is a general technique developed by Schapire et al. (1998) that attempts to decrease the error of a weak learning algorithm using clever resampling of the training data. AdaBoost is the most popular Boosting algorithm and we use the classical algorithm without modification. The core idea of AdaBoost is to draw bootstrap samples to increase the power of a weak learner. This is done by weighting the individuals when drawing the bootstrap sample. When a weak learner instance misclassifies an individual, the weight of that individual is increased (and increased more if the weak learner instance was otherwise accurate). Thus, hard to classify individuals are more likely to be included in future bootstrap samples. In the end, the ensemble votes for class labels weighting the weak learner instances by training set accuracy. While AdaBoost was designed to decrease training set error, some have argued that instead it primarily reduces weak learner variance. This is disputed; the modern consensus is that Boosting and many other approaches can be reformulated in terms of margin theory. The goal of this approach is to maximize the distance from the class decision boundary and the training set. This improves generalization power over other algorithms that have similar test set error. The algorithm is described in Figure 2.Fig. 2.

Bottom Line: Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms.Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs. jingli@case.edu.

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA. jingli@case.edu

ABSTRACT

Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.

Results: In this article, we propose an ensemble approach based on boosting to study gene-gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.

Contact: jingli@case.edu.

Show MeSH