Limits...
SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association.

Dai JY, Leblanc M, Smith NL, Psaty B, Kooperberg C - Biostatistics (2009)

Bottom Line: Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses.Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure.Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

View Article: PubMed Central - PubMed

Affiliation: Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, M2-C200, Seattle, WA 98109, USA. jdai@fhcrc.org

ABSTRACT
Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

Show MeSH

Related in: MedlinePlus

The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a  model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2742496&req=5

fig3: The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.

Mentions: We used SHARE to re-analyze the data from a published case–control genetic association study (Smith and others, 2007). This study aimed to investigate the association of common genetic variation in 24 coagulation, anti-coagulation, fibrinolysis, and antifibrinolysis candidate genes with risk of incident nonfatal venous thrombosis in postmenopausal women. The participants were selected from a large integrated health care system in Washington State and consist of 349 cases and 1680 controls matched on age, hypertension status, and calendar year. In the original analysis, the single-locus scan and a full haplotype analysis were applied to each of 24 genes, assuming an additive genetic effect, adjusting for race and other matching variables (e.g. age and hypertension status). We illustrate our algorithm using the data on the tissue factor pathway inhibitor (TFPI) gene. The LD among five genotyped tagSNPs in TFPI gene is quite strong and the highest correlation occurs between 2 adjacent tagSNPs: rs2192824 and rs2300412 (). This region has been shown to have a significant global haplotype effect (Smith and others, 2007). Figure 3 shows prediction deviances, estimated by 10-fold cross-validation, of various models with different sets of SNPs. Lower deviances suggest better prediction accuracy. It is clear that there is a genetic effect in this gene as all models yield a smaller prediction deviance than the model. We performed a permutation test within the strata defined by other covariates. The p value for a global hypothesis is 0.023. It is interesting to observe that though SNP rs2192824 predicts the disease status fairly well, a haplotype model based on SNPs rs2192824 and rs2300412 further improves the prediction accuracy; inclusion of additional tagSNPs no longer helps. Inspection of the haplotype analysis using SNPs rs2192824 and rs2300412 (Table 4), we found that the increased disease risk is concentrated on haplotypes “01” and “10.” This pattern implies that there might be an underlying disease-causing allele that is tagged by these 2-SNP haplotypes. We searched the SeattleSNPs database for SNPs that are not genotyped in the study. Based on 23 European descended individuals, there is a total of 54 SNPs with MAF larger than 0.05, covering a distance of 3.9 kb in this region. We found 2 SNPs (rs8676500 and rs8176531), that are in perfect LD with each other, which display a pattern of correlation to rs2192824 and rs2300412 similar to the results in the SHARE analysis. The highest pairwise r2 between scored SNPs and unscored SNPs is 0.17, however, if we define a multilocus r2 as in Hao and others (2007), the highest r2 jumps to 0.72. Both rs8676500 and rs8176531 are located in the intron region of the TFPI gene. Although it is too early for an interpretation on this finding, this analysis gives useful hints for future studies, such as genotyping more SNPs that are correlated with rs2192824 and rs2300412 such as rs8676500 and rs8176531.


SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association.

Dai JY, Leblanc M, Smith NL, Psaty B, Kooperberg C - Biostatistics (2009)

The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a  model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2742496&req=5

fig3: The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.
Mentions: We used SHARE to re-analyze the data from a published case–control genetic association study (Smith and others, 2007). This study aimed to investigate the association of common genetic variation in 24 coagulation, anti-coagulation, fibrinolysis, and antifibrinolysis candidate genes with risk of incident nonfatal venous thrombosis in postmenopausal women. The participants were selected from a large integrated health care system in Washington State and consist of 349 cases and 1680 controls matched on age, hypertension status, and calendar year. In the original analysis, the single-locus scan and a full haplotype analysis were applied to each of 24 genes, assuming an additive genetic effect, adjusting for race and other matching variables (e.g. age and hypertension status). We illustrate our algorithm using the data on the tissue factor pathway inhibitor (TFPI) gene. The LD among five genotyped tagSNPs in TFPI gene is quite strong and the highest correlation occurs between 2 adjacent tagSNPs: rs2192824 and rs2300412 (). This region has been shown to have a significant global haplotype effect (Smith and others, 2007). Figure 3 shows prediction deviances, estimated by 10-fold cross-validation, of various models with different sets of SNPs. Lower deviances suggest better prediction accuracy. It is clear that there is a genetic effect in this gene as all models yield a smaller prediction deviance than the model. We performed a permutation test within the strata defined by other covariates. The p value for a global hypothesis is 0.023. It is interesting to observe that though SNP rs2192824 predicts the disease status fairly well, a haplotype model based on SNPs rs2192824 and rs2300412 further improves the prediction accuracy; inclusion of additional tagSNPs no longer helps. Inspection of the haplotype analysis using SNPs rs2192824 and rs2300412 (Table 4), we found that the increased disease risk is concentrated on haplotypes “01” and “10.” This pattern implies that there might be an underlying disease-causing allele that is tagged by these 2-SNP haplotypes. We searched the SeattleSNPs database for SNPs that are not genotyped in the study. Based on 23 European descended individuals, there is a total of 54 SNPs with MAF larger than 0.05, covering a distance of 3.9 kb in this region. We found 2 SNPs (rs8676500 and rs8176531), that are in perfect LD with each other, which display a pattern of correlation to rs2192824 and rs2300412 similar to the results in the SHARE analysis. The highest pairwise r2 between scored SNPs and unscored SNPs is 0.17, however, if we define a multilocus r2 as in Hao and others (2007), the highest r2 jumps to 0.72. Both rs8676500 and rs8176531 are located in the intron region of the TFPI gene. Although it is too early for an interpretation on this finding, this analysis gives useful hints for future studies, such as genotyping more SNPs that are correlated with rs2192824 and rs2300412 such as rs8676500 and rs8176531.

Bottom Line: Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses.Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure.Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

View Article: PubMed Central - PubMed

Affiliation: Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, M2-C200, Seattle, WA 98109, USA. jdai@fhcrc.org

ABSTRACT
Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

Show MeSH
Related in: MedlinePlus