Limits...
Integrating functional data to prioritize causal variants in statistical fine-mapping studies.

Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, Kraft P, Pasaniuc B - PLoS Genet. (2014)

Bottom Line: Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy).We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits.Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America.

ABSTRACT
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.

Show MeSH
PAINTOR outperforms existing methodologies for fine-mapping.We simulated datasets consisting of 10 K genotypes over one hundred 10 KB loci using three synthetic functional annotations randomly dispersed at fixed percentages (2.2%, 2.2%, 30.7%). SNPs falling within these annotations were enriched (9.5, 5.7, 3.65) times more with causal variants relative to unannotated SNPs. We fixed the variance explained by these loci to  and repeated the simulation 500 times. The top figure corresponds to the overall performance at causal loci (64 loci) with PAINTOR clearly achieving the greatest overall accuracy. The bottom figures correspond to loci with a single causal variant (an average of 34 per simulation) (left) or multiple causal variants (average of 30 per simulation) (right). At loci where there is one true causal variant, fgwas achieves greater accuracy than PAINTOR due to the fact that fgwas assumes the correct number of causal variants. We note that the version of PAINTOR that assumes a single causal variant yields very similar to fgwas at loci where the truth is of a single causal (both requiring 2.63 SNPs per locus to identify 90% of the causal variants.) However, at loci with multiple causal variants, the power of methods that assume a single causal is greatly deflated leading to PAINTOR's superior overall accuracy.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4214605&req=5

pgen-1004722-g002: PAINTOR outperforms existing methodologies for fine-mapping.We simulated datasets consisting of 10 K genotypes over one hundred 10 KB loci using three synthetic functional annotations randomly dispersed at fixed percentages (2.2%, 2.2%, 30.7%). SNPs falling within these annotations were enriched (9.5, 5.7, 3.65) times more with causal variants relative to unannotated SNPs. We fixed the variance explained by these loci to and repeated the simulation 500 times. The top figure corresponds to the overall performance at causal loci (64 loci) with PAINTOR clearly achieving the greatest overall accuracy. The bottom figures correspond to loci with a single causal variant (an average of 34 per simulation) (left) or multiple causal variants (average of 30 per simulation) (right). At loci where there is one true causal variant, fgwas achieves greater accuracy than PAINTOR due to the fact that fgwas assumes the correct number of causal variants. We note that the version of PAINTOR that assumes a single causal variant yields very similar to fgwas at loci where the truth is of a single causal (both requiring 2.63 SNPs per locus to identify 90% of the causal variants.) However, at loci with multiple causal variants, the power of methods that assume a single causal is greatly deflated leading to PAINTOR's superior overall accuracy.

Mentions: We find that prioritizing variants using PAINTOR posterior probabilities achieves superior accuracy over existing methodologies (see Figure 2 and Table 1). Our approach identifies more causal variants at all selection thresholds, and is a consequence of PAINTOR's ability to model multiple causal variants while incorporating functional priors. For example, in order to find (50%, 90%) of all causal variants one needs to select an average of (1.3, 10.4) SNPs per locus if using PAINTOR. In contrast, ranking SNPs using frameworks that assume a single causal variant, such as Maller et al. [5] and fgwas [10], require (2.7, 25.4) and (2.0, 21.5) SNPs per locus, respectively. In general, we observe an increase in performance for methods that incorporate functional data and allow for multiple causal variants at a risk locus (see Tables 1 and 2). Despite having access to individual level data, variable selection strategies [13], [14] were less accurate than PAINTOR in our simulations (see Figure 2 and Table 1). Ranking SNPs based on correlation-adjusted t-scores [12] was superior to existing methodologies, however, still failed to achieve the same level of accuracy of PAINTOR, requiring an average of (2.0,13.3) SNPs per locus to find (50%, 90%) of all causal variants. Across all methodologies, the relative performance holds irrespective of whether SNPs are prioritized across all fine-mapping loci or within each locus independently (generally the latter strategy is sub-optimal (see Table 1)). Finally, we note that iterative conditioning, a method typically used to detect multiple independent signals, performs worse than the prioritization strategies described here (see Figure S1) [7]. Interestingly, as the number of SNPs selected for follow-up increases, the naive approach of selecting based on association p-value alone attains high accuracy, most likely due to the much smaller set of assumptions as compared to other methods.


Integrating functional data to prioritize causal variants in statistical fine-mapping studies.

Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, Kraft P, Pasaniuc B - PLoS Genet. (2014)

PAINTOR outperforms existing methodologies for fine-mapping.We simulated datasets consisting of 10 K genotypes over one hundred 10 KB loci using three synthetic functional annotations randomly dispersed at fixed percentages (2.2%, 2.2%, 30.7%). SNPs falling within these annotations were enriched (9.5, 5.7, 3.65) times more with causal variants relative to unannotated SNPs. We fixed the variance explained by these loci to  and repeated the simulation 500 times. The top figure corresponds to the overall performance at causal loci (64 loci) with PAINTOR clearly achieving the greatest overall accuracy. The bottom figures correspond to loci with a single causal variant (an average of 34 per simulation) (left) or multiple causal variants (average of 30 per simulation) (right). At loci where there is one true causal variant, fgwas achieves greater accuracy than PAINTOR due to the fact that fgwas assumes the correct number of causal variants. We note that the version of PAINTOR that assumes a single causal variant yields very similar to fgwas at loci where the truth is of a single causal (both requiring 2.63 SNPs per locus to identify 90% of the causal variants.) However, at loci with multiple causal variants, the power of methods that assume a single causal is greatly deflated leading to PAINTOR's superior overall accuracy.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4214605&req=5

pgen-1004722-g002: PAINTOR outperforms existing methodologies for fine-mapping.We simulated datasets consisting of 10 K genotypes over one hundred 10 KB loci using three synthetic functional annotations randomly dispersed at fixed percentages (2.2%, 2.2%, 30.7%). SNPs falling within these annotations were enriched (9.5, 5.7, 3.65) times more with causal variants relative to unannotated SNPs. We fixed the variance explained by these loci to and repeated the simulation 500 times. The top figure corresponds to the overall performance at causal loci (64 loci) with PAINTOR clearly achieving the greatest overall accuracy. The bottom figures correspond to loci with a single causal variant (an average of 34 per simulation) (left) or multiple causal variants (average of 30 per simulation) (right). At loci where there is one true causal variant, fgwas achieves greater accuracy than PAINTOR due to the fact that fgwas assumes the correct number of causal variants. We note that the version of PAINTOR that assumes a single causal variant yields very similar to fgwas at loci where the truth is of a single causal (both requiring 2.63 SNPs per locus to identify 90% of the causal variants.) However, at loci with multiple causal variants, the power of methods that assume a single causal is greatly deflated leading to PAINTOR's superior overall accuracy.
Mentions: We find that prioritizing variants using PAINTOR posterior probabilities achieves superior accuracy over existing methodologies (see Figure 2 and Table 1). Our approach identifies more causal variants at all selection thresholds, and is a consequence of PAINTOR's ability to model multiple causal variants while incorporating functional priors. For example, in order to find (50%, 90%) of all causal variants one needs to select an average of (1.3, 10.4) SNPs per locus if using PAINTOR. In contrast, ranking SNPs using frameworks that assume a single causal variant, such as Maller et al. [5] and fgwas [10], require (2.7, 25.4) and (2.0, 21.5) SNPs per locus, respectively. In general, we observe an increase in performance for methods that incorporate functional data and allow for multiple causal variants at a risk locus (see Tables 1 and 2). Despite having access to individual level data, variable selection strategies [13], [14] were less accurate than PAINTOR in our simulations (see Figure 2 and Table 1). Ranking SNPs based on correlation-adjusted t-scores [12] was superior to existing methodologies, however, still failed to achieve the same level of accuracy of PAINTOR, requiring an average of (2.0,13.3) SNPs per locus to find (50%, 90%) of all causal variants. Across all methodologies, the relative performance holds irrespective of whether SNPs are prioritized across all fine-mapping loci or within each locus independently (generally the latter strategy is sub-optimal (see Table 1)). Finally, we note that iterative conditioning, a method typically used to detect multiple independent signals, performs worse than the prioritization strategies described here (see Figure S1) [7]. Interestingly, as the number of SNPs selected for follow-up increases, the naive approach of selecting based on association p-value alone attains high accuracy, most likely due to the much smaller set of assumptions as compared to other methods.

Bottom Line: Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy).We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits.Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America.

ABSTRACT
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.

Show MeSH