Limits...
A Bayesian outlier criterion to detect SNPs under selection in large data sets.

Gautier M, Hocking TD, Foulley JL - PLoS ONE (2010)

Bottom Line: Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations.An application to a cattle data set is also provided.The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

View Article: PubMed Central - PubMed

Affiliation: INRA, UMR1313 GABI, Jouy-en-Josas, France. mathieu.gautier@jouy.inra.fr

ABSTRACT

Background: The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

Methodology/principal findings: The purpose of this study is to develop an efficient model-based approach to perform bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

Conclusions/significance: The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

Show MeSH
Plots of the PPP-values estimated with model 2 against the BF (in dB) computed with model 3.A) Results for the PDM data set with Tβ€Š=β€Š100 generations and Jβ€Š=β€Š8 populations simulated under a pure-drift demographic model. B) Results for the MDM data set with FSTβ€Š=β€Š0.1 and Jβ€Š=β€Š8 populations simulated under a migration-drift demographic model. In A) and B) neutral (simulated) SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). Depending on their underlying simulated coefficient of selection, plots latter are represented by a triangle (sβ€Š=β€Š0.02), a circle (sβ€Š=β€Š0.05) or a square (sβ€Š=β€Š0.10). C) Results from the analysis of the data set consisting in 36,320 SNPs genotyped on 9 West-African cattle populations [12]. SNP localized within the first (last) half of the SNP specific FST distribution, as previously estimated [12], are plotted in blue (red), the color being darker for those within the first (last) quarter of this same distribution.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2914027&req=5

pone-0011913-g004: Plots of the PPP-values estimated with model 2 against the BF (in dB) computed with model 3.A) Results for the PDM data set with Tβ€Š=β€Š100 generations and Jβ€Š=β€Š8 populations simulated under a pure-drift demographic model. B) Results for the MDM data set with FSTβ€Š=β€Š0.1 and Jβ€Š=β€Š8 populations simulated under a migration-drift demographic model. In A) and B) neutral (simulated) SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). Depending on their underlying simulated coefficient of selection, plots latter are represented by a triangle (sβ€Š=β€Š0.02), a circle (sβ€Š=β€Š0.05) or a square (sβ€Š=β€Š0.10). C) Results from the analysis of the data set consisting in 36,320 SNPs genotyped on 9 West-African cattle populations [12]. SNP localized within the first (last) half of the SNP specific FST distribution, as previously estimated [12], are plotted in blue (red), the color being darker for those within the first (last) quarter of this same distribution.

Mentions: We first assessed the power of the three different classifiers (based on PPP-values estimated under models 1 and 2 and BF estimated under model 3) by generating receiver operating characteristic (ROC) curves [26] which plot the power vs (1-specificity) for a binary classification system whereby the cutoff value is varied. Curves resulting from the analysis of nine different simulated data sets are reported in Figure 3 for each of the three classifiers and distinguishing SNPs subjected to positive (in red) and balancing (in blue) selection (irrespectively of the intensities of selection). As expected from previous observations, ROC curves for SNPs subjected to positive selection were always better than ROC curves for SNPs under balancing selection. Similarly, power to detect SNPs subjected to selection under a pure-drift demographic model increased with differentiation but remained lower than under a migration-drift demographic model. Interestingly, ROC curves of both PPP-value classifiers were generally above ROC curves for the BF classifier while the PPP-value classifier based on model 2 clearly outperformed the PPP-value classifier based on model 1 for positive selection. Nevertheless, the definition of an optimal threshold value for the PPP-value classifier strongly depends on the level of differentiation (see above). Table 3 reports the comparisons of the power and robustness of the analyses performed with model 2 and model 3. As a matter of expedience we chose a PPP-value threshold of 0.2 (respectively 0.8) to declare SNPs as subjected to positive (respectively balancing) selection and a threshold of 15 on BF was chosen when analyzing data with model 3. At such thresholds, the FDR were generally similar among the two different analyses except for low level of differentiation (FST≀0.05). In these cases FDR (and thus power) was substantially higher for model 2 than for model 3. Note that a great proportion of false positives originated from the upper tail of the PPP-value distribution (see for instance results for the MDM data sets simulated with FSTβ€Š=β€Š0.05). At the thresholds considered and for data sets simulated under migration-drift equilibrium, the power to detect SNPs under positive selection varied from 42.9% to 56.8% and was similar between the two approaches. However, for data sets simulated under a pure-drift demographic model, this power was always lower (from 0.3% to 40.9%). In addition, model 3 outperformed model 2 as the level of differentiation increased. Yet, as illustrated in Figure 4A for the PDM data set with Tβ€Š=β€Š100 and Jβ€Š=β€Š8, the PPP-value threshold of 0.2 is clearly too stringent. Hence, for this latter data set, increasing the threshold to 0.3 improves the power from 24.3% to 41.2% (see Table 2), which is similar to the model 3 value (40.9%) without affecting robustness. As expected from previous studies [10]–[13], the power to detect SNPs under balancing selection was small (always lower than 25%). However, model 2 performed generally better than model 3 in particular for MDM data sets.


A Bayesian outlier criterion to detect SNPs under selection in large data sets.

Gautier M, Hocking TD, Foulley JL - PLoS ONE (2010)

Plots of the PPP-values estimated with model 2 against the BF (in dB) computed with model 3.A) Results for the PDM data set with Tβ€Š=β€Š100 generations and Jβ€Š=β€Š8 populations simulated under a pure-drift demographic model. B) Results for the MDM data set with FSTβ€Š=β€Š0.1 and Jβ€Š=β€Š8 populations simulated under a migration-drift demographic model. In A) and B) neutral (simulated) SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). Depending on their underlying simulated coefficient of selection, plots latter are represented by a triangle (sβ€Š=β€Š0.02), a circle (sβ€Š=β€Š0.05) or a square (sβ€Š=β€Š0.10). C) Results from the analysis of the data set consisting in 36,320 SNPs genotyped on 9 West-African cattle populations [12]. SNP localized within the first (last) half of the SNP specific FST distribution, as previously estimated [12], are plotted in blue (red), the color being darker for those within the first (last) quarter of this same distribution.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2914027&req=5

pone-0011913-g004: Plots of the PPP-values estimated with model 2 against the BF (in dB) computed with model 3.A) Results for the PDM data set with Tβ€Š=β€Š100 generations and Jβ€Š=β€Š8 populations simulated under a pure-drift demographic model. B) Results for the MDM data set with FSTβ€Š=β€Š0.1 and Jβ€Š=β€Š8 populations simulated under a migration-drift demographic model. In A) and B) neutral (simulated) SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). Depending on their underlying simulated coefficient of selection, plots latter are represented by a triangle (sβ€Š=β€Š0.02), a circle (sβ€Š=β€Š0.05) or a square (sβ€Š=β€Š0.10). C) Results from the analysis of the data set consisting in 36,320 SNPs genotyped on 9 West-African cattle populations [12]. SNP localized within the first (last) half of the SNP specific FST distribution, as previously estimated [12], are plotted in blue (red), the color being darker for those within the first (last) quarter of this same distribution.
Mentions: We first assessed the power of the three different classifiers (based on PPP-values estimated under models 1 and 2 and BF estimated under model 3) by generating receiver operating characteristic (ROC) curves [26] which plot the power vs (1-specificity) for a binary classification system whereby the cutoff value is varied. Curves resulting from the analysis of nine different simulated data sets are reported in Figure 3 for each of the three classifiers and distinguishing SNPs subjected to positive (in red) and balancing (in blue) selection (irrespectively of the intensities of selection). As expected from previous observations, ROC curves for SNPs subjected to positive selection were always better than ROC curves for SNPs under balancing selection. Similarly, power to detect SNPs subjected to selection under a pure-drift demographic model increased with differentiation but remained lower than under a migration-drift demographic model. Interestingly, ROC curves of both PPP-value classifiers were generally above ROC curves for the BF classifier while the PPP-value classifier based on model 2 clearly outperformed the PPP-value classifier based on model 1 for positive selection. Nevertheless, the definition of an optimal threshold value for the PPP-value classifier strongly depends on the level of differentiation (see above). Table 3 reports the comparisons of the power and robustness of the analyses performed with model 2 and model 3. As a matter of expedience we chose a PPP-value threshold of 0.2 (respectively 0.8) to declare SNPs as subjected to positive (respectively balancing) selection and a threshold of 15 on BF was chosen when analyzing data with model 3. At such thresholds, the FDR were generally similar among the two different analyses except for low level of differentiation (FST≀0.05). In these cases FDR (and thus power) was substantially higher for model 2 than for model 3. Note that a great proportion of false positives originated from the upper tail of the PPP-value distribution (see for instance results for the MDM data sets simulated with FSTβ€Š=β€Š0.05). At the thresholds considered and for data sets simulated under migration-drift equilibrium, the power to detect SNPs under positive selection varied from 42.9% to 56.8% and was similar between the two approaches. However, for data sets simulated under a pure-drift demographic model, this power was always lower (from 0.3% to 40.9%). In addition, model 3 outperformed model 2 as the level of differentiation increased. Yet, as illustrated in Figure 4A for the PDM data set with Tβ€Š=β€Š100 and Jβ€Š=β€Š8, the PPP-value threshold of 0.2 is clearly too stringent. Hence, for this latter data set, increasing the threshold to 0.3 improves the power from 24.3% to 41.2% (see Table 2), which is similar to the model 3 value (40.9%) without affecting robustness. As expected from previous studies [10]–[13], the power to detect SNPs under balancing selection was small (always lower than 25%). However, model 2 performed generally better than model 3 in particular for MDM data sets.

Bottom Line: Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations.An application to a cattle data set is also provided.The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

View Article: PubMed Central - PubMed

Affiliation: INRA, UMR1313 GABI, Jouy-en-Josas, France. mathieu.gautier@jouy.inra.fr

ABSTRACT

Background: The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

Methodology/principal findings: The purpose of this study is to develop an efficient model-based approach to perform bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

Conclusions/significance: The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

Show MeSH