Limits...
A Bayesian outlier criterion to detect SNPs under selection in large data sets.

Gautier M, Hocking TD, Foulley JL - PLoS ONE (2010)

Bottom Line: Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations.An application to a cattle data set is also provided.The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

View Article: PubMed Central - PubMed

Affiliation: INRA, UMR1313 GABI, Jouy-en-Josas, France. mathieu.gautier@jouy.inra.fr

ABSTRACT

Background: The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

Methodology/principal findings: The purpose of this study is to develop an efficient model-based approach to perform bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

Conclusions/significance: The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

Show MeSH
Distribution of the PPP-values estimates obtained with model 1 and model 2 for two data sets (FST = 0.05 and FST = 0.15) simulated under a migration-drift demographic scenario.Each data set consists of genotyping data for 10,000 SNPs: 8,500 neutral SNPs (N), 3×250 SNPs subjected to positive selection (P) of varying intensity (s = 0.02, s = 0.5 and s = 0.10) and 3×250 = 750 SNPs subjected to balancing selection (B) of varying intensity (s = 0.02, s = 0.5 and s = 0.10). Boxplots of the PPP-values as a function of the type and intensity of selection are represented for data set simulated with T = 10 and analyzed with model 1 (A) and model 2 (B) and for data set simulated with T = 100 and analyzed with model 1 (D) and model 2 (E). C) and F) PPP-values estimated with model 1 are plotted against those estimated with model 2 for the data set simulated with T = 10 and T = 100 respectively. Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s = 0.02), a circle (s = 0.05) or a square (s = 0.10).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2914027&req=5

pone-0011913-g002: Distribution of the PPP-values estimates obtained with model 1 and model 2 for two data sets (FST = 0.05 and FST = 0.15) simulated under a migration-drift demographic scenario.Each data set consists of genotyping data for 10,000 SNPs: 8,500 neutral SNPs (N), 3×250 SNPs subjected to positive selection (P) of varying intensity (s = 0.02, s = 0.5 and s = 0.10) and 3×250 = 750 SNPs subjected to balancing selection (B) of varying intensity (s = 0.02, s = 0.5 and s = 0.10). Boxplots of the PPP-values as a function of the type and intensity of selection are represented for data set simulated with T = 10 and analyzed with model 1 (A) and model 2 (B) and for data set simulated with T = 100 and analyzed with model 1 (D) and model 2 (E). C) and F) PPP-values estimated with model 1 are plotted against those estimated with model 2 for the data set simulated with T = 10 and T = 100 respectively. Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s = 0.02), a circle (s = 0.05) or a square (s = 0.10).

Mentions: Although we expected lower (resp. upper) tails to be enriched in SNPs under positive (resp. balancing) selection, identifying outliers on the observed PPP-value distribution makes it impossible, in practice, to control for False Discovery Rate (FDR) or False Negative Rate. We thus further investigated the power and robustness of each model, based on the simulated data sets, by computing FDR and recording the number of SNPs properly identified as subjected to selection for different PPP-value threshold (Table 2). For a given threshold the FDR but also the power decreased as the number of simulated generations increased, which was expected since this also resulted in sharpening the overall PPP-value distribution due in particular to an increase of the cj and thus the allele frequency variance. Similarly, the power was always higher when considering SNPs subjected to stronger selection (see above). Model 2 appeared far more efficient than model 1 mainly because PPP-value estimates were more extreme for SNPs under selection (e.g. Figures 2C and 2F). For instance, when T = 75 a threshold of 0.2 to detect SNPs under positive selection resulted in a FDR equal to 0 while the power was equal to respectively 13.6% when using model 1 and 68.4% when using model 2. The associated FDR for such a threshold when T = 10 was close to 10% for both models (Table 1). Finally, performing similar analyses on simulated data sets with a lower number of populations (Tables 1 and 2) lead to only a slight decrease in overall power.


A Bayesian outlier criterion to detect SNPs under selection in large data sets.

Gautier M, Hocking TD, Foulley JL - PLoS ONE (2010)

Distribution of the PPP-values estimates obtained with model 1 and model 2 for two data sets (FST = 0.05 and FST = 0.15) simulated under a migration-drift demographic scenario.Each data set consists of genotyping data for 10,000 SNPs: 8,500 neutral SNPs (N), 3×250 SNPs subjected to positive selection (P) of varying intensity (s = 0.02, s = 0.5 and s = 0.10) and 3×250 = 750 SNPs subjected to balancing selection (B) of varying intensity (s = 0.02, s = 0.5 and s = 0.10). Boxplots of the PPP-values as a function of the type and intensity of selection are represented for data set simulated with T = 10 and analyzed with model 1 (A) and model 2 (B) and for data set simulated with T = 100 and analyzed with model 1 (D) and model 2 (E). C) and F) PPP-values estimated with model 1 are plotted against those estimated with model 2 for the data set simulated with T = 10 and T = 100 respectively. Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s = 0.02), a circle (s = 0.05) or a square (s = 0.10).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2914027&req=5

pone-0011913-g002: Distribution of the PPP-values estimates obtained with model 1 and model 2 for two data sets (FST = 0.05 and FST = 0.15) simulated under a migration-drift demographic scenario.Each data set consists of genotyping data for 10,000 SNPs: 8,500 neutral SNPs (N), 3×250 SNPs subjected to positive selection (P) of varying intensity (s = 0.02, s = 0.5 and s = 0.10) and 3×250 = 750 SNPs subjected to balancing selection (B) of varying intensity (s = 0.02, s = 0.5 and s = 0.10). Boxplots of the PPP-values as a function of the type and intensity of selection are represented for data set simulated with T = 10 and analyzed with model 1 (A) and model 2 (B) and for data set simulated with T = 100 and analyzed with model 1 (D) and model 2 (E). C) and F) PPP-values estimated with model 1 are plotted against those estimated with model 2 for the data set simulated with T = 10 and T = 100 respectively. Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s = 0.02), a circle (s = 0.05) or a square (s = 0.10).
Mentions: Although we expected lower (resp. upper) tails to be enriched in SNPs under positive (resp. balancing) selection, identifying outliers on the observed PPP-value distribution makes it impossible, in practice, to control for False Discovery Rate (FDR) or False Negative Rate. We thus further investigated the power and robustness of each model, based on the simulated data sets, by computing FDR and recording the number of SNPs properly identified as subjected to selection for different PPP-value threshold (Table 2). For a given threshold the FDR but also the power decreased as the number of simulated generations increased, which was expected since this also resulted in sharpening the overall PPP-value distribution due in particular to an increase of the cj and thus the allele frequency variance. Similarly, the power was always higher when considering SNPs subjected to stronger selection (see above). Model 2 appeared far more efficient than model 1 mainly because PPP-value estimates were more extreme for SNPs under selection (e.g. Figures 2C and 2F). For instance, when T = 75 a threshold of 0.2 to detect SNPs under positive selection resulted in a FDR equal to 0 while the power was equal to respectively 13.6% when using model 1 and 68.4% when using model 2. The associated FDR for such a threshold when T = 10 was close to 10% for both models (Table 1). Finally, performing similar analyses on simulated data sets with a lower number of populations (Tables 1 and 2) lead to only a slight decrease in overall power.

Bottom Line: Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations.An application to a cattle data set is also provided.The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

View Article: PubMed Central - PubMed

Affiliation: INRA, UMR1313 GABI, Jouy-en-Josas, France. mathieu.gautier@jouy.inra.fr

ABSTRACT

Background: The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

Methodology/principal findings: The purpose of this study is to develop an efficient model-based approach to perform bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

Conclusions/significance: The procedure described turns out to be much faster than former bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

Show MeSH