Limits...
Estimating demographic parameters from large-scale population genomic data using Approximate Bayesian Computation.

Li S, Jakobsson M - BMC Genet. (2012)

Bottom Line: We compared the ability of different summary statistics to infer demographic parameters, including haplotype and LD based statistics, and found that the accuracy of the parameter estimates can be improved by combining summary statistics that capture different parts of information in the data.Furthermore, our results suggest that poor choices of prior distributions can in some circumstances be detected using ABC.We conclude that the ABC approach can accommodate realistic genome-wide population genetic data, which may be difficult to analyze with full likelihood approaches, and that the ABC can provide accurate and precise inference of demographic parameters from these data, suggesting that the ABC approach will be a useful tool for analyzing large genome-wide datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Evolutionary Biology, EBC, Uppsala University, Norbyvägen 18D, Uppsala SE-75236, Sweden.

ABSTRACT

Background: The Approximate Bayesian Computation (ABC) approach has been used to infer demographic parameters for numerous species, including humans. However, most applications of ABC still use limited amounts of data, from a small number of loci, compared to the large amount of genome-wide population-genetic data which have become available in the last few years.

Results: We evaluated the performance of the ABC approach for three 'population divergence' models - similar to the 'isolation with migration' model - when the data consists of several hundred thousand SNPs typed for multiple individuals by simulating data from known demographic models. The ABC approach was used to infer demographic parameters of interest and we compared the inferred values to the true parameter values that was used to generate hypothetical "observed" data. For all three case models, the ABC approach inferred most demographic parameters quite well with narrow credible intervals, for example, population divergence times and past population sizes, but some parameters were more difficult to infer, such as population sizes at present and migration rates. We compared the ability of different summary statistics to infer demographic parameters, including haplotype and LD based statistics, and found that the accuracy of the parameter estimates can be improved by combining summary statistics that capture different parts of information in the data. Furthermore, our results suggest that poor choices of prior distributions can in some circumstances be detected using ABC. Finally, increasing the amount of data beyond some hundred loci will substantially improve the accuracy of many parameter estimates using ABC.

Conclusions: We conclude that the ABC approach can accommodate realistic genome-wide population genetic data, which may be difficult to analyze with full likelihood approaches, and that the ABC can provide accurate and precise inference of demographic parameters from these data, suggesting that the ABC approach will be a useful tool for analyzing large genome-wide datasets.

Show MeSH
The mean (across 49 choices of true T) difference between the true and estimated divergence time T (red) and the mean width of the 95% credible interval of the posterior sample (blue) given by single summary statistics, pairs of summary statistics, and the combination of all eight summary statistics. The results are based on model 3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3368717&req=5

Figure 7: The mean (across 49 choices of true T) difference between the true and estimated divergence time T (red) and the mean width of the 95% credible interval of the posterior sample (blue) given by single summary statistics, pairs of summary statistics, and the combination of all eight summary statistics. The results are based on model 3.

Mentions: We investigated the performance of each summary statistic, and combinations of summary statistics, for estimating the divergence time T. We investigated the complex model 3 by simulating 49 "observed" datasets from a set of known parameter values using the same approach as described above. The ABC with the local linear regression adjustment was used to infer the population divergence time T. The mean difference between the true and the estimated T, and the mean width of the 95% credibility interval of the posterior sample of T is shown in Figure 7 and Additional file 1: Table S2 for each summary statistic, for pairs of summary statistics, and for the combination of all summary statistics. We first noted that an accurate mean of the posterior sample (small deviation from the true parameter-value) also corresponded to a narrow credible interval (Pearson correlation: 0.95, p < 10-18). Moreover, pairs of summary statistics generally improved the accuracy of the parameter estimation compared to single summary statistics; the mean difference between true and estimated T was greater than 0.070 for all single summary statistics (mean difference across the 8 summary statistics equaled 0.110), whereas the mean difference was less than 0.070 for 68% of the pairs of summary statistics (mean difference across the 26 pairs of summary statistics equaled 0.055). However, there were pairs of summary statistics that performed poorly - at the level of single summary statistics - for example, pairs that include FST generally performed poorly, as well as pairs that included similar types of data, such as the number of distinct haplotype-alleles (NOA) and the number of private haplotype-alleles (NPA, Figure 7, Additional file 1: Table S2). The combination of all eight summary statistics estimated the divergence time T accurately (the mean difference was 0.0203 and the mean width of the 95%-credible interval was 0.1669), but several pairs of summary statistics performed at the same level. Although this comparison of summary statistics was by no means exhaustive, we noted that i) combining summary statistics generally provided more accurate inference, and ii) there was a large variation in performance across pairs of summary statistics. These two observations suggested that combining several summary statistics that capture different population-genetic phenomena may be a powerful approach for making accurate inferences at the same time as keeping the number of summary statistics low, both important features for any ABC investigation [4].


Estimating demographic parameters from large-scale population genomic data using Approximate Bayesian Computation.

Li S, Jakobsson M - BMC Genet. (2012)

The mean (across 49 choices of true T) difference between the true and estimated divergence time T (red) and the mean width of the 95% credible interval of the posterior sample (blue) given by single summary statistics, pairs of summary statistics, and the combination of all eight summary statistics. The results are based on model 3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3368717&req=5

Figure 7: The mean (across 49 choices of true T) difference between the true and estimated divergence time T (red) and the mean width of the 95% credible interval of the posterior sample (blue) given by single summary statistics, pairs of summary statistics, and the combination of all eight summary statistics. The results are based on model 3.
Mentions: We investigated the performance of each summary statistic, and combinations of summary statistics, for estimating the divergence time T. We investigated the complex model 3 by simulating 49 "observed" datasets from a set of known parameter values using the same approach as described above. The ABC with the local linear regression adjustment was used to infer the population divergence time T. The mean difference between the true and the estimated T, and the mean width of the 95% credibility interval of the posterior sample of T is shown in Figure 7 and Additional file 1: Table S2 for each summary statistic, for pairs of summary statistics, and for the combination of all summary statistics. We first noted that an accurate mean of the posterior sample (small deviation from the true parameter-value) also corresponded to a narrow credible interval (Pearson correlation: 0.95, p < 10-18). Moreover, pairs of summary statistics generally improved the accuracy of the parameter estimation compared to single summary statistics; the mean difference between true and estimated T was greater than 0.070 for all single summary statistics (mean difference across the 8 summary statistics equaled 0.110), whereas the mean difference was less than 0.070 for 68% of the pairs of summary statistics (mean difference across the 26 pairs of summary statistics equaled 0.055). However, there were pairs of summary statistics that performed poorly - at the level of single summary statistics - for example, pairs that include FST generally performed poorly, as well as pairs that included similar types of data, such as the number of distinct haplotype-alleles (NOA) and the number of private haplotype-alleles (NPA, Figure 7, Additional file 1: Table S2). The combination of all eight summary statistics estimated the divergence time T accurately (the mean difference was 0.0203 and the mean width of the 95%-credible interval was 0.1669), but several pairs of summary statistics performed at the same level. Although this comparison of summary statistics was by no means exhaustive, we noted that i) combining summary statistics generally provided more accurate inference, and ii) there was a large variation in performance across pairs of summary statistics. These two observations suggested that combining several summary statistics that capture different population-genetic phenomena may be a powerful approach for making accurate inferences at the same time as keeping the number of summary statistics low, both important features for any ABC investigation [4].

Bottom Line: We compared the ability of different summary statistics to infer demographic parameters, including haplotype and LD based statistics, and found that the accuracy of the parameter estimates can be improved by combining summary statistics that capture different parts of information in the data.Furthermore, our results suggest that poor choices of prior distributions can in some circumstances be detected using ABC.We conclude that the ABC approach can accommodate realistic genome-wide population genetic data, which may be difficult to analyze with full likelihood approaches, and that the ABC can provide accurate and precise inference of demographic parameters from these data, suggesting that the ABC approach will be a useful tool for analyzing large genome-wide datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Evolutionary Biology, EBC, Uppsala University, Norbyvägen 18D, Uppsala SE-75236, Sweden.

ABSTRACT

Background: The Approximate Bayesian Computation (ABC) approach has been used to infer demographic parameters for numerous species, including humans. However, most applications of ABC still use limited amounts of data, from a small number of loci, compared to the large amount of genome-wide population-genetic data which have become available in the last few years.

Results: We evaluated the performance of the ABC approach for three 'population divergence' models - similar to the 'isolation with migration' model - when the data consists of several hundred thousand SNPs typed for multiple individuals by simulating data from known demographic models. The ABC approach was used to infer demographic parameters of interest and we compared the inferred values to the true parameter values that was used to generate hypothetical "observed" data. For all three case models, the ABC approach inferred most demographic parameters quite well with narrow credible intervals, for example, population divergence times and past population sizes, but some parameters were more difficult to infer, such as population sizes at present and migration rates. We compared the ability of different summary statistics to infer demographic parameters, including haplotype and LD based statistics, and found that the accuracy of the parameter estimates can be improved by combining summary statistics that capture different parts of information in the data. Furthermore, our results suggest that poor choices of prior distributions can in some circumstances be detected using ABC. Finally, increasing the amount of data beyond some hundred loci will substantially improve the accuracy of many parameter estimates using ABC.

Conclusions: We conclude that the ABC approach can accommodate realistic genome-wide population genetic data, which may be difficult to analyze with full likelihood approaches, and that the ABC can provide accurate and precise inference of demographic parameters from these data, suggesting that the ABC approach will be a useful tool for analyzing large genome-wide datasets.

Show MeSH