Limits...
Correcting the site frequency spectrum for divergence-based ascertainment.

Kern AD - PLoS ONE (2009)

Bottom Line: Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics.Here, a method to correct this problem is developed that obtains maximum-likelihood estimates of the unascertained allele frequency distribution using numerical optimization.I show how divergence-based ascertainment may mimic the effects of natural selection and offer correction formulae for performing proper estimation into the strength of selection in candidate regions in a maximum-likelihood setting.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, Dartmouth College, Hanover, NH, USA. andrew.d.kern@dartmouth.edu

ABSTRACT
Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics. However, selection of candidate regions for further study on the basis of elevated or depressed divergence between species leads to a divergence-based ascertainment bias in the site frequency spectrum within selected candidate loci. Here, a method to correct this problem is developed that obtains maximum-likelihood estimates of the unascertained allele frequency distribution using numerical optimization. I show how divergence-based ascertainment may mimic the effects of natural selection and offer correction formulae for performing proper estimation into the strength of selection in candidate regions in a maximum-likelihood setting.

Show MeSH

Related in: MedlinePlus

ML estimates of the site frequency spectrum from divergence ascertained data.Data used in the top and bottom panels correspond to the lower and upper 1% divergence ascertained data simulated via a coalescent method (see Figure 1 caption for details). Shown are the unfolded site frequency spectra from the divergence ascertained data (orange), ascertainment-corrected data (blue), and the expected standard neutral model spectrum (black).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2666160&req=5

pone-0005152-g003: ML estimates of the site frequency spectrum from divergence ascertained data.Data used in the top and bottom panels correspond to the lower and upper 1% divergence ascertained data simulated via a coalescent method (see Figure 1 caption for details). Shown are the unfolded site frequency spectra from the divergence ascertained data (orange), ascertainment-corrected data (blue), and the expected standard neutral model spectrum (black).

Mentions: To correct for this divergence-based ascertainment, we wish to obtain an estimate of the unascertained site frequency spectrum. Following the general framework of Nielsen et al. [7], let pj be the frequency of segregating sites with derived allele frequency j in a sample of n chromosomes that has undergone no ascertainment bias (1<j≤n−1). With a set of observed counts of allele frequencies in hand we aim to find the maximum-likelihood estimate of the site frequency spectrum, . We assume that the presence or absence of the derived version of each segregating site in the original divergence comparison is known. The likelihood function for the site frequency spectrum, is then(1)where S is the number of segregating sites in the sample, Xi is the derived allele frequency at the ith site including those in the reference genome used to ascertain the site, and Asci is the notation for the ascertainment of a segregating site in the original divergence comparison. It is clear that since ascertainment depends only on how many ancestral and derived actually occur at site i (Nielsen et al. [7]). Thus the crucial quantity for correcting divergence-based ascertainment is the probability of our ascertainment condition, which can be incorporated at a site by site basis to correct for arbitrary levels of divergence (see Figure 3). I assume that divergence ascertainment was performed in a subsample, size d, of our final sample, size n, thus the probability of ascertainment is


Correcting the site frequency spectrum for divergence-based ascertainment.

Kern AD - PLoS ONE (2009)

ML estimates of the site frequency spectrum from divergence ascertained data.Data used in the top and bottom panels correspond to the lower and upper 1% divergence ascertained data simulated via a coalescent method (see Figure 1 caption for details). Shown are the unfolded site frequency spectra from the divergence ascertained data (orange), ascertainment-corrected data (blue), and the expected standard neutral model spectrum (black).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2666160&req=5

pone-0005152-g003: ML estimates of the site frequency spectrum from divergence ascertained data.Data used in the top and bottom panels correspond to the lower and upper 1% divergence ascertained data simulated via a coalescent method (see Figure 1 caption for details). Shown are the unfolded site frequency spectra from the divergence ascertained data (orange), ascertainment-corrected data (blue), and the expected standard neutral model spectrum (black).
Mentions: To correct for this divergence-based ascertainment, we wish to obtain an estimate of the unascertained site frequency spectrum. Following the general framework of Nielsen et al. [7], let pj be the frequency of segregating sites with derived allele frequency j in a sample of n chromosomes that has undergone no ascertainment bias (1<j≤n−1). With a set of observed counts of allele frequencies in hand we aim to find the maximum-likelihood estimate of the site frequency spectrum, . We assume that the presence or absence of the derived version of each segregating site in the original divergence comparison is known. The likelihood function for the site frequency spectrum, is then(1)where S is the number of segregating sites in the sample, Xi is the derived allele frequency at the ith site including those in the reference genome used to ascertain the site, and Asci is the notation for the ascertainment of a segregating site in the original divergence comparison. It is clear that since ascertainment depends only on how many ancestral and derived actually occur at site i (Nielsen et al. [7]). Thus the crucial quantity for correcting divergence-based ascertainment is the probability of our ascertainment condition, which can be incorporated at a site by site basis to correct for arbitrary levels of divergence (see Figure 3). I assume that divergence ascertainment was performed in a subsample, size d, of our final sample, size n, thus the probability of ascertainment is

Bottom Line: Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics.Here, a method to correct this problem is developed that obtains maximum-likelihood estimates of the unascertained allele frequency distribution using numerical optimization.I show how divergence-based ascertainment may mimic the effects of natural selection and offer correction formulae for performing proper estimation into the strength of selection in candidate regions in a maximum-likelihood setting.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Sciences, Dartmouth College, Hanover, NH, USA. andrew.d.kern@dartmouth.edu

ABSTRACT
Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics. However, selection of candidate regions for further study on the basis of elevated or depressed divergence between species leads to a divergence-based ascertainment bias in the site frequency spectrum within selected candidate loci. Here, a method to correct this problem is developed that obtains maximum-likelihood estimates of the unascertained allele frequency distribution using numerical optimization. I show how divergence-based ascertainment may mimic the effects of natural selection and offer correction formulae for performing proper estimation into the strength of selection in candidate regions in a maximum-likelihood setting.

Show MeSH
Related in: MedlinePlus