Limits...
Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms.

Racimo F, Schraiber JG - PLoS Genet. (2014)

Bottom Line: Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans.Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level.We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.

View Article: PubMed Central - PubMed

Affiliation: Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America.

ABSTRACT
Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this and other conservation scores to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model on SNP data. Our method serves to approximate the deleterious DFE of mutations that are segregating, regardless of their genomic consequence. We can then compare the proportion of mutations that are negatively selected or neutral across various categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly peaked at neutrality, while the distribution of nonsynonymous polymorphisms has a second peak at [Formula: see text]. Other types of polymorphisms have shapes that fall roughly in between these two. We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.

Show MeSH
Mapping of C-scores to selection coefficients.A) We first fit a single coefficient to each bin using data from all autosomes in the genome. Red dots represent the maximum likelihood selection coefficient corresponding to each C-score bin. The blue line is a smooth spline fitted to the discrete mappings of C-scores to log-scaled selection coefficients after excluding the neutral bins (degrees of freedom  = 20). The grey shade is a 95% confidence interval obtained from bootstrapping the data 100 times in each bin. B) To verify the shape of the mapping was not a result of the number of sites in each bin, we randomized the C-score labels across polymorphisms, and recomputed the mapping. C) To account for exonic patterns of conservation and background selection that may not have been captured by C-scores, we computed a mapping based solely on exonic sites. D) We fitted gamma distributions of selection coefficients to each bin, and computed the mean divided by the standard deviation of each distribution, which is indicative of its shape (see Results). E) To test whether the gamma distributions were a significantly better fit than the single-coefficient mapping, we computed log-likelihood ratio statistics of the two models at each bin. The black line denotes the Bonferroni-corrected significance cutoff. F) To test how well we were fitting the data at each bin, we computed chi-squared statistics of the fitted SFS to the observed SFS at each bin. The black line denotes the Bonferroni-corrected significance cutoff.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4222666&req=5

pgen-1004697-g001: Mapping of C-scores to selection coefficients.A) We first fit a single coefficient to each bin using data from all autosomes in the genome. Red dots represent the maximum likelihood selection coefficient corresponding to each C-score bin. The blue line is a smooth spline fitted to the discrete mappings of C-scores to log-scaled selection coefficients after excluding the neutral bins (degrees of freedom  = 20). The grey shade is a 95% confidence interval obtained from bootstrapping the data 100 times in each bin. B) To verify the shape of the mapping was not a result of the number of sites in each bin, we randomized the C-score labels across polymorphisms, and recomputed the mapping. C) To account for exonic patterns of conservation and background selection that may not have been captured by C-scores, we computed a mapping based solely on exonic sites. D) We fitted gamma distributions of selection coefficients to each bin, and computed the mean divided by the standard deviation of each distribution, which is indicative of its shape (see Results). E) To test whether the gamma distributions were a significantly better fit than the single-coefficient mapping, we computed log-likelihood ratio statistics of the two models at each bin. The black line denotes the Bonferroni-corrected significance cutoff. F) To test how well we were fitting the data at each bin, we computed chi-squared statistics of the fitted SFS to the observed SFS at each bin. The black line denotes the Bonferroni-corrected significance cutoff.

Mentions: Using the best-fitting demography, we next fit a range of models with different selection coefficients to the EM-corrected site frequency spectrum for each C-score bin, using maximum likelihood (Figure 1.A) (see Methods). We restricted to C≤40, because very few sites have larger C-scores, and hence estimates of the selection coefficients for those C-scores are unreliable. We tested the robustness of the mappings to different levels of background selection, by partitioning the data into deciles of B-scores [35] and re-computing the C-to-s mapping for each decile. We observe that the mapping is generally robust to background selection, with substantial differences only observed at the lowest two B-score deciles, which correspond to high background selection (Figure S4). For this reason, and so as to obtain reliable DFEs at exonic sites (where background selection is generally higher than in the rest of the genome), we also performed a neutral demographic fitting and a C-to-s mapping while restricting only to sites in the exome (Figure 1.C). This mapping has a steeper decline than the genomic mapping, reflecting patterns of background selection which are not fully controlled by C-scores but that affect the SFS. We therefore show estimated DFEs using both the genome-wide and the exome-wide fittings below. After removing the C-score bins that best fit the neutral model, we fit a smoothing spline with 20 degrees of freedom to the remaining C-scores, so as to produce a continuous mapping of C-scores to selection coefficients (Figure 1.A).


Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms.

Racimo F, Schraiber JG - PLoS Genet. (2014)

Mapping of C-scores to selection coefficients.A) We first fit a single coefficient to each bin using data from all autosomes in the genome. Red dots represent the maximum likelihood selection coefficient corresponding to each C-score bin. The blue line is a smooth spline fitted to the discrete mappings of C-scores to log-scaled selection coefficients after excluding the neutral bins (degrees of freedom  = 20). The grey shade is a 95% confidence interval obtained from bootstrapping the data 100 times in each bin. B) To verify the shape of the mapping was not a result of the number of sites in each bin, we randomized the C-score labels across polymorphisms, and recomputed the mapping. C) To account for exonic patterns of conservation and background selection that may not have been captured by C-scores, we computed a mapping based solely on exonic sites. D) We fitted gamma distributions of selection coefficients to each bin, and computed the mean divided by the standard deviation of each distribution, which is indicative of its shape (see Results). E) To test whether the gamma distributions were a significantly better fit than the single-coefficient mapping, we computed log-likelihood ratio statistics of the two models at each bin. The black line denotes the Bonferroni-corrected significance cutoff. F) To test how well we were fitting the data at each bin, we computed chi-squared statistics of the fitted SFS to the observed SFS at each bin. The black line denotes the Bonferroni-corrected significance cutoff.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4222666&req=5

pgen-1004697-g001: Mapping of C-scores to selection coefficients.A) We first fit a single coefficient to each bin using data from all autosomes in the genome. Red dots represent the maximum likelihood selection coefficient corresponding to each C-score bin. The blue line is a smooth spline fitted to the discrete mappings of C-scores to log-scaled selection coefficients after excluding the neutral bins (degrees of freedom  = 20). The grey shade is a 95% confidence interval obtained from bootstrapping the data 100 times in each bin. B) To verify the shape of the mapping was not a result of the number of sites in each bin, we randomized the C-score labels across polymorphisms, and recomputed the mapping. C) To account for exonic patterns of conservation and background selection that may not have been captured by C-scores, we computed a mapping based solely on exonic sites. D) We fitted gamma distributions of selection coefficients to each bin, and computed the mean divided by the standard deviation of each distribution, which is indicative of its shape (see Results). E) To test whether the gamma distributions were a significantly better fit than the single-coefficient mapping, we computed log-likelihood ratio statistics of the two models at each bin. The black line denotes the Bonferroni-corrected significance cutoff. F) To test how well we were fitting the data at each bin, we computed chi-squared statistics of the fitted SFS to the observed SFS at each bin. The black line denotes the Bonferroni-corrected significance cutoff.
Mentions: Using the best-fitting demography, we next fit a range of models with different selection coefficients to the EM-corrected site frequency spectrum for each C-score bin, using maximum likelihood (Figure 1.A) (see Methods). We restricted to C≤40, because very few sites have larger C-scores, and hence estimates of the selection coefficients for those C-scores are unreliable. We tested the robustness of the mappings to different levels of background selection, by partitioning the data into deciles of B-scores [35] and re-computing the C-to-s mapping for each decile. We observe that the mapping is generally robust to background selection, with substantial differences only observed at the lowest two B-score deciles, which correspond to high background selection (Figure S4). For this reason, and so as to obtain reliable DFEs at exonic sites (where background selection is generally higher than in the rest of the genome), we also performed a neutral demographic fitting and a C-to-s mapping while restricting only to sites in the exome (Figure 1.C). This mapping has a steeper decline than the genomic mapping, reflecting patterns of background selection which are not fully controlled by C-scores but that affect the SFS. We therefore show estimated DFEs using both the genome-wide and the exome-wide fittings below. After removing the C-score bins that best fit the neutral model, we fit a smoothing spline with 20 degrees of freedom to the remaining C-scores, so as to produce a continuous mapping of C-scores to selection coefficients (Figure 1.A).

Bottom Line: Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans.Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level.We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.

View Article: PubMed Central - PubMed

Affiliation: Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America.

ABSTRACT
Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this and other conservation scores to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model on SNP data. Our method serves to approximate the deleterious DFE of mutations that are segregating, regardless of their genomic consequence. We can then compare the proportion of mutations that are negatively selected or neutral across various categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly peaked at neutrality, while the distribution of nonsynonymous polymorphisms has a second peak at [Formula: see text]. Other types of polymorphisms have shapes that fall roughly in between these two. We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.

Show MeSH