Limits...
A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals.

Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M - Nat Commun (2016)

Bottom Line: Here, we provide insights into the functional effect of these variants using allele-specific behaviour.Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'.Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

View Article: PubMed Central - PubMed

Affiliation: Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.

ABSTRACT
Large-scale sequencing in the 1000 Genomes Project has revealed multitudes of single nucleotide variants (SNVs). Here, we provide insights into the functional effect of these variants using allele-specific behaviour. This can be assessed for an individual by mapping ChIP-seq and RNA-seq reads to a personal genome, and then measuring 'allelic imbalances' between the numbers of reads mapped to the paternal and maternal chromosomes. We annotate variants associated with allele-specific binding and expression in 382 individuals by uniformly processing 1,263 functional genomics data sets, developing approaches to reduce the heterogeneity between data sets due to overdispersion and mapping bias. Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'. We also found SNVs for which we can anticipate allelic imbalance from the disruption of a binding motif. Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

No MeSH data available.


Related in: MedlinePlus

Comparing the effects of the binomial and beta-binomial tests in data sets with low and intermediate level of overdispersion.The grey bars in each plot represent the empirical allelic ratio distribution. For a and b, the red and blue lines on the left plots represent the  (expected) allelic ratio distributions associated with the binomial and beta-binomial tests, respectively. The red and blue bars on the right plots represent the number of allele-specific (AS) SNVs detected in each for the binomial and beta-binomial tests, respectively. (a) The plots for one of the RNA-seq data sets for the individual HG00096. It has a low overdispersion parameter, ρ=0.0205. The empirical distribution does not have heavy tails and the binomial and beta-binomial tests give very similar results. (b) The plots for one of the RNA-seq data sets for the individual NA11894. Overdispersion is higher at ρ=0.1234, and the beta-binomial  distribution provides a better fit to the empirical allelic ratio distribution than the binomial distribution. The empirical distribution (grey bars) also show heavier tails, signifying more SNVs with allelic imbalance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4837449&req=5

f2: Comparing the effects of the binomial and beta-binomial tests in data sets with low and intermediate level of overdispersion.The grey bars in each plot represent the empirical allelic ratio distribution. For a and b, the red and blue lines on the left plots represent the (expected) allelic ratio distributions associated with the binomial and beta-binomial tests, respectively. The red and blue bars on the right plots represent the number of allele-specific (AS) SNVs detected in each for the binomial and beta-binomial tests, respectively. (a) The plots for one of the RNA-seq data sets for the individual HG00096. It has a low overdispersion parameter, ρ=0.0205. The empirical distribution does not have heavy tails and the binomial and beta-binomial tests give very similar results. (b) The plots for one of the RNA-seq data sets for the individual NA11894. Overdispersion is higher at ρ=0.1234, and the beta-binomial distribution provides a better fit to the empirical allelic ratio distribution than the binomial distribution. The empirical distribution (grey bars) also show heavier tails, signifying more SNVs with allelic imbalance.

Mentions: In general, the AlleleDB workflow uniformly processes two pieces of information from each individual: the DNA variants, and reads from either the ChIP-seq or RNA-seq experiment to assess SNVs associated with ASB or ASE, respectively (Fig. 1). (1) It starts by first constructing a diploid personal genome for each of the 382 individuals, using DNA variants from the 1000 Genomes Project. (2) It then aligns the ChIP-seq or RNA-seq data set to each of the haploid genomes instead of the human reference genome, and chooses the better uniquely mapped alignment. This reduces reference bias that can potentially result in erroneous read mapping13. Because each individual can have multiple ChIP-seq or RNA-seq data sets, the alignment is performed to each personal genome twice. (2a) In the first round, the alignment is performed for each of 276 ChIP-seq and 987 RNA-seq data sets to calculate a measure of overdispersion (with respect to an expected binomial distribution) ‘ρ' (see the ‘Methods' section). We observed that if there is a greater overdispersion in the empirical allelic ratio (defined as the proportion of reads that map to the reference allele) distribution of a data set, the binomial test tends to overestimate the number of allele-specific events (Fig. 2). There are varying degrees of overdispersion in our data sets, even between biological replicates. In general, RNA-seq data sets are generally more consistent in overdispersion than ChIP-seq data sets. Differing overdispersion in individual data sets poses a challenge later in Step 2b when we pool and merge multiple data sets. To harmonize the data sets, we flag and filter data sets that are deemed to be more overdispersed in allelic ratio distributions, leaving us with 186 ChIP-seq and 955 RNA-seq data sets for allele-specific detection (Supplementary Table 2). (2b) The second alignment is performed by ‘pooling' the 186 ChIP-seq and 955 RNA-seq data sets that have not been filtered in Step 2a. The pooling is performed for each individual and each TF (for ChIP-seq); for example, CTCF (CCCTC-binding factor) ChIP-seq data sets for NA12878 that were not filtered were pooled together. An overdispersion parameter is re-calculated for each pooled set. (3) The third module filters reads that exhibit a bias that we term as ‘ambiguous mapping bias' (AMB). This bias occurs at a locus when reads containing one allele are preferred, not because of better alignments, but because of sequence homology in the region overlapping the other allele, with another location in the genome. As a result, reads with the other allele align ambiguously to multiple locations and are consequently removed, resulting in an erroneous allelic imbalance at that locus (Fig. 1). This module detects reads that exhibit AMB via simulations. Briefly, for each original uniquely mapped read (we call the ‘O read') that overlaps at least one heterozygous SNV on one parental genome, we simulate reads (‘S reads') that represent all possible haplotypes of that O read. We then align the S reads to the other parental genome. ‘O' reads with ‘S' reads that map to multiple locations (we call them ‘AMB reads') are filtered from the aligned reads obtained in Step 2b (see Fig. 1 and the ‘Methods' section). (4) Finally, we obtain allelic counts from the filtered read pile and estimate a ‘pooled' overdispersion parameter. Beta-binomial tests are performed using the ‘pooled' overdispersion parameter to detect allele-specific SNVs. For ChIP-seq data, the SNVs are further pared down to those within peak regions. We also remove SNVs if they lie in regions predicted to be copy number variants. Please refer to the ‘Methods' section for a more detailed description.


A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals.

Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M - Nat Commun (2016)

Comparing the effects of the binomial and beta-binomial tests in data sets with low and intermediate level of overdispersion.The grey bars in each plot represent the empirical allelic ratio distribution. For a and b, the red and blue lines on the left plots represent the  (expected) allelic ratio distributions associated with the binomial and beta-binomial tests, respectively. The red and blue bars on the right plots represent the number of allele-specific (AS) SNVs detected in each for the binomial and beta-binomial tests, respectively. (a) The plots for one of the RNA-seq data sets for the individual HG00096. It has a low overdispersion parameter, ρ=0.0205. The empirical distribution does not have heavy tails and the binomial and beta-binomial tests give very similar results. (b) The plots for one of the RNA-seq data sets for the individual NA11894. Overdispersion is higher at ρ=0.1234, and the beta-binomial  distribution provides a better fit to the empirical allelic ratio distribution than the binomial distribution. The empirical distribution (grey bars) also show heavier tails, signifying more SNVs with allelic imbalance.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4837449&req=5

f2: Comparing the effects of the binomial and beta-binomial tests in data sets with low and intermediate level of overdispersion.The grey bars in each plot represent the empirical allelic ratio distribution. For a and b, the red and blue lines on the left plots represent the (expected) allelic ratio distributions associated with the binomial and beta-binomial tests, respectively. The red and blue bars on the right plots represent the number of allele-specific (AS) SNVs detected in each for the binomial and beta-binomial tests, respectively. (a) The plots for one of the RNA-seq data sets for the individual HG00096. It has a low overdispersion parameter, ρ=0.0205. The empirical distribution does not have heavy tails and the binomial and beta-binomial tests give very similar results. (b) The plots for one of the RNA-seq data sets for the individual NA11894. Overdispersion is higher at ρ=0.1234, and the beta-binomial distribution provides a better fit to the empirical allelic ratio distribution than the binomial distribution. The empirical distribution (grey bars) also show heavier tails, signifying more SNVs with allelic imbalance.
Mentions: In general, the AlleleDB workflow uniformly processes two pieces of information from each individual: the DNA variants, and reads from either the ChIP-seq or RNA-seq experiment to assess SNVs associated with ASB or ASE, respectively (Fig. 1). (1) It starts by first constructing a diploid personal genome for each of the 382 individuals, using DNA variants from the 1000 Genomes Project. (2) It then aligns the ChIP-seq or RNA-seq data set to each of the haploid genomes instead of the human reference genome, and chooses the better uniquely mapped alignment. This reduces reference bias that can potentially result in erroneous read mapping13. Because each individual can have multiple ChIP-seq or RNA-seq data sets, the alignment is performed to each personal genome twice. (2a) In the first round, the alignment is performed for each of 276 ChIP-seq and 987 RNA-seq data sets to calculate a measure of overdispersion (with respect to an expected binomial distribution) ‘ρ' (see the ‘Methods' section). We observed that if there is a greater overdispersion in the empirical allelic ratio (defined as the proportion of reads that map to the reference allele) distribution of a data set, the binomial test tends to overestimate the number of allele-specific events (Fig. 2). There are varying degrees of overdispersion in our data sets, even between biological replicates. In general, RNA-seq data sets are generally more consistent in overdispersion than ChIP-seq data sets. Differing overdispersion in individual data sets poses a challenge later in Step 2b when we pool and merge multiple data sets. To harmonize the data sets, we flag and filter data sets that are deemed to be more overdispersed in allelic ratio distributions, leaving us with 186 ChIP-seq and 955 RNA-seq data sets for allele-specific detection (Supplementary Table 2). (2b) The second alignment is performed by ‘pooling' the 186 ChIP-seq and 955 RNA-seq data sets that have not been filtered in Step 2a. The pooling is performed for each individual and each TF (for ChIP-seq); for example, CTCF (CCCTC-binding factor) ChIP-seq data sets for NA12878 that were not filtered were pooled together. An overdispersion parameter is re-calculated for each pooled set. (3) The third module filters reads that exhibit a bias that we term as ‘ambiguous mapping bias' (AMB). This bias occurs at a locus when reads containing one allele are preferred, not because of better alignments, but because of sequence homology in the region overlapping the other allele, with another location in the genome. As a result, reads with the other allele align ambiguously to multiple locations and are consequently removed, resulting in an erroneous allelic imbalance at that locus (Fig. 1). This module detects reads that exhibit AMB via simulations. Briefly, for each original uniquely mapped read (we call the ‘O read') that overlaps at least one heterozygous SNV on one parental genome, we simulate reads (‘S reads') that represent all possible haplotypes of that O read. We then align the S reads to the other parental genome. ‘O' reads with ‘S' reads that map to multiple locations (we call them ‘AMB reads') are filtered from the aligned reads obtained in Step 2b (see Fig. 1 and the ‘Methods' section). (4) Finally, we obtain allelic counts from the filtered read pile and estimate a ‘pooled' overdispersion parameter. Beta-binomial tests are performed using the ‘pooled' overdispersion parameter to detect allele-specific SNVs. For ChIP-seq data, the SNVs are further pared down to those within peak regions. We also remove SNVs if they lie in regions predicted to be copy number variants. Please refer to the ‘Methods' section for a more detailed description.

Bottom Line: Here, we provide insights into the functional effect of these variants using allele-specific behaviour.Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'.Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

View Article: PubMed Central - PubMed

Affiliation: Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.

ABSTRACT
Large-scale sequencing in the 1000 Genomes Project has revealed multitudes of single nucleotide variants (SNVs). Here, we provide insights into the functional effect of these variants using allele-specific behaviour. This can be assessed for an individual by mapping ChIP-seq and RNA-seq reads to a personal genome, and then measuring 'allelic imbalances' between the numbers of reads mapped to the paternal and maternal chromosomes. We annotate variants associated with allele-specific binding and expression in 382 individuals by uniformly processing 1,263 functional genomics data sets, developing approaches to reduce the heterogeneity between data sets due to overdispersion and mapping bias. Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'. We also found SNVs for which we can anticipate allelic imbalance from the disruption of a binding motif. Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

No MeSH data available.


Related in: MedlinePlus