Limits...
A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals.

Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M - Nat Commun (2016)

Bottom Line: Here, we provide insights into the functional effect of these variants using allele-specific behaviour.Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'.Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

View Article: PubMed Central - PubMed

Affiliation: Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.

ABSTRACT
Large-scale sequencing in the 1000 Genomes Project has revealed multitudes of single nucleotide variants (SNVs). Here, we provide insights into the functional effect of these variants using allele-specific behaviour. This can be assessed for an individual by mapping ChIP-seq and RNA-seq reads to a personal genome, and then measuring 'allelic imbalances' between the numbers of reads mapped to the paternal and maternal chromosomes. We annotate variants associated with allele-specific binding and expression in 382 individuals by uniformly processing 1,263 functional genomics data sets, developing approaches to reduce the heterogeneity between data sets due to overdispersion and mapping bias. Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'. We also found SNVs for which we can anticipate allelic imbalance from the disruption of a binding motif. Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

No MeSH data available.


Related in: MedlinePlus

Workflow for uniform processing of data from 382 individuals and construction of AlleleDB.For each of the 382 individuals, (1) a diploid personal genome is first constructed using the variants from the 1000 Genomes Project. Next, reads from individual (2a) and pooled (2b) ChIP-seq or RNA-seq data sets are mapped onto each of the haplotypes of the diploid genome. In (2a), overdispersion (OD) is measured for each data set and used to flag and segregate highly overdispersed data sets (‘OD data sets'). (2b) The resultant data sets are pooled. (3) From each pooled read pile for each individual and each TF (for ChIP-seq data sets), original reads that give rise to simulated ambiguous-mapping reads are removed (‘AMB reads'). (4) For each filtered read pool, we estimate an OD parameter and then detect allele-specific SNVs. For detection, we compare the read counts between the two parental chromosomes and a statistical significance is computed (after multiple hypothesis test correction) based on the beta-binomial test using the ‘pooled' OD parameter to account for OD. All the candidate allele-specific variants are then deposited in AlleleDB database. Additional information, such as raw read counts of both accessible non-allele-specific and allele-specific variants, can be downloaded for further analyses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4837449&req=5

f1: Workflow for uniform processing of data from 382 individuals and construction of AlleleDB.For each of the 382 individuals, (1) a diploid personal genome is first constructed using the variants from the 1000 Genomes Project. Next, reads from individual (2a) and pooled (2b) ChIP-seq or RNA-seq data sets are mapped onto each of the haplotypes of the diploid genome. In (2a), overdispersion (OD) is measured for each data set and used to flag and segregate highly overdispersed data sets (‘OD data sets'). (2b) The resultant data sets are pooled. (3) From each pooled read pile for each individual and each TF (for ChIP-seq data sets), original reads that give rise to simulated ambiguous-mapping reads are removed (‘AMB reads'). (4) For each filtered read pool, we estimate an OD parameter and then detect allele-specific SNVs. For detection, we compare the read counts between the two parental chromosomes and a statistical significance is computed (after multiple hypothesis test correction) based on the beta-binomial test using the ‘pooled' OD parameter to account for OD. All the candidate allele-specific variants are then deposited in AlleleDB database. Additional information, such as raw read counts of both accessible non-allele-specific and allele-specific variants, can be downloaded for further analyses.

Mentions: In general, the AlleleDB workflow uniformly processes two pieces of information from each individual: the DNA variants, and reads from either the ChIP-seq or RNA-seq experiment to assess SNVs associated with ASB or ASE, respectively (Fig. 1). (1) It starts by first constructing a diploid personal genome for each of the 382 individuals, using DNA variants from the 1000 Genomes Project. (2) It then aligns the ChIP-seq or RNA-seq data set to each of the haploid genomes instead of the human reference genome, and chooses the better uniquely mapped alignment. This reduces reference bias that can potentially result in erroneous read mapping13. Because each individual can have multiple ChIP-seq or RNA-seq data sets, the alignment is performed to each personal genome twice. (2a) In the first round, the alignment is performed for each of 276 ChIP-seq and 987 RNA-seq data sets to calculate a measure of overdispersion (with respect to an expected binomial distribution) ‘ρ' (see the ‘Methods' section). We observed that if there is a greater overdispersion in the empirical allelic ratio (defined as the proportion of reads that map to the reference allele) distribution of a data set, the binomial test tends to overestimate the number of allele-specific events (Fig. 2). There are varying degrees of overdispersion in our data sets, even between biological replicates. In general, RNA-seq data sets are generally more consistent in overdispersion than ChIP-seq data sets. Differing overdispersion in individual data sets poses a challenge later in Step 2b when we pool and merge multiple data sets. To harmonize the data sets, we flag and filter data sets that are deemed to be more overdispersed in allelic ratio distributions, leaving us with 186 ChIP-seq and 955 RNA-seq data sets for allele-specific detection (Supplementary Table 2). (2b) The second alignment is performed by ‘pooling' the 186 ChIP-seq and 955 RNA-seq data sets that have not been filtered in Step 2a. The pooling is performed for each individual and each TF (for ChIP-seq); for example, CTCF (CCCTC-binding factor) ChIP-seq data sets for NA12878 that were not filtered were pooled together. An overdispersion parameter is re-calculated for each pooled set. (3) The third module filters reads that exhibit a bias that we term as ‘ambiguous mapping bias' (AMB). This bias occurs at a locus when reads containing one allele are preferred, not because of better alignments, but because of sequence homology in the region overlapping the other allele, with another location in the genome. As a result, reads with the other allele align ambiguously to multiple locations and are consequently removed, resulting in an erroneous allelic imbalance at that locus (Fig. 1). This module detects reads that exhibit AMB via simulations. Briefly, for each original uniquely mapped read (we call the ‘O read') that overlaps at least one heterozygous SNV on one parental genome, we simulate reads (‘S reads') that represent all possible haplotypes of that O read. We then align the S reads to the other parental genome. ‘O' reads with ‘S' reads that map to multiple locations (we call them ‘AMB reads') are filtered from the aligned reads obtained in Step 2b (see Fig. 1 and the ‘Methods' section). (4) Finally, we obtain allelic counts from the filtered read pile and estimate a ‘pooled' overdispersion parameter. Beta-binomial tests are performed using the ‘pooled' overdispersion parameter to detect allele-specific SNVs. For ChIP-seq data, the SNVs are further pared down to those within peak regions. We also remove SNVs if they lie in regions predicted to be copy number variants. Please refer to the ‘Methods' section for a more detailed description.


A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals.

Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M - Nat Commun (2016)

Workflow for uniform processing of data from 382 individuals and construction of AlleleDB.For each of the 382 individuals, (1) a diploid personal genome is first constructed using the variants from the 1000 Genomes Project. Next, reads from individual (2a) and pooled (2b) ChIP-seq or RNA-seq data sets are mapped onto each of the haplotypes of the diploid genome. In (2a), overdispersion (OD) is measured for each data set and used to flag and segregate highly overdispersed data sets (‘OD data sets'). (2b) The resultant data sets are pooled. (3) From each pooled read pile for each individual and each TF (for ChIP-seq data sets), original reads that give rise to simulated ambiguous-mapping reads are removed (‘AMB reads'). (4) For each filtered read pool, we estimate an OD parameter and then detect allele-specific SNVs. For detection, we compare the read counts between the two parental chromosomes and a statistical significance is computed (after multiple hypothesis test correction) based on the beta-binomial test using the ‘pooled' OD parameter to account for OD. All the candidate allele-specific variants are then deposited in AlleleDB database. Additional information, such as raw read counts of both accessible non-allele-specific and allele-specific variants, can be downloaded for further analyses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4837449&req=5

f1: Workflow for uniform processing of data from 382 individuals and construction of AlleleDB.For each of the 382 individuals, (1) a diploid personal genome is first constructed using the variants from the 1000 Genomes Project. Next, reads from individual (2a) and pooled (2b) ChIP-seq or RNA-seq data sets are mapped onto each of the haplotypes of the diploid genome. In (2a), overdispersion (OD) is measured for each data set and used to flag and segregate highly overdispersed data sets (‘OD data sets'). (2b) The resultant data sets are pooled. (3) From each pooled read pile for each individual and each TF (for ChIP-seq data sets), original reads that give rise to simulated ambiguous-mapping reads are removed (‘AMB reads'). (4) For each filtered read pool, we estimate an OD parameter and then detect allele-specific SNVs. For detection, we compare the read counts between the two parental chromosomes and a statistical significance is computed (after multiple hypothesis test correction) based on the beta-binomial test using the ‘pooled' OD parameter to account for OD. All the candidate allele-specific variants are then deposited in AlleleDB database. Additional information, such as raw read counts of both accessible non-allele-specific and allele-specific variants, can be downloaded for further analyses.
Mentions: In general, the AlleleDB workflow uniformly processes two pieces of information from each individual: the DNA variants, and reads from either the ChIP-seq or RNA-seq experiment to assess SNVs associated with ASB or ASE, respectively (Fig. 1). (1) It starts by first constructing a diploid personal genome for each of the 382 individuals, using DNA variants from the 1000 Genomes Project. (2) It then aligns the ChIP-seq or RNA-seq data set to each of the haploid genomes instead of the human reference genome, and chooses the better uniquely mapped alignment. This reduces reference bias that can potentially result in erroneous read mapping13. Because each individual can have multiple ChIP-seq or RNA-seq data sets, the alignment is performed to each personal genome twice. (2a) In the first round, the alignment is performed for each of 276 ChIP-seq and 987 RNA-seq data sets to calculate a measure of overdispersion (with respect to an expected binomial distribution) ‘ρ' (see the ‘Methods' section). We observed that if there is a greater overdispersion in the empirical allelic ratio (defined as the proportion of reads that map to the reference allele) distribution of a data set, the binomial test tends to overestimate the number of allele-specific events (Fig. 2). There are varying degrees of overdispersion in our data sets, even between biological replicates. In general, RNA-seq data sets are generally more consistent in overdispersion than ChIP-seq data sets. Differing overdispersion in individual data sets poses a challenge later in Step 2b when we pool and merge multiple data sets. To harmonize the data sets, we flag and filter data sets that are deemed to be more overdispersed in allelic ratio distributions, leaving us with 186 ChIP-seq and 955 RNA-seq data sets for allele-specific detection (Supplementary Table 2). (2b) The second alignment is performed by ‘pooling' the 186 ChIP-seq and 955 RNA-seq data sets that have not been filtered in Step 2a. The pooling is performed for each individual and each TF (for ChIP-seq); for example, CTCF (CCCTC-binding factor) ChIP-seq data sets for NA12878 that were not filtered were pooled together. An overdispersion parameter is re-calculated for each pooled set. (3) The third module filters reads that exhibit a bias that we term as ‘ambiguous mapping bias' (AMB). This bias occurs at a locus when reads containing one allele are preferred, not because of better alignments, but because of sequence homology in the region overlapping the other allele, with another location in the genome. As a result, reads with the other allele align ambiguously to multiple locations and are consequently removed, resulting in an erroneous allelic imbalance at that locus (Fig. 1). This module detects reads that exhibit AMB via simulations. Briefly, for each original uniquely mapped read (we call the ‘O read') that overlaps at least one heterozygous SNV on one parental genome, we simulate reads (‘S reads') that represent all possible haplotypes of that O read. We then align the S reads to the other parental genome. ‘O' reads with ‘S' reads that map to multiple locations (we call them ‘AMB reads') are filtered from the aligned reads obtained in Step 2b (see Fig. 1 and the ‘Methods' section). (4) Finally, we obtain allelic counts from the filtered read pile and estimate a ‘pooled' overdispersion parameter. Beta-binomial tests are performed using the ‘pooled' overdispersion parameter to detect allele-specific SNVs. For ChIP-seq data, the SNVs are further pared down to those within peak regions. We also remove SNVs if they lie in regions predicted to be copy number variants. Please refer to the ‘Methods' section for a more detailed description.

Bottom Line: Here, we provide insights into the functional effect of these variants using allele-specific behaviour.Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'.Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

View Article: PubMed Central - PubMed

Affiliation: Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.

ABSTRACT
Large-scale sequencing in the 1000 Genomes Project has revealed multitudes of single nucleotide variants (SNVs). Here, we provide insights into the functional effect of these variants using allele-specific behaviour. This can be assessed for an individual by mapping ChIP-seq and RNA-seq reads to a personal genome, and then measuring 'allelic imbalances' between the numbers of reads mapped to the paternal and maternal chromosomes. We annotate variants associated with allele-specific binding and expression in 382 individuals by uniformly processing 1,263 functional genomics data sets, developing approaches to reduce the heterogeneity between data sets due to overdispersion and mapping bias. Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'. We also found SNVs for which we can anticipate allelic imbalance from the disruption of a binding motif. Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).

No MeSH data available.


Related in: MedlinePlus