Limits...
Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data.

Bao Y, Vinciotti V, Wit E, 't Hoen PA - BMC Bioinformatics (2013)

Bottom Line: These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment.We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets.The framework that we present accounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information Systems, Computing and Mathematics, Brunel University, London, UK.

ABSTRACT

Background: ImmunoPrecipitation (IP) efficiencies may vary largely between different antibodies and between repeated experiments with the same antibody. These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment. In this paper, we show how IP efficiencies can be explicitly accounted for in the joint statistical modelling of ChIP-seq data.

Results: We fit a latent mixture model to eight experiments on two proteins, from two laboratories where different antibodies are used for the two proteins. We use the model parameters to estimate the efficiencies of individual experiments, and find that these are clearly different for the different laboratories, and amongst technical replicates from the same lab. When we account for ChIP efficiency, we find more regions bound in the more efficient experiments than in the less efficient ones, at the same false discovery rate. A priori knowledge of the same number of binding sites across experiments can also be included in the model for a more robust detection of differentially bound regions among two different proteins.

Conclusions: We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets. The framework that we present accounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.

Show MeSH
Simulation study to assess the usefulness of BIC in deciding whether two proteins have equal probabilitypc. BIC values for the model that assumes different pc probabilities for each condition (grey) and the one that assumes the same probabilities (p1 = p2, black). The x-axis shows the true p2−p1 values for simulated ChIP-seq data on two experiments with different IP efficiencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3717085&req=5

Figure 4: Simulation study to assess the usefulness of BIC in deciding whether two proteins have equal probabilitypc. BIC values for the model that assumes different pc probabilities for each condition (grey) and the one that assumes the same probabilities (p1 = p2, black). The x-axis shows the true p2−p1 values for simulated ChIP-seq data on two experiments with different IP efficiencies.

Mentions: In the presence of two different proteins, a priori biological knowledge about the two proteins can be further included in the test. In particular, in the context of the model described in this paper, one can impose the assumption that the two proteins have the same number of binding sites, that is p1=p2, where pc is the probability of a region being enriched by protein c. If realistic, this assumption is expected to lead to a more robust detection of the enriched regions, by providing a better estimate of the expected number of enriched regions in the different experiments. Indirectly, this allows to better account for the different IP efficiencies of the different experiments. The constraint of p1 = p2 can be imposed in the maximum likelihood procedure, in a similar way to parameter estimation in the presence of replicates. However, in this case, we do not make an assumption of equal binding profiles (i.e. Xm1 = Xm2 for all regions m), which is instead appropriate for technical replicates. The main difficulty in implementing this method is in assessing whether the assumption of a same pc is appropriate in a particular biological context. When no definite knowledge is available, we suggest to compare the fit of a model which makes an assumption of p1 = p2 with a model which does not make this assumption. As the two models have a different number of parameters, we suggest to compare them in terms of their BIC value. This is defined in the usual way by , with the estimated parameters in the model, the maximum likelihood and r the number of parameters. The estimated parameters are different depending on whether the constraint of equal pc is imposed or not, and the best model is chosen as the one with the lowest BIC. Figure 4 shows the output of a simulation study where we have assessed whether this BIC measure leads to an informative choice in our context. We have simulated count data on 10000 regions for two different experiments (e.g. proteins), using the mixture distributions p1NB(14,2)+(1−p1)NB(0.5,2) and p2NB(5,1)+(1−p2)NB(1,1), respectively. We have chosen these distributions so as to have different IP efficiencies (namely, IPE1 = 0.9996 and IPE2 = 0.9732). The plot gives the average BIC value, over 100 iterations, for the model which does not make the assumption of equal probabilities (grey line) versus the model which does make this assumption (black line). The x-axis shows the true p2−p1 value for the different simulations, where we fix p1 = 0.05 and vary p2 between 0.05 and 0.06. Despite the different IP efficiencies, it is clear how the BIC measure manages to distinguish between the case when p1 = p2 and the case when this assumption is not satisfied. The simulation shows further how there is a small margin of error for values of p2 very close to, but not exactly equal to, p1.


Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data.

Bao Y, Vinciotti V, Wit E, 't Hoen PA - BMC Bioinformatics (2013)

Simulation study to assess the usefulness of BIC in deciding whether two proteins have equal probabilitypc. BIC values for the model that assumes different pc probabilities for each condition (grey) and the one that assumes the same probabilities (p1 = p2, black). The x-axis shows the true p2−p1 values for simulated ChIP-seq data on two experiments with different IP efficiencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3717085&req=5

Figure 4: Simulation study to assess the usefulness of BIC in deciding whether two proteins have equal probabilitypc. BIC values for the model that assumes different pc probabilities for each condition (grey) and the one that assumes the same probabilities (p1 = p2, black). The x-axis shows the true p2−p1 values for simulated ChIP-seq data on two experiments with different IP efficiencies.
Mentions: In the presence of two different proteins, a priori biological knowledge about the two proteins can be further included in the test. In particular, in the context of the model described in this paper, one can impose the assumption that the two proteins have the same number of binding sites, that is p1=p2, where pc is the probability of a region being enriched by protein c. If realistic, this assumption is expected to lead to a more robust detection of the enriched regions, by providing a better estimate of the expected number of enriched regions in the different experiments. Indirectly, this allows to better account for the different IP efficiencies of the different experiments. The constraint of p1 = p2 can be imposed in the maximum likelihood procedure, in a similar way to parameter estimation in the presence of replicates. However, in this case, we do not make an assumption of equal binding profiles (i.e. Xm1 = Xm2 for all regions m), which is instead appropriate for technical replicates. The main difficulty in implementing this method is in assessing whether the assumption of a same pc is appropriate in a particular biological context. When no definite knowledge is available, we suggest to compare the fit of a model which makes an assumption of p1 = p2 with a model which does not make this assumption. As the two models have a different number of parameters, we suggest to compare them in terms of their BIC value. This is defined in the usual way by , with the estimated parameters in the model, the maximum likelihood and r the number of parameters. The estimated parameters are different depending on whether the constraint of equal pc is imposed or not, and the best model is chosen as the one with the lowest BIC. Figure 4 shows the output of a simulation study where we have assessed whether this BIC measure leads to an informative choice in our context. We have simulated count data on 10000 regions for two different experiments (e.g. proteins), using the mixture distributions p1NB(14,2)+(1−p1)NB(0.5,2) and p2NB(5,1)+(1−p2)NB(1,1), respectively. We have chosen these distributions so as to have different IP efficiencies (namely, IPE1 = 0.9996 and IPE2 = 0.9732). The plot gives the average BIC value, over 100 iterations, for the model which does not make the assumption of equal probabilities (grey line) versus the model which does make this assumption (black line). The x-axis shows the true p2−p1 value for the different simulations, where we fix p1 = 0.05 and vary p2 between 0.05 and 0.06. Despite the different IP efficiencies, it is clear how the BIC measure manages to distinguish between the case when p1 = p2 and the case when this assumption is not satisfied. The simulation shows further how there is a small margin of error for values of p2 very close to, but not exactly equal to, p1.

Bottom Line: These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment.We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets.The framework that we present accounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information Systems, Computing and Mathematics, Brunel University, London, UK.

ABSTRACT

Background: ImmunoPrecipitation (IP) efficiencies may vary largely between different antibodies and between repeated experiments with the same antibody. These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment. In this paper, we show how IP efficiencies can be explicitly accounted for in the joint statistical modelling of ChIP-seq data.

Results: We fit a latent mixture model to eight experiments on two proteins, from two laboratories where different antibodies are used for the two proteins. We use the model parameters to estimate the efficiencies of individual experiments, and find that these are clearly different for the different laboratories, and amongst technical replicates from the same lab. When we account for ChIP efficiency, we find more regions bound in the more efficient experiments than in the less efficient ones, at the same false discovery rate. A priori knowledge of the same number of binding sites across experiments can also be included in the model for a more robust detection of differentially bound regions among two different proteins.

Conclusions: We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets. The framework that we present accounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.

Show MeSH