Limits...
Parameter estimation for robust HMM analysis of ChIP-chip data.

Humburg P, Bulger D, Stone G - BMC Bioinformatics (2008)

Bottom Line: Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc.We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates.The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Macquarie University, North Ryde, NSW 2109, Australia. peter.humburg@csiro.au

ABSTRACT

Background: Tiling arrays are an important tool for the study of transcriptional activity, protein-DNA interactions and chromatin structure on a genome-wide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIP-chip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the Baum-Welch algorithm or Viterbi training are rarely applied in the context of tiling array analysis.

Results: Here we develop a hidden Markov model for the analysis of chromatin structure ChIP-chip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure.

Conclusion: We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.

Show MeSH
Length distribution of enriched regions from dataset II. Quantile-quantile plots comparing length distributions of enriched regions found with TileMap (left) and with the model based on maximum likelihood estimates (right) to the true length distribution of enriched regions in dataset II. Figures on the bottom provide a close-up view of the plots above. Each dot represents a percentile of the length distributions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2536674&req=5

Figure 8: Length distribution of enriched regions from dataset II. Quantile-quantile plots comparing length distributions of enriched regions found with TileMap (left) and with the model based on maximum likelihood estimates (right) to the true length distribution of enriched regions in dataset II. Figures on the bottom provide a close-up view of the plots above. Each dot represents a percentile of the length distributions.

Mentions: When studying histone modifications one possible characteristic of interest is the length of enriched regions. To assess how accurately the different methods reflect the length distribution of enriched regions, we compare the length of regions predicted by TileMap and by the model (using Baum-Welch parameter estimates) to the length distribution of enriched regions in the simulated data (the "true length distribution"). Note that this length distribution may vary from the one found in real data. Nevertheless this comparison highlights some of the differences between the two models. Quantile-quantile plots of the respective length distributions show that TileMap systematically underestimates the length of enriched regions (Figure 7 (bottom left) and Figure 8 (bottom left)). While this effect is relatively small on dataset I there is some indication that it increases with region length and long regions may not be characterised appropriately by TileMap (Figure 7 (top left)). This observation is further supported by the length distribution of enriched regions produced by TileMap on dataset II (Figure 8 (left)). Enriched regions in dataset II are generally longer than regions in dataset I. This difference is not captured by TileMap. Both TileMap and the Baum-Welch trained model produce several regions that are shorter than the shortest enriched region in the simulated data (Figure 7 (bottom)). There are two possible explanations for these short regions. They may be caused by underestimating the length of enriched regions, possibly splitting one enriched region into several predicted regions, or they may represent spurious enriched results produced by the model. In each case there is the possibility that the occurrence of extremely short regions is caused either by an intrinsic shortcoming of the model or by artifacts introduced during the simulation process. Since the simulation relies on TileMap to identify enriched and non-enriched probes it is inevitable that some probes will be misclassified. Subsequently these probes may be included in the simulated data, causing short disruptions of enriched and non-enriched regions. A sufficiently sensitive model could detect these unintended changes between enriched and non-enriched states.


Parameter estimation for robust HMM analysis of ChIP-chip data.

Humburg P, Bulger D, Stone G - BMC Bioinformatics (2008)

Length distribution of enriched regions from dataset II. Quantile-quantile plots comparing length distributions of enriched regions found with TileMap (left) and with the model based on maximum likelihood estimates (right) to the true length distribution of enriched regions in dataset II. Figures on the bottom provide a close-up view of the plots above. Each dot represents a percentile of the length distributions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2536674&req=5

Figure 8: Length distribution of enriched regions from dataset II. Quantile-quantile plots comparing length distributions of enriched regions found with TileMap (left) and with the model based on maximum likelihood estimates (right) to the true length distribution of enriched regions in dataset II. Figures on the bottom provide a close-up view of the plots above. Each dot represents a percentile of the length distributions.
Mentions: When studying histone modifications one possible characteristic of interest is the length of enriched regions. To assess how accurately the different methods reflect the length distribution of enriched regions, we compare the length of regions predicted by TileMap and by the model (using Baum-Welch parameter estimates) to the length distribution of enriched regions in the simulated data (the "true length distribution"). Note that this length distribution may vary from the one found in real data. Nevertheless this comparison highlights some of the differences between the two models. Quantile-quantile plots of the respective length distributions show that TileMap systematically underestimates the length of enriched regions (Figure 7 (bottom left) and Figure 8 (bottom left)). While this effect is relatively small on dataset I there is some indication that it increases with region length and long regions may not be characterised appropriately by TileMap (Figure 7 (top left)). This observation is further supported by the length distribution of enriched regions produced by TileMap on dataset II (Figure 8 (left)). Enriched regions in dataset II are generally longer than regions in dataset I. This difference is not captured by TileMap. Both TileMap and the Baum-Welch trained model produce several regions that are shorter than the shortest enriched region in the simulated data (Figure 7 (bottom)). There are two possible explanations for these short regions. They may be caused by underestimating the length of enriched regions, possibly splitting one enriched region into several predicted regions, or they may represent spurious enriched results produced by the model. In each case there is the possibility that the occurrence of extremely short regions is caused either by an intrinsic shortcoming of the model or by artifacts introduced during the simulation process. Since the simulation relies on TileMap to identify enriched and non-enriched probes it is inevitable that some probes will be misclassified. Subsequently these probes may be included in the simulated data, causing short disruptions of enriched and non-enriched regions. A sufficiently sensitive model could detect these unintended changes between enriched and non-enriched states.

Bottom Line: Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc.We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates.The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Statistics, Macquarie University, North Ryde, NSW 2109, Australia. peter.humburg@csiro.au

ABSTRACT

Background: Tiling arrays are an important tool for the study of transcriptional activity, protein-DNA interactions and chromatin structure on a genome-wide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIP-chip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the Baum-Welch algorithm or Viterbi training are rarely applied in the context of tiling array analysis.

Results: Here we develop a hidden Markov model for the analysis of chromatin structure ChIP-chip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure.

Conclusion: We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.

Show MeSH