zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm.
Bottom Line:
Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms.This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model.In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.
Affiliation: Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark. asand@birc.au.dk.
ABSTRACT
Show MeSH
Background: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. Results: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. Conclusions: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/. |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC4222747&req=5
Mentions: While the first iterations of the preprocessing procedure compress the sequence very effectively, the last iterations do not decrease the sequence length by much, since most pairs are uncommon when more characters are introduced. This is illustrated in Figure 3, where we see that the number of occurrences of the most frequent pair of symbols decreases superexponentially as a function of the number of iterations performed on an alignment of the human and chimpanzee chromosome 1. This means that we potentially save a lot of time on the likelihood computation by performing the first iterations, but as the slope of the curve increases towards 0 we risk to spend a long time on the preprocessing and save very little time on the actual likelihood computation. To overcome this problem, we do not compress the input sequence all the way down to a single character. Assume we know that the preprocessing will not be reused for an HMM with less than Nmin states, and let tmv be the time required for an (Nmin×Nmin)×Nmin matrix-vector multiplication and tmm be the time required for an (Nmin×Nmin)×(Nmin×Nmin) matrix-matrix multiplication. In iteration i of the preprocessing we replace the most frequent pair of two symbols in the current sequence and find the most frequent pair of two symbols in the resulting sequence. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark. asand@birc.au.dk.
Background: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library.
Results: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.
Conclusions: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.