Limits...
zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm.

Sand A, Kristiansen M, Pedersen CN, Mailund T - BMC Bioinformatics (2013)

Bottom Line: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms.This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model.In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark. asand@birc.au.dk.

ABSTRACT

Background: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library.

Results: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.

Conclusions: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.

Show MeSH
Speedup vs. sequence complexity. The speedup factor decreases linearly with the sequence complexity.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4222747&req=5

Figure 6: Speedup vs. sequence complexity. The speedup factor decreases linearly with the sequence complexity.

Mentions: Our new implementation of the forward algorithm is expected to perform best on strings of low complexity because they are more compressible. To investigate this we measured the per-iteration running time of the forward algorithm for parredHMMlib, HMMlib and the simple implementation of equation (3) on random binary sequences (over the alphabet {0,1}) of length L=107 with the frequency of 1s varying from 0.0001 to 0.05, and divided it by the per-iteration running time for zipHMMlib (excluding the preprocessing time) to obtain the speedup factor. This experiment is summarised in Figure 6, where we note that the speedup factor decreases linearly with the complexity of the input sequence; however, speedup factors of more than two orders of magnitude are obtained for less complex sequences, and even for sequences of low complexity a (modest) speedup is obtained.


zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm.

Sand A, Kristiansen M, Pedersen CN, Mailund T - BMC Bioinformatics (2013)

Speedup vs. sequence complexity. The speedup factor decreases linearly with the sequence complexity.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4222747&req=5

Figure 6: Speedup vs. sequence complexity. The speedup factor decreases linearly with the sequence complexity.
Mentions: Our new implementation of the forward algorithm is expected to perform best on strings of low complexity because they are more compressible. To investigate this we measured the per-iteration running time of the forward algorithm for parredHMMlib, HMMlib and the simple implementation of equation (3) on random binary sequences (over the alphabet {0,1}) of length L=107 with the frequency of 1s varying from 0.0001 to 0.05, and divided it by the per-iteration running time for zipHMMlib (excluding the preprocessing time) to obtain the speedup factor. This experiment is summarised in Figure 6, where we note that the speedup factor decreases linearly with the complexity of the input sequence; however, speedup factors of more than two orders of magnitude are obtained for less complex sequences, and even for sequences of low complexity a (modest) speedup is obtained.

Bottom Line: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms.This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model.In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark. asand@birc.au.dk.

ABSTRACT

Background: Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library.

Results: We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.

Conclusions: We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.

Show MeSH