Limits...
Multithreaded comparative RNA secondary structure prediction using stochastic context-free grammars.

Sükösd Z, Knudsen B, Vaerum M, Kjems J, Andersen ES - BMC Bioinformatics (2011)

Bottom Line: We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches.We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Nanoscience Center, Aarhus University, Denmark. zs@mb.au.dk

ABSTRACT

Background: The prediction of the structure of large RNAs remains a particular challenge in bioinformatics, due to the computational complexity and low levels of accuracy of state-of-the-art algorithms. The pfold model couples a stochastic context-free grammar to phylogenetic analysis for a high accuracy in predictions, but the time complexity of the algorithm and underflow errors have prevented its use for long alignments. Here we present PPfold, a multithreaded version of pfold, which is capable of predicting the structure of large RNA alignments accurately on practical timescales.

Results: We have distributed both the phylogenetic calculations and the inside-outside algorithm in PPfold, resulting in a significant reduction of runtime on multicore machines. We have addressed the floating-point underflow problems of pfold by implementing an extended-exponent datatype, enabling PPfold to be used for large-scale RNA structure predictions. We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches. We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.

Conclusions: PPfold is the first parallelized comparative RNA structure prediction algorithm to date. Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts. The techniques used in the parallelization of this algorithm may be of general applicability to other bioinformatics algorithms.

Show MeSH

Related in: MedlinePlus

Execution time of SCFG part on multicore machines. The execution time of the SCFG part of the algorithm is reduced proportionally to the number of cores for a sufficiently high number of divisions. We used 40 divisions for the folding of 30 × 3000 nt, on Intel(R) Xeon(R) E5420 CPU, 8 cores, 2.50 GHz, 32 GB RAM, and enabled different number of cores to be used by PPfold by varying the size of the thread pool. Here we are plotting the mean and standard deviation of 4 measurement points, scaled as a fraction of one-core runtime.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3102635&req=5

Figure 6: Execution time of SCFG part on multicore machines. The execution time of the SCFG part of the algorithm is reduced proportionally to the number of cores for a sufficiently high number of divisions. We used 40 divisions for the folding of 30 × 3000 nt, on Intel(R) Xeon(R) E5420 CPU, 8 cores, 2.50 GHz, 32 GB RAM, and enabled different number of cores to be used by PPfold by varying the size of the thread pool. Here we are plotting the mean and standard deviation of 4 measurement points, scaled as a fraction of one-core runtime.

Mentions: It is important to note that multithreading is not possible for all parts of the algorithm: for example, the job at the top of the triangular matrix has to be executed by one processing unit without any simultaneous calculations. Therefore it is ideal to choose N >>u, where u is the number of available processing units. In the limit N → ∞, the theoretical execution time on u processing units is reduced to of the execution time on one processing unit, and this is also what we observe in practice. (Figure 6) We note that this method of divisions is generally applicable to any bifurcating SCFG, and thus may be used for the parallelization of other algorithms also.


Multithreaded comparative RNA secondary structure prediction using stochastic context-free grammars.

Sükösd Z, Knudsen B, Vaerum M, Kjems J, Andersen ES - BMC Bioinformatics (2011)

Execution time of SCFG part on multicore machines. The execution time of the SCFG part of the algorithm is reduced proportionally to the number of cores for a sufficiently high number of divisions. We used 40 divisions for the folding of 30 × 3000 nt, on Intel(R) Xeon(R) E5420 CPU, 8 cores, 2.50 GHz, 32 GB RAM, and enabled different number of cores to be used by PPfold by varying the size of the thread pool. Here we are plotting the mean and standard deviation of 4 measurement points, scaled as a fraction of one-core runtime.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3102635&req=5

Figure 6: Execution time of SCFG part on multicore machines. The execution time of the SCFG part of the algorithm is reduced proportionally to the number of cores for a sufficiently high number of divisions. We used 40 divisions for the folding of 30 × 3000 nt, on Intel(R) Xeon(R) E5420 CPU, 8 cores, 2.50 GHz, 32 GB RAM, and enabled different number of cores to be used by PPfold by varying the size of the thread pool. Here we are plotting the mean and standard deviation of 4 measurement points, scaled as a fraction of one-core runtime.
Mentions: It is important to note that multithreading is not possible for all parts of the algorithm: for example, the job at the top of the triangular matrix has to be executed by one processing unit without any simultaneous calculations. Therefore it is ideal to choose N >>u, where u is the number of available processing units. In the limit N → ∞, the theoretical execution time on u processing units is reduced to of the execution time on one processing unit, and this is also what we observe in practice. (Figure 6) We note that this method of divisions is generally applicable to any bifurcating SCFG, and thus may be used for the parallelization of other algorithms also.

Bottom Line: We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches.We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Nanoscience Center, Aarhus University, Denmark. zs@mb.au.dk

ABSTRACT

Background: The prediction of the structure of large RNAs remains a particular challenge in bioinformatics, due to the computational complexity and low levels of accuracy of state-of-the-art algorithms. The pfold model couples a stochastic context-free grammar to phylogenetic analysis for a high accuracy in predictions, but the time complexity of the algorithm and underflow errors have prevented its use for long alignments. Here we present PPfold, a multithreaded version of pfold, which is capable of predicting the structure of large RNA alignments accurately on practical timescales.

Results: We have distributed both the phylogenetic calculations and the inside-outside algorithm in PPfold, resulting in a significant reduction of runtime on multicore machines. We have addressed the floating-point underflow problems of pfold by implementing an extended-exponent datatype, enabling PPfold to be used for large-scale RNA structure predictions. We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches. We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.

Conclusions: PPfold is the first parallelized comparative RNA structure prediction algorithm to date. Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts. The techniques used in the parallelization of this algorithm may be of general applicability to other bioinformatics algorithms.

Show MeSH
Related in: MedlinePlus