Limits...
Lost in folding space? Comparing four variants of the thermodynamic model for RNA secondary structure prediction.

Janssen S, Schudoma C, Steger G, Giegerich R - BMC Bioinformatics (2011)

Bottom Line: Although all programs refer to the same physical model, they implement it with considerable variation for different tasks, and little is known about the effects of heuristic assumptions and model simplifications used by the programs on the outcome of the analysis.Based on our study, future work on thermodynamic RNA folding may make a choice of model based on our empirical data.It can take our implementations as a starting point for further program development.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany.

ABSTRACT

Background: Many bioinformatics tools for RNA secondary structure analysis are based on a thermodynamic model of RNA folding. They predict a single, "optimal" structure by free energy minimization, they enumerate near-optimal structures, they compute base pair probabilities and dot plots, representative structures of different abstract shapes, or Boltzmann probabilities of structures and shapes. Although all programs refer to the same physical model, they implement it with considerable variation for different tasks, and little is known about the effects of heuristic assumptions and model simplifications used by the programs on the outcome of the analysis.

Results: We extract four different models of the thermodynamic folding space which underlie the programs RNAFOLD, RNASHAPES, and RNASUBOPT. Their differences lie within the details of the energy model and the granularity of the folding space. We implement probabilistic shape analysis for all models, and introduce the shape probability shift as a robust measure of model similarity. Using four data sets derived from experimentally solved structures, we provide a quantitative evaluation of the model differences.

Conclusions: We find that search space granularity affects the computed shape probabilities less than the over- or underapproximation of free energy by a simplified energy model. Still, the approximations perform similar enough to implementations of the full model to justify their continued use in settings where computational constraints call for simpler algorithms. On the side, we observe that the rarely used level 2 shapes, which predict the complete arrangement of helices, multiloops, internal loops and bulges, include the "true" shape in a rather small number of predicted high probability shapes. This calls for an investigation of new strategies to extract high probability members from the (very large) level 2 shape space of an RNA sequence. We provide implementations of all four models, written in a declarative style that makes them easy to be modified. Based on our study, future work on thermodynamic RNA folding may make a choice of model based on our empirical data. It can take our implementations as a starting point for further program development.

Show MeSH

Related in: MedlinePlus

Comparison of different MFE prediction programs. Dataset: we use the 147 sequences from the DARTS set, except pdb1ajt1B, pdb1kod1A, pdb1koc1A, pdb1lpw1B and pdb1t4x1B, which crashed under UNAFOLD. Together, all according "PDB" structures contain 1,614 base pairs. All "gold" structures have 1,593 base pairs. Distance: One base pair set, i.e. secondary structure, is the reference (R: table columns), the other one is the prediction (P: table rows). Traditional base pair distance is defined as/R \P/ + /P \R/. Following [34], we decide to allow additional base pairs in the prediction, as long as they are compatible with the reference, i.e. both bases are unpaired and the additional base pair does not introduce a pseudoknot in the reference. The set of compatible base pairs is P-c = P\{(a, b)/(a, b) ∉ R Λ (a, b) compatible to R}. Then, our asymmetric base pair distance is: /R \P/ + /P-c \R/. Table values are the sums of base pair distances for all 142 sequences. In the case of co-optimal results, the one with the smallest distance to the reference is chosen. Our distance function is rather strict and does not allow base pair slippage. If a gold base pair (i, j) is mispredicted as (i + 1, j), this contributes a distance of 2. Programs: for each RNA sequence we called the programs with the following command line options: RNAFOLD (version 1.8.5): echo sequence / RNAfold -noPS -noLP -dX, where X is 0, 1 or 2. UNAFOLD (version 3.8): hybrid-ss-min --suffix = DAT --mfold --NA=RNA --tmin = 37 --tinc = 1 --tmax = 37 --sodium = 1 --magnesium = 0 --noisolate --nodangle tmpseqfile >/dev/ && ct2b.pl tmpseqfile.ct, with and without the --nodangle switch, where "tmpseqfile" is a fasta file containing the sequence and "ct2b.pl" is a small Perl script from the Vienna Package, which converts RNA structures from "connect" to "dot-bracket" format. CENTROIDFOLD (version v0.0.9): centroid_fold --engine=X tmpseqfile, where X is the source of base pair probabilities and is either computed by RNAFOLD (McCaskill) or by CONTRAFOLD. Our ADP implementation of the four grammars "NoDangle", "OverDangle", "MicroState" and "MacroState" get the sequence as their sole input. The binaries can be built with the source code from the additional file 3 and the Bellman's GAP compiler.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3293930&req=5

Figure 6: Comparison of different MFE prediction programs. Dataset: we use the 147 sequences from the DARTS set, except pdb1ajt1B, pdb1kod1A, pdb1koc1A, pdb1lpw1B and pdb1t4x1B, which crashed under UNAFOLD. Together, all according "PDB" structures contain 1,614 base pairs. All "gold" structures have 1,593 base pairs. Distance: One base pair set, i.e. secondary structure, is the reference (R: table columns), the other one is the prediction (P: table rows). Traditional base pair distance is defined as/R \P/ + /P \R/. Following [34], we decide to allow additional base pairs in the prediction, as long as they are compatible with the reference, i.e. both bases are unpaired and the additional base pair does not introduce a pseudoknot in the reference. The set of compatible base pairs is P-c = P\{(a, b)/(a, b) ∉ R Λ (a, b) compatible to R}. Then, our asymmetric base pair distance is: /R \P/ + /P-c \R/. Table values are the sums of base pair distances for all 142 sequences. In the case of co-optimal results, the one with the smallest distance to the reference is chosen. Our distance function is rather strict and does not allow base pair slippage. If a gold base pair (i, j) is mispredicted as (i + 1, j), this contributes a distance of 2. Programs: for each RNA sequence we called the programs with the following command line options: RNAFOLD (version 1.8.5): echo sequence / RNAfold -noPS -noLP -dX, where X is 0, 1 or 2. UNAFOLD (version 3.8): hybrid-ss-min --suffix = DAT --mfold --NA=RNA --tmin = 37 --tinc = 1 --tmax = 37 --sodium = 1 --magnesium = 0 --noisolate --nodangle tmpseqfile >/dev/ && ct2b.pl tmpseqfile.ct, with and without the --nodangle switch, where "tmpseqfile" is a fasta file containing the sequence and "ct2b.pl" is a small Perl script from the Vienna Package, which converts RNA structures from "connect" to "dot-bracket" format. CENTROIDFOLD (version v0.0.9): centroid_fold --engine=X tmpseqfile, where X is the source of base pair probabilities and is either computed by RNAFOLD (McCaskill) or by CONTRAFOLD. Our ADP implementation of the four grammars "NoDangle", "OverDangle", "MicroState" and "MacroState" get the sequence as their sole input. The binaries can be built with the source code from the additional file 3 and the Bellman's GAP compiler.

Mentions: The data set in this evaluation is DARTS. Evaluation results are summarized in Figure 6. We use an asymmetric base pair distance for comparison, as explained with Figure 6, where one structure (row entry) is treated as the prediction, the other as the reference (column entry).


Lost in folding space? Comparing four variants of the thermodynamic model for RNA secondary structure prediction.

Janssen S, Schudoma C, Steger G, Giegerich R - BMC Bioinformatics (2011)

Comparison of different MFE prediction programs. Dataset: we use the 147 sequences from the DARTS set, except pdb1ajt1B, pdb1kod1A, pdb1koc1A, pdb1lpw1B and pdb1t4x1B, which crashed under UNAFOLD. Together, all according "PDB" structures contain 1,614 base pairs. All "gold" structures have 1,593 base pairs. Distance: One base pair set, i.e. secondary structure, is the reference (R: table columns), the other one is the prediction (P: table rows). Traditional base pair distance is defined as/R \P/ + /P \R/. Following [34], we decide to allow additional base pairs in the prediction, as long as they are compatible with the reference, i.e. both bases are unpaired and the additional base pair does not introduce a pseudoknot in the reference. The set of compatible base pairs is P-c = P\{(a, b)/(a, b) ∉ R Λ (a, b) compatible to R}. Then, our asymmetric base pair distance is: /R \P/ + /P-c \R/. Table values are the sums of base pair distances for all 142 sequences. In the case of co-optimal results, the one with the smallest distance to the reference is chosen. Our distance function is rather strict and does not allow base pair slippage. If a gold base pair (i, j) is mispredicted as (i + 1, j), this contributes a distance of 2. Programs: for each RNA sequence we called the programs with the following command line options: RNAFOLD (version 1.8.5): echo sequence / RNAfold -noPS -noLP -dX, where X is 0, 1 or 2. UNAFOLD (version 3.8): hybrid-ss-min --suffix = DAT --mfold --NA=RNA --tmin = 37 --tinc = 1 --tmax = 37 --sodium = 1 --magnesium = 0 --noisolate --nodangle tmpseqfile >/dev/ && ct2b.pl tmpseqfile.ct, with and without the --nodangle switch, where "tmpseqfile" is a fasta file containing the sequence and "ct2b.pl" is a small Perl script from the Vienna Package, which converts RNA structures from "connect" to "dot-bracket" format. CENTROIDFOLD (version v0.0.9): centroid_fold --engine=X tmpseqfile, where X is the source of base pair probabilities and is either computed by RNAFOLD (McCaskill) or by CONTRAFOLD. Our ADP implementation of the four grammars "NoDangle", "OverDangle", "MicroState" and "MacroState" get the sequence as their sole input. The binaries can be built with the source code from the additional file 3 and the Bellman's GAP compiler.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3293930&req=5

Figure 6: Comparison of different MFE prediction programs. Dataset: we use the 147 sequences from the DARTS set, except pdb1ajt1B, pdb1kod1A, pdb1koc1A, pdb1lpw1B and pdb1t4x1B, which crashed under UNAFOLD. Together, all according "PDB" structures contain 1,614 base pairs. All "gold" structures have 1,593 base pairs. Distance: One base pair set, i.e. secondary structure, is the reference (R: table columns), the other one is the prediction (P: table rows). Traditional base pair distance is defined as/R \P/ + /P \R/. Following [34], we decide to allow additional base pairs in the prediction, as long as they are compatible with the reference, i.e. both bases are unpaired and the additional base pair does not introduce a pseudoknot in the reference. The set of compatible base pairs is P-c = P\{(a, b)/(a, b) ∉ R Λ (a, b) compatible to R}. Then, our asymmetric base pair distance is: /R \P/ + /P-c \R/. Table values are the sums of base pair distances for all 142 sequences. In the case of co-optimal results, the one with the smallest distance to the reference is chosen. Our distance function is rather strict and does not allow base pair slippage. If a gold base pair (i, j) is mispredicted as (i + 1, j), this contributes a distance of 2. Programs: for each RNA sequence we called the programs with the following command line options: RNAFOLD (version 1.8.5): echo sequence / RNAfold -noPS -noLP -dX, where X is 0, 1 or 2. UNAFOLD (version 3.8): hybrid-ss-min --suffix = DAT --mfold --NA=RNA --tmin = 37 --tinc = 1 --tmax = 37 --sodium = 1 --magnesium = 0 --noisolate --nodangle tmpseqfile >/dev/ && ct2b.pl tmpseqfile.ct, with and without the --nodangle switch, where "tmpseqfile" is a fasta file containing the sequence and "ct2b.pl" is a small Perl script from the Vienna Package, which converts RNA structures from "connect" to "dot-bracket" format. CENTROIDFOLD (version v0.0.9): centroid_fold --engine=X tmpseqfile, where X is the source of base pair probabilities and is either computed by RNAFOLD (McCaskill) or by CONTRAFOLD. Our ADP implementation of the four grammars "NoDangle", "OverDangle", "MicroState" and "MacroState" get the sequence as their sole input. The binaries can be built with the source code from the additional file 3 and the Bellman's GAP compiler.
Mentions: The data set in this evaluation is DARTS. Evaluation results are summarized in Figure 6. We use an asymmetric base pair distance for comparison, as explained with Figure 6, where one structure (row entry) is treated as the prediction, the other as the reference (column entry).

Bottom Line: Although all programs refer to the same physical model, they implement it with considerable variation for different tasks, and little is known about the effects of heuristic assumptions and model simplifications used by the programs on the outcome of the analysis.Based on our study, future work on thermodynamic RNA folding may make a choice of model based on our empirical data.It can take our implementations as a starting point for further program development.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany.

ABSTRACT

Background: Many bioinformatics tools for RNA secondary structure analysis are based on a thermodynamic model of RNA folding. They predict a single, "optimal" structure by free energy minimization, they enumerate near-optimal structures, they compute base pair probabilities and dot plots, representative structures of different abstract shapes, or Boltzmann probabilities of structures and shapes. Although all programs refer to the same physical model, they implement it with considerable variation for different tasks, and little is known about the effects of heuristic assumptions and model simplifications used by the programs on the outcome of the analysis.

Results: We extract four different models of the thermodynamic folding space which underlie the programs RNAFOLD, RNASHAPES, and RNASUBOPT. Their differences lie within the details of the energy model and the granularity of the folding space. We implement probabilistic shape analysis for all models, and introduce the shape probability shift as a robust measure of model similarity. Using four data sets derived from experimentally solved structures, we provide a quantitative evaluation of the model differences.

Conclusions: We find that search space granularity affects the computed shape probabilities less than the over- or underapproximation of free energy by a simplified energy model. Still, the approximations perform similar enough to implementations of the full model to justify their continued use in settings where computational constraints call for simpler algorithms. On the side, we observe that the rarely used level 2 shapes, which predict the complete arrangement of helices, multiloops, internal loops and bulges, include the "true" shape in a rather small number of predicted high probability shapes. This calls for an investigation of new strategies to extract high probability members from the (very large) level 2 shape space of an RNA sequence. We provide implementations of all four models, written in a declarative style that makes them easy to be modified. Based on our study, future work on thermodynamic RNA folding may make a choice of model based on our empirical data. It can take our implementations as a starting point for further program development.

Show MeSH
Related in: MedlinePlus