Limits...
RSEARCH: finding homologs of single structured RNA sequences.

Klein RJ, Eddy SR - BMC Bioinformatics (2003)

Bottom Line: RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit.The primary drawback of the program is that it is slow.The C code for RSEARCH is freely available from our lab's website.

View Article: PubMed Central - HTML - PubMed

Affiliation: Howard Hughes Medical Institute & Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri 63110, USA. rjklein@linkage.rockefeller.edu

ABSTRACT

Background: For many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure.

Results: We have developed a program, RSEARCH, that takes a single RNA sequence with its secondary structure and utilizes a local alignment algorithm to search a database for homologous RNAs. For this purpose, we have developed a series of base pair and single nucleotide substitution matrices for RNA sequences called RIBOSUM matrices. RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit. We show several examples in which RSEARCH outperforms the primary sequence search programs BLAST and SSEARCH. The primary drawback of the program is that it is slow. The C code for RSEARCH is freely available from our lab's website.

Conclusion: RSEARCH outperforms primary sequence programs in finding homologs of structured RNA sequences.

Show MeSH

Related in: MedlinePlus

An example SCFG architecture. The sequence at the top folds into the specified secondary structure. At the bottom, the nodal architecture of the model that would produce this sequence is shown. Shaded triangles represent base pair emitting nodes, and point to the base pair they emit. Open triangles represent single nucleotide emitting nodes, and point to the nucleotide they emit.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC239859&req=5

Figure 1: An example SCFG architecture. The sequence at the top folds into the specified secondary structure. At the bottom, the nodal architecture of the model that would produce this sequence is shown. Shaded triangles represent base pair emitting nodes, and point to the base pair they emit. Open triangles represent single nucleotide emitting nodes, and point to the nucleotide they emit.

Mentions: For our purposes, there are two separable steps in the creation of a covariance model. First, the model architecture needs to be determined from the given secondary structure. It is easiest to think of the model as being composed of modular nodes, where each node contains a characteristic arrangement of states. The node architecture of the model follows from the secondary structure of the query. A sample RNA secondary structure and its corresponding "guide tree" of nodes is given in Figure 1. There are eight types of nodes. A ROOT node marks the start of the model. A BIF node marks a bifurcation in the tree, and is always followed by a BEGL node on the left branch and a BEGR node on the right branch. All branches end with an END node. The remaining nodes are match nodes; these nodes represent either a base-pair (MATP) or a single nucleotide (MATL and MATR) in the secondary structure. (For a profile SCFG built from a multiple sequence alignment, only those positions thought to be conserved by some measure correspond to match nodes; for our single-sequence covariance models, all positions correspond to match nodes.) For consistency, MATL is always preferred over MATR when possible. Each secondary structure yields one and only one model architecture, and a model architecture implies a unique secondary structure. As each node is associated with a static arrangement of states, this architecture also gives the final arrangement of states in the model. A more detailed exposition of this algorithm for profile SCFGs is given in reference [18] and the C code used to construct a covariance model from a given secondary structure can be found in the modelmaker.c file of the Infernal package [42].


RSEARCH: finding homologs of single structured RNA sequences.

Klein RJ, Eddy SR - BMC Bioinformatics (2003)

An example SCFG architecture. The sequence at the top folds into the specified secondary structure. At the bottom, the nodal architecture of the model that would produce this sequence is shown. Shaded triangles represent base pair emitting nodes, and point to the base pair they emit. Open triangles represent single nucleotide emitting nodes, and point to the nucleotide they emit.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC239859&req=5

Figure 1: An example SCFG architecture. The sequence at the top folds into the specified secondary structure. At the bottom, the nodal architecture of the model that would produce this sequence is shown. Shaded triangles represent base pair emitting nodes, and point to the base pair they emit. Open triangles represent single nucleotide emitting nodes, and point to the nucleotide they emit.
Mentions: For our purposes, there are two separable steps in the creation of a covariance model. First, the model architecture needs to be determined from the given secondary structure. It is easiest to think of the model as being composed of modular nodes, where each node contains a characteristic arrangement of states. The node architecture of the model follows from the secondary structure of the query. A sample RNA secondary structure and its corresponding "guide tree" of nodes is given in Figure 1. There are eight types of nodes. A ROOT node marks the start of the model. A BIF node marks a bifurcation in the tree, and is always followed by a BEGL node on the left branch and a BEGR node on the right branch. All branches end with an END node. The remaining nodes are match nodes; these nodes represent either a base-pair (MATP) or a single nucleotide (MATL and MATR) in the secondary structure. (For a profile SCFG built from a multiple sequence alignment, only those positions thought to be conserved by some measure correspond to match nodes; for our single-sequence covariance models, all positions correspond to match nodes.) For consistency, MATL is always preferred over MATR when possible. Each secondary structure yields one and only one model architecture, and a model architecture implies a unique secondary structure. As each node is associated with a static arrangement of states, this architecture also gives the final arrangement of states in the model. A more detailed exposition of this algorithm for profile SCFGs is given in reference [18] and the C code used to construct a covariance model from a given secondary structure can be found in the modelmaker.c file of the Infernal package [42].

Bottom Line: RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit.The primary drawback of the program is that it is slow.The C code for RSEARCH is freely available from our lab's website.

View Article: PubMed Central - HTML - PubMed

Affiliation: Howard Hughes Medical Institute & Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri 63110, USA. rjklein@linkage.rockefeller.edu

ABSTRACT

Background: For many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure.

Results: We have developed a program, RSEARCH, that takes a single RNA sequence with its secondary structure and utilizes a local alignment algorithm to search a database for homologous RNAs. For this purpose, we have developed a series of base pair and single nucleotide substitution matrices for RNA sequences called RIBOSUM matrices. RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit. We show several examples in which RSEARCH outperforms the primary sequence search programs BLAST and SSEARCH. The primary drawback of the program is that it is slow. The C code for RSEARCH is freely available from our lab's website.

Conclusion: RSEARCH outperforms primary sequence programs in finding homologs of structured RNA sequences.

Show MeSH
Related in: MedlinePlus