Limits...
Shape based indexing for faster search of RNA family databases.

Janssen S, Reeder J, Giegerich R - BMC Bioinformatics (2008)

Bottom Line: Filtering approaches that use the sequence conservation to reduce the number of CM searches, are fast, but it is unknown to which sacrifice.We present a new filtering approach, which exploits the family specific secondary structure and significantly reduces the number of CM searches.First results also capture previously undetected non-coding RNAs in a recent human RNAz screen.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany. stefan.janssen@uni-bielefeld.de

ABSTRACT

Background: Most non-coding RNA families exert their function by means of a conserved, common secondary structure. The Rfam data base contains more than five hundred structurally annotated RNA families. Unfortunately, searching for new family members using covariance models (CMs) is very time consuming. Filtering approaches that use the sequence conservation to reduce the number of CM searches, are fast, but it is unknown to which sacrifice.

Results: We present a new filtering approach, which exploits the family specific secondary structure and significantly reduces the number of CM searches. The filter eliminates approximately 85% of the queries and discards only 2.6% true positives when evaluating Rfam against itself. First results also capture previously undetected non-coding RNAs in a recent human RNAz screen.

Conclusion: The RNA shape index filter (RNAsifter) is based on the following rationale: An RNA family is characterised by structure, much more succinctly than by sequence content. Structures of individual family members, which naturally have different length and sequence composition, may exhibit structural variation in detail, but overall, they have a common shape in a more abstract sense. Given a fixed release of the Rfam data base, we can compute these abstract shapes for all families. This is called a shape index. If a query sequence belongs to a certain family, it must be able to fold into the family shape with reasonable free energy. Therefore, rather than matching the query against all families in the data base, we can first (and quickly) compute its feasible shape(s), and use the shape index to access only those families where a good match is possible due to a common shape with the query.

Show MeSH
Distribution of shrep energies, normalized to sequence length for three selected Rfam families. The MFE structure for all sequences from the three chosen families is the single hairpin shape. While their abstract shape is the same, they differ in folding energy. To eliminate the influence of sequence length on the folding energy, it is divided by sequence length. The x-axis represents this normalized energy value and the y-axis shows the amount of sequences that fold with this normalized energy value. Several distinct peaks can be seen, showing that the shrep energy does not only depend on sequence length. Therefore, it provides an additional attribute of a family that can be used in filtration.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2277397&req=5

Figure 4: Distribution of shrep energies, normalized to sequence length for three selected Rfam families. The MFE structure for all sequences from the three chosen families is the single hairpin shape. While their abstract shape is the same, they differ in folding energy. To eliminate the influence of sequence length on the folding energy, it is divided by sequence length. The x-axis represents this normalized energy value and the y-axis shows the amount of sequences that fold with this normalized energy value. Several distinct peaks can be seen, showing that the shrep energy does not only depend on sequence length. Therefore, it provides an additional attribute of a family that can be used in filtration.

Mentions: Members of sequence families often share a typical range of folding energies. A query that folds in a common shape with some family members, but with substantially different energy, is unlikely to be a family member. Together with the shapes, RNAshapes also delivers the energies of the corresponding shreps. Hence, the shapes in the index can be recorded together with their shrep energies, and the matches in M(x) are restricted to those with a similar energy. Figure 4 shows a very clear example. While all three families share the same shape – a simple hairpin – their energies form quite distinct energy ranges, independent of sequence length and GC-content. To this end, we reduce the match set M(x) to those families f that share a shape of their fss(x) and the qss(x) with the shreps free energy tolerance between both shapes less than ε percent.


Shape based indexing for faster search of RNA family databases.

Janssen S, Reeder J, Giegerich R - BMC Bioinformatics (2008)

Distribution of shrep energies, normalized to sequence length for three selected Rfam families. The MFE structure for all sequences from the three chosen families is the single hairpin shape. While their abstract shape is the same, they differ in folding energy. To eliminate the influence of sequence length on the folding energy, it is divided by sequence length. The x-axis represents this normalized energy value and the y-axis shows the amount of sequences that fold with this normalized energy value. Several distinct peaks can be seen, showing that the shrep energy does not only depend on sequence length. Therefore, it provides an additional attribute of a family that can be used in filtration.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2277397&req=5

Figure 4: Distribution of shrep energies, normalized to sequence length for three selected Rfam families. The MFE structure for all sequences from the three chosen families is the single hairpin shape. While their abstract shape is the same, they differ in folding energy. To eliminate the influence of sequence length on the folding energy, it is divided by sequence length. The x-axis represents this normalized energy value and the y-axis shows the amount of sequences that fold with this normalized energy value. Several distinct peaks can be seen, showing that the shrep energy does not only depend on sequence length. Therefore, it provides an additional attribute of a family that can be used in filtration.
Mentions: Members of sequence families often share a typical range of folding energies. A query that folds in a common shape with some family members, but with substantially different energy, is unlikely to be a family member. Together with the shapes, RNAshapes also delivers the energies of the corresponding shreps. Hence, the shapes in the index can be recorded together with their shrep energies, and the matches in M(x) are restricted to those with a similar energy. Figure 4 shows a very clear example. While all three families share the same shape – a simple hairpin – their energies form quite distinct energy ranges, independent of sequence length and GC-content. To this end, we reduce the match set M(x) to those families f that share a shape of their fss(x) and the qss(x) with the shreps free energy tolerance between both shapes less than ε percent.

Bottom Line: Filtering approaches that use the sequence conservation to reduce the number of CM searches, are fast, but it is unknown to which sacrifice.We present a new filtering approach, which exploits the family specific secondary structure and significantly reduces the number of CM searches.First results also capture previously undetected non-coding RNAs in a recent human RNAz screen.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany. stefan.janssen@uni-bielefeld.de

ABSTRACT

Background: Most non-coding RNA families exert their function by means of a conserved, common secondary structure. The Rfam data base contains more than five hundred structurally annotated RNA families. Unfortunately, searching for new family members using covariance models (CMs) is very time consuming. Filtering approaches that use the sequence conservation to reduce the number of CM searches, are fast, but it is unknown to which sacrifice.

Results: We present a new filtering approach, which exploits the family specific secondary structure and significantly reduces the number of CM searches. The filter eliminates approximately 85% of the queries and discards only 2.6% true positives when evaluating Rfam against itself. First results also capture previously undetected non-coding RNAs in a recent human RNAz screen.

Conclusion: The RNA shape index filter (RNAsifter) is based on the following rationale: An RNA family is characterised by structure, much more succinctly than by sequence content. Structures of individual family members, which naturally have different length and sequence composition, may exhibit structural variation in detail, but overall, they have a common shape in a more abstract sense. Given a fixed release of the Rfam data base, we can compute these abstract shapes for all families. This is called a shape index. If a query sequence belongs to a certain family, it must be able to fold into the family shape with reasonable free energy. Therefore, rather than matching the query against all families in the data base, we can first (and quickly) compute its feasible shape(s), and use the shape index to access only those families where a good match is possible due to a common shape with the query.

Show MeSH