Limits...
Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R - PLoS Comput. Biol. (2007)

Bottom Line: We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments.Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis.Bioinformatics 21 (Supplement 2): i77-i78).

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Group, Institute of Computer Science, University of Freiburg, Freiburg, Germany.

ABSTRACT
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77-i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.

Show MeSH

Related in: MedlinePlus

ROC Curve of the Global Comparison of Clustering and RFAM FamiliesAt a false positive rate of 12%, we achieve a sensitivity of 52% (correctly grouping together sequences of the same family), which is more than sufficient to detect families.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1851984&req=5

pcbi-0030065-g001: ROC Curve of the Global Comparison of Clustering and RFAM FamiliesAt a false positive rate of 12%, we achieve a sensitivity of 52% (correctly grouping together sequences of the same family), which is more than sufficient to detect families.

Mentions: Concerning the global view, the complete RFAM defines a partition of the set of all sequences into families (or clusters), and we can compare the degree of agreement between the partition defined by our clustering with the partition defined by RFAM. Since we have a hierarchical clustering, different sets of clusters can be defined by cutting the tree at different thresholds ϑ, and we have to compare all these thresholds to find the set of clusters with the best agreement. The problem of comparing the partition defined by a given set of clusters (generated by cutting the tree at some specific level) with the partition defined by RFAM is now transformed into a classification problem as follows. We consider all possible pairs of sequences, and define the number of true positives (ss) as the number of sequence pairs from the same family that lie in the same cluster. Analogously, the number of false positives, false negatives, and true negatives are given by the number of pairs from different families but same cluster (ds), same family but different clusters (sd), and different families and different clusters (dd), respectively. Sensitivity and specificity are then defined as usual, namely spec = dd/(dd + ds) and sens = ss/(ss + sd). The receiver operating characteristic (ROC), obtained by plotting the sensitivity against the false positive rate (1-specificity) for different values of the cutoff ϑ, is shown in Figure 1.


Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R - PLoS Comput. Biol. (2007)

ROC Curve of the Global Comparison of Clustering and RFAM FamiliesAt a false positive rate of 12%, we achieve a sensitivity of 52% (correctly grouping together sequences of the same family), which is more than sufficient to detect families.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1851984&req=5

pcbi-0030065-g001: ROC Curve of the Global Comparison of Clustering and RFAM FamiliesAt a false positive rate of 12%, we achieve a sensitivity of 52% (correctly grouping together sequences of the same family), which is more than sufficient to detect families.
Mentions: Concerning the global view, the complete RFAM defines a partition of the set of all sequences into families (or clusters), and we can compare the degree of agreement between the partition defined by our clustering with the partition defined by RFAM. Since we have a hierarchical clustering, different sets of clusters can be defined by cutting the tree at different thresholds ϑ, and we have to compare all these thresholds to find the set of clusters with the best agreement. The problem of comparing the partition defined by a given set of clusters (generated by cutting the tree at some specific level) with the partition defined by RFAM is now transformed into a classification problem as follows. We consider all possible pairs of sequences, and define the number of true positives (ss) as the number of sequence pairs from the same family that lie in the same cluster. Analogously, the number of false positives, false negatives, and true negatives are given by the number of pairs from different families but same cluster (ds), same family but different clusters (sd), and different families and different clusters (dd), respectively. Sensitivity and specificity are then defined as usual, namely spec = dd/(dd + ds) and sens = ss/(ss + sd). The receiver operating characteristic (ROC), obtained by plotting the sensitivity against the false positive rate (1-specificity) for different values of the cutoff ϑ, is shown in Figure 1.

Bottom Line: We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments.Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis.Bioinformatics 21 (Supplement 2): i77-i78).

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Group, Institute of Computer Science, University of Freiburg, Freiburg, Germany.

ABSTRACT
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77-i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.

Show MeSH
Related in: MedlinePlus