Limits...
Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures.

Saito Y, Sato K, Sakakibara Y - BMC Bioinformatics (2011)

Bottom Line: Such heuristics degrade the quality of clustering results, especially when the similarity among family members is not detectable at the primary sequence level.The idea is that the reliability of approximate algorithms can be improved by utilizing the information of suboptimal solutions in their dynamic programming frameworks.We demonstrate that this strategy can achieve the best balance between the computational cost and the quality of the clustering.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan. saito@dna.bio.keio.ac.jp

ABSTRACT

Background: Clustering of unannotated transcripts is an important task to identify novel families of noncoding RNAs (ncRNAs). Several hierarchical clustering methods have been developed using similarity measures based on the scores of structural alignment. However, the high computational cost of exact structural alignment requires these methods to employ approximate algorithms. Such heuristics degrade the quality of clustering results, especially when the similarity among family members is not detectable at the primary sequence level.

Results: We describe a new similarity measure for the hierarchical clustering of ncRNAs. The idea is that the reliability of approximate algorithms can be improved by utilizing the information of suboptimal solutions in their dynamic programming frameworks. We approximate structural alignment in a more simplified manner than the existing methods. Instead, our method utilizes all possible sequence alignments and all possible secondary structures, whereas the existing methods only use one optimal sequence alignment and one optimal secondary structure. We demonstrate that this strategy can achieve the best balance between the computational cost and the quality of the clustering. In particular, our method can keep its high performance even when the sequence identity of family members is less than 60%.

Conclusions: Our method enables fast and accurate clustering of ncRNAs. The software is available for download at http://bpla-kernel.dna.bio.keio.ac.jp/clustering/.

Show MeSH

Related in: MedlinePlus

Quality of the clustering for the “plus unrelated sequences” dataset. For each sequence identity range, the overall quality of the cluster tree is evaluated by the AUC.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3044305&req=5

Figure 4: Quality of the clustering for the “plus unrelated sequences” dataset. For each sequence identity range, the overall quality of the cluster tree is evaluated by the AUC.

Mentions: Next, we evaluated the quality of the clustering for the “plus flanking regions” dataset (Figure 3), and the “plus unrelated sequences” dataset (Figure 4). In both cases, we observed the same tendency as in the results for the “normal” dataset (Figure 1). Our method kept high accuracy in all the range of sequence identity, and achieved the best AUC in the sequence identity range below 60%. These results further support the effectiveness of our method in the practical situations that involve flanking regions and unrelated sequences.


Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures.

Saito Y, Sato K, Sakakibara Y - BMC Bioinformatics (2011)

Quality of the clustering for the “plus unrelated sequences” dataset. For each sequence identity range, the overall quality of the cluster tree is evaluated by the AUC.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3044305&req=5

Figure 4: Quality of the clustering for the “plus unrelated sequences” dataset. For each sequence identity range, the overall quality of the cluster tree is evaluated by the AUC.
Mentions: Next, we evaluated the quality of the clustering for the “plus flanking regions” dataset (Figure 3), and the “plus unrelated sequences” dataset (Figure 4). In both cases, we observed the same tendency as in the results for the “normal” dataset (Figure 1). Our method kept high accuracy in all the range of sequence identity, and achieved the best AUC in the sequence identity range below 60%. These results further support the effectiveness of our method in the practical situations that involve flanking regions and unrelated sequences.

Bottom Line: Such heuristics degrade the quality of clustering results, especially when the similarity among family members is not detectable at the primary sequence level.The idea is that the reliability of approximate algorithms can be improved by utilizing the information of suboptimal solutions in their dynamic programming frameworks.We demonstrate that this strategy can achieve the best balance between the computational cost and the quality of the clustering.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan. saito@dna.bio.keio.ac.jp

ABSTRACT

Background: Clustering of unannotated transcripts is an important task to identify novel families of noncoding RNAs (ncRNAs). Several hierarchical clustering methods have been developed using similarity measures based on the scores of structural alignment. However, the high computational cost of exact structural alignment requires these methods to employ approximate algorithms. Such heuristics degrade the quality of clustering results, especially when the similarity among family members is not detectable at the primary sequence level.

Results: We describe a new similarity measure for the hierarchical clustering of ncRNAs. The idea is that the reliability of approximate algorithms can be improved by utilizing the information of suboptimal solutions in their dynamic programming frameworks. We approximate structural alignment in a more simplified manner than the existing methods. Instead, our method utilizes all possible sequence alignments and all possible secondary structures, whereas the existing methods only use one optimal sequence alignment and one optimal secondary structure. We demonstrate that this strategy can achieve the best balance between the computational cost and the quality of the clustering. In particular, our method can keep its high performance even when the sequence identity of family members is less than 60%.

Conclusions: Our method enables fast and accurate clustering of ncRNAs. The software is available for download at http://bpla-kernel.dna.bio.keio.ac.jp/clustering/.

Show MeSH
Related in: MedlinePlus