Limits...
A computational pipeline for high- throughput discovery of cis-regulatory noncoding RNA in prokaryotes.

Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL - PLoS Comput. Biol. (2007)

Bottom Line: The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation.We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly.Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam).

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA. yzizhen@cs.washington.edu

ABSTRACT
Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair-level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth.

Show MeSH

Related in: MedlinePlus

Pipeline FlowchartThe boxes with solid lines indicate steps involving intensive computation (approximate running time is specified next to each). Other intermediate steps are specified in the boxes with dashed lines.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1913097&req=5

pcbi-0030126-g001: Pipeline FlowchartThe boxes with solid lines indicate steps involving intensive computation (approximate running time is specified next to each). Other intermediate steps are specified in the boxes with dashed lines.

Mentions: In outline, our pipeline consists of the following major steps. (See Figure 1, Materials and Methods, and the online supplement at http://bio.cs.washington.edu/supplements/yzizhen/pipeline for more details.) First, we used the National Center for Biotechnology Information's (NCBI's) Conserved Domain Database (CDD) [16] to identify homologous gene sets. For each gene, we collected its 5′ upstream sequence. We call the set of 5′ sequences associated with one CDD group a dataset. cis-Regulatory elements are often conserved within such groups. Second, we applied FootPrinter [17], a DNA phylogenetic footprinting tool, to select datasets that are likely to host ncRNAs. In our experience, functional RNAs such as riboswitches often show low overall sequence conservation, but contain interspersed patches where conservation is high. FootPrinter is very effective at highlighting the latter regions. Third, we used CMfinder to infer RNA motifs in each unaligned sequence dataset. CMfinder is a structure-oriented local alignment tool that is robust to varying sequence conservation and length of extraneous flanking regions. We postprocessed motifs to identify distinct motifs corresponding to different RNA elements by removing poor and redundant motifs and clustering the rest based on overlap. Fourth, we used RaveNnA [18–20] to find additional motif instances by scanning the prokaryotic genome database. Riboswitches, for example, often regulate multiple operons that contribute to a single pathway, but no single CDD domain will be common to all of these operons. Thus, the search step was a powerful adjunct to the motif discovery process. These newly discovered motif members were incorporated into a refined motif model, again using CMfinder, and in some cases the search and motif refinement steps were repeated. Motif postprocessing was also repeated after the search/refinement steps. Both CMfinder and RaveNnA rely on the Infernal covariance model software package [21] for RNA motif modeling and search. Finally, we performed gene context analysis and literature searches (manually) for the top-ranking motifs.


A computational pipeline for high- throughput discovery of cis-regulatory noncoding RNA in prokaryotes.

Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL - PLoS Comput. Biol. (2007)

Pipeline FlowchartThe boxes with solid lines indicate steps involving intensive computation (approximate running time is specified next to each). Other intermediate steps are specified in the boxes with dashed lines.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1913097&req=5

pcbi-0030126-g001: Pipeline FlowchartThe boxes with solid lines indicate steps involving intensive computation (approximate running time is specified next to each). Other intermediate steps are specified in the boxes with dashed lines.
Mentions: In outline, our pipeline consists of the following major steps. (See Figure 1, Materials and Methods, and the online supplement at http://bio.cs.washington.edu/supplements/yzizhen/pipeline for more details.) First, we used the National Center for Biotechnology Information's (NCBI's) Conserved Domain Database (CDD) [16] to identify homologous gene sets. For each gene, we collected its 5′ upstream sequence. We call the set of 5′ sequences associated with one CDD group a dataset. cis-Regulatory elements are often conserved within such groups. Second, we applied FootPrinter [17], a DNA phylogenetic footprinting tool, to select datasets that are likely to host ncRNAs. In our experience, functional RNAs such as riboswitches often show low overall sequence conservation, but contain interspersed patches where conservation is high. FootPrinter is very effective at highlighting the latter regions. Third, we used CMfinder to infer RNA motifs in each unaligned sequence dataset. CMfinder is a structure-oriented local alignment tool that is robust to varying sequence conservation and length of extraneous flanking regions. We postprocessed motifs to identify distinct motifs corresponding to different RNA elements by removing poor and redundant motifs and clustering the rest based on overlap. Fourth, we used RaveNnA [18–20] to find additional motif instances by scanning the prokaryotic genome database. Riboswitches, for example, often regulate multiple operons that contribute to a single pathway, but no single CDD domain will be common to all of these operons. Thus, the search step was a powerful adjunct to the motif discovery process. These newly discovered motif members were incorporated into a refined motif model, again using CMfinder, and in some cases the search and motif refinement steps were repeated. Motif postprocessing was also repeated after the search/refinement steps. Both CMfinder and RaveNnA rely on the Infernal covariance model software package [21] for RNA motif modeling and search. Finally, we performed gene context analysis and literature searches (manually) for the top-ranking motifs.

Bottom Line: The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation.We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly.Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam).

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA. yzizhen@cs.washington.edu

ABSTRACT
Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair-level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth.

Show MeSH
Related in: MedlinePlus