Limits...
A computational pipeline for high- throughput discovery of cis-regulatory noncoding RNA in prokaryotes.

Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL - PLoS Comput. Biol. (2007)

Bottom Line: The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation.We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly.Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam).

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA. yzizhen@cs.washington.edu

ABSTRACT
Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair-level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth.

Show MeSH

Related in: MedlinePlus

The Empirical p-Value Distribution Based on the Permutation TestThe black curve shows the complementary cumulative distribution function for the composite scores on randomized datasets (i.e., for each score, the fraction of permuted alignments exceeding that score). The red pluses show the p-values for the composite scores of the motifs in the original (unpermuted) datasets. All p-values are greater than or equal to 2 × 10−4 as there are only 5,000 samples in the background distribution.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1913097&req=5

pcbi-0030126-g002: The Empirical p-Value Distribution Based on the Permutation TestThe black curve shows the complementary cumulative distribution function for the composite scores on randomized datasets (i.e., for each score, the fraction of permuted alignments exceeding that score). The red pluses show the p-values for the composite scores of the motifs in the original (unpermuted) datasets. All p-values are greater than or equal to 2 × 10−4 as there are only 5,000 samples in the background distribution.

Mentions: To evaluate how many of our top candidates could have arisen by chance, we performed a randomized control experiment. We first computed CLUSTALW alignments of the 100 sequence datasets having the highest motif scores (before the RaveNnA scan). We then randomly permuted the alignments 50 times, maintaining the approximate gap pattern (see the online supplement at http://bio.cs.washington.edu/supplements/yzizhen/pipeline). After degapping each permuted alignment (treating it as a set of unaligned sequences), we applied CMfinder, retaining the top-ranking motif from each randomized dataset. We used this collection of 5,000 motifs to estimate the background score distribution, and to infer p-values for predicted motifs in the original datasets. Results are shown in Figure 2. By this measure, all 100 top-scoring motifs have p-values less than 0.1, with the median at 0.016. In addition, 73 of the 100 candidates in the original dataset score higher than all motifs in the corresponding randomized datasets.


A computational pipeline for high- throughput discovery of cis-regulatory noncoding RNA in prokaryotes.

Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL - PLoS Comput. Biol. (2007)

The Empirical p-Value Distribution Based on the Permutation TestThe black curve shows the complementary cumulative distribution function for the composite scores on randomized datasets (i.e., for each score, the fraction of permuted alignments exceeding that score). The red pluses show the p-values for the composite scores of the motifs in the original (unpermuted) datasets. All p-values are greater than or equal to 2 × 10−4 as there are only 5,000 samples in the background distribution.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1913097&req=5

pcbi-0030126-g002: The Empirical p-Value Distribution Based on the Permutation TestThe black curve shows the complementary cumulative distribution function for the composite scores on randomized datasets (i.e., for each score, the fraction of permuted alignments exceeding that score). The red pluses show the p-values for the composite scores of the motifs in the original (unpermuted) datasets. All p-values are greater than or equal to 2 × 10−4 as there are only 5,000 samples in the background distribution.
Mentions: To evaluate how many of our top candidates could have arisen by chance, we performed a randomized control experiment. We first computed CLUSTALW alignments of the 100 sequence datasets having the highest motif scores (before the RaveNnA scan). We then randomly permuted the alignments 50 times, maintaining the approximate gap pattern (see the online supplement at http://bio.cs.washington.edu/supplements/yzizhen/pipeline). After degapping each permuted alignment (treating it as a set of unaligned sequences), we applied CMfinder, retaining the top-ranking motif from each randomized dataset. We used this collection of 5,000 motifs to estimate the background score distribution, and to infer p-values for predicted motifs in the original datasets. Results are shown in Figure 2. By this measure, all 100 top-scoring motifs have p-values less than 0.1, with the median at 0.016. In addition, 73 of the 100 candidates in the original dataset score higher than all motifs in the corresponding randomized datasets.

Bottom Line: The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation.We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly.Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam).

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA. yzizhen@cs.washington.edu

ABSTRACT
Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair-level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth.

Show MeSH
Related in: MedlinePlus