Limits...
Considerations in the identification of functional RNA structural elements in genomic alignments.

Babak T, Blencowe BJ, Hughes TR - BMC Bioinformatics (2007)

Bottom Line: Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries.On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone.Accurate prediction of novel RNA structural elements in genome sequence remains a difficult problem, and development of an appropriate negative-control strategy for multiple alignments is an important practical challenge.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, Donnelly Centre for Cellular and Biomolecular Research, 160 College St, Toronto, ON M5S 3E1 Canada. tomas.babak@utoronto.ca <tomas.babak@utoronto.ca>

ABSTRACT

Background: Accurate identification of novel, functional noncoding (nc) RNA features in genome sequence has proven more difficult than for exons. Current algorithms identify and score potential RNA secondary structures on the basis of thermodynamic stability, conservation, and/or covariance in sequence alignments. Neither the algorithms nor the information gained from the individual inputs have been independently assessed. Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries.

Results: We developed a shuffling algorithm, shuffle-pair.pl, that simultaneously preserves dinucleotide frequency, gaps, and local conservation in pairwise sequence alignments. We used shuffle-pair.pl to assess precision and recall of six ncRNA search tools (MSARI, QRNA, ddbRNA, RNAz, Evofold, and several variants of simple thermodynamic stability on a test set of 3046 alignments of known ncRNAs. Relative to mononucleotide shuffling, preservation of dinucleotide content in shuffling the alignments resulted in a drastic increase in estimated false-positive detection rates for ncRNA elements, precluding evaluation of higher order alignments, which cannot not be adequately shuffled maintaining both dinucleotides and alignment structure. On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone. Although the high false-positive rates call into question the veracity of any individual predicted secondary structural element in our analysis, we nevertheless identified intriguing global trends in human genome alignments. The distribution of ncRNA prediction scores in 75-base windows overlapping UTRs, introns, and intergenic regions analyzed using both thermodynamic stability and EvoFold (which has no thermodynamic component) was significantly higher for real than shuffled sequence, while the distribution for coding sequences was lower than that of corresponding shuffles.

Conclusion: Accurate prediction of novel RNA structural elements in genome sequence remains a difficult problem, and development of an appropriate negative-control strategy for multiple alignments is an important practical challenge. Nonetheless, the general trends we observed for the distributions of predicted ncRNAs across genomic features are biologically meaningful, supporting the presence of secondary structural elements in many 3' UTRs, and providing evidence for evolutionary selection against secondary structures in coding regions.

Show MeSH

Related in: MedlinePlus

Enrichment for conserved secondary structure elements across genomic features. a) Differences between real and shuffled score distributions over genomic features. Left, Histogram showing distribution of all scores from real and shuffled tiling windows of genomic sequences (see Methods). Right, histogram showing the difference between the number of real sequences scoring in a given range and the number of shuffled sequences scoring in the same range. b) Histograms showing real-minus-shuffled distributions (as in the right part of panel a of this figure) for different categories of genomic features: left, zMFOLD (human vs. mouse); right, Evofold (human vs. chimp).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1803800&req=5

Figure 7: Enrichment for conserved secondary structure elements across genomic features. a) Differences between real and shuffled score distributions over genomic features. Left, Histogram showing distribution of all scores from real and shuffled tiling windows of genomic sequences (see Methods). Right, histogram showing the difference between the number of real sequences scoring in a given range and the number of shuffled sequences scoring in the same range. b) Histograms showing real-minus-shuffled distributions (as in the right part of panel a of this figure) for different categories of genomic features: left, zMFOLD (human vs. mouse); right, Evofold (human vs. chimp).

Mentions: We scanned pairwise genomic alignments [22] of human chromosome 19 and genomic regions corresponding to alignable portions of 24,576 Refseq [23] genes with either zMFOLD or Evofold, as practical representatives of the thermodynamic-stability-only regime and the covariance regime. We scanned real and shuffled counterparts (using shuffle-pair.pl) of each window. We found that the distributions of the real and shuffled scores largely overlap, with few or no real scores completely exceeding the shuffled distribution (Figure 7a, left). However, the real and shuffled score distributions were significantly different (Wilcoxon rank-sum test, p < e-100; data not shown). A histogram of the difference in number of real and shuffled sequences at different scores is shown at the right in Figure 7a. These results obviously bring into question the significance of any individual element, but also suggest a selection for some level of conserved secondary structure across the genome.


Considerations in the identification of functional RNA structural elements in genomic alignments.

Babak T, Blencowe BJ, Hughes TR - BMC Bioinformatics (2007)

Enrichment for conserved secondary structure elements across genomic features. a) Differences between real and shuffled score distributions over genomic features. Left, Histogram showing distribution of all scores from real and shuffled tiling windows of genomic sequences (see Methods). Right, histogram showing the difference between the number of real sequences scoring in a given range and the number of shuffled sequences scoring in the same range. b) Histograms showing real-minus-shuffled distributions (as in the right part of panel a of this figure) for different categories of genomic features: left, zMFOLD (human vs. mouse); right, Evofold (human vs. chimp).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1803800&req=5

Figure 7: Enrichment for conserved secondary structure elements across genomic features. a) Differences between real and shuffled score distributions over genomic features. Left, Histogram showing distribution of all scores from real and shuffled tiling windows of genomic sequences (see Methods). Right, histogram showing the difference between the number of real sequences scoring in a given range and the number of shuffled sequences scoring in the same range. b) Histograms showing real-minus-shuffled distributions (as in the right part of panel a of this figure) for different categories of genomic features: left, zMFOLD (human vs. mouse); right, Evofold (human vs. chimp).
Mentions: We scanned pairwise genomic alignments [22] of human chromosome 19 and genomic regions corresponding to alignable portions of 24,576 Refseq [23] genes with either zMFOLD or Evofold, as practical representatives of the thermodynamic-stability-only regime and the covariance regime. We scanned real and shuffled counterparts (using shuffle-pair.pl) of each window. We found that the distributions of the real and shuffled scores largely overlap, with few or no real scores completely exceeding the shuffled distribution (Figure 7a, left). However, the real and shuffled score distributions were significantly different (Wilcoxon rank-sum test, p < e-100; data not shown). A histogram of the difference in number of real and shuffled sequences at different scores is shown at the right in Figure 7a. These results obviously bring into question the significance of any individual element, but also suggest a selection for some level of conserved secondary structure across the genome.

Bottom Line: Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries.On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone.Accurate prediction of novel RNA structural elements in genome sequence remains a difficult problem, and development of an appropriate negative-control strategy for multiple alignments is an important practical challenge.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, Donnelly Centre for Cellular and Biomolecular Research, 160 College St, Toronto, ON M5S 3E1 Canada. tomas.babak@utoronto.ca <tomas.babak@utoronto.ca>

ABSTRACT

Background: Accurate identification of novel, functional noncoding (nc) RNA features in genome sequence has proven more difficult than for exons. Current algorithms identify and score potential RNA secondary structures on the basis of thermodynamic stability, conservation, and/or covariance in sequence alignments. Neither the algorithms nor the information gained from the individual inputs have been independently assessed. Furthermore, due to issues in modelling background signal, it has been difficult to gauge the precision of these algorithms on a genomic scale, in which even a seemingly small false-positive rate can result in a vast excess of false discoveries.

Results: We developed a shuffling algorithm, shuffle-pair.pl, that simultaneously preserves dinucleotide frequency, gaps, and local conservation in pairwise sequence alignments. We used shuffle-pair.pl to assess precision and recall of six ncRNA search tools (MSARI, QRNA, ddbRNA, RNAz, Evofold, and several variants of simple thermodynamic stability on a test set of 3046 alignments of known ncRNAs. Relative to mononucleotide shuffling, preservation of dinucleotide content in shuffling the alignments resulted in a drastic increase in estimated false-positive detection rates for ncRNA elements, precluding evaluation of higher order alignments, which cannot not be adequately shuffled maintaining both dinucleotides and alignment structure. On pairwise alignments, none of the covariance-based tools performed markedly better than thermodynamic scoring alone. Although the high false-positive rates call into question the veracity of any individual predicted secondary structural element in our analysis, we nevertheless identified intriguing global trends in human genome alignments. The distribution of ncRNA prediction scores in 75-base windows overlapping UTRs, introns, and intergenic regions analyzed using both thermodynamic stability and EvoFold (which has no thermodynamic component) was significantly higher for real than shuffled sequence, while the distribution for coding sequences was lower than that of corresponding shuffles.

Conclusion: Accurate prediction of novel RNA structural elements in genome sequence remains a difficult problem, and development of an appropriate negative-control strategy for multiple alignments is an important practical challenge. Nonetheless, the general trends we observed for the distributions of predicted ncRNAs across genomic features are biologically meaningful, supporting the presence of secondary structural elements in many 3' UTRs, and providing evidence for evolutionary selection against secondary structures in coding regions.

Show MeSH
Related in: MedlinePlus