Limits...
Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling.

Lindgreen S, Umu SU, Lai AS, Eldai H, Liu W, McGimpsey S, Wheeler NE, Biggs PJ, Thomson NR, Barquist L, Poole AM, Gardner PP - PLoS Comput. Biol. (2014)

Bottom Line: However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling.Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a hypothesis of transcriptional noise.Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

View Article: PubMed Central - PubMed

Affiliation: Department of Biology, University of Copenhagen, Copenhagen, Denmark; School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.

ABSTRACT
Noncoding RNAs are integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

Show MeSH

Related in: MedlinePlus

Many ncRNAs and RUFs are highly expressed but show limited conservation across represented strains/species.A–C: We have defined the “family conservation” for Pfam, Rfam and RUFs based upon the maximum phylogenetic distance (using structural SSU rRNA alignments) between any two strains hosting the family. We have divided the highly expressed transcripts (ranks 1–204) into Low, Medium and High conservation groups based on the lower-quartile, inter-quartile range and the upper-quartile of the family conservation measure (see Methods for further details). Both the known Rfam families and the RUFs identified in this analysis are often highly expressed transcripts. In contrast to protein-coding transcripts (blue), where highly-expressed transcripts are well-conserved, the opposite is true of many non-coding RNA elements (Rfam, red; RUFs, black). Notably, the greatest proportion of highly expressed Rfam-annotated RNA elements show a narrow evolutionary distribution. This is also reflected in the RUFs identified in this study. D: Venn diagram of the 568 highly expressed RUFs. Each RUF was analysed to look for evidence of secondary structure formation, level of conservation, and evidence of expression in at least one other RNA-seq dataset. All RUFs showing expression in other strains/species are conserved in at least two strains/species, so the figure also shows that 219 highly expressed RUFs are conserved across a limited phylogenetic distance only.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4214555&req=5

pcbi-1003907-g002: Many ncRNAs and RUFs are highly expressed but show limited conservation across represented strains/species.A–C: We have defined the “family conservation” for Pfam, Rfam and RUFs based upon the maximum phylogenetic distance (using structural SSU rRNA alignments) between any two strains hosting the family. We have divided the highly expressed transcripts (ranks 1–204) into Low, Medium and High conservation groups based on the lower-quartile, inter-quartile range and the upper-quartile of the family conservation measure (see Methods for further details). Both the known Rfam families and the RUFs identified in this analysis are often highly expressed transcripts. In contrast to protein-coding transcripts (blue), where highly-expressed transcripts are well-conserved, the opposite is true of many non-coding RNA elements (Rfam, red; RUFs, black). Notably, the greatest proportion of highly expressed Rfam-annotated RNA elements show a narrow evolutionary distribution. This is also reflected in the RUFs identified in this study. D: Venn diagram of the 568 highly expressed RUFs. Each RUF was analysed to look for evidence of secondary structure formation, level of conservation, and evidence of expression in at least one other RNA-seq dataset. All RUFs showing expression in other strains/species are conserved in at least two strains/species, so the figure also shows that 219 highly expressed RUFs are conserved across a limited phylogenetic distance only.

Mentions: To assess whether highly expressed RUFs possess features commonly associated with function, we employed three criteria: 1) evolutionary conservation, 2) conservation of secondary structure, 3) evidence of expression in more than one RNA-seq dataset. For this analysis, we compared and ranked transcriptional outputs across species/strains (see Methods for details). Based on the relative rank across RNA-seq datasets and the maximum phylogenetic distance observed across all genomes, each transcript was classified as high, medium or low expression, and high, medium or low conservation. This yielded a set of highly expressed transcripts consisting of 162 Rfam families, 568 RUFs and 1429 Pfam families. As expected [44]–[46], conserved, highly expressed outputs are dominated by protein-coding transcripts (Figure 2B&C). In contrast, transcripts that are highly expressed but poorly conserved are primarily RUFs (Figure 2A). Of the 568 RUFs identified, only 25 are supported by all three conservative criteria (conservation, secondary structure and expression) (Figure 2D), a further 138 RUFs are supported by two criteria (Figure 2D). Consequently, on these criteria, the vast majority of RUFs appear indistinguishable from transcriptional noise. However, as these RUFs are among the most highly expressed transcripts in public RNA-seq data, we next considered whether our criteria were sufficiently discriminatory to identify functional RNAs. It is well established that not all functional RNAs exhibit conserved secondary structure – antisense base pairing with a target is common, and does not require intramolecular folding [47]. This indicates that criterion 2 will apply to some, but not all functional RNA elements. Criteria 1 and 3 both derive from comparative analysis: criterion 1 requires an expressed RUF to be conserved in some other genome, while criterion 2 requires an expressed RUF to be expressed in another of the datasets in our study. We therefore sought to examine how effective our comparative analyses are given that the available data represent a small sample (transcriptomes from 37 strains) and given that biases in genome sampling across bacterial and archaeal diversity impact comparative analysis of RNAs [36].


Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling.

Lindgreen S, Umu SU, Lai AS, Eldai H, Liu W, McGimpsey S, Wheeler NE, Biggs PJ, Thomson NR, Barquist L, Poole AM, Gardner PP - PLoS Comput. Biol. (2014)

Many ncRNAs and RUFs are highly expressed but show limited conservation across represented strains/species.A–C: We have defined the “family conservation” for Pfam, Rfam and RUFs based upon the maximum phylogenetic distance (using structural SSU rRNA alignments) between any two strains hosting the family. We have divided the highly expressed transcripts (ranks 1–204) into Low, Medium and High conservation groups based on the lower-quartile, inter-quartile range and the upper-quartile of the family conservation measure (see Methods for further details). Both the known Rfam families and the RUFs identified in this analysis are often highly expressed transcripts. In contrast to protein-coding transcripts (blue), where highly-expressed transcripts are well-conserved, the opposite is true of many non-coding RNA elements (Rfam, red; RUFs, black). Notably, the greatest proportion of highly expressed Rfam-annotated RNA elements show a narrow evolutionary distribution. This is also reflected in the RUFs identified in this study. D: Venn diagram of the 568 highly expressed RUFs. Each RUF was analysed to look for evidence of secondary structure formation, level of conservation, and evidence of expression in at least one other RNA-seq dataset. All RUFs showing expression in other strains/species are conserved in at least two strains/species, so the figure also shows that 219 highly expressed RUFs are conserved across a limited phylogenetic distance only.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4214555&req=5

pcbi-1003907-g002: Many ncRNAs and RUFs are highly expressed but show limited conservation across represented strains/species.A–C: We have defined the “family conservation” for Pfam, Rfam and RUFs based upon the maximum phylogenetic distance (using structural SSU rRNA alignments) between any two strains hosting the family. We have divided the highly expressed transcripts (ranks 1–204) into Low, Medium and High conservation groups based on the lower-quartile, inter-quartile range and the upper-quartile of the family conservation measure (see Methods for further details). Both the known Rfam families and the RUFs identified in this analysis are often highly expressed transcripts. In contrast to protein-coding transcripts (blue), where highly-expressed transcripts are well-conserved, the opposite is true of many non-coding RNA elements (Rfam, red; RUFs, black). Notably, the greatest proportion of highly expressed Rfam-annotated RNA elements show a narrow evolutionary distribution. This is also reflected in the RUFs identified in this study. D: Venn diagram of the 568 highly expressed RUFs. Each RUF was analysed to look for evidence of secondary structure formation, level of conservation, and evidence of expression in at least one other RNA-seq dataset. All RUFs showing expression in other strains/species are conserved in at least two strains/species, so the figure also shows that 219 highly expressed RUFs are conserved across a limited phylogenetic distance only.
Mentions: To assess whether highly expressed RUFs possess features commonly associated with function, we employed three criteria: 1) evolutionary conservation, 2) conservation of secondary structure, 3) evidence of expression in more than one RNA-seq dataset. For this analysis, we compared and ranked transcriptional outputs across species/strains (see Methods for details). Based on the relative rank across RNA-seq datasets and the maximum phylogenetic distance observed across all genomes, each transcript was classified as high, medium or low expression, and high, medium or low conservation. This yielded a set of highly expressed transcripts consisting of 162 Rfam families, 568 RUFs and 1429 Pfam families. As expected [44]–[46], conserved, highly expressed outputs are dominated by protein-coding transcripts (Figure 2B&C). In contrast, transcripts that are highly expressed but poorly conserved are primarily RUFs (Figure 2A). Of the 568 RUFs identified, only 25 are supported by all three conservative criteria (conservation, secondary structure and expression) (Figure 2D), a further 138 RUFs are supported by two criteria (Figure 2D). Consequently, on these criteria, the vast majority of RUFs appear indistinguishable from transcriptional noise. However, as these RUFs are among the most highly expressed transcripts in public RNA-seq data, we next considered whether our criteria were sufficiently discriminatory to identify functional RNAs. It is well established that not all functional RNAs exhibit conserved secondary structure – antisense base pairing with a target is common, and does not require intramolecular folding [47]. This indicates that criterion 2 will apply to some, but not all functional RNA elements. Criteria 1 and 3 both derive from comparative analysis: criterion 1 requires an expressed RUF to be conserved in some other genome, while criterion 2 requires an expressed RUF to be expressed in another of the datasets in our study. We therefore sought to examine how effective our comparative analyses are given that the available data represent a small sample (transcriptomes from 37 strains) and given that biases in genome sampling across bacterial and archaeal diversity impact comparative analysis of RNAs [36].

Bottom Line: However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling.Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a hypothesis of transcriptional noise.Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

View Article: PubMed Central - PubMed

Affiliation: Department of Biology, University of Copenhagen, Copenhagen, Denmark; School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.

ABSTRACT
Noncoding RNAs are integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

Show MeSH
Related in: MedlinePlus