Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Translated coding sequences identified in hundreds of pseudogenes.Comparing the inferred CDS in the pseudogene GAPDHP72 (orange) with the annotated CDS (purple) of its parent gene GAPDH (introns have been removed for better visualization). Gray shaded boxes show the alignment between the pseudogene and parent gene transcripts, and dark shaded boxes indicate that the inferred coding frame of the pseudogene matches that of the parent gene. Although the pseudogene and parent gene share coding frames, the underlying sequences are sufficiently different.DOI:http://dx.doi.org/10.7554/eLife.13328.013
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig3s1: Translated coding sequences identified in hundreds of pseudogenes.Comparing the inferred CDS in the pseudogene GAPDHP72 (orange) with the annotated CDS (purple) of its parent gene GAPDH (introns have been removed for better visualization). Gray shaded boxes show the alignment between the pseudogene and parent gene transcripts, and dark shaded boxes indicate that the inferred coding frame of the pseudogene matches that of the parent gene. Although the pseudogene and parent gene share coding frames, the underlying sequences are sufficiently different.DOI:http://dx.doi.org/10.7554/eLife.13328.013

Mentions: In addition, we identified 2550 novel mCDS in annotated non-coding transcripts and 1019 mCDS within novel transcripts assembled de novo by StringTie (FDR = 5.6%). Using simulations, we estimated that given a transcript has no translated sequence; riboHMM inaccurately identifies a translated sequence at fairly low error rates (Type I error rate = 4.5%). Over 60% of the mCDS in novel transcripts were identified in single-exon transcripts transcribed from regions containing no annotated genes, while about 8% were identified in novel transcripts transcribed from the strand opposite to an annotated transcript. Finally, we inferred mCDS in 448 expressed pseudogenes, among 14,065 pseudogenes annotated in humans (Pei et al., 2012); nearly 90% of these mCDS were identified in processed pseudogenes. An mCDS in pseudogene GAPDHP72 is shown in Figure 3—figure supplement 1, comparing the ribosome abundance and peptide matches to the pseudogene mCDS with those of its parent gene GAPDH.


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Translated coding sequences identified in hundreds of pseudogenes.Comparing the inferred CDS in the pseudogene GAPDHP72 (orange) with the annotated CDS (purple) of its parent gene GAPDH (introns have been removed for better visualization). Gray shaded boxes show the alignment between the pseudogene and parent gene transcripts, and dark shaded boxes indicate that the inferred coding frame of the pseudogene matches that of the parent gene. Although the pseudogene and parent gene share coding frames, the underlying sequences are sufficiently different.DOI:http://dx.doi.org/10.7554/eLife.13328.013
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig3s1: Translated coding sequences identified in hundreds of pseudogenes.Comparing the inferred CDS in the pseudogene GAPDHP72 (orange) with the annotated CDS (purple) of its parent gene GAPDH (introns have been removed for better visualization). Gray shaded boxes show the alignment between the pseudogene and parent gene transcripts, and dark shaded boxes indicate that the inferred coding frame of the pseudogene matches that of the parent gene. Although the pseudogene and parent gene share coding frames, the underlying sequences are sufficiently different.DOI:http://dx.doi.org/10.7554/eLife.13328.013
Mentions: In addition, we identified 2550 novel mCDS in annotated non-coding transcripts and 1019 mCDS within novel transcripts assembled de novo by StringTie (FDR = 5.6%). Using simulations, we estimated that given a transcript has no translated sequence; riboHMM inaccurately identifies a translated sequence at fairly low error rates (Type I error rate = 4.5%). Over 60% of the mCDS in novel transcripts were identified in single-exon transcripts transcribed from regions containing no annotated genes, while about 8% were identified in novel transcripts transcribed from the strand opposite to an annotated transcript. Finally, we inferred mCDS in 448 expressed pseudogenes, among 14,065 pseudogenes annotated in humans (Pei et al., 2012); nearly 90% of these mCDS were identified in processed pseudogenes. An mCDS in pseudogene GAPDHP72 is shown in Figure 3—figure supplement 1, comparing the ribosome abundance and peptide matches to the pseudogene mCDS with those of its parent gene GAPDH.

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.