Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Model accuracy.Quantifying model accuracy in terms of the fraction of previously annotated CDSs accurately recovered as a function of total footprint sequencing depth. We performed the entire analysis on the same set of assembled transcripts (parameter estimation and inference) after subsampling the data. Starting with inferences using the complete footprint data set that exactly match annotated CDSs (top subpanel), we show the fraction of these CDSs that were accurately (blue) and inaccurately (brown) recovered with a high posterior for varying sequencing depths. Starting with inferences that only match the frame of annotated CDSs (bottom subpanel), we show the fraction of accurately and inaccurately recovered CDSs for varying sequencing depths.DOI:http://dx.doi.org/10.7554/eLife.13328.006
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig1s4: Model accuracy.Quantifying model accuracy in terms of the fraction of previously annotated CDSs accurately recovered as a function of total footprint sequencing depth. We performed the entire analysis on the same set of assembled transcripts (parameter estimation and inference) after subsampling the data. Starting with inferences using the complete footprint data set that exactly match annotated CDSs (top subpanel), we show the fraction of these CDSs that were accurately (blue) and inaccurately (brown) recovered with a high posterior for varying sequencing depths. Starting with inferences that only match the frame of annotated CDSs (bottom subpanel), we show the fraction of accurately and inaccurately recovered CDSs for varying sequencing depths.DOI:http://dx.doi.org/10.7554/eLife.13328.006

Mentions: We used simulations to estimate the accuracy of our inferences. For transcripts that do contain a translated sequence, we find that riboHMM inaccurately identifies an overlapping, translated sequence in a different frame at extremely low rates (Type 1 error rate = 0.31%). In contrast, riboHMM has a relatively high error rate for identifying the precise translation initiation site (false discovery proportion = 38%; see Materials and methods for details). Among transcripts where riboHMM correctly identified the reading frame, the concordance between the inferred and annotated translation initiation site does not correlate with the length of CDS (Mann-Whitney test; p-value = 0.12). Amongst these, when riboHMM did not identify the annotated initiation site, the inferred initiation site was equally likely to be upstream or downstream of the annotated initiation site (Mann-Whitney test; p-value = 0.41). Our analysis is robust to sequencing depth; Figure 1—figure supplement 4 illustrates that nearly 60% of annotated coding sequences identified with the full data set (580 million footprints) could be accurately recovered even when the sequencing depth was reduced by almost two orders of magnitude.


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Model accuracy.Quantifying model accuracy in terms of the fraction of previously annotated CDSs accurately recovered as a function of total footprint sequencing depth. We performed the entire analysis on the same set of assembled transcripts (parameter estimation and inference) after subsampling the data. Starting with inferences using the complete footprint data set that exactly match annotated CDSs (top subpanel), we show the fraction of these CDSs that were accurately (blue) and inaccurately (brown) recovered with a high posterior for varying sequencing depths. Starting with inferences that only match the frame of annotated CDSs (bottom subpanel), we show the fraction of accurately and inaccurately recovered CDSs for varying sequencing depths.DOI:http://dx.doi.org/10.7554/eLife.13328.006
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig1s4: Model accuracy.Quantifying model accuracy in terms of the fraction of previously annotated CDSs accurately recovered as a function of total footprint sequencing depth. We performed the entire analysis on the same set of assembled transcripts (parameter estimation and inference) after subsampling the data. Starting with inferences using the complete footprint data set that exactly match annotated CDSs (top subpanel), we show the fraction of these CDSs that were accurately (blue) and inaccurately (brown) recovered with a high posterior for varying sequencing depths. Starting with inferences that only match the frame of annotated CDSs (bottom subpanel), we show the fraction of accurately and inaccurately recovered CDSs for varying sequencing depths.DOI:http://dx.doi.org/10.7554/eLife.13328.006
Mentions: We used simulations to estimate the accuracy of our inferences. For transcripts that do contain a translated sequence, we find that riboHMM inaccurately identifies an overlapping, translated sequence in a different frame at extremely low rates (Type 1 error rate = 0.31%). In contrast, riboHMM has a relatively high error rate for identifying the precise translation initiation site (false discovery proportion = 38%; see Materials and methods for details). Among transcripts where riboHMM correctly identified the reading frame, the concordance between the inferred and annotated translation initiation site does not correlate with the length of CDS (Mann-Whitney test; p-value = 0.12). Amongst these, when riboHMM did not identify the annotated initiation site, the inferred initiation site was equally likely to be upstream or downstream of the annotated initiation site (Mann-Whitney test; p-value = 0.41). Our analysis is robust to sequencing depth; Figure 1—figure supplement 4 illustrates that nearly 60% of annotated coding sequences identified with the full data set (580 million footprints) could be accurately recovered even when the sequencing depth was reduced by almost two orders of magnitude.

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.