Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).DOI:http://dx.doi.org/10.7554/eLife.13328.025
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig8: Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).DOI:http://dx.doi.org/10.7554/eLife.13328.025

Mentions: Our model makes the assumption that all cytoplasmic mRNA translation shares the same molecular mechanism. Our learning set is enriched for annotated coding genes and we agree that this choice of learning set could lead to bias in estimating the frequency of non-AUG initiated coding sequences. Given the general lack of available information on these novel CDS, we find our estimate a reasonable starting point. The model parameters were learned using five thousand well-expressed genes; we used GENCODE annotations to define a gene but did not restrict this set to annotated coding genes. Most importantly, we did not use any information about the annotated translation initiation and termination sites when learning the model parameters. The psi_c values estimated with these well-expressed 5000 genes were very similar to those observed when random sets of five thousand genes were used as the learning set (see Author response image 1). In addition, only the periodicity parameters are shared between uaCDS and mCDS; i.e., we re-estimate psi_c before inferring uaCDS. We decided to use well-expressed genes as our learning set to ensure that when the footprint data do not provide very strong evidence regarding the initiation site, novel coding sequences identified by our method are as similar as possible to annotated coding sequences in the sequence composition of their initiation sites. We have now elaborated on our choice of learning set and its consequences in the Discussion section of the revised manuscript (third paragraph).10.7554/eLife.13328.025Author response image 1.Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).DOI:http://dx.doi.org/10.7554/eLife.13328.025
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig8: Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).DOI:http://dx.doi.org/10.7554/eLife.13328.025
Mentions: Our model makes the assumption that all cytoplasmic mRNA translation shares the same molecular mechanism. Our learning set is enriched for annotated coding genes and we agree that this choice of learning set could lead to bias in estimating the frequency of non-AUG initiated coding sequences. Given the general lack of available information on these novel CDS, we find our estimate a reasonable starting point. The model parameters were learned using five thousand well-expressed genes; we used GENCODE annotations to define a gene but did not restrict this set to annotated coding genes. Most importantly, we did not use any information about the annotated translation initiation and termination sites when learning the model parameters. The psi_c values estimated with these well-expressed 5000 genes were very similar to those observed when random sets of five thousand genes were used as the learning set (see Author response image 1). In addition, only the periodicity parameters are shared between uaCDS and mCDS; i.e., we re-estimate psi_c before inferring uaCDS. We decided to use well-expressed genes as our learning set to ensure that when the footprint data do not provide very strong evidence regarding the initiation site, novel coding sequences identified by our method are as similar as possible to annotated coding sequences in the sequence composition of their initiation sites. We have now elaborated on our choice of learning set and its consequences in the Discussion section of the revised manuscript (third paragraph).10.7554/eLife.13328.025Author response image 1.Comparing the estimated values of parameter psi_c, when using the 5000 most expressed genes (blue) and random sets of 5000 genes (gray).

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.