Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Robustness of periodicity parameter estimates.Bar plots of the periodicity parameters , for 29 bp long footprints, estimated using the top 5000 expressed genes. Each red circle indicates the parameter values estimated by using a random set of 1000 genes.DOI:http://dx.doi.org/10.7554/eLife.13328.003
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig1s1: Robustness of periodicity parameter estimates.Bar plots of the periodicity parameters , for 29 bp long footprints, estimated using the top 5000 expressed genes. Each red circle indicates the parameter values estimated by using a random set of 1000 genes.DOI:http://dx.doi.org/10.7554/eLife.13328.003

Mentions: We applied riboHMM to infer translated CDSs in human lymphoblastoid cell lines (LCLs) for which gene expression phenotypes were measured genome-wide: mRNA in 86 individuals, ribosome occupancy in 72 individuals and protein levels in 60 individuals (Lappalainen et al., 2013; Battle et al., 2015). We first assembled over 2.8 billion RNA sequencing reads into transcripts using StringTie (Pertea et al., 2015). This assembly gives us annotated transcripts that are expressed in LCLs, along with novel transcripts that do not overlap any GENCODE annotated gene. (We do not consider novel isoforms of annotated genes in our analyses.) Restricting to transcripts with at least five footprints mapped to each exon, we used riboHMM to identify high-confidence translated CDS. We learned the maximum likelihood estimates of the model parameters using the top five thousand highly expressed genes. The estimated parameters are robust to the choice of the learning set (Figure 1—figure supplements 1 and 2). Using these parameters, we then inferred the maximum a posteriori (MAP) frame and latent state sequence for each of the assembled transcripts. We retained transcripts whose MAP frame and state sequence corresponded to a pair of translation initiation and termination sites and had a joint posterior probability greater than 0.8. Using a set of simulated transcripts, we estimated that this posterior cutoff corresponded to a Type 1 error rate of 4.5% per transcript. The MAP frame and state sequence directly give us the nucleotide sequence with the strongest signal of translation; we refer to these as main coding sequences or mCDS.


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Robustness of periodicity parameter estimates.Bar plots of the periodicity parameters , for 29 bp long footprints, estimated using the top 5000 expressed genes. Each red circle indicates the parameter values estimated by using a random set of 1000 genes.DOI:http://dx.doi.org/10.7554/eLife.13328.003
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig1s1: Robustness of periodicity parameter estimates.Bar plots of the periodicity parameters , for 29 bp long footprints, estimated using the top 5000 expressed genes. Each red circle indicates the parameter values estimated by using a random set of 1000 genes.DOI:http://dx.doi.org/10.7554/eLife.13328.003
Mentions: We applied riboHMM to infer translated CDSs in human lymphoblastoid cell lines (LCLs) for which gene expression phenotypes were measured genome-wide: mRNA in 86 individuals, ribosome occupancy in 72 individuals and protein levels in 60 individuals (Lappalainen et al., 2013; Battle et al., 2015). We first assembled over 2.8 billion RNA sequencing reads into transcripts using StringTie (Pertea et al., 2015). This assembly gives us annotated transcripts that are expressed in LCLs, along with novel transcripts that do not overlap any GENCODE annotated gene. (We do not consider novel isoforms of annotated genes in our analyses.) Restricting to transcripts with at least five footprints mapped to each exon, we used riboHMM to identify high-confidence translated CDS. We learned the maximum likelihood estimates of the model parameters using the top five thousand highly expressed genes. The estimated parameters are robust to the choice of the learning set (Figure 1—figure supplements 1 and 2). Using these parameters, we then inferred the maximum a posteriori (MAP) frame and latent state sequence for each of the assembled transcripts. We retained transcripts whose MAP frame and state sequence corresponded to a pair of translation initiation and termination sites and had a joint posterior probability greater than 0.8. Using a set of simulated transcripts, we estimated that this posterior cutoff corresponded to a Type 1 error rate of 4.5% per transcript. The MAP frame and state sequence directly give us the nucleotide sequence with the strongest signal of translation; we refer to these as main coding sequences or mCDS.

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.