Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Comparing footprint abundance and gene expression.Scatter plot of the RNA-seq density (reads per base per million sequenced reads) within annotated coding transcripts vs. the expected ribosome footprint count (normalized by sequencing depth) in triplets within the CDS of the transcripts. The top 50th percentile of expressed genes is shown in black and the bottom 50th percentile is shown in gray. For highly expressed genes, it is reasonable to assume that the expected footprint count in a triplet scales linearly with RNA-seq density. However, for lowly expressed genes and outlier genes (i.e., genes with high expected footprint count and low RNA-seq density, and genes with high RNA-seq density and low expected footprint count), this assumption may not be valid.DOI:http://dx.doi.org/10.7554/eLife.13328.007
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig1s5: Comparing footprint abundance and gene expression.Scatter plot of the RNA-seq density (reads per base per million sequenced reads) within annotated coding transcripts vs. the expected ribosome footprint count (normalized by sequencing depth) in triplets within the CDS of the transcripts. The top 50th percentile of expressed genes is shown in black and the bottom 50th percentile is shown in gray. For highly expressed genes, it is reasonable to assume that the expected footprint count in a triplet scales linearly with RNA-seq density. However, for lowly expressed genes and outlier genes (i.e., genes with high expected footprint count and low RNA-seq density, and genes with high RNA-seq density and low expected footprint count), this assumption may not be valid.DOI:http://dx.doi.org/10.7554/eLife.13328.007

Mentions: The Poisson distribution for captures the difference in RPF abundance between translated and untranslated regions (precisely, difference in abundance between triplets in different states). We corrected for differences in RPF abundance across transcripts due to differences in transcript expression levels by using as a transcript-specific normalization factor (see Figure 1—figure supplement 5). To account for additional variation in the RPF counts across triplets in the same state (e.g., due to varying translation rates across transcripts, and translational pausing), we allowed for triplet-specific parameters in the Poisson intensity and assumed that those parameters follow a gamma distribution. Under this model,  and .


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Comparing footprint abundance and gene expression.Scatter plot of the RNA-seq density (reads per base per million sequenced reads) within annotated coding transcripts vs. the expected ribosome footprint count (normalized by sequencing depth) in triplets within the CDS of the transcripts. The top 50th percentile of expressed genes is shown in black and the bottom 50th percentile is shown in gray. For highly expressed genes, it is reasonable to assume that the expected footprint count in a triplet scales linearly with RNA-seq density. However, for lowly expressed genes and outlier genes (i.e., genes with high expected footprint count and low RNA-seq density, and genes with high RNA-seq density and low expected footprint count), this assumption may not be valid.DOI:http://dx.doi.org/10.7554/eLife.13328.007
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig1s5: Comparing footprint abundance and gene expression.Scatter plot of the RNA-seq density (reads per base per million sequenced reads) within annotated coding transcripts vs. the expected ribosome footprint count (normalized by sequencing depth) in triplets within the CDS of the transcripts. The top 50th percentile of expressed genes is shown in black and the bottom 50th percentile is shown in gray. For highly expressed genes, it is reasonable to assume that the expected footprint count in a triplet scales linearly with RNA-seq density. However, for lowly expressed genes and outlier genes (i.e., genes with high expected footprint count and low RNA-seq density, and genes with high RNA-seq density and low expected footprint count), this assumption may not be valid.DOI:http://dx.doi.org/10.7554/eLife.13328.007
Mentions: The Poisson distribution for captures the difference in RPF abundance between translated and untranslated regions (precisely, difference in abundance between triplets in different states). We corrected for differences in RPF abundance across transcripts due to differences in transcript expression levels by using as a transcript-specific normalization factor (see Figure 1—figure supplement 5). To account for additional variation in the RPF counts across triplets in the same state (e.g., due to varying translation rates across transcripts, and translational pausing), we allowed for triplet-specific parameters in the Poisson intensity and assumed that those parameters follow a gamma distribution. Under this model,  and .

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.