Limits...
Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.


Signature of coding function in translated sequences identified in pseudogenes.Nonsynonymous variants (orange) segregate at significantly lower frequencies than synonymous variants (blue) in those pseudogenes predicted to have a translated CDS. Pseudogenes with ribosome occupancy, but predicted to have no translated CDS, do not show any significant difference between the site frequency spectra of synonymous and nonsynonymous variants.DOI:http://dx.doi.org/10.7554/eLife.13328.020
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4940163&req=5

fig5s1: Signature of coding function in translated sequences identified in pseudogenes.Nonsynonymous variants (orange) segregate at significantly lower frequencies than synonymous variants (blue) in those pseudogenes predicted to have a translated CDS. Pseudogenes with ribosome occupancy, but predicted to have no translated CDS, do not show any significant difference between the site frequency spectra of synonymous and nonsynonymous variants.DOI:http://dx.doi.org/10.7554/eLife.13328.020

Mentions: Starting with biallelic SNPs identified using whole genome sequences of 2504 individuals (Auton et al., 2015), we examined the set of SNPs falling within all novel mCDS (13,907 variants within 3096 novel mCDS). We labeled each SNP as synonymous or nonsynonymous with respect to the inferred CDS and show the cumulative distribution of minor allele frequencies (MAF) of each SNP class (Figure 5A). We observed that nonsynonymous SNPs have an excess of rare variants compared with synonymous SNPs (Mann-Whitney test; p-value = 1.08 × 10−4), implying a difference in the intensity of purifying selection (Nielsen, 2005). This observed excess suggests that the novel mCDS are under significant constraint, consistent with functional peptides, albeit weaker than at annotated CDS. The mCDS identified within pseudogenes alone also showed a similar excess of rare variants among nonsynonymous SNPs (Mann-Whitney test; p-value = 5.6 × 10−3). Such an excess was not observed for pseudogenes that had detectable ribosome occupancy but lacked a high-confidence inferred coding sequence (Figure 5—figure supplement 1); for these pseudogenes, the SNPs were labeled based on the reading frame of the parent gene. This highlights that ribosome occupancy alone is insufficient to identify translated sequences, and our method is able to leverage finer scale structure in ribosome footprint data to detect functional coding sequences.10.7554/eLife.13328.019Figure 5.Novel translated sequences show significant signatures of coding function.


Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling
Signature of coding function in translated sequences identified in pseudogenes.Nonsynonymous variants (orange) segregate at significantly lower frequencies than synonymous variants (blue) in those pseudogenes predicted to have a translated CDS. Pseudogenes with ribosome occupancy, but predicted to have no translated CDS, do not show any significant difference between the site frequency spectra of synonymous and nonsynonymous variants.DOI:http://dx.doi.org/10.7554/eLife.13328.020
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4940163&req=5

fig5s1: Signature of coding function in translated sequences identified in pseudogenes.Nonsynonymous variants (orange) segregate at significantly lower frequencies than synonymous variants (blue) in those pseudogenes predicted to have a translated CDS. Pseudogenes with ribosome occupancy, but predicted to have no translated CDS, do not show any significant difference between the site frequency spectra of synonymous and nonsynonymous variants.DOI:http://dx.doi.org/10.7554/eLife.13328.020
Mentions: Starting with biallelic SNPs identified using whole genome sequences of 2504 individuals (Auton et al., 2015), we examined the set of SNPs falling within all novel mCDS (13,907 variants within 3096 novel mCDS). We labeled each SNP as synonymous or nonsynonymous with respect to the inferred CDS and show the cumulative distribution of minor allele frequencies (MAF) of each SNP class (Figure 5A). We observed that nonsynonymous SNPs have an excess of rare variants compared with synonymous SNPs (Mann-Whitney test; p-value = 1.08 × 10−4), implying a difference in the intensity of purifying selection (Nielsen, 2005). This observed excess suggests that the novel mCDS are under significant constraint, consistent with functional peptides, albeit weaker than at annotated CDS. The mCDS identified within pseudogenes alone also showed a similar excess of rare variants among nonsynonymous SNPs (Mann-Whitney test; p-value = 5.6 × 10−3). Such an excess was not observed for pseudogenes that had detectable ribosome occupancy but lacked a high-confidence inferred coding sequence (Figure 5—figure supplement 1); for these pseudogenes, the SNPs were labeled based on the reading frame of the parent gene. This highlights that ribosome occupancy alone is insufficient to identify translated sequences, and our method is able to leverage finer scale structure in ribosome footprint data to detect functional coding sequences.10.7554/eLife.13328.019Figure 5.Novel translated sequences show significant signatures of coding function.

View Article: PubMed Central - PubMed

ABSTRACT

Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function. We describe riboHMM, a new method that uses ribosome footprint data to accurately infer translated sequences. Applying riboHMM to human lymphoblastoid cell lines, we identified 7273 novel coding sequences, including 2442 translated upstream open reading frames. We observed an enrichment of footprints at inferred initiation sites after drug-induced arrest of translation initiation, validating many of the novel coding sequences. The novel proteins exhibit significant selective constraint in the inferred reading frames, suggesting that many are functional. Moreover, ~40% of bicistronic transcripts showed negative correlation in the translation levels of their two coding sequences, suggesting a potential regulatory role for these novel regions. Despite known limitations of mass spectrometry to detect protein expressed at low level, we estimated a 14% validation rate. Our work significantly expands the set of known coding regions in humans.

Doi:: http://dx.doi.org/10.7554/eLife.13328.001

No MeSH data available.