Limits...
Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences.

Ivanov IP, Firth AE, Michel AM, Atkins JF, Baranov PV - Nucleic Acids Res. (2011)

Bottom Line: We use evolutionary signatures of protein-coding sequences as an indicator of translation initiation upstream of annotated coding sequences.Our search identified novel conserved potential non-AUG-initiated N-terminal extensions in 42 human genes including VANGL2, FGFR1, KCNN4, TRPV6, HDGF, CITED2, EIF4G3 and NTF3, and also affirmed the conservation of known non-AUG-initiated extensions in 17 other genes.In several instances, we have been able to obtain independent experimental evidence of the expression of non-AUG-initiated products from the previously published literature and ribosome profiling data.

View Article: PubMed Central - PubMed

Affiliation: BioSciences Institute, University College Cork, Cork, Ireland. iivanov@genetics.utah.edu

ABSTRACT
In eukaryotes, it is generally assumed that translation initiation occurs at the AUG codon closest to the messenger RNA 5' cap. However, in certain cases, initiation can occur at codons differing from AUG by a single nucleotide, especially the codons CUG, UUG, GUG, ACG, AUA and AUU. While non-AUG initiation has been experimentally verified for a handful of human genes, the full extent to which this phenomenon is utilized--both for increased coding capacity and potentially also for novel regulatory mechanisms--remains unclear. To address this issue, and hence to improve the quality of existing coding sequence annotations, we developed a methodology based on phylogenetic analysis of predicted 5' untranslated regions from orthologous genes. We use evolutionary signatures of protein-coding sequences as an indicator of translation initiation upstream of annotated coding sequences. Our search identified novel conserved potential non-AUG-initiated N-terminal extensions in 42 human genes including VANGL2, FGFR1, KCNN4, TRPV6, HDGF, CITED2, EIF4G3 and NTF3, and also affirmed the conservation of known non-AUG-initiated extensions in 17 other genes. In several instances, we have been able to obtain independent experimental evidence of the expression of non-AUG-initiated products from the previously published literature and ribosome profiling data.

Show MeSH
Pipeline of RefSeq mRNA analysis for the identification of conserved 5′ CDS extensions (P5ECs). White boxes indicate annotated CDSs. Black boxes correspond to 5′ in-frame codon extensions up to the closest in-frame stop codon. Xs correspond to the deleted regions of human–mouse alignments prior to Ka/Ks analysis.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3105428&req=5

Figure 2: Pipeline of RefSeq mRNA analysis for the identification of conserved 5′ CDS extensions (P5ECs). White boxes indicate annotated CDSs. Black boxes correspond to 5′ in-frame codon extensions up to the closest in-frame stop codon. Xs correspond to the deleted regions of human–mouse alignments prior to Ka/Ks analysis.

Mentions: A schematic outline of the pipeline for potential 5 extension of CDS (P5EC) detection is illustrated in Figure 2. A total of 46 500 human mRNA sequences were downloaded from the NCBI RefSeq database (35) on 12 February 2009 (release 33) using the Refseq ftp site: ftp://ftp.ncbi.nih.gov/refseq/release/. We chose RefSeq annotations to minimize the effect from erroneous and inconsistent misannotations as well as to avoid redundancy in our dataset. All sequences that have in-frame stop codons within 50 codons upstream of annotated AUG initiator codons were discarded. For the remaining mRNA sequences, the region surrounding the annotated AUG codon was translated from the nearest upstream in-frame stop codon to the 100th codon downstream of the AUG. The inclusion of the 5′-terminal 100 codons of the annotated CDS was a necessary element to ensure detection of correct mouse orthologs in the following steps. This procedure resulted in a set of 3437 peptide sequences. This set of peptide sequences was then queried against the mouse Refseq mRNA database using tblastn as implemented in NCBI BLAST client netblast-2.2.19. The best hits for each of the peptides were considered as candidate orthologs. Those sequences that produced alignments with a span of at least 50 codons upstream of the annotated CDS were extracted for further processing; 2194 human–mouse orthologous pairs were obtained. Sequences were conceptually translated, re-aligned using muscle (36) and back-translated using t_coffee utilite (37) to yield nucleotide alignments. A custom perl script was designed to trim P5EC alignments prior to Ka/Ks calculation in the following way. Columns containing gaps in either the human or mouse sequence were removed. Because the alignment of sequences was carried out at the amino acid level, all gaps at the nucleotide level occur in multiples of 3 and therefore their removal does not affect reading frame. For alignments containing in-frame stop codons in the mouse sequence, columns corresponding to the stop codons and all columns upstream were removed. This was done based on the assumption that, in general, functionally important P5ECs and alternative translation initiation starts (ATISs) should be conserved between mouse and human sequences and that the presence of a stop codon in a mouse sequence indicates that the sequence is not translated in mouse. If, after trimming, the resulting alignment was shorter than 25 codons, then the corresponding mouse P5EC could not be longer than 25 codons. Also it would not be practical to include such short alignments into further analysis. Therefore, only alignments with a length of at least 25 codons upstream of the annotated AUG (1193 alignments) were subjected to Ka/Ks analysis as implemented in the codonml program of the PAML package (38). Then the alignments were scored based on Ka/Ks values, length of the pairwise alignments and identity at the protein level.


Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences.

Ivanov IP, Firth AE, Michel AM, Atkins JF, Baranov PV - Nucleic Acids Res. (2011)

Pipeline of RefSeq mRNA analysis for the identification of conserved 5′ CDS extensions (P5ECs). White boxes indicate annotated CDSs. Black boxes correspond to 5′ in-frame codon extensions up to the closest in-frame stop codon. Xs correspond to the deleted regions of human–mouse alignments prior to Ka/Ks analysis.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3105428&req=5

Figure 2: Pipeline of RefSeq mRNA analysis for the identification of conserved 5′ CDS extensions (P5ECs). White boxes indicate annotated CDSs. Black boxes correspond to 5′ in-frame codon extensions up to the closest in-frame stop codon. Xs correspond to the deleted regions of human–mouse alignments prior to Ka/Ks analysis.
Mentions: A schematic outline of the pipeline for potential 5 extension of CDS (P5EC) detection is illustrated in Figure 2. A total of 46 500 human mRNA sequences were downloaded from the NCBI RefSeq database (35) on 12 February 2009 (release 33) using the Refseq ftp site: ftp://ftp.ncbi.nih.gov/refseq/release/. We chose RefSeq annotations to minimize the effect from erroneous and inconsistent misannotations as well as to avoid redundancy in our dataset. All sequences that have in-frame stop codons within 50 codons upstream of annotated AUG initiator codons were discarded. For the remaining mRNA sequences, the region surrounding the annotated AUG codon was translated from the nearest upstream in-frame stop codon to the 100th codon downstream of the AUG. The inclusion of the 5′-terminal 100 codons of the annotated CDS was a necessary element to ensure detection of correct mouse orthologs in the following steps. This procedure resulted in a set of 3437 peptide sequences. This set of peptide sequences was then queried against the mouse Refseq mRNA database using tblastn as implemented in NCBI BLAST client netblast-2.2.19. The best hits for each of the peptides were considered as candidate orthologs. Those sequences that produced alignments with a span of at least 50 codons upstream of the annotated CDS were extracted for further processing; 2194 human–mouse orthologous pairs were obtained. Sequences were conceptually translated, re-aligned using muscle (36) and back-translated using t_coffee utilite (37) to yield nucleotide alignments. A custom perl script was designed to trim P5EC alignments prior to Ka/Ks calculation in the following way. Columns containing gaps in either the human or mouse sequence were removed. Because the alignment of sequences was carried out at the amino acid level, all gaps at the nucleotide level occur in multiples of 3 and therefore their removal does not affect reading frame. For alignments containing in-frame stop codons in the mouse sequence, columns corresponding to the stop codons and all columns upstream were removed. This was done based on the assumption that, in general, functionally important P5ECs and alternative translation initiation starts (ATISs) should be conserved between mouse and human sequences and that the presence of a stop codon in a mouse sequence indicates that the sequence is not translated in mouse. If, after trimming, the resulting alignment was shorter than 25 codons, then the corresponding mouse P5EC could not be longer than 25 codons. Also it would not be practical to include such short alignments into further analysis. Therefore, only alignments with a length of at least 25 codons upstream of the annotated AUG (1193 alignments) were subjected to Ka/Ks analysis as implemented in the codonml program of the PAML package (38). Then the alignments were scored based on Ka/Ks values, length of the pairwise alignments and identity at the protein level.

Bottom Line: We use evolutionary signatures of protein-coding sequences as an indicator of translation initiation upstream of annotated coding sequences.Our search identified novel conserved potential non-AUG-initiated N-terminal extensions in 42 human genes including VANGL2, FGFR1, KCNN4, TRPV6, HDGF, CITED2, EIF4G3 and NTF3, and also affirmed the conservation of known non-AUG-initiated extensions in 17 other genes.In several instances, we have been able to obtain independent experimental evidence of the expression of non-AUG-initiated products from the previously published literature and ribosome profiling data.

View Article: PubMed Central - PubMed

Affiliation: BioSciences Institute, University College Cork, Cork, Ireland. iivanov@genetics.utah.edu

ABSTRACT
In eukaryotes, it is generally assumed that translation initiation occurs at the AUG codon closest to the messenger RNA 5' cap. However, in certain cases, initiation can occur at codons differing from AUG by a single nucleotide, especially the codons CUG, UUG, GUG, ACG, AUA and AUU. While non-AUG initiation has been experimentally verified for a handful of human genes, the full extent to which this phenomenon is utilized--both for increased coding capacity and potentially also for novel regulatory mechanisms--remains unclear. To address this issue, and hence to improve the quality of existing coding sequence annotations, we developed a methodology based on phylogenetic analysis of predicted 5' untranslated regions from orthologous genes. We use evolutionary signatures of protein-coding sequences as an indicator of translation initiation upstream of annotated coding sequences. Our search identified novel conserved potential non-AUG-initiated N-terminal extensions in 42 human genes including VANGL2, FGFR1, KCNN4, TRPV6, HDGF, CITED2, EIF4G3 and NTF3, and also affirmed the conservation of known non-AUG-initiated extensions in 17 other genes. In several instances, we have been able to obtain independent experimental evidence of the expression of non-AUG-initiated products from the previously published literature and ribosome profiling data.

Show MeSH