Limits...
Origins of De Novo Genes in Human and Chimpanzee.

Ruiz-Orera J, Hernandez-Rodriguez J, Chiva C, Sabidó E, Kondova I, Bontrop R, Marqués-Bonet T, Albà MM - PLoS Genet. (2015)

Bottom Line: Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies.This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species.Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS.

View Article: PubMed Central - PubMed

Affiliation: Evolutionary Genomics Group, Hospital del Mar Research Institute (IMIM), Barcelona, Spain.

ABSTRACT
The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process, we have performed in-depth sequencing of the transcriptomes of four mammalian species--human, chimpanzee, macaque, and mouse--and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species. Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional. However, we find signatures of selection in a subset of de novo genes which have evidence of protein translation. Taken together, the data support a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.

Show MeSH

Related in: MedlinePlus

Coding potential of de novo genes.a-d) ORF length and coding score for ORFs in different sequence types. De novo gene, longest ORF in de novo transcripts (n = 1,933). CodRNA (all), annotated coding sequences from Ensembl v.75 (n = 8,462). CodRNA (short), annotated coding sequences sampled as to have the same transcript length distribution as de novo transcripts (n = 1,952). Intron, longest ORF in intronic sequences from annotated genes sampled as to have the same transcript length distribution as novo transcripts (n = 5,000); Proteogenomics—ORFs in de novo transcripts with peptide evidence by mass-spectrometry; Ribosome profiling—ORFs in de novo transcripts with ribosome association evidence in brain. e) Example of hominoid-specific de novo gene with evidence of protein expression from proteogenomics, with RNA-Seq read profiles in two human samples. (f) Example of hominoid-specific de novo gene with RNA-Seq and ribosome profiling read profiles. Predicted coding sequences are highlighted with red boxes and the putative encoded protein sequences displayed.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4697840&req=5

pgen.1005721.g004: Coding potential of de novo genes.a-d) ORF length and coding score for ORFs in different sequence types. De novo gene, longest ORF in de novo transcripts (n = 1,933). CodRNA (all), annotated coding sequences from Ensembl v.75 (n = 8,462). CodRNA (short), annotated coding sequences sampled as to have the same transcript length distribution as de novo transcripts (n = 1,952). Intron, longest ORF in intronic sequences from annotated genes sampled as to have the same transcript length distribution as novo transcripts (n = 5,000); Proteogenomics—ORFs in de novo transcripts with peptide evidence by mass-spectrometry; Ribosome profiling—ORFs in de novo transcripts with ribosome association evidence in brain. e) Example of hominoid-specific de novo gene with evidence of protein expression from proteogenomics, with RNA-Seq read profiles in two human samples. (f) Example of hominoid-specific de novo gene with RNA-Seq and ribosome profiling read profiles. Predicted coding sequences are highlighted with red boxes and the putative encoded protein sequences displayed.

Mentions: Most de novo genes were not annotated in the databases and their coding status was unclear. We analyzed two coding properties in de novo genes as well as in other sequences: ORF length and ORF coding score. The latter score was based on hexanucleotide frequencies in bona fide sets of coding and non-coding sequences (see Methods). The median length of the longest ORF of each de novo gene was 52 amino acids. De novo predicted proteins were shorter than proteins encoded by annotated coding RNAs (codRNA) with the same transcript length distribution as the set of de novo genes, and comparable to ORFs from similarly sampled intronic sequences (Fig 4a and 4b). In contrast, the coding score of the longest ORF was higher in de novo genes than in intronic ORFs (Wilcoxon test, p-value < 10−10) and comparable to the score for proteins shorter than 100 amino acids in the set of annotated protein-coding genes.


Origins of De Novo Genes in Human and Chimpanzee.

Ruiz-Orera J, Hernandez-Rodriguez J, Chiva C, Sabidó E, Kondova I, Bontrop R, Marqués-Bonet T, Albà MM - PLoS Genet. (2015)

Coding potential of de novo genes.a-d) ORF length and coding score for ORFs in different sequence types. De novo gene, longest ORF in de novo transcripts (n = 1,933). CodRNA (all), annotated coding sequences from Ensembl v.75 (n = 8,462). CodRNA (short), annotated coding sequences sampled as to have the same transcript length distribution as de novo transcripts (n = 1,952). Intron, longest ORF in intronic sequences from annotated genes sampled as to have the same transcript length distribution as novo transcripts (n = 5,000); Proteogenomics—ORFs in de novo transcripts with peptide evidence by mass-spectrometry; Ribosome profiling—ORFs in de novo transcripts with ribosome association evidence in brain. e) Example of hominoid-specific de novo gene with evidence of protein expression from proteogenomics, with RNA-Seq read profiles in two human samples. (f) Example of hominoid-specific de novo gene with RNA-Seq and ribosome profiling read profiles. Predicted coding sequences are highlighted with red boxes and the putative encoded protein sequences displayed.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4697840&req=5

pgen.1005721.g004: Coding potential of de novo genes.a-d) ORF length and coding score for ORFs in different sequence types. De novo gene, longest ORF in de novo transcripts (n = 1,933). CodRNA (all), annotated coding sequences from Ensembl v.75 (n = 8,462). CodRNA (short), annotated coding sequences sampled as to have the same transcript length distribution as de novo transcripts (n = 1,952). Intron, longest ORF in intronic sequences from annotated genes sampled as to have the same transcript length distribution as novo transcripts (n = 5,000); Proteogenomics—ORFs in de novo transcripts with peptide evidence by mass-spectrometry; Ribosome profiling—ORFs in de novo transcripts with ribosome association evidence in brain. e) Example of hominoid-specific de novo gene with evidence of protein expression from proteogenomics, with RNA-Seq read profiles in two human samples. (f) Example of hominoid-specific de novo gene with RNA-Seq and ribosome profiling read profiles. Predicted coding sequences are highlighted with red boxes and the putative encoded protein sequences displayed.
Mentions: Most de novo genes were not annotated in the databases and their coding status was unclear. We analyzed two coding properties in de novo genes as well as in other sequences: ORF length and ORF coding score. The latter score was based on hexanucleotide frequencies in bona fide sets of coding and non-coding sequences (see Methods). The median length of the longest ORF of each de novo gene was 52 amino acids. De novo predicted proteins were shorter than proteins encoded by annotated coding RNAs (codRNA) with the same transcript length distribution as the set of de novo genes, and comparable to ORFs from similarly sampled intronic sequences (Fig 4a and 4b). In contrast, the coding score of the longest ORF was higher in de novo genes than in intronic ORFs (Wilcoxon test, p-value < 10−10) and comparable to the score for proteins shorter than 100 amino acids in the set of annotated protein-coding genes.

Bottom Line: Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies.This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species.Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS.

View Article: PubMed Central - PubMed

Affiliation: Evolutionary Genomics Group, Hospital del Mar Research Institute (IMIM), Barcelona, Spain.

ABSTRACT
The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process, we have performed in-depth sequencing of the transcriptomes of four mammalian species--human, chimpanzee, macaque, and mouse--and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species. Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional. However, we find signatures of selection in a subset of de novo genes which have evidence of protein translation. Taken together, the data support a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.

Show MeSH
Related in: MedlinePlus