Limits...
Identification and characterization of pseudogenes in the rice gene complement.

Thibaud-Nissen F, Ouyang S, Buell CR - BMC Genomics (2009)

Bottom Line: Nipponbare) is the product of a semi-automated pipeline that does not explicitly predict pseudogenes.Among the 816 pseudogenes for which a probable origin could be determined, 75% originated from gene duplication events while 25% were the result of retrotransposition events.Finally, F-box proteins, BTB/POZ proteins, terpene synthases, chalcone synthases and cytochrome P450 protein families were found to harbor large numbers of pseudogenes.

View Article: PubMed Central - HTML - PubMed

Affiliation: The J. Craig Venter Institute, 9712 Medical Center Dr, Rockville, MD 20850, USA. thibaudf@ncbi.nlm.nih.gov

ABSTRACT

Background: The Osa1 Genome Annotation of rice (Oryza sativa L. ssp. japonica cv. Nipponbare) is the product of a semi-automated pipeline that does not explicitly predict pseudogenes. As such, it is likely to mis-annotate pseudogenes as functional genes. A total of 22,033 gene models within the Osa1 Release 5 were investigated as potential pseudogenes as these genes exhibit at least one feature potentially indicative of pseudogenes: lack of transcript support, short coding region, long untranslated region, or, for genes residing within a segmentally duplicated region, lack of a paralog or significantly shorter corresponding paralog.

Results: A total of 1,439 pseudogenes, identified among genes with pseudogene features, were characterized by similarity to fully-supported gene models and the presence of frameshifts or premature translational stop codons. Significant difference in the length of duplicated genes within segmentally-duplicated regions was the optimal indicator of pseudogenization. Among the 816 pseudogenes for which a probable origin could be determined, 75% originated from gene duplication events while 25% were the result of retrotransposition events. A total of 12% of the pseudogenes were expressed. Finally, F-box proteins, BTB/POZ proteins, terpene synthases, chalcone synthases and cytochrome P450 protein families were found to harbor large numbers of pseudogenes.

Conclusion: These pseudogenes still have a detectable open reading frame and are thus distinct from pseudogenes detected within intergenic regions which typically lack definable open reading frames. Families containing the highest number of pseudogenes are fast-evolving families involved in ubiquitination and secondary metabolism.

Show MeSH
log(Ka/Ks) ratios distribution of the pseudogenes (full line) and of a control set of functional paralogous genes (dotted line).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2724416&req=5

Figure 2: log(Ka/Ks) ratios distribution of the pseudogenes (full line) and of a control set of functional paralogous genes (dotted line).

Mentions: Due to their non-functional nature, pseudogenes are not expected to be under evolutionary constraint and instead, are expected to be under neutral selection. Thus, pseudogenes should have a synonymous substitution rate (Ks) roughly equal to the non-synonymous mutation rate (Ka), while functional genes should have a Ka/Ks much lower than 1, since non-synonymous mutations present a disadvantage and are selected against (purifying selection). Maximum likelihood estimates of the Ka and Ks were calculated by analysis of the alignments of the pseudogenes to their corresponding parents. We found that the Ka/Ks distribution was log-normal with a geometric mean of 0.32. This is lower than the expected value of 1, but can be explained in part by the fact that each pseudogene is compared to its "sibling" rather than its true, ancestral, parent. This approximation inflates the Ks value and therefore decreases the Ka/Ks. Furthermore, we estimated the Ka/Ks for paralogous functional genes within segmentally duplicated regions [21] in the same manner as for the pseudogenes and compared the Ka/Ks distribution of the pseudogenes to that of this control set (geometric mean of 0.14). The two distributions were found to be significantly different (p < 10-15, Welch t-test) (Figure 2).


Identification and characterization of pseudogenes in the rice gene complement.

Thibaud-Nissen F, Ouyang S, Buell CR - BMC Genomics (2009)

log(Ka/Ks) ratios distribution of the pseudogenes (full line) and of a control set of functional paralogous genes (dotted line).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2724416&req=5

Figure 2: log(Ka/Ks) ratios distribution of the pseudogenes (full line) and of a control set of functional paralogous genes (dotted line).
Mentions: Due to their non-functional nature, pseudogenes are not expected to be under evolutionary constraint and instead, are expected to be under neutral selection. Thus, pseudogenes should have a synonymous substitution rate (Ks) roughly equal to the non-synonymous mutation rate (Ka), while functional genes should have a Ka/Ks much lower than 1, since non-synonymous mutations present a disadvantage and are selected against (purifying selection). Maximum likelihood estimates of the Ka and Ks were calculated by analysis of the alignments of the pseudogenes to their corresponding parents. We found that the Ka/Ks distribution was log-normal with a geometric mean of 0.32. This is lower than the expected value of 1, but can be explained in part by the fact that each pseudogene is compared to its "sibling" rather than its true, ancestral, parent. This approximation inflates the Ks value and therefore decreases the Ka/Ks. Furthermore, we estimated the Ka/Ks for paralogous functional genes within segmentally duplicated regions [21] in the same manner as for the pseudogenes and compared the Ka/Ks distribution of the pseudogenes to that of this control set (geometric mean of 0.14). The two distributions were found to be significantly different (p < 10-15, Welch t-test) (Figure 2).

Bottom Line: Nipponbare) is the product of a semi-automated pipeline that does not explicitly predict pseudogenes.Among the 816 pseudogenes for which a probable origin could be determined, 75% originated from gene duplication events while 25% were the result of retrotransposition events.Finally, F-box proteins, BTB/POZ proteins, terpene synthases, chalcone synthases and cytochrome P450 protein families were found to harbor large numbers of pseudogenes.

View Article: PubMed Central - HTML - PubMed

Affiliation: The J. Craig Venter Institute, 9712 Medical Center Dr, Rockville, MD 20850, USA. thibaudf@ncbi.nlm.nih.gov

ABSTRACT

Background: The Osa1 Genome Annotation of rice (Oryza sativa L. ssp. japonica cv. Nipponbare) is the product of a semi-automated pipeline that does not explicitly predict pseudogenes. As such, it is likely to mis-annotate pseudogenes as functional genes. A total of 22,033 gene models within the Osa1 Release 5 were investigated as potential pseudogenes as these genes exhibit at least one feature potentially indicative of pseudogenes: lack of transcript support, short coding region, long untranslated region, or, for genes residing within a segmentally duplicated region, lack of a paralog or significantly shorter corresponding paralog.

Results: A total of 1,439 pseudogenes, identified among genes with pseudogene features, were characterized by similarity to fully-supported gene models and the presence of frameshifts or premature translational stop codons. Significant difference in the length of duplicated genes within segmentally-duplicated regions was the optimal indicator of pseudogenization. Among the 816 pseudogenes for which a probable origin could be determined, 75% originated from gene duplication events while 25% were the result of retrotransposition events. A total of 12% of the pseudogenes were expressed. Finally, F-box proteins, BTB/POZ proteins, terpene synthases, chalcone synthases and cytochrome P450 protein families were found to harbor large numbers of pseudogenes.

Conclusion: These pseudogenes still have a detectable open reading frame and are thus distinct from pseudogenes detected within intergenic regions which typically lack definable open reading frames. Families containing the highest number of pseudogenes are fast-evolving families involved in ubiquitination and secondary metabolism.

Show MeSH