Limits...
The COG database: an updated version includes eukaryotes.

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA - BMC Bioinformatics (2003)

Bottom Line: Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs.This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs).In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. tatusov@ncbi.nlm.nih.gov

ABSTRACT

Background: The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.

Results: We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or approximately 54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of approximately 20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

Conclusion: The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.

Show MeSH

Related in: MedlinePlus

Examples of phyletic pattern search. (A) COGs represented in Encephalitozoon cuniculi but missing in the two yeasts (B) COGs represented in Yersinia pestis but not in other Proteobacteria or eukaryotesThe sets of species included in COGs are color-coded as follows (from left to right): yellow, archaea; purple, eukaryotes; green, miscellaneous bacteria, including hyperthermophiles, cyanobacteria, Fusobacterium, and Deinococcus; dark yellow, actinobacteria; torqoise, low-GC Gram-positive bacteria (except for mycoplasmas); light blue, Gamma-proteobacteria; dark-blue, Beta- and Epsilon-proteobacteria; dark gray, Alpha-proteobacteria; green, chlamydia and spirochetes; dark green, mycoplasmas. The functional categories, designated as in Fig. 4, are also color-coded.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC222959&req=5

Figure 5: Examples of phyletic pattern search. (A) COGs represented in Encephalitozoon cuniculi but missing in the two yeasts (B) COGs represented in Yersinia pestis but not in other Proteobacteria or eukaryotesThe sets of species included in COGs are color-coded as follows (from left to right): yellow, archaea; purple, eukaryotes; green, miscellaneous bacteria, including hyperthermophiles, cyanobacteria, Fusobacterium, and Deinococcus; dark yellow, actinobacteria; torqoise, low-GC Gram-positive bacteria (except for mycoplasmas); light blue, Gamma-proteobacteria; dark-blue, Beta- and Epsilon-proteobacteria; dark gray, Alpha-proteobacteria; green, chlamydia and spirochetes; dark green, mycoplasmas. The functional categories, designated as in Fig. 4, are also color-coded.

Mentions: Phyletic pattern search can be employed for preliminary assessment of specific functional and evolutionary hypotheses. With the increased number of included genomes and enhanced capabilities of the phyletic pattern search tool, this analysis becomes particularly informative. Below we discuss straightforward examples of its use. Figure 5a shows the results of querying the COG database for COGs that are represented in the microsporidian parasite Encephalitozoon cuniculi but not in the two yeast species. Given the dramatic genome reduction seen in the microsporidium [38], it is not unexpected that the query retrieves only a small set of 13 COGs. This phyletic pattern can be explained either by loss of ancestral eukaryotic genes in yeast or by acquisition of genes by E. cuniculi via horizontal gene transfer. At least in some cases, further examination of the phyletic patterns of the retrieved COGs suggests the most likely scenario. Thus, COGs 1078, 1258, 1690, and 2263 are represented in all archaea, but are either missing in bacteria or are present in a minority of species (Fig. 5a). Therefore these COGs most likely are part of the ancestral archaeo-eukaryotic heritage [49] and might have been lost in yeasts; the respective proteins are known or predicted to be involved in translation or RNA modification, which is compatible with this evolutionary scenario. In contrast, COG3202 seems to be a likely case of horizontal gene transfer. Remarkably, the proteins in this COG are ADP/ATP translocases, which seem to be a hallmark of intracellular parasitism (or symbiosis) allowing the respective organisms to tap into the ATP supplies of the host cell [50,51]. Indeed, this COG is shared by the eukaryotic (E. cuniculi) and bacterial (Chlamydia and Rickettsia) intracellular parasites, the only exception being a diverged member of the COG found in the plant pathogenic bacterium Xylella fastidiosum (Fig. 5a).


The COG database: an updated version includes eukaryotes.

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA - BMC Bioinformatics (2003)

Examples of phyletic pattern search. (A) COGs represented in Encephalitozoon cuniculi but missing in the two yeasts (B) COGs represented in Yersinia pestis but not in other Proteobacteria or eukaryotesThe sets of species included in COGs are color-coded as follows (from left to right): yellow, archaea; purple, eukaryotes; green, miscellaneous bacteria, including hyperthermophiles, cyanobacteria, Fusobacterium, and Deinococcus; dark yellow, actinobacteria; torqoise, low-GC Gram-positive bacteria (except for mycoplasmas); light blue, Gamma-proteobacteria; dark-blue, Beta- and Epsilon-proteobacteria; dark gray, Alpha-proteobacteria; green, chlamydia and spirochetes; dark green, mycoplasmas. The functional categories, designated as in Fig. 4, are also color-coded.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC222959&req=5

Figure 5: Examples of phyletic pattern search. (A) COGs represented in Encephalitozoon cuniculi but missing in the two yeasts (B) COGs represented in Yersinia pestis but not in other Proteobacteria or eukaryotesThe sets of species included in COGs are color-coded as follows (from left to right): yellow, archaea; purple, eukaryotes; green, miscellaneous bacteria, including hyperthermophiles, cyanobacteria, Fusobacterium, and Deinococcus; dark yellow, actinobacteria; torqoise, low-GC Gram-positive bacteria (except for mycoplasmas); light blue, Gamma-proteobacteria; dark-blue, Beta- and Epsilon-proteobacteria; dark gray, Alpha-proteobacteria; green, chlamydia and spirochetes; dark green, mycoplasmas. The functional categories, designated as in Fig. 4, are also color-coded.
Mentions: Phyletic pattern search can be employed for preliminary assessment of specific functional and evolutionary hypotheses. With the increased number of included genomes and enhanced capabilities of the phyletic pattern search tool, this analysis becomes particularly informative. Below we discuss straightforward examples of its use. Figure 5a shows the results of querying the COG database for COGs that are represented in the microsporidian parasite Encephalitozoon cuniculi but not in the two yeast species. Given the dramatic genome reduction seen in the microsporidium [38], it is not unexpected that the query retrieves only a small set of 13 COGs. This phyletic pattern can be explained either by loss of ancestral eukaryotic genes in yeast or by acquisition of genes by E. cuniculi via horizontal gene transfer. At least in some cases, further examination of the phyletic patterns of the retrieved COGs suggests the most likely scenario. Thus, COGs 1078, 1258, 1690, and 2263 are represented in all archaea, but are either missing in bacteria or are present in a minority of species (Fig. 5a). Therefore these COGs most likely are part of the ancestral archaeo-eukaryotic heritage [49] and might have been lost in yeasts; the respective proteins are known or predicted to be involved in translation or RNA modification, which is compatible with this evolutionary scenario. In contrast, COG3202 seems to be a likely case of horizontal gene transfer. Remarkably, the proteins in this COG are ADP/ATP translocases, which seem to be a hallmark of intracellular parasitism (or symbiosis) allowing the respective organisms to tap into the ATP supplies of the host cell [50,51]. Indeed, this COG is shared by the eukaryotic (E. cuniculi) and bacterial (Chlamydia and Rickettsia) intracellular parasites, the only exception being a diverged member of the COG found in the plant pathogenic bacterium Xylella fastidiosum (Fig. 5a).

Bottom Line: Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs.This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs).In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. tatusov@ncbi.nlm.nih.gov

ABSTRACT

Background: The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.

Results: We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or approximately 54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of approximately 20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

Conclusion: The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.

Show MeSH
Related in: MedlinePlus