Limits...
DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection.

Chen TW, Wu TH, Ng WV, Lin WC - BMC Bioinformatics (2010)

Bottom Line: Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity.Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes.The output results of DODO are highly comparable with other known ortholog databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.

ABSTRACT

Background: Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.

Results: An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.

Conclusions: DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://140.109.42.19:16080/dodo_web/home.htm.

Show MeSH
examples of putative ortholog group found by DODO. Two examples of ortholog groups found with DODO which are not recorded in InParanoid. The alignments were generated by CLC free Workbench version 4.0.2. Consensus residues are shown in black and dissimilar residues are shown in blue. (A) These two sequences are clustered together with DODO and both are reported to have four different domains: Transketolase_N/E1_dh/Transket_pyr/Transketolase_C. (B) These two sequences are the only protein containing the Nop16 domain in human and mouse genomes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2957689&req=5

Figure 2: examples of putative ortholog group found by DODO. Two examples of ortholog groups found with DODO which are not recorded in InParanoid. The alignments were generated by CLC free Workbench version 4.0.2. Consensus residues are shown in black and dissimilar residues are shown in blue. (A) These two sequences are clustered together with DODO and both are reported to have four different domains: Transketolase_N/E1_dh/Transket_pyr/Transketolase_C. (B) These two sequences are the only protein containing the Nop16 domain in human and mouse genomes.

Mentions: InParanoid [7] is a well known database established based on primary sequence comparison and including in-paralogs into ortholog clusters. Among the 21,673 human and 23,497 mouse protein sequences downloaded from the InParanoid website [7]. DODO identified 14,128 ortholog groups and 95.8% of them have the same classifications as the InParanoid. Approximately 16.6% of the orthologs recorded in InParanoid were not found in our results. Of these, most of them (98%) were composed of proteins having different domain architectures identified with RPS-BLAST. Those orthologs with apparently different domain architecture may be generated through domain rearrangement events in the protein evolution history or one or more of its domains were below the RPS-BLAST e-values cutoff. Our method is able to identify 244 ortholog groups not reported in InParanoid. Most of them are members of large protein families or proteins with short-sequences (47% of them have sequences shorter than 300 amino acids). Ortholog discovery among big family proteins can introduce complication that obscure true orthology, since true orthologs may not be reciprocally most similar in their primary sequences. One such example is shown in Figure 2A. Here, we have two putative orthologs containing the same four-domain architecture. The BLASTP procedure used in InParanoid did not find them in the RBH when searching through the entire genomes since their primary sequence similarity is relatively low when compared to some other proteins. As a result, both proteins are omitted in the In Paranoid data. However, given that they both contain the same four domains it is likely that they were functionally closely related. When domain-architecture clustering is applied prior to the RBH procedure as we did, the orthologous relationship between them could be recovered. In addition, other ortholog pairs we discovered are short sequences. The pair of ortholog sequences shown in Figure 2B is putative orthologs having difference in their protein lengths. These two sequences both contain the Nop16 domain. The Nop16 containing protein is only identified exactly once in human and mouse genomes; therefore, the two sequences are very likely to be orthologs. We checked the BLASTP results from InParanoid and found these two genes are RBH. However, InParanoid requires the matched region to be longer than 50% of the sequences in order to avoid matching at domain-level instead of finding real ortholog pair [7,20]. This might be the reason for these orthologs missed in InParanoid and we were able to discover them here.


DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection.

Chen TW, Wu TH, Ng WV, Lin WC - BMC Bioinformatics (2010)

examples of putative ortholog group found by DODO. Two examples of ortholog groups found with DODO which are not recorded in InParanoid. The alignments were generated by CLC free Workbench version 4.0.2. Consensus residues are shown in black and dissimilar residues are shown in blue. (A) These two sequences are clustered together with DODO and both are reported to have four different domains: Transketolase_N/E1_dh/Transket_pyr/Transketolase_C. (B) These two sequences are the only protein containing the Nop16 domain in human and mouse genomes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2957689&req=5

Figure 2: examples of putative ortholog group found by DODO. Two examples of ortholog groups found with DODO which are not recorded in InParanoid. The alignments were generated by CLC free Workbench version 4.0.2. Consensus residues are shown in black and dissimilar residues are shown in blue. (A) These two sequences are clustered together with DODO and both are reported to have four different domains: Transketolase_N/E1_dh/Transket_pyr/Transketolase_C. (B) These two sequences are the only protein containing the Nop16 domain in human and mouse genomes.
Mentions: InParanoid [7] is a well known database established based on primary sequence comparison and including in-paralogs into ortholog clusters. Among the 21,673 human and 23,497 mouse protein sequences downloaded from the InParanoid website [7]. DODO identified 14,128 ortholog groups and 95.8% of them have the same classifications as the InParanoid. Approximately 16.6% of the orthologs recorded in InParanoid were not found in our results. Of these, most of them (98%) were composed of proteins having different domain architectures identified with RPS-BLAST. Those orthologs with apparently different domain architecture may be generated through domain rearrangement events in the protein evolution history or one or more of its domains were below the RPS-BLAST e-values cutoff. Our method is able to identify 244 ortholog groups not reported in InParanoid. Most of them are members of large protein families or proteins with short-sequences (47% of them have sequences shorter than 300 amino acids). Ortholog discovery among big family proteins can introduce complication that obscure true orthology, since true orthologs may not be reciprocally most similar in their primary sequences. One such example is shown in Figure 2A. Here, we have two putative orthologs containing the same four-domain architecture. The BLASTP procedure used in InParanoid did not find them in the RBH when searching through the entire genomes since their primary sequence similarity is relatively low when compared to some other proteins. As a result, both proteins are omitted in the In Paranoid data. However, given that they both contain the same four domains it is likely that they were functionally closely related. When domain-architecture clustering is applied prior to the RBH procedure as we did, the orthologous relationship between them could be recovered. In addition, other ortholog pairs we discovered are short sequences. The pair of ortholog sequences shown in Figure 2B is putative orthologs having difference in their protein lengths. These two sequences both contain the Nop16 domain. The Nop16 containing protein is only identified exactly once in human and mouse genomes; therefore, the two sequences are very likely to be orthologs. We checked the BLASTP results from InParanoid and found these two genes are RBH. However, InParanoid requires the matched region to be longer than 50% of the sequences in order to avoid matching at domain-level instead of finding real ortholog pair [7,20]. This might be the reason for these orthologs missed in InParanoid and we were able to discover them here.

Bottom Line: Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity.Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes.The output results of DODO are highly comparable with other known ortholog databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.

ABSTRACT

Background: Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.

Results: An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.

Conclusions: DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://140.109.42.19:16080/dodo_web/home.htm.

Show MeSH