Limits...
Reassessing domain architecture evolution of metazoan proteins: major impact of errors caused by confusing paralogs and epaktologs.

Nagy A, Bányai L, Patthy L - Genes (Basel) (2011)

Bottom Line: Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins.In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa.Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins.

View Article: PubMed Central - PubMed

Affiliation: Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest H-1113, Hungary. nagya@enzim.hu.

ABSTRACT
In the accompanying paper (Nagy, Szláma, Szarka, Trexler, Bányai, Patthy, Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors) we showed that in the case of UniProtKB/TrEMBL, RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species the contribution of erroneous (incomplete, abnormal, mispredicted) sequences to domain architecture (DA) differences of orthologous proteins might be greater than those of true gene rearrangements. Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa. To estimate the contribution of this type of error we have used as reference UniProtKB/Swiss-Prot sequences from protein families with well-characterized evolutionary histories. We have used two types of paralogy-group construction procedures and monitored the impact of various parameters on the separation of true paralogs from epaktologs on correctly annotated Swiss-Prot entries of multidomain proteins. Our studies have shown that, although public protein family databases are contaminated with epaktologs, analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs. We have also demonstrated that contamination of protein families with epaktologs increases the apparent rate of DA change and introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences.We have shown that confusing paralogous and epaktologous multidomain proteins significantly increases the apparent rate of DA change in Metazoa and introduces a positional bias in favor of terminal over internal DA changes. Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of multidomain proteins is presented in an accompanying paper [1].

No MeSH data available.


Related in: MedlinePlus

Analysis of the positional distribution of DA differences in clusters of human Swiss-Prot proteins defined through comparison with RefSeq proteomes of various species. The ordinate shows the proportion of pair-wise comparisons within clusters where the pairs differ in DA. N-terminal differences (blue rectangles), C-terminal differences (red rectangles), internal differences (green rectangles), tandem duplications (black rectangles). (a) Positional distribution of DA differences for type 1 transitions; (b) Positional distribution of DA differences for type 2 transitions; and (c) Positional distribution of DA differences for type 3 transitions. Note that in the case of type 3 transitions the proportion of internal DA difference is comparable to those of N-terminal or C-terminal changes only in the case of chordate species. On the abscissa the species are listed in the order of decreasing evolutionary distance from Homo sapiens, thus the abscissa has a time-dimension but their distance is not drawn to scale. Abbreviations on the abscissa: Ta - Trichoplax adhaerens, Nv - Nematostella vectensis, Hm - Hydra magnipapillata, Ce -Caenorhabditis elegans, Cb -Caenorhabditis briggsae, Dm -Drosophila melanogaster, Dp -Drosophila pseudoobscura, Ds - Drosophila simulans, Sp - Strongylocentrotus purpuratus, Bf - Branchiostoma floridae, Ci - Ciona intestinalis, Dr - Danio rerio, Xt - Xenopus tropicalis, Gg - Gallus gallus, Mm - Mus musculus.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3927612&req=5

f12-genes-02-00516: Analysis of the positional distribution of DA differences in clusters of human Swiss-Prot proteins defined through comparison with RefSeq proteomes of various species. The ordinate shows the proportion of pair-wise comparisons within clusters where the pairs differ in DA. N-terminal differences (blue rectangles), C-terminal differences (red rectangles), internal differences (green rectangles), tandem duplications (black rectangles). (a) Positional distribution of DA differences for type 1 transitions; (b) Positional distribution of DA differences for type 2 transitions; and (c) Positional distribution of DA differences for type 3 transitions. Note that in the case of type 3 transitions the proportion of internal DA difference is comparable to those of N-terminal or C-terminal changes only in the case of chordate species. On the abscissa the species are listed in the order of decreasing evolutionary distance from Homo sapiens, thus the abscissa has a time-dimension but their distance is not drawn to scale. Abbreviations on the abscissa: Ta - Trichoplax adhaerens, Nv - Nematostella vectensis, Hm - Hydra magnipapillata, Ce -Caenorhabditis elegans, Cb -Caenorhabditis briggsae, Dm -Drosophila melanogaster, Dp -Drosophila pseudoobscura, Ds - Drosophila simulans, Sp - Strongylocentrotus purpuratus, Bf - Branchiostoma floridae, Ci - Ciona intestinalis, Dr - Danio rerio, Xt - Xenopus tropicalis, Gg - Gallus gallus, Mm - Mus musculus.

Mentions: When we analyzed the positional distribution of DA changes for type 1, type 2 and type 3 transitions within the paralogous clusters defined by the proteomes of the various species we noted a slight preference for N-terminal over C-terminal DA differences, irrespective of the species used to cluster paralogs (Table S8; Figure 12a and Figure 12b).


Reassessing domain architecture evolution of metazoan proteins: major impact of errors caused by confusing paralogs and epaktologs.

Nagy A, Bányai L, Patthy L - Genes (Basel) (2011)

Analysis of the positional distribution of DA differences in clusters of human Swiss-Prot proteins defined through comparison with RefSeq proteomes of various species. The ordinate shows the proportion of pair-wise comparisons within clusters where the pairs differ in DA. N-terminal differences (blue rectangles), C-terminal differences (red rectangles), internal differences (green rectangles), tandem duplications (black rectangles). (a) Positional distribution of DA differences for type 1 transitions; (b) Positional distribution of DA differences for type 2 transitions; and (c) Positional distribution of DA differences for type 3 transitions. Note that in the case of type 3 transitions the proportion of internal DA difference is comparable to those of N-terminal or C-terminal changes only in the case of chordate species. On the abscissa the species are listed in the order of decreasing evolutionary distance from Homo sapiens, thus the abscissa has a time-dimension but their distance is not drawn to scale. Abbreviations on the abscissa: Ta - Trichoplax adhaerens, Nv - Nematostella vectensis, Hm - Hydra magnipapillata, Ce -Caenorhabditis elegans, Cb -Caenorhabditis briggsae, Dm -Drosophila melanogaster, Dp -Drosophila pseudoobscura, Ds - Drosophila simulans, Sp - Strongylocentrotus purpuratus, Bf - Branchiostoma floridae, Ci - Ciona intestinalis, Dr - Danio rerio, Xt - Xenopus tropicalis, Gg - Gallus gallus, Mm - Mus musculus.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3927612&req=5

f12-genes-02-00516: Analysis of the positional distribution of DA differences in clusters of human Swiss-Prot proteins defined through comparison with RefSeq proteomes of various species. The ordinate shows the proportion of pair-wise comparisons within clusters where the pairs differ in DA. N-terminal differences (blue rectangles), C-terminal differences (red rectangles), internal differences (green rectangles), tandem duplications (black rectangles). (a) Positional distribution of DA differences for type 1 transitions; (b) Positional distribution of DA differences for type 2 transitions; and (c) Positional distribution of DA differences for type 3 transitions. Note that in the case of type 3 transitions the proportion of internal DA difference is comparable to those of N-terminal or C-terminal changes only in the case of chordate species. On the abscissa the species are listed in the order of decreasing evolutionary distance from Homo sapiens, thus the abscissa has a time-dimension but their distance is not drawn to scale. Abbreviations on the abscissa: Ta - Trichoplax adhaerens, Nv - Nematostella vectensis, Hm - Hydra magnipapillata, Ce -Caenorhabditis elegans, Cb -Caenorhabditis briggsae, Dm -Drosophila melanogaster, Dp -Drosophila pseudoobscura, Ds - Drosophila simulans, Sp - Strongylocentrotus purpuratus, Bf - Branchiostoma floridae, Ci - Ciona intestinalis, Dr - Danio rerio, Xt - Xenopus tropicalis, Gg - Gallus gallus, Mm - Mus musculus.
Mentions: When we analyzed the positional distribution of DA changes for type 1, type 2 and type 3 transitions within the paralogous clusters defined by the proteomes of the various species we noted a slight preference for N-terminal over C-terminal DA differences, irrespective of the species used to cluster paralogs (Table S8; Figure 12a and Figure 12b).

Bottom Line: Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins.In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa.Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins.

View Article: PubMed Central - PubMed

Affiliation: Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest H-1113, Hungary. nagya@enzim.hu.

ABSTRACT
In the accompanying paper (Nagy, Szláma, Szarka, Trexler, Bányai, Patthy, Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors) we showed that in the case of UniProtKB/TrEMBL, RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species the contribution of erroneous (incomplete, abnormal, mispredicted) sequences to domain architecture (DA) differences of orthologous proteins might be greater than those of true gene rearrangements. Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa. To estimate the contribution of this type of error we have used as reference UniProtKB/Swiss-Prot sequences from protein families with well-characterized evolutionary histories. We have used two types of paralogy-group construction procedures and monitored the impact of various parameters on the separation of true paralogs from epaktologs on correctly annotated Swiss-Prot entries of multidomain proteins. Our studies have shown that, although public protein family databases are contaminated with epaktologs, analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs. We have also demonstrated that contamination of protein families with epaktologs increases the apparent rate of DA change and introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences.We have shown that confusing paralogous and epaktologous multidomain proteins significantly increases the apparent rate of DA change in Metazoa and introduces a positional bias in favor of terminal over internal DA changes. Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of multidomain proteins is presented in an accompanying paper [1].

No MeSH data available.


Related in: MedlinePlus