Limits...
A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation.

Trachana K, Forslund K, Larsson T, Powell S, Doerks T, von Mering C, Bork P - PLoS ONE (2014)

Bottom Line: Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification.We also examined how our dataset can instruct the selection of a "core" species repertoire to improve detection accuracy.We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection.

View Article: PubMed Central - PubMed

Affiliation: Institute for Systems Biology, Seattle, WA, United States of America.

ABSTRACT
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a "core" species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.

Show MeSH

Related in: MedlinePlus

Flowchart for generating reference orthologous groups.The initial material (INPUT) is the homologs in 385 bacterial species for 50 error-prone families (Table S1 in Data S1). The homologs were chosen from the Clusters of Orthologous Groups (COGs), as these are inferred in the eggNOG database [10]. Depending on the complexity of the gene family, we performed two (ancient, well-conserved families) up to five (genus-specific, highly versatile families) rounds of curation. We followed four distinct steps at every round. The homologous (or later orthologous) sequences were aligned (Step 1) and the multiple sequence alignments (MSA) were used to build phylogenetic trees (Step 2). Each family tree was compared to a well-accepted species tree [32]. At the first round, species topology was used to define the boundaries of gene sub-families (sub-tress that include sequences from all three clades: α-, β- and γ- Proteobacteria), while at the following rounds, bona-fide orthologous groups related to the LCA of γ- Proteobacteria. The protein sequences of the members of each orthologous groups were re-aligned (Step 3). The new alignments were used to build Hidden Markov Models (HMMs), search the 385 bacterial genomes and define new homologs (Step 4). At the final round, we finalized the reference orthologous groups (OUTPUT) and identified and annotated through manual inspection HGT events and outgroups.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4219706&req=5

pone-0111122-g001: Flowchart for generating reference orthologous groups.The initial material (INPUT) is the homologs in 385 bacterial species for 50 error-prone families (Table S1 in Data S1). The homologs were chosen from the Clusters of Orthologous Groups (COGs), as these are inferred in the eggNOG database [10]. Depending on the complexity of the gene family, we performed two (ancient, well-conserved families) up to five (genus-specific, highly versatile families) rounds of curation. We followed four distinct steps at every round. The homologous (or later orthologous) sequences were aligned (Step 1) and the multiple sequence alignments (MSA) were used to build phylogenetic trees (Step 2). Each family tree was compared to a well-accepted species tree [32]. At the first round, species topology was used to define the boundaries of gene sub-families (sub-tress that include sequences from all three clades: α-, β- and γ- Proteobacteria), while at the following rounds, bona-fide orthologous groups related to the LCA of γ- Proteobacteria. The protein sequences of the members of each orthologous groups were re-aligned (Step 3). The new alignments were used to build Hidden Markov Models (HMMs), search the 385 bacterial genomes and define new homologs (Step 4). At the final round, we finalized the reference orthologous groups (OUTPUT) and identified and annotated through manual inspection HGT events and outgroups.

Mentions: As mentioned above, identification of genome-wide sets of orthologous groups for a large number of distantly related organisms is an arduous task due to the complexity of the genomic events that shape the evolution of gene families. We chose to delineate the evolutionary history of 49 bacterial gene families (19 universal families that are also found in eukaryotes [32] and 30 bacterial-specific families) that can exemplify complex phylogenetic scenarios resulting from HGT, lineage-specific gene loss and other such events (Table S1 in Data S1). All 49 families correspond to a COG (Clusters of Orthologous Groups) that is a high-quality material given that the COG assignments are further checked and curated by hand to eliminate potential false-positives. The eggNOG algorithm uses COGs as starting material and builds up-to-date versions that include newly sequenced genomes. We used the eggNOG version of COGs as “homology seeds” for this study. We had to focus in a defined taxonomic clade to avoid technical challenges due to the large number of bacterial sequences in every COG (i.e. computationally expensive phylogenetic trees, or increased sensitive to noise and biases i.e. sequences may evolve at very fast rates in some clades of the Tree of Life or even within bacteria phyla leading to misalignments [33]–[35]). We decided to generate phylogenetic trees that describe the orthology relationships of the 49 aforementioned families in the gamma-Proteobacteria clade (238 genomes) for several reasons: 1) they represent one third of all bacterial genomes sequenced [1], [32], 2) there is an extensive research about their contribution in ecosystems and human health, and 3) there are well-established species relationships within the Proteobacteria clade, which facilitate the use of alpha- and beta-species as outgroups [32], [48], [49]. This is crucial for our analysis. As we mention above, orthology is defined relative to the common ancestor of the compared species. The phylogenetic information in the Proteobacteria clade allows us to define the boundaries of the orthologous groups and to place the last common ancestor of the gamma-proteobacteria in the family trees (Figure 1). We performed a three-step curating protocol (Figure 1) for all families, certifying that the curated orthologous groups (termed reference orthologous groups –RefOGs): 1) are not biased towards our initial collection of proteins (COG members based on eggNOG version 3) and 2) depict the evolutionary history of these gene families as accurately as possible. The final dataset consists of 4,698 reference orthologs in 49 RefOGs for hundreds of bacterial species.


A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation.

Trachana K, Forslund K, Larsson T, Powell S, Doerks T, von Mering C, Bork P - PLoS ONE (2014)

Flowchart for generating reference orthologous groups.The initial material (INPUT) is the homologs in 385 bacterial species for 50 error-prone families (Table S1 in Data S1). The homologs were chosen from the Clusters of Orthologous Groups (COGs), as these are inferred in the eggNOG database [10]. Depending on the complexity of the gene family, we performed two (ancient, well-conserved families) up to five (genus-specific, highly versatile families) rounds of curation. We followed four distinct steps at every round. The homologous (or later orthologous) sequences were aligned (Step 1) and the multiple sequence alignments (MSA) were used to build phylogenetic trees (Step 2). Each family tree was compared to a well-accepted species tree [32]. At the first round, species topology was used to define the boundaries of gene sub-families (sub-tress that include sequences from all three clades: α-, β- and γ- Proteobacteria), while at the following rounds, bona-fide orthologous groups related to the LCA of γ- Proteobacteria. The protein sequences of the members of each orthologous groups were re-aligned (Step 3). The new alignments were used to build Hidden Markov Models (HMMs), search the 385 bacterial genomes and define new homologs (Step 4). At the final round, we finalized the reference orthologous groups (OUTPUT) and identified and annotated through manual inspection HGT events and outgroups.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4219706&req=5

pone-0111122-g001: Flowchart for generating reference orthologous groups.The initial material (INPUT) is the homologs in 385 bacterial species for 50 error-prone families (Table S1 in Data S1). The homologs were chosen from the Clusters of Orthologous Groups (COGs), as these are inferred in the eggNOG database [10]. Depending on the complexity of the gene family, we performed two (ancient, well-conserved families) up to five (genus-specific, highly versatile families) rounds of curation. We followed four distinct steps at every round. The homologous (or later orthologous) sequences were aligned (Step 1) and the multiple sequence alignments (MSA) were used to build phylogenetic trees (Step 2). Each family tree was compared to a well-accepted species tree [32]. At the first round, species topology was used to define the boundaries of gene sub-families (sub-tress that include sequences from all three clades: α-, β- and γ- Proteobacteria), while at the following rounds, bona-fide orthologous groups related to the LCA of γ- Proteobacteria. The protein sequences of the members of each orthologous groups were re-aligned (Step 3). The new alignments were used to build Hidden Markov Models (HMMs), search the 385 bacterial genomes and define new homologs (Step 4). At the final round, we finalized the reference orthologous groups (OUTPUT) and identified and annotated through manual inspection HGT events and outgroups.
Mentions: As mentioned above, identification of genome-wide sets of orthologous groups for a large number of distantly related organisms is an arduous task due to the complexity of the genomic events that shape the evolution of gene families. We chose to delineate the evolutionary history of 49 bacterial gene families (19 universal families that are also found in eukaryotes [32] and 30 bacterial-specific families) that can exemplify complex phylogenetic scenarios resulting from HGT, lineage-specific gene loss and other such events (Table S1 in Data S1). All 49 families correspond to a COG (Clusters of Orthologous Groups) that is a high-quality material given that the COG assignments are further checked and curated by hand to eliminate potential false-positives. The eggNOG algorithm uses COGs as starting material and builds up-to-date versions that include newly sequenced genomes. We used the eggNOG version of COGs as “homology seeds” for this study. We had to focus in a defined taxonomic clade to avoid technical challenges due to the large number of bacterial sequences in every COG (i.e. computationally expensive phylogenetic trees, or increased sensitive to noise and biases i.e. sequences may evolve at very fast rates in some clades of the Tree of Life or even within bacteria phyla leading to misalignments [33]–[35]). We decided to generate phylogenetic trees that describe the orthology relationships of the 49 aforementioned families in the gamma-Proteobacteria clade (238 genomes) for several reasons: 1) they represent one third of all bacterial genomes sequenced [1], [32], 2) there is an extensive research about their contribution in ecosystems and human health, and 3) there are well-established species relationships within the Proteobacteria clade, which facilitate the use of alpha- and beta-species as outgroups [32], [48], [49]. This is crucial for our analysis. As we mention above, orthology is defined relative to the common ancestor of the compared species. The phylogenetic information in the Proteobacteria clade allows us to define the boundaries of the orthologous groups and to place the last common ancestor of the gamma-proteobacteria in the family trees (Figure 1). We performed a three-step curating protocol (Figure 1) for all families, certifying that the curated orthologous groups (termed reference orthologous groups –RefOGs): 1) are not biased towards our initial collection of proteins (COG members based on eggNOG version 3) and 2) depict the evolutionary history of these gene families as accurately as possible. The final dataset consists of 4,698 reference orthologs in 49 RefOGs for hundreds of bacterial species.

Bottom Line: Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification.We also examined how our dataset can instruct the selection of a "core" species repertoire to improve detection accuracy.We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection.

View Article: PubMed Central - PubMed

Affiliation: Institute for Systems Biology, Seattle, WA, United States of America.

ABSTRACT
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a "core" species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.

Show MeSH
Related in: MedlinePlus