Limits...
Gene family assignment-free comparative genomics.

Doerr D, Thévenin A, Stoye J - BMC Bioinformatics (2012)

Bottom Line: In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes.Moreover, the results of our heuristic are close to those of the exact algorithm.In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Center for Biotechnology, Bielefeld University, Germany. ddoerr@cebitec.uni-bielefeld.de

ABSTRACT

Background: The comparison of relative gene orders between two genomes offers deep insights into functional correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene order analyses often require prior knowledge of homologies between all genes of the genomic dataset. Since such information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These hypothetical groups of homologous genes are called gene families.

Results: This manuscript promotes a new branch of gene order studies in which prior assignment of gene families is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our methods on a dataset comprising 12 γ-proteobacteria from the literature.

Conclusions: In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.

Show MeSH
Phylogenies. (a) True phylogeny obtained from [14]. NeighborNet representations of phylogenies based on distances obtained by (b) Adjacencies-Intermediate-Matching [13], (c) running FFAdj-Int with α = 0.001, (d) α = 0.5, (e) α = 0.8, and (f) running FFAdj-MCS with α = 1.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526435&req=5

Figure 4: Phylogenies. (a) True phylogeny obtained from [14]. NeighborNet representations of phylogenies based on distances obtained by (b) Adjacencies-Intermediate-Matching [13], (c) running FFAdj-Int with α = 0.001, (d) α = 0.5, (e) α = 0.8, and (f) running FFAdj-MCS with α = 1.

Mentions: In our evaluation we applied the well-known Neighbor Joining Method (NJ) [18] for inferring phylogenetic trees. Subsequently we compared these to the tree proposed by Lerat et al. [14] that we assume to represent the true phylogeny. Thereby we used the Robinson Foulds topological distance (RF distance) [19] to evaluate our inferred phylogenetic trees. The results are shown on the right side of Table 2. For the majority of all cases we were able to reconstruct the tree correctly up to a single internal edge, causing an RF distance of 2 to the original tree. This internal edge connects the two organisms Buchnera aphidicola (BAPHI) and Salmonella typhimurium (SALTY) with the rest of the tree (Figure 4(a)). This branch is known to be particularly hard to reconstruct since the two organisms diverged far from each other, resulting in two long external edges in the tree. We also reconstructed the phylogeny based on gene families using the measures that were proposed in [13] and obtained an RF distance of 2 to the true tree under the intermediate model and an RF distance of 4 under the maximum matching model, featuring the same aberrancy. These results suggest that gene family information is not relevant in reconstructing the phylogeny of Lerat et al.'s tree. Yet, the increasing deviation from the true tree for the results of FFAdj-Int when α tends to 1 indicates that for genome comparison, the maximization of adjacencies is not enough. The fact that the heuristic outperforms the exact algorithm for α = 1 in terms of RF distance to the true tree confirms the importance of maximizing edg() as well. We recall that FFAdj-MCS iterates until complete saturation is obtained, which increases edg(), even when α = 1.


Gene family assignment-free comparative genomics.

Doerr D, Thévenin A, Stoye J - BMC Bioinformatics (2012)

Phylogenies. (a) True phylogeny obtained from [14]. NeighborNet representations of phylogenies based on distances obtained by (b) Adjacencies-Intermediate-Matching [13], (c) running FFAdj-Int with α = 0.001, (d) α = 0.5, (e) α = 0.8, and (f) running FFAdj-MCS with α = 1.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526435&req=5

Figure 4: Phylogenies. (a) True phylogeny obtained from [14]. NeighborNet representations of phylogenies based on distances obtained by (b) Adjacencies-Intermediate-Matching [13], (c) running FFAdj-Int with α = 0.001, (d) α = 0.5, (e) α = 0.8, and (f) running FFAdj-MCS with α = 1.
Mentions: In our evaluation we applied the well-known Neighbor Joining Method (NJ) [18] for inferring phylogenetic trees. Subsequently we compared these to the tree proposed by Lerat et al. [14] that we assume to represent the true phylogeny. Thereby we used the Robinson Foulds topological distance (RF distance) [19] to evaluate our inferred phylogenetic trees. The results are shown on the right side of Table 2. For the majority of all cases we were able to reconstruct the tree correctly up to a single internal edge, causing an RF distance of 2 to the original tree. This internal edge connects the two organisms Buchnera aphidicola (BAPHI) and Salmonella typhimurium (SALTY) with the rest of the tree (Figure 4(a)). This branch is known to be particularly hard to reconstruct since the two organisms diverged far from each other, resulting in two long external edges in the tree. We also reconstructed the phylogeny based on gene families using the measures that were proposed in [13] and obtained an RF distance of 2 to the true tree under the intermediate model and an RF distance of 4 under the maximum matching model, featuring the same aberrancy. These results suggest that gene family information is not relevant in reconstructing the phylogeny of Lerat et al.'s tree. Yet, the increasing deviation from the true tree for the results of FFAdj-Int when α tends to 1 indicates that for genome comparison, the maximization of adjacencies is not enough. The fact that the heuristic outperforms the exact algorithm for α = 1 in terms of RF distance to the true tree confirms the importance of maximizing edg() as well. We recall that FFAdj-MCS iterates until complete saturation is obtained, which increases edg(), even when α = 1.

Bottom Line: In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes.Moreover, the results of our heuristic are close to those of the exact algorithm.In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Technology, Center for Biotechnology, Bielefeld University, Germany. ddoerr@cebitec.uni-bielefeld.de

ABSTRACT

Background: The comparison of relative gene orders between two genomes offers deep insights into functional correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene order analyses often require prior knowledge of homologies between all genes of the genomic dataset. Since such information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These hypothetical groups of homologous genes are called gene families.

Results: This manuscript promotes a new branch of gene order studies in which prior assignment of gene families is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our methods on a dataset comprising 12 γ-proteobacteria from the literature.

Conclusions: In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.

Show MeSH