Limits...
A flexible ancestral genome reconstruction method based on gapped adjacencies.

Gagnon Y, Blanchette M, El-Mabrouk N - BMC Bioinformatics (2012)

Bottom Line: The "small phylogeny" problem consists in inferring ancestral genomes associated with each internal node of a phylogenetic tree of a set of extant species.Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers.Applying our algorithm on various simulated data sets reveals good performance as we usually end up with a completely assembled genome, while keeping a low error rate.

View Article: PubMed Central - HTML - PubMed

Affiliation: Département d'Informatique, DIRO, Université de Montréal, Canada.

ABSTRACT

Background: The "small phylogeny" problem consists in inferring ancestral genomes associated with each internal node of a phylogenetic tree of a set of extant species. Existing methods can be grouped into two main categories: the distance-based methods aiming at minimizing a total branch length, and the synteny-based (or mapping) methods that first predict a collection of relations between ancestral markers in term of "synteny", and then assemble this collection into a set of Contiguous Ancestral Regions (CARs). The predicted CARs are likely to be more reliable as they are more directly deduced from observed conservations in extant species. However the challenge is to end up with a completely assembled genome.

Results: We develop a new synteny-based method that is flexible enough to handle a model of evolution involving whole genome duplication events, in addition to rearrangements, gene insertions, and losses. Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers. It improves on a previous restricted to direct adjacencies, which revealed a high accuracy for adjacency prediction, but with the drawback of being overly conservative, i.e. of generating a large number of CARs. Applying our algorithm on various simulated data sets reveals good performance as we usually end up with a completely assembled genome, while keeping a low error rate.

Availability: All source code is available at http://www.iro.umontreal.ca/~mabrouk.

Show MeSH

Related in: MedlinePlus

From left to right, (1st) Error rate and (2nd) Number of CARs obtained by GapAdj on simulations following a model accounting for multichromosomal genomes evolving through gene losses, and a maximum of rmax (x-axis) inversions and inter-chromosomal rearrangements per branch of the tree. (3d) Error rate obtained by GapAdj on simulations performed according to the cereal tree (Figure 4(B)) and the subtree of yeast rooted at τ (Figure 4(B)). The model accounts for inversions, inter-chromosomal rearrangements, gene losses and one WGD. The two red (resp. blue) curves correspond to the results for cereal (resp. yeast) by performing 50 and 100 losses just following the WGD. (4th) Running time of GapAdj for one data set following the "cereal 50" model, and with rmax=20.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526437&req=5

Figure 6: From left to right, (1st) Error rate and (2nd) Number of CARs obtained by GapAdj on simulations following a model accounting for multichromosomal genomes evolving through gene losses, and a maximum of rmax (x-axis) inversions and inter-chromosomal rearrangements per branch of the tree. (3d) Error rate obtained by GapAdj on simulations performed according to the cereal tree (Figure 4(B)) and the subtree of yeast rooted at τ (Figure 4(B)). The model accounts for inversions, inter-chromosomal rearrangements, gene losses and one WGD. The two red (resp. blue) curves correspond to the results for cereal (resp. yeast) by performing 50 and 100 losses just following the WGD. (4th) Running time of GapAdj for one data set following the "cereal 50" model, and with rmax=20.

Mentions: We then consider an extended model of evolution for multichromosomal genomes that evolve through inversions, inter-chromosomal rearrangements (translocations, fusions, fissions) and gene losses. Based on the same six-leaf species tree described above, we simulate data sets starting with a 2-chromosome, 200-gene genome at the root ρ of the tree. Each gene loss event involves a single gene chosen randomly in the genome. The number of gene losses on each branch is proportional to that observed in actual yeast genomes, while the proportion of each type of rearrangement operation is chosen to be similar to that reported for S. cerevisiae in [18]: (Inv : Trans : Fus+Fiss) = (5 : 4 : 1). The results given in Figure 6 (two leftmost diagrams) reflect the difference in gapped-adjacencies and number of chromosomes between the real and predicted genome at node σ. Notice that chromosomal fusions and fissions may occur on the branch from ρ to σ, so the true number of chromosomes depicted in the second diagram of Figure 6 is not always 2. Interestingly, the curve for inferred CARs roughly follows the curve for true CARs. In addition, the error rate remains lower than 12% in all cases.


A flexible ancestral genome reconstruction method based on gapped adjacencies.

Gagnon Y, Blanchette M, El-Mabrouk N - BMC Bioinformatics (2012)

From left to right, (1st) Error rate and (2nd) Number of CARs obtained by GapAdj on simulations following a model accounting for multichromosomal genomes evolving through gene losses, and a maximum of rmax (x-axis) inversions and inter-chromosomal rearrangements per branch of the tree. (3d) Error rate obtained by GapAdj on simulations performed according to the cereal tree (Figure 4(B)) and the subtree of yeast rooted at τ (Figure 4(B)). The model accounts for inversions, inter-chromosomal rearrangements, gene losses and one WGD. The two red (resp. blue) curves correspond to the results for cereal (resp. yeast) by performing 50 and 100 losses just following the WGD. (4th) Running time of GapAdj for one data set following the "cereal 50" model, and with rmax=20.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526437&req=5

Figure 6: From left to right, (1st) Error rate and (2nd) Number of CARs obtained by GapAdj on simulations following a model accounting for multichromosomal genomes evolving through gene losses, and a maximum of rmax (x-axis) inversions and inter-chromosomal rearrangements per branch of the tree. (3d) Error rate obtained by GapAdj on simulations performed according to the cereal tree (Figure 4(B)) and the subtree of yeast rooted at τ (Figure 4(B)). The model accounts for inversions, inter-chromosomal rearrangements, gene losses and one WGD. The two red (resp. blue) curves correspond to the results for cereal (resp. yeast) by performing 50 and 100 losses just following the WGD. (4th) Running time of GapAdj for one data set following the "cereal 50" model, and with rmax=20.
Mentions: We then consider an extended model of evolution for multichromosomal genomes that evolve through inversions, inter-chromosomal rearrangements (translocations, fusions, fissions) and gene losses. Based on the same six-leaf species tree described above, we simulate data sets starting with a 2-chromosome, 200-gene genome at the root ρ of the tree. Each gene loss event involves a single gene chosen randomly in the genome. The number of gene losses on each branch is proportional to that observed in actual yeast genomes, while the proportion of each type of rearrangement operation is chosen to be similar to that reported for S. cerevisiae in [18]: (Inv : Trans : Fus+Fiss) = (5 : 4 : 1). The results given in Figure 6 (two leftmost diagrams) reflect the difference in gapped-adjacencies and number of chromosomes between the real and predicted genome at node σ. Notice that chromosomal fusions and fissions may occur on the branch from ρ to σ, so the true number of chromosomes depicted in the second diagram of Figure 6 is not always 2. Interestingly, the curve for inferred CARs roughly follows the curve for true CARs. In addition, the error rate remains lower than 12% in all cases.

Bottom Line: The "small phylogeny" problem consists in inferring ancestral genomes associated with each internal node of a phylogenetic tree of a set of extant species.Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers.Applying our algorithm on various simulated data sets reveals good performance as we usually end up with a completely assembled genome, while keeping a low error rate.

View Article: PubMed Central - HTML - PubMed

Affiliation: Département d'Informatique, DIRO, Université de Montréal, Canada.

ABSTRACT

Background: The "small phylogeny" problem consists in inferring ancestral genomes associated with each internal node of a phylogenetic tree of a set of extant species. Existing methods can be grouped into two main categories: the distance-based methods aiming at minimizing a total branch length, and the synteny-based (or mapping) methods that first predict a collection of relations between ancestral markers in term of "synteny", and then assemble this collection into a set of Contiguous Ancestral Regions (CARs). The predicted CARs are likely to be more reliable as they are more directly deduced from observed conservations in extant species. However the challenge is to end up with a completely assembled genome.

Results: We develop a new synteny-based method that is flexible enough to handle a model of evolution involving whole genome duplication events, in addition to rearrangements, gene insertions, and losses. Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers. It improves on a previous restricted to direct adjacencies, which revealed a high accuracy for adjacency prediction, but with the drawback of being overly conservative, i.e. of generating a large number of CARs. Applying our algorithm on various simulated data sets reveals good performance as we usually end up with a completely assembled genome, while keeping a low error rate.

Availability: All source code is available at http://www.iro.umontreal.ca/~mabrouk.

Show MeSH
Related in: MedlinePlus