Limits...
An ILP solution for the gene duplication problem.

Chang WC, Burleigh GJ, Fernández-Baca DF, Eulenstein O - BMC Bioinformatics (2011)

Bottom Line: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours.These are the largest instances that have been solved to optimally to date.Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Iowa State University, Ames 50011, USA. wcchang@iastate.edu

ABSTRACT

Background: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.

Results: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6,084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.

Conclusions: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

Show MeSH
The optimal seed plant phylogeny. The unique optimal seed plant phylogeny based on 12 taxa and 6,084 genes under the GD model.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3044268&req=5

Figure 1: The optimal seed plant phylogeny. The unique optimal seed plant phylogeny based on 12 taxa and 6,084 genes under the GD model.

Mentions: Our implementation of the ILP formulation finished running the data set in approximately two minutes. We identified a unique optimal solution with 47, 658 duplications (Figure 1).In the optimal species tree, the seed plants are split into angiosperm and gymnosperm clades (Figure 1). In the gymnosperm clade, Gnetales are sister to a conifer clade. With 6, 084 genes, this GTP analysis of seed plants includes by far the most genes ever used to infer the seed plant phylogeny. Our GTP analysis of this large, previously underutilized, data source provides a novel line of evidence that angiosperms and all extant gymnosperms are sister clades. Like most ML analyses of multi-locus data sets, our results show a close affinity between Gnetales and conifers (e.g., [35,37,39,50,51]); however, unlike many of these analyses, GTP does not place Gnetales sister to Pinaceae. Due to the necessarily limited taxon sampling, especially among non-Pinaceae conifers, our results regarding the placement of Gnetales are neither precise nor definitive. Still, the placement of Gnetales sister to the conifers, is an intriguing result that is consistent with some morphological characters, such as ovulate cone scales and resin canals, which appear to support conifer monophyly [36]. However, in contrast to our result, the deletion of the ndh genes in Gnetales and Pinaceae suggests that these clades are sister. Although the GTP results are intriguing, they should be interpreted with caution. For example, the results do not provide any measures of confidence or suggest the degree to which alternate phylogenetic hypotheses are sub-optimal. Furthermore, the gene trees were rooted using mid-point rooting, which may produce incorrect rootings when the sequences do not evolve at a constant rate of evolution [52]. Also, the taxon sampling in this analysis is limited, and the seed plant phylogeny problem can be sensitive to taxon sampling [45]. Thus, although our result provides a novel large-scale genomic perspective on the seed plant phylogeny, it is not a definitive.


An ILP solution for the gene duplication problem.

Chang WC, Burleigh GJ, Fernández-Baca DF, Eulenstein O - BMC Bioinformatics (2011)

The optimal seed plant phylogeny. The unique optimal seed plant phylogeny based on 12 taxa and 6,084 genes under the GD model.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3044268&req=5

Figure 1: The optimal seed plant phylogeny. The unique optimal seed plant phylogeny based on 12 taxa and 6,084 genes under the GD model.
Mentions: Our implementation of the ILP formulation finished running the data set in approximately two minutes. We identified a unique optimal solution with 47, 658 duplications (Figure 1).In the optimal species tree, the seed plants are split into angiosperm and gymnosperm clades (Figure 1). In the gymnosperm clade, Gnetales are sister to a conifer clade. With 6, 084 genes, this GTP analysis of seed plants includes by far the most genes ever used to infer the seed plant phylogeny. Our GTP analysis of this large, previously underutilized, data source provides a novel line of evidence that angiosperms and all extant gymnosperms are sister clades. Like most ML analyses of multi-locus data sets, our results show a close affinity between Gnetales and conifers (e.g., [35,37,39,50,51]); however, unlike many of these analyses, GTP does not place Gnetales sister to Pinaceae. Due to the necessarily limited taxon sampling, especially among non-Pinaceae conifers, our results regarding the placement of Gnetales are neither precise nor definitive. Still, the placement of Gnetales sister to the conifers, is an intriguing result that is consistent with some morphological characters, such as ovulate cone scales and resin canals, which appear to support conifer monophyly [36]. However, in contrast to our result, the deletion of the ndh genes in Gnetales and Pinaceae suggests that these clades are sister. Although the GTP results are intriguing, they should be interpreted with caution. For example, the results do not provide any measures of confidence or suggest the degree to which alternate phylogenetic hypotheses are sub-optimal. Furthermore, the gene trees were rooted using mid-point rooting, which may produce incorrect rootings when the sequences do not evolve at a constant rate of evolution [52]. Also, the taxon sampling in this analysis is limited, and the seed plant phylogeny problem can be sensitive to taxon sampling [45]. Thus, although our result provides a novel large-scale genomic perspective on the seed plant phylogeny, it is not a definitive.

Bottom Line: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours.These are the largest instances that have been solved to optimally to date.Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Iowa State University, Ames 50011, USA. wcchang@iastate.edu

ABSTRACT

Background: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.

Results: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6,084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.

Conclusions: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

Show MeSH