Limits...
Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

Smith SA, Moore MJ, Brown JW, Yang Y - BMC Evol. Biol. (2015)

Bottom Line: The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses.The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets.We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone.

View Article: PubMed Central - PubMed

Affiliation: Department of Ecology and Evolutionary Biology, University of Michigan, S State St, Ann Arbor, 48109, MI, USA. eebsmith@umich.edu.

ABSTRACT

Background: The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets.

Results: Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone.

Conclusion: This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes underlying these complex datasets is necessary to improve and develop adequate models for sequence analysis and downstream applications. To aid this effort, we developed the open source software phyparts ( https://bitbucket.org/blackrim/phyparts ), which calculates unique, conflicting, and concordant bipartitions, maps gene duplications, and outputs summary statistics such as internode certainy (ICA) scores and node-specific counts of gene duplications.

Show MeSH

Related in: MedlinePlus

Combined ML (species tree) topology for Hymenoptera, with summary of conflicting and concordant homologs. For each branch, the top number indicates the number of homologs concordant with the species tree at that node, and the bottom number indicates the number of homologs in conflict with that clade in the species tree. The pie charts at each node present the proportion of homologs that support that clade (blue), the proportion that support the main alternative for that clade (green), the proportion that support the remaining alternatives (red), and the proportion that inform (conflict or support) this clade that have less than 50 % bootstrap support (grey). The histograms show, for three nodes, the proportion of the total homologs that support each conflicting alternative resolution for the clade in question, sorted from largest to smallest. Grey lines represent distributions of conflicting alternative resolutions based on coalescent simulations generated with three tree heights. The histograms for other nodes are presented in Additional file 2: Figure S5
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4524127&req=5

Fig2: Combined ML (species tree) topology for Hymenoptera, with summary of conflicting and concordant homologs. For each branch, the top number indicates the number of homologs concordant with the species tree at that node, and the bottom number indicates the number of homologs in conflict with that clade in the species tree. The pie charts at each node present the proportion of homologs that support that clade (blue), the proportion that support the main alternative for that clade (green), the proportion that support the remaining alternatives (red), and the proportion that inform (conflict or support) this clade that have less than 50 % bootstrap support (grey). The histograms show, for three nodes, the proportion of the total homologs that support each conflicting alternative resolution for the clade in question, sorted from largest to smallest. Grey lines represent distributions of conflicting alternative resolutions based on coalescent simulations generated with three tree heights. The histograms for other nodes are presented in Additional file 2: Figure S5

Mentions: For the Hymenoptera dataset, we recovered 5,863 homolog groups that were used for conflict and concordance analyses. For phylogenetic analyses, we then used a 1-to-1 orthologs approach to identify 1,116 ortholog groups that contained at least 16 of the 19 total taxa [26]. For the Caryophyllales, we used a phylogenetic tree-based approach to homolog identification and processed the homolog groups into ortholog groups using the ‘rooted ingroups’ orthology inference procedure described in [26]. We recovered 10,960 homolog groups that each contained at least eight ingroup taxa. From this set of homologs, we identified 1,122 ortholog groups that contained at least 65 taxa. These orthologs were concatenated and used to construct a phylogeny and had an ortholog occupancy of 92.1 %. Two samples were removed from the original analyses because of potential contamination. Of the original 10,960 homolog groups, 4,550 contained at least 60 taxa and these were used for conflict and concordance analyses. For both groups, we used RAxML (v. 8.0.2) with the PROTCATWAG substitution model to estimate ML topologies, with each data matrix partitioned by gene region. We will refer to these comprehensive phylogenetic hypotheses as ‘species trees’ below. The inferred species trees for Hymenoptera and Caryophyllales are presented in Figs. 2 and 4, respectively. We note here that while the inference of species trees is not the focus of the present study, they nevertheless are useful for mapping results of gene tree congruence and conflict. We also note that the concatenation-based species trees employed here are identical to coalescent-based species trees estimated for these groups [5, 42], with the exception of one highly mobile taxon in Caryophyllales, Sarcobatus.


Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

Smith SA, Moore MJ, Brown JW, Yang Y - BMC Evol. Biol. (2015)

Combined ML (species tree) topology for Hymenoptera, with summary of conflicting and concordant homologs. For each branch, the top number indicates the number of homologs concordant with the species tree at that node, and the bottom number indicates the number of homologs in conflict with that clade in the species tree. The pie charts at each node present the proportion of homologs that support that clade (blue), the proportion that support the main alternative for that clade (green), the proportion that support the remaining alternatives (red), and the proportion that inform (conflict or support) this clade that have less than 50 % bootstrap support (grey). The histograms show, for three nodes, the proportion of the total homologs that support each conflicting alternative resolution for the clade in question, sorted from largest to smallest. Grey lines represent distributions of conflicting alternative resolutions based on coalescent simulations generated with three tree heights. The histograms for other nodes are presented in Additional file 2: Figure S5
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4524127&req=5

Fig2: Combined ML (species tree) topology for Hymenoptera, with summary of conflicting and concordant homologs. For each branch, the top number indicates the number of homologs concordant with the species tree at that node, and the bottom number indicates the number of homologs in conflict with that clade in the species tree. The pie charts at each node present the proportion of homologs that support that clade (blue), the proportion that support the main alternative for that clade (green), the proportion that support the remaining alternatives (red), and the proportion that inform (conflict or support) this clade that have less than 50 % bootstrap support (grey). The histograms show, for three nodes, the proportion of the total homologs that support each conflicting alternative resolution for the clade in question, sorted from largest to smallest. Grey lines represent distributions of conflicting alternative resolutions based on coalescent simulations generated with three tree heights. The histograms for other nodes are presented in Additional file 2: Figure S5
Mentions: For the Hymenoptera dataset, we recovered 5,863 homolog groups that were used for conflict and concordance analyses. For phylogenetic analyses, we then used a 1-to-1 orthologs approach to identify 1,116 ortholog groups that contained at least 16 of the 19 total taxa [26]. For the Caryophyllales, we used a phylogenetic tree-based approach to homolog identification and processed the homolog groups into ortholog groups using the ‘rooted ingroups’ orthology inference procedure described in [26]. We recovered 10,960 homolog groups that each contained at least eight ingroup taxa. From this set of homologs, we identified 1,122 ortholog groups that contained at least 65 taxa. These orthologs were concatenated and used to construct a phylogeny and had an ortholog occupancy of 92.1 %. Two samples were removed from the original analyses because of potential contamination. Of the original 10,960 homolog groups, 4,550 contained at least 60 taxa and these were used for conflict and concordance analyses. For both groups, we used RAxML (v. 8.0.2) with the PROTCATWAG substitution model to estimate ML topologies, with each data matrix partitioned by gene region. We will refer to these comprehensive phylogenetic hypotheses as ‘species trees’ below. The inferred species trees for Hymenoptera and Caryophyllales are presented in Figs. 2 and 4, respectively. We note here that while the inference of species trees is not the focus of the present study, they nevertheless are useful for mapping results of gene tree congruence and conflict. We also note that the concatenation-based species trees employed here are identical to coalescent-based species trees estimated for these groups [5, 42], with the exception of one highly mobile taxon in Caryophyllales, Sarcobatus.

Bottom Line: The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses.The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets.We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone.

View Article: PubMed Central - PubMed

Affiliation: Department of Ecology and Evolutionary Biology, University of Michigan, S State St, Ann Arbor, 48109, MI, USA. eebsmith@umich.edu.

ABSTRACT

Background: The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets.

Results: Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone.

Conclusion: This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes underlying these complex datasets is necessary to improve and develop adequate models for sequence analysis and downstream applications. To aid this effort, we developed the open source software phyparts ( https://bitbucket.org/blackrim/phyparts ), which calculates unique, conflicting, and concordant bipartitions, maps gene duplications, and outputs summary statistics such as internode certainy (ICA) scores and node-specific counts of gene duplications.

Show MeSH
Related in: MedlinePlus