Limits...
Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae).

Parks M, Cronn R, Liston A - BMC Evol. Biol. (2012)

Bottom Line: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years.Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331-2902, USA. parksma@science.oregonstate.edu

ABSTRACT

Background: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years. Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.

Results: We investigated the influence of phylogenetic noise in large data sets by applying two fundamental strategies, variable site removal and long-branch exclusion, to the phylogenetic analysis of a full plastome alignment of 107 species of Pinus and six Pinaceae outgroups. While high overall phylogenetic resolution resulted from inclusion of all data, three historically recalcitrant nodes remained conflicted with previous analyses. Close investigation of these nodes revealed dramatically different responses to data removal. Whereas topological resolution and bootstrap support for two clades peaked with removal of highly variable sites, the third clade resolved most strongly when all sites were included. Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear. When compared to previous phylogenetic analyses of nuclear loci and morphological data, the most highly supported topologies seen in Pinus plastome analysis are congruent for the two clades gaining support from variable site removal and long-branch exclusion, but in conflict for the clade with highest support from the full data set.

Conclusions: These results suggest that removal of misleading signal in phylogenomic datasets can result not only in increased resolution for poorly supported nodes, but may serve as a tool for identifying erroneous yet highly supported topologies. For Pinus chloroplast genomes, removal of variable sites appears to be more effective than long-branch exclusion for clarifying phylogenetic hypotheses.

Show MeSH

Related in: MedlinePlus

Trends in bootstrap support values and topologies for likelihood analyses of alignment partitions. For OV-based analyses, the following are shown: a) Distributions of bootstrap support values for all nodes. Circles represent median bootstrap support for each An partition size. b) Distribution of branch score metric (triangles) and partition metric (circles) values for tests of topological congruence between An and corresponding Bn data partitions. Filled data points correspond to An partitions sizes falling between final decrease of branch score metric values and start of decreases in overall bootstrap support values for An partitions. Partition metric values shown are 0.1× actual value in order to fit on same scale with branch score metric values.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3475122&req=5

Figure 3: Trends in bootstrap support values and topologies for likelihood analyses of alignment partitions. For OV-based analyses, the following are shown: a) Distributions of bootstrap support values for all nodes. Circles represent median bootstrap support for each An partition size. b) Distribution of branch score metric (triangles) and partition metric (circles) values for tests of topological congruence between An and corresponding Bn data partitions. Filled data points correspond to An partitions sizes falling between final decrease of branch score metric values and start of decreases in overall bootstrap support values for An partitions. Partition metric values shown are 0.1× actual value in order to fit on same scale with branch score metric values.

Mentions: Bootstrap support values showed clear trends throughout An partitions as variable sites were removed, with overall values consistently high (average value > 85%, median value ≥ 98%) until the most variable ca. 8.3 kbp had been removed (A133065) (Figure 3). Support steadily decreased from this point before leveling off at low values with removal of ≥18kbp of variable sites (mean/median bootstrap support <17% /<10%). Branch score metric values decreased relatively rapidly over removal of the first ca. 2kbp of highly variable alignment positions (A141265 to A139265), but subsequently rose again and plateaued at about half their maximal level from ca. A138565 to A136665, until decreasing rapidly again and remaining at low levels (Figure 3). Partition metric values showed an initial rapid decline before leveling off after the removal of the most variable 2.2 kbp (ca. A139165) (Figure 3), at which point values remained constant and relatively low until increasing again beyond removal of the most variable 8.3 kbp (A133065). The most highly variable 2kbp of alignment positions, corresponding to the first rapid decline in branch score and partition metric values, was dominated by positions wherein the majority of taxa contained gaps and masked bases (mean/median/standard deviation of gaps + masked bases per alignment position = 80.49/108/44.05). Visual inspection of An and Bn trees further revealed that the great majority of topological differences over the plateau of branch score metric values from ca. A138565 to A136665 involved changes in branch lengths associated with subgeneric and sectional level divisions. This is also indirectly evidenced by the consistently low values of partition metric scores over this range, which reflect the consistent branching orders between An and Bn partitions but do not reflect differences in branch lengths. It is likely that these branch length differences are largely responsible for the temporary increase in branch score metric values seen here. Such a result is not completely unexpected, as variable sites associated with internal divisions of large groups of taxa typically have relatively high OV scores[35].


Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae).

Parks M, Cronn R, Liston A - BMC Evol. Biol. (2012)

Trends in bootstrap support values and topologies for likelihood analyses of alignment partitions. For OV-based analyses, the following are shown: a) Distributions of bootstrap support values for all nodes. Circles represent median bootstrap support for each An partition size. b) Distribution of branch score metric (triangles) and partition metric (circles) values for tests of topological congruence between An and corresponding Bn data partitions. Filled data points correspond to An partitions sizes falling between final decrease of branch score metric values and start of decreases in overall bootstrap support values for An partitions. Partition metric values shown are 0.1× actual value in order to fit on same scale with branch score metric values.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3475122&req=5

Figure 3: Trends in bootstrap support values and topologies for likelihood analyses of alignment partitions. For OV-based analyses, the following are shown: a) Distributions of bootstrap support values for all nodes. Circles represent median bootstrap support for each An partition size. b) Distribution of branch score metric (triangles) and partition metric (circles) values for tests of topological congruence between An and corresponding Bn data partitions. Filled data points correspond to An partitions sizes falling between final decrease of branch score metric values and start of decreases in overall bootstrap support values for An partitions. Partition metric values shown are 0.1× actual value in order to fit on same scale with branch score metric values.
Mentions: Bootstrap support values showed clear trends throughout An partitions as variable sites were removed, with overall values consistently high (average value > 85%, median value ≥ 98%) until the most variable ca. 8.3 kbp had been removed (A133065) (Figure 3). Support steadily decreased from this point before leveling off at low values with removal of ≥18kbp of variable sites (mean/median bootstrap support <17% /<10%). Branch score metric values decreased relatively rapidly over removal of the first ca. 2kbp of highly variable alignment positions (A141265 to A139265), but subsequently rose again and plateaued at about half their maximal level from ca. A138565 to A136665, until decreasing rapidly again and remaining at low levels (Figure 3). Partition metric values showed an initial rapid decline before leveling off after the removal of the most variable 2.2 kbp (ca. A139165) (Figure 3), at which point values remained constant and relatively low until increasing again beyond removal of the most variable 8.3 kbp (A133065). The most highly variable 2kbp of alignment positions, corresponding to the first rapid decline in branch score and partition metric values, was dominated by positions wherein the majority of taxa contained gaps and masked bases (mean/median/standard deviation of gaps + masked bases per alignment position = 80.49/108/44.05). Visual inspection of An and Bn trees further revealed that the great majority of topological differences over the plateau of branch score metric values from ca. A138565 to A136665 involved changes in branch lengths associated with subgeneric and sectional level divisions. This is also indirectly evidenced by the consistently low values of partition metric scores over this range, which reflect the consistent branching orders between An and Bn partitions but do not reflect differences in branch lengths. It is likely that these branch length differences are largely responsible for the temporary increase in branch score metric values seen here. Such a result is not completely unexpected, as variable sites associated with internal divisions of large groups of taxa typically have relatively high OV scores[35].

Bottom Line: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years.Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331-2902, USA. parksma@science.oregonstate.edu

ABSTRACT

Background: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years. Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.

Results: We investigated the influence of phylogenetic noise in large data sets by applying two fundamental strategies, variable site removal and long-branch exclusion, to the phylogenetic analysis of a full plastome alignment of 107 species of Pinus and six Pinaceae outgroups. While high overall phylogenetic resolution resulted from inclusion of all data, three historically recalcitrant nodes remained conflicted with previous analyses. Close investigation of these nodes revealed dramatically different responses to data removal. Whereas topological resolution and bootstrap support for two clades peaked with removal of highly variable sites, the third clade resolved most strongly when all sites were included. Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear. When compared to previous phylogenetic analyses of nuclear loci and morphological data, the most highly supported topologies seen in Pinus plastome analysis are congruent for the two clades gaining support from variable site removal and long-branch exclusion, but in conflict for the clade with highest support from the full data set.

Conclusions: These results suggest that removal of misleading signal in phylogenomic datasets can result not only in increased resolution for poorly supported nodes, but may serve as a tool for identifying erroneous yet highly supported topologies. For Pinus chloroplast genomes, removal of variable sites appears to be more effective than long-branch exclusion for clarifying phylogenetic hypotheses.

Show MeSH
Related in: MedlinePlus