Limits...
Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae).

Parks M, Cronn R, Liston A - BMC Evol. Biol. (2012)

Bottom Line: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years.Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331-2902, USA. parksma@science.oregonstate.edu

ABSTRACT

Background: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years. Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.

Results: We investigated the influence of phylogenetic noise in large data sets by applying two fundamental strategies, variable site removal and long-branch exclusion, to the phylogenetic analysis of a full plastome alignment of 107 species of Pinus and six Pinaceae outgroups. While high overall phylogenetic resolution resulted from inclusion of all data, three historically recalcitrant nodes remained conflicted with previous analyses. Close investigation of these nodes revealed dramatically different responses to data removal. Whereas topological resolution and bootstrap support for two clades peaked with removal of highly variable sites, the third clade resolved most strongly when all sites were included. Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear. When compared to previous phylogenetic analyses of nuclear loci and morphological data, the most highly supported topologies seen in Pinus plastome analysis are congruent for the two clades gaining support from variable site removal and long-branch exclusion, but in conflict for the clade with highest support from the full data set.

Conclusions: These results suggest that removal of misleading signal in phylogenomic datasets can result not only in increased resolution for poorly supported nodes, but may serve as a tool for identifying erroneous yet highly supported topologies. For Pinus chloroplast genomes, removal of variable sites appears to be more effective than long-branch exclusion for clarifying phylogenetic hypotheses.

Show MeSH

Related in: MedlinePlus

Distribution of OV for variable plastome alignment positions. Schematic of the Pinus chloroplast genome with annotated protein-coding exons (blue), rRNA loci (yellow), tRNA loci (orange) and noncoding regions (green). The coding loci ycf1 and ycf2 are highlighted in light blue. The distribution of OV values > 0 is indicated by the internal histogram, as follows: red – most variable 4.6 kbp (A142165 to A136665 ); yellow – most variable sites from 4.6 to 8.3 kbp (A136565 to A133065 ); green – remaining sites with OV > 0.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3475122&req=5

Figure 2: Distribution of OV for variable plastome alignment positions. Schematic of the Pinus chloroplast genome with annotated protein-coding exons (blue), rRNA loci (yellow), tRNA loci (orange) and noncoding regions (green). The coding loci ycf1 and ycf2 are highlighted in light blue. The distribution of OV values > 0 is indicated by the internal histogram, as follows: red – most variable 4.6 kbp (A142165 to A136665 ); yellow – most variable sites from 4.6 to 8.3 kbp (A136565 to A133065 ); green – remaining sites with OV > 0.

Mentions: Variable sites were identified in nearly all coding and noncoding regions of the plastome, although they were unequally distributed between and among exons, introns and noncoding regions (Table 1, Figure 2). Highest average per-site OV was found in noncoding regions, followed by protein-coding exons, introns, and finally RNA-coding exons (Table 1). The higher variability of exons than introns was an unexpected result; however, previous work[31,61] has shown that the loci ycf 1 and ycf2 are extremely variable in Pinus compared to other protein-coding loci. Because of this, OV calculations were also averaged for exons without ycf1 and ycf2. With the removal of either ycf 1 alone or both ycf1 and ycf2 positions, average per site OV for protein-coding exons fell below that of intronic regions (Table 1), although the difference between intronic regions and exonic regions with removal of only ycf1 was not significant. The distribution of rate values for alignment positions by AIR-Identifier was similar, although intron regions were significantly more highly variable than all three exon partitions, and variability of tRNA loci was significantly higher on average than rRNA and exon regions with the exclusion of ycf 1 and ycf 2 (Additional File3).


Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae).

Parks M, Cronn R, Liston A - BMC Evol. Biol. (2012)

Distribution of OV for variable plastome alignment positions. Schematic of the Pinus chloroplast genome with annotated protein-coding exons (blue), rRNA loci (yellow), tRNA loci (orange) and noncoding regions (green). The coding loci ycf1 and ycf2 are highlighted in light blue. The distribution of OV values > 0 is indicated by the internal histogram, as follows: red – most variable 4.6 kbp (A142165 to A136665 ); yellow – most variable sites from 4.6 to 8.3 kbp (A136565 to A133065 ); green – remaining sites with OV > 0.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3475122&req=5

Figure 2: Distribution of OV for variable plastome alignment positions. Schematic of the Pinus chloroplast genome with annotated protein-coding exons (blue), rRNA loci (yellow), tRNA loci (orange) and noncoding regions (green). The coding loci ycf1 and ycf2 are highlighted in light blue. The distribution of OV values > 0 is indicated by the internal histogram, as follows: red – most variable 4.6 kbp (A142165 to A136665 ); yellow – most variable sites from 4.6 to 8.3 kbp (A136565 to A133065 ); green – remaining sites with OV > 0.
Mentions: Variable sites were identified in nearly all coding and noncoding regions of the plastome, although they were unequally distributed between and among exons, introns and noncoding regions (Table 1, Figure 2). Highest average per-site OV was found in noncoding regions, followed by protein-coding exons, introns, and finally RNA-coding exons (Table 1). The higher variability of exons than introns was an unexpected result; however, previous work[31,61] has shown that the loci ycf 1 and ycf2 are extremely variable in Pinus compared to other protein-coding loci. Because of this, OV calculations were also averaged for exons without ycf1 and ycf2. With the removal of either ycf 1 alone or both ycf1 and ycf2 positions, average per site OV for protein-coding exons fell below that of intronic regions (Table 1), although the difference between intronic regions and exonic regions with removal of only ycf1 was not significant. The distribution of rate values for alignment positions by AIR-Identifier was similar, although intron regions were significantly more highly variable than all three exon partitions, and variability of tRNA loci was significantly higher on average than rRNA and exon regions with the exclusion of ycf 1 and ycf 2 (Additional File3).

Bottom Line: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years.Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331-2902, USA. parksma@science.oregonstate.edu

ABSTRACT

Background: Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years. Simultaneously, the risk of incorporating 'noisy' data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.

Results: We investigated the influence of phylogenetic noise in large data sets by applying two fundamental strategies, variable site removal and long-branch exclusion, to the phylogenetic analysis of a full plastome alignment of 107 species of Pinus and six Pinaceae outgroups. While high overall phylogenetic resolution resulted from inclusion of all data, three historically recalcitrant nodes remained conflicted with previous analyses. Close investigation of these nodes revealed dramatically different responses to data removal. Whereas topological resolution and bootstrap support for two clades peaked with removal of highly variable sites, the third clade resolved most strongly when all sites were included. Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear. When compared to previous phylogenetic analyses of nuclear loci and morphological data, the most highly supported topologies seen in Pinus plastome analysis are congruent for the two clades gaining support from variable site removal and long-branch exclusion, but in conflict for the clade with highest support from the full data set.

Conclusions: These results suggest that removal of misleading signal in phylogenomic datasets can result not only in increased resolution for poorly supported nodes, but may serve as a tool for identifying erroneous yet highly supported topologies. For Pinus chloroplast genomes, removal of variable sites appears to be more effective than long-branch exclusion for clarifying phylogenetic hypotheses.

Show MeSH
Related in: MedlinePlus