Limits...
Séance: reference-based phylogenetic analysis for 18S rRNA studies.

Medlar A, Aivelo T, Löytynoja A - BMC Evol. Biol. (2014)

Bottom Line: Séance combines the alignment extension and phylogenetic placement capabilities of the Pagan multiple sequence alignment program with a suite of tools to preprocess, cluster and visualise datasets composed of many samples.We demonstrate both improved OTU picking at higher levels of sequence similarity for 454 data and show the accuracy of phylogenetic placement to be comparable to maximum likelihood methods for lower numbers of taxa.Whilst in this article we focus on studying nematodes using the 18S marker gene, the concepts are generic and reference data for alternative marker genes can be easily created.

View Article: PubMed Central - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, Helsinki, P.O.Box 56, Finland. alan.j.medlar@helsinki.fi.

ABSTRACT

Background: Marker gene studies often use short amplicons spanning one or more hypervariable regions from an rRNA gene to interrogate the community structure of uncultured environmental samples. Target regions are chosen for their discriminatory power, but the limited phylogenetic signal of short high-throughput sequencing reads precludes accurate phylogenetic analysis. This is particularly unfortunate in the study of microscopic eukaryotes where horizontal gene flow is limited and the rRNA gene is expected to accurately reflect the species phylogeny. A promising alternative to full phylogenetic analysis is phylogenetic placement, where a reference phylogeny is inferred using the complete marker gene and iteratively extended with the short sequences from a metagenetic sample under study.

Results: Based on the phylogenetic placement approach we built Séance, a community analysis pipeline focused on the analysis of 18S marker gene data. Séance combines the alignment extension and phylogenetic placement capabilities of the Pagan multiple sequence alignment program with a suite of tools to preprocess, cluster and visualise datasets composed of many samples. We showcase Séance by analysing 454 data from a longitudinal study of intestinal parasite communities in wild rufous mouse lemurs (Microcebus rufus) as well as in simulation. We demonstrate both improved OTU picking at higher levels of sequence similarity for 454 data and show the accuracy of phylogenetic placement to be comparable to maximum likelihood methods for lower numbers of taxa.

Conclusions: Séance is an open source community analysis pipeline that provides reference-based phylogenetic analysis for rRNA marker gene studies. Whilst in this article we focus on studying nematodes using the 18S marker gene, the concepts are generic and reference data for alternative marker genes can be easily created. Séance can be downloaded from http://wasabiapp.org/software/seance/ .

Show MeSH

Related in: MedlinePlus

Homopolymer modelling. The cluster centroid (1) is higher in abundance than the query sequence (2). When the query is aligned with the centroid sequence, the alignment score is maximised by the dynamic programming algorithm traversing the red path through the sequence graph defined by the query sequence and Pagan’s homopolymer heuristics.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4265393&req=5

Fig2: Homopolymer modelling. The cluster centroid (1) is higher in abundance than the query sequence (2). When the query is aligned with the centroid sequence, the alignment score is maximised by the dynamic programming algorithm traversing the red path through the sequence graph defined by the query sequence and Pagan’s homopolymer heuristics.

Mentions: Séance constructs OTUs by greedily clustering reads in descending order of abundance. Our approach is based on the assumption that the sequences found in highest abundance are error-free and those reads containing PCR or sequencing artefacts are not only lower in abundance but highly similar to the error-free sequences they originated from. Taking abundance into consideration during clustering is similar to many other approaches [9,17]. Séance differs from these approaches by using Pagan’s modelling of homopolymer length uncertainty on sequences from platforms known to produce such errors (e.g. Roche/454). This modelling reduces alignment errors around low-complexity regions separated by short spacers and affects the calculation of sequence similarity. Figure 2 illustrates this approach by showing how skipping the last base from the homopolymer AAAA maximises the alignment score and that this gap should not contribute to the similarity calculation.Figure 2


Séance: reference-based phylogenetic analysis for 18S rRNA studies.

Medlar A, Aivelo T, Löytynoja A - BMC Evol. Biol. (2014)

Homopolymer modelling. The cluster centroid (1) is higher in abundance than the query sequence (2). When the query is aligned with the centroid sequence, the alignment score is maximised by the dynamic programming algorithm traversing the red path through the sequence graph defined by the query sequence and Pagan’s homopolymer heuristics.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4265393&req=5

Fig2: Homopolymer modelling. The cluster centroid (1) is higher in abundance than the query sequence (2). When the query is aligned with the centroid sequence, the alignment score is maximised by the dynamic programming algorithm traversing the red path through the sequence graph defined by the query sequence and Pagan’s homopolymer heuristics.
Mentions: Séance constructs OTUs by greedily clustering reads in descending order of abundance. Our approach is based on the assumption that the sequences found in highest abundance are error-free and those reads containing PCR or sequencing artefacts are not only lower in abundance but highly similar to the error-free sequences they originated from. Taking abundance into consideration during clustering is similar to many other approaches [9,17]. Séance differs from these approaches by using Pagan’s modelling of homopolymer length uncertainty on sequences from platforms known to produce such errors (e.g. Roche/454). This modelling reduces alignment errors around low-complexity regions separated by short spacers and affects the calculation of sequence similarity. Figure 2 illustrates this approach by showing how skipping the last base from the homopolymer AAAA maximises the alignment score and that this gap should not contribute to the similarity calculation.Figure 2

Bottom Line: Séance combines the alignment extension and phylogenetic placement capabilities of the Pagan multiple sequence alignment program with a suite of tools to preprocess, cluster and visualise datasets composed of many samples.We demonstrate both improved OTU picking at higher levels of sequence similarity for 454 data and show the accuracy of phylogenetic placement to be comparable to maximum likelihood methods for lower numbers of taxa.Whilst in this article we focus on studying nematodes using the 18S marker gene, the concepts are generic and reference data for alternative marker genes can be easily created.

View Article: PubMed Central - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, Helsinki, P.O.Box 56, Finland. alan.j.medlar@helsinki.fi.

ABSTRACT

Background: Marker gene studies often use short amplicons spanning one or more hypervariable regions from an rRNA gene to interrogate the community structure of uncultured environmental samples. Target regions are chosen for their discriminatory power, but the limited phylogenetic signal of short high-throughput sequencing reads precludes accurate phylogenetic analysis. This is particularly unfortunate in the study of microscopic eukaryotes where horizontal gene flow is limited and the rRNA gene is expected to accurately reflect the species phylogeny. A promising alternative to full phylogenetic analysis is phylogenetic placement, where a reference phylogeny is inferred using the complete marker gene and iteratively extended with the short sequences from a metagenetic sample under study.

Results: Based on the phylogenetic placement approach we built Séance, a community analysis pipeline focused on the analysis of 18S marker gene data. Séance combines the alignment extension and phylogenetic placement capabilities of the Pagan multiple sequence alignment program with a suite of tools to preprocess, cluster and visualise datasets composed of many samples. We showcase Séance by analysing 454 data from a longitudinal study of intestinal parasite communities in wild rufous mouse lemurs (Microcebus rufus) as well as in simulation. We demonstrate both improved OTU picking at higher levels of sequence similarity for 454 data and show the accuracy of phylogenetic placement to be comparable to maximum likelihood methods for lower numbers of taxa.

Conclusions: Séance is an open source community analysis pipeline that provides reference-based phylogenetic analysis for rRNA marker gene studies. Whilst in this article we focus on studying nematodes using the 18S marker gene, the concepts are generic and reference data for alternative marker genes can be easily created. Séance can be downloaded from http://wasabiapp.org/software/seance/ .

Show MeSH
Related in: MedlinePlus