Limits...
Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

Wang Q, Fish JA, Gilman M, Sun Y, Brown CT, Tiedje JM, Cole JR - Microbiome (2015)

Bottom Line: However, assembling metagenomic datasets has proven to be computationally challenging.We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences.HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines.

View Article: PubMed Central - PubMed

Affiliation: Center for Microbial Ecology, Michigan State University, East Lansing, MI USA.

ABSTRACT

Background: Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes.

Results: We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility of this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences.

Conclusion: Xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines. This method is implemented as open source software and is available at https://github.com/rdpstaff/Xander_assembler.

No MeSH data available.


Xander gene assembly workflow. Two types of input sequences are required: one or more metagenomic read files used to build the de Bruijn graph and one set of reference sequences for each targeted gene, for building specialized profile HMMs using a modified version of HMMER 3.0 (see the “Implementation” section). During the search phase, Xander uses a combined weighted assembly graph to assemble genes (contigs). After assembly, several filters are applied at the quality filter step: chimeric genes, or genes below length cutoff or HMM score cutoff are discarded, and genes are clustered at 99 % aa identity and the longest one from each cluster is chosen as the representative. The quality-filtered genes are further processed to provide coverage and abundance information
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4526283&req=5

Fig2: Xander gene assembly workflow. Two types of input sequences are required: one or more metagenomic read files used to build the de Bruijn graph and one set of reference sequences for each targeted gene, for building specialized profile HMMs using a modified version of HMMER 3.0 (see the “Implementation” section). During the search phase, Xander uses a combined weighted assembly graph to assemble genes (contigs). After assembly, several filters are applied at the quality filter step: chimeric genes, or genes below length cutoff or HMM score cutoff are discarded, and genes are clustered at 99 % aa identity and the longest one from each cluster is chosen as the representative. The quality-filtered genes are further processed to provide coverage and abundance information

Mentions: For each of the three genes and for each sample (either individual or pooled), we used Xander to assemble one best contig from each starting kmer of length 45. To be comparable to the metagenomic assembly contigs, assembled contigs shorter than 300 nucleotides or with an HMM score less than 50 were discarded. A few post-assembly steps were included as part of the analysis (Fig. 2). We clustered the assembled contigs at 99 % aa identity and chose a set of representative nucleotide and protein contigs (the longest contig from each cluster). The 99 % aa identity cutoff was used for contig clustering throughout the analysis unless otherwise noted. Chimeras were identified using UCHIME against the nucleotide reference set. The closest matching reference sequences to these representative contigs were identified using FrameBot. Read mapping and kmer coverage estimates were also performed as described below.Fig. 2


Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

Wang Q, Fish JA, Gilman M, Sun Y, Brown CT, Tiedje JM, Cole JR - Microbiome (2015)

Xander gene assembly workflow. Two types of input sequences are required: one or more metagenomic read files used to build the de Bruijn graph and one set of reference sequences for each targeted gene, for building specialized profile HMMs using a modified version of HMMER 3.0 (see the “Implementation” section). During the search phase, Xander uses a combined weighted assembly graph to assemble genes (contigs). After assembly, several filters are applied at the quality filter step: chimeric genes, or genes below length cutoff or HMM score cutoff are discarded, and genes are clustered at 99 % aa identity and the longest one from each cluster is chosen as the representative. The quality-filtered genes are further processed to provide coverage and abundance information
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4526283&req=5

Fig2: Xander gene assembly workflow. Two types of input sequences are required: one or more metagenomic read files used to build the de Bruijn graph and one set of reference sequences for each targeted gene, for building specialized profile HMMs using a modified version of HMMER 3.0 (see the “Implementation” section). During the search phase, Xander uses a combined weighted assembly graph to assemble genes (contigs). After assembly, several filters are applied at the quality filter step: chimeric genes, or genes below length cutoff or HMM score cutoff are discarded, and genes are clustered at 99 % aa identity and the longest one from each cluster is chosen as the representative. The quality-filtered genes are further processed to provide coverage and abundance information
Mentions: For each of the three genes and for each sample (either individual or pooled), we used Xander to assemble one best contig from each starting kmer of length 45. To be comparable to the metagenomic assembly contigs, assembled contigs shorter than 300 nucleotides or with an HMM score less than 50 were discarded. A few post-assembly steps were included as part of the analysis (Fig. 2). We clustered the assembled contigs at 99 % aa identity and chose a set of representative nucleotide and protein contigs (the longest contig from each cluster). The 99 % aa identity cutoff was used for contig clustering throughout the analysis unless otherwise noted. Chimeras were identified using UCHIME against the nucleotide reference set. The closest matching reference sequences to these representative contigs were identified using FrameBot. Read mapping and kmer coverage estimates were also performed as described below.Fig. 2

Bottom Line: However, assembling metagenomic datasets has proven to be computationally challenging.We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences.HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines.

View Article: PubMed Central - PubMed

Affiliation: Center for Microbial Ecology, Michigan State University, East Lansing, MI USA.

ABSTRACT

Background: Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes.

Results: We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility of this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences.

Conclusion: Xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines. This method is implemented as open source software and is available at https://github.com/rdpstaff/Xander_assembler.

No MeSH data available.