Limits...
Scalable metagenomic taxonomy classification using a reference genome database.

Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE - Bioinformatics (2013)

Bottom Line: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents.Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge.Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample.

View Article: PubMed Central - PubMed

Affiliation: Center for Applied Scientific Computing, Lawrence Livermore National Laboratory and Global Security Directorate, P. O. Box 808, Livermore, CA 94551, USA.

ABSTRACT

Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge.

Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample.

Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat

Contact: allen99@llnl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH

Related in: MedlinePlus

Run time performance. Tests run on three real metagenomic datasets SRX, DRR and ERR. Run times are shown for the metagenomic classifiers (LMAT-kFull, LMAT-kML and MetaPhlAn using Bowtie2 for read mapping and its reference database) and simple sequence searches for Bowtie2 and blastn (BLAST) using the same full reference genomes found in kFull. We report run time normalized to the percentage of mapped or labeled reads. Note log scale on y-axis; values given within each bar
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3753567&req=5

btt389-F6: Run time performance. Tests run on three real metagenomic datasets SRX, DRR and ERR. Run times are shown for the metagenomic classifiers (LMAT-kFull, LMAT-kML and MetaPhlAn using Bowtie2 for read mapping and its reference database) and simple sequence searches for Bowtie2 and blastn (BLAST) using the same full reference genomes found in kFull. We report run time normalized to the percentage of mapped or labeled reads. Note log scale on y-axis; values given within each bar

Mentions: Run time of our classifier is compared with MetaPhlAn as an existing scalable metagenomic classifier using the marker library approach. Additional comparisons were made with the search tool Bowtie2 (Langmead et al., 2009) and blastn. It is noteworthy that these mapping tools (Bowtie2 and BLAST in Fig. 6) do not perform classification, and tools such as PhymmBL and Genometa would have to be run on the mapped reads, which might increase total run time (considering appropriately balancing run time and accuracy trade-offs). Nonetheless, if our methods are comparable in speed with raw search times, they should present a distinct advantage. Measurements were conducted on a single node large memory machine, a quad-CPU Intel Westmere (10 core per CPU) with 1 TB of DRAM, running linux kernel 2.6.32.Fig. 6.


Scalable metagenomic taxonomy classification using a reference genome database.

Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE - Bioinformatics (2013)

Run time performance. Tests run on three real metagenomic datasets SRX, DRR and ERR. Run times are shown for the metagenomic classifiers (LMAT-kFull, LMAT-kML and MetaPhlAn using Bowtie2 for read mapping and its reference database) and simple sequence searches for Bowtie2 and blastn (BLAST) using the same full reference genomes found in kFull. We report run time normalized to the percentage of mapped or labeled reads. Note log scale on y-axis; values given within each bar
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3753567&req=5

btt389-F6: Run time performance. Tests run on three real metagenomic datasets SRX, DRR and ERR. Run times are shown for the metagenomic classifiers (LMAT-kFull, LMAT-kML and MetaPhlAn using Bowtie2 for read mapping and its reference database) and simple sequence searches for Bowtie2 and blastn (BLAST) using the same full reference genomes found in kFull. We report run time normalized to the percentage of mapped or labeled reads. Note log scale on y-axis; values given within each bar
Mentions: Run time of our classifier is compared with MetaPhlAn as an existing scalable metagenomic classifier using the marker library approach. Additional comparisons were made with the search tool Bowtie2 (Langmead et al., 2009) and blastn. It is noteworthy that these mapping tools (Bowtie2 and BLAST in Fig. 6) do not perform classification, and tools such as PhymmBL and Genometa would have to be run on the mapped reads, which might increase total run time (considering appropriately balancing run time and accuracy trade-offs). Nonetheless, if our methods are comparable in speed with raw search times, they should present a distinct advantage. Measurements were conducted on a single node large memory machine, a quad-CPU Intel Westmere (10 core per CPU) with 1 TB of DRAM, running linux kernel 2.6.32.Fig. 6.

Bottom Line: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents.Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge.Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample.

View Article: PubMed Central - PubMed

Affiliation: Center for Applied Scientific Computing, Lawrence Livermore National Laboratory and Global Security Directorate, P. O. Box 808, Livermore, CA 94551, USA.

ABSTRACT

Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge.

Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample.

Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat

Contact: allen99@llnl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH
Related in: MedlinePlus