Limits...
Automated and Accurate Estimation of Gene Family Abundance from Shotgun Metagenomes.

Nayfach S, Bradley PH, Wyman SK, Laurent TJ, Williams A, Eisen JA, Pollard KS, Sharpton TJ - PLoS Comput. Biol. (2015)

Bottom Line: However, little is known about how decisions made during annotation affect the reliability of the results.We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP).We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease.

View Article: PubMed Central - PubMed

Affiliation: Gladstone Institute of Cardiovascular Disease, San Francisco, California, United States of America.

ABSTRACT
Shotgun metagenomic DNA sequencing is a widely applicable tool for characterizing the functions that are encoded by microbial communities. Several bioinformatic tools can be used to functionally annotate metagenomes, allowing researchers to draw inferences about the functional potential of the community and to identify putative functional biomarkers. However, little is known about how decisions made during annotation affect the reliability of the results. Here, we use statistical simulations to rigorously assess how to optimize annotation accuracy and speed, given parameters of the input data like read length and library size. We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP). ShotMAP is an analytically flexible, end-to-end annotation pipeline that can be implemented either on a local computer or a cloud compute cluster. We use ShotMAP to assess how different annotation databases impact the interpretation of how marine metagenome and metatranscriptome functional capacity changes across seasons. We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease. This analysis finds that gut microbiota collected from Crohn's disease patients are functionally distinct from gut microbiota collected from either ulcerative colitis patients or healthy controls, with differential abundance of metabolic pathways related to host-microbiome interactions that may serve as putative biomarkers of disease.

Show MeSH

Related in: MedlinePlus

Relationship between read length, bit-score threshold, and prediction accuracy.(A) Simulated metagenomes (50–500 bp; 1% error rate; mock community 160319967-stool1) were searched and classified into SFams at different bit-score thresholds. At each threshold, L1 relative abundance error was calculated. (B) Simulated 101-bp Illumina metagenomes from ten communities were searched and classified into SFams at different bit-score thresholds. Plotted is the optimal bit-score threshold for each community. Error bars indicate the range of bit-scores that result in L1 error within 1% of the optimal level. (C) Relative abundance error for simulated metagenomes of varying phylogenetic distance to reference genomes (50–500 bp; species to phylum taxonomic exclusion; 1% error rate; mock community 160319967-stool1). (D) Optimal bit-score thresholds error for metagenomes in (C). (E) Relative abundance error for simulated metagenomes of varying length and sequencing error (50–500 bp; 0–10% error rate; mock community 160319967-stool1). (F) Optimal bit-score thresholds for metagenomes in (E).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4643905&req=5

pcbi.1004573.g004: Relationship between read length, bit-score threshold, and prediction accuracy.(A) Simulated metagenomes (50–500 bp; 1% error rate; mock community 160319967-stool1) were searched and classified into SFams at different bit-score thresholds. At each threshold, L1 relative abundance error was calculated. (B) Simulated 101-bp Illumina metagenomes from ten communities were searched and classified into SFams at different bit-score thresholds. Plotted is the optimal bit-score threshold for each community. Error bars indicate the range of bit-scores that result in L1 error within 1% of the optimal level. (C) Relative abundance error for simulated metagenomes of varying phylogenetic distance to reference genomes (50–500 bp; species to phylum taxonomic exclusion; 1% error rate; mock community 160319967-stool1). (D) Optimal bit-score thresholds error for metagenomes in (C). (E) Relative abundance error for simulated metagenomes of varying length and sequencing error (50–500 bp; 0–10% error rate; mock community 160319967-stool1). (F) Optimal bit-score thresholds for metagenomes in (E).

Mentions: We began with an exploration of bit-score thresholds and found that (i) a precise bit-score threshold was critical to accurately estimate the relative abundance of protein families using short-read metagenomes (Fig 4A), (ii) optimal bit-score thresholds were read length specific (Fig 4A), and (iii) the bit-score thresholds we identified tended to correspond to non-significant E-values (S3 Fig). Bit-score thresholds that were either too lenient or too stringent resulted in inaccurate estimates of protein family abundance, particularly for short-read metagenomes. For example, at 100 bp, we found that accuracy was maximized at a bit-score threshold of ~31 bits; decreasing the threshold to 20 bits or increasing it to 50 bits increased error by 29–44%, which agrees with a previous report [34]. For reads longer than 100 bp, a precise bit-score threshold was not as important and similar accuracy was achieved over a wider range of thresholds (Fig 4A), which is presumably due to an increased separation of false-positives from true-positives at longer read lengths. When applying optimal read-length thresholds it is important to recall that reads within a sample may vary in length, especially after trimming, so it may be desirable to use different thresholds for different reads (see below).


Automated and Accurate Estimation of Gene Family Abundance from Shotgun Metagenomes.

Nayfach S, Bradley PH, Wyman SK, Laurent TJ, Williams A, Eisen JA, Pollard KS, Sharpton TJ - PLoS Comput. Biol. (2015)

Relationship between read length, bit-score threshold, and prediction accuracy.(A) Simulated metagenomes (50–500 bp; 1% error rate; mock community 160319967-stool1) were searched and classified into SFams at different bit-score thresholds. At each threshold, L1 relative abundance error was calculated. (B) Simulated 101-bp Illumina metagenomes from ten communities were searched and classified into SFams at different bit-score thresholds. Plotted is the optimal bit-score threshold for each community. Error bars indicate the range of bit-scores that result in L1 error within 1% of the optimal level. (C) Relative abundance error for simulated metagenomes of varying phylogenetic distance to reference genomes (50–500 bp; species to phylum taxonomic exclusion; 1% error rate; mock community 160319967-stool1). (D) Optimal bit-score thresholds error for metagenomes in (C). (E) Relative abundance error for simulated metagenomes of varying length and sequencing error (50–500 bp; 0–10% error rate; mock community 160319967-stool1). (F) Optimal bit-score thresholds for metagenomes in (E).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4643905&req=5

pcbi.1004573.g004: Relationship between read length, bit-score threshold, and prediction accuracy.(A) Simulated metagenomes (50–500 bp; 1% error rate; mock community 160319967-stool1) were searched and classified into SFams at different bit-score thresholds. At each threshold, L1 relative abundance error was calculated. (B) Simulated 101-bp Illumina metagenomes from ten communities were searched and classified into SFams at different bit-score thresholds. Plotted is the optimal bit-score threshold for each community. Error bars indicate the range of bit-scores that result in L1 error within 1% of the optimal level. (C) Relative abundance error for simulated metagenomes of varying phylogenetic distance to reference genomes (50–500 bp; species to phylum taxonomic exclusion; 1% error rate; mock community 160319967-stool1). (D) Optimal bit-score thresholds error for metagenomes in (C). (E) Relative abundance error for simulated metagenomes of varying length and sequencing error (50–500 bp; 0–10% error rate; mock community 160319967-stool1). (F) Optimal bit-score thresholds for metagenomes in (E).
Mentions: We began with an exploration of bit-score thresholds and found that (i) a precise bit-score threshold was critical to accurately estimate the relative abundance of protein families using short-read metagenomes (Fig 4A), (ii) optimal bit-score thresholds were read length specific (Fig 4A), and (iii) the bit-score thresholds we identified tended to correspond to non-significant E-values (S3 Fig). Bit-score thresholds that were either too lenient or too stringent resulted in inaccurate estimates of protein family abundance, particularly for short-read metagenomes. For example, at 100 bp, we found that accuracy was maximized at a bit-score threshold of ~31 bits; decreasing the threshold to 20 bits or increasing it to 50 bits increased error by 29–44%, which agrees with a previous report [34]. For reads longer than 100 bp, a precise bit-score threshold was not as important and similar accuracy was achieved over a wider range of thresholds (Fig 4A), which is presumably due to an increased separation of false-positives from true-positives at longer read lengths. When applying optimal read-length thresholds it is important to recall that reads within a sample may vary in length, especially after trimming, so it may be desirable to use different thresholds for different reads (see below).

Bottom Line: However, little is known about how decisions made during annotation affect the reliability of the results.We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP).We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease.

View Article: PubMed Central - PubMed

Affiliation: Gladstone Institute of Cardiovascular Disease, San Francisco, California, United States of America.

ABSTRACT
Shotgun metagenomic DNA sequencing is a widely applicable tool for characterizing the functions that are encoded by microbial communities. Several bioinformatic tools can be used to functionally annotate metagenomes, allowing researchers to draw inferences about the functional potential of the community and to identify putative functional biomarkers. However, little is known about how decisions made during annotation affect the reliability of the results. Here, we use statistical simulations to rigorously assess how to optimize annotation accuracy and speed, given parameters of the input data like read length and library size. We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP). ShotMAP is an analytically flexible, end-to-end annotation pipeline that can be implemented either on a local computer or a cloud compute cluster. We use ShotMAP to assess how different annotation databases impact the interpretation of how marine metagenome and metatranscriptome functional capacity changes across seasons. We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease. This analysis finds that gut microbiota collected from Crohn's disease patients are functionally distinct from gut microbiota collected from either ulcerative colitis patients or healthy controls, with differential abundance of metabolic pathways related to host-microbiome interactions that may serve as putative biomarkers of disease.

Show MeSH
Related in: MedlinePlus