Limits...
The impact of sequence database choice on metaproteomic results in gut microbiota studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.


Functional annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different levels (PF protein family; KO KEGG ortholog; EC enzyme code; BP Gene Ontology Biological Process; PW pathway; *p < 0.05; **p < 0.01; ***p < 0.001; paired t test) in human (left) and mouse (right) samples upon a blastp against bacterial Swiss-Prot entries. b PCA plots of function abundances, with each dot indicating a different human (left) or mouse (right) subject
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037606&req=5

Fig6: Functional annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different levels (PF protein family; KO KEGG ortholog; EC enzyme code; BP Gene Ontology Biological Process; PW pathway; *p < 0.05; **p < 0.01; ***p < 0.001; paired t test) in human (left) and mouse (right) samples upon a blastp against bacterial Swiss-Prot entries. b PCA plots of function abundances, with each dot indicating a different human (left) or mouse (right) subject

Mentions: Finally, we focused on the functional annotation of metaproteomic data. In a preliminary test, we obtained a generally better functional annotation performance (in terms of absolute and relative amount of entries with complete annotation, at several levels) when blasting sequences against bacterial entries from Swiss-Prot, rather than blasting against the whole UniProt (i.e., Swiss-Prot+TrEMBL) DB, or retrieving KEGG or SEED functional information from MEGAN (data not shown). Functional information was therefore retrieved from Swiss-Prot. As shown in Fig. 6a, the contig-based DB significantly outperformed the read-based DB in terms of peptide annotation yield, while UP-B behaved slightly worse than the contig-based DB in human, and dramatically worse than both MG-DBs in mouse. Consistently with taxonomic data, human function abundances clustered according to individuals, whereas mouse data clustered according to the DB used (PCA plots in Fig. 6b). Furthermore, LEfSe differential analysis revealed that up to 15 and 39 % of the identified functions varied significantly when using different DB types in human and mouse, respectively (Additional file 8: Figure S8 and Additional file 9: Dataset S1). Interestingly, 60 protein families detected in all mice using the contig-based DB (16 % of total identification achieved using that DB) were not detected at all when using one or both the other DBs.Fig. 6


The impact of sequence database choice on metaproteomic results in gut microbiota studies
Functional annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different levels (PF protein family; KO KEGG ortholog; EC enzyme code; BP Gene Ontology Biological Process; PW pathway; *p < 0.05; **p < 0.01; ***p < 0.001; paired t test) in human (left) and mouse (right) samples upon a blastp against bacterial Swiss-Prot entries. b PCA plots of function abundances, with each dot indicating a different human (left) or mouse (right) subject
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037606&req=5

Fig6: Functional annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different levels (PF protein family; KO KEGG ortholog; EC enzyme code; BP Gene Ontology Biological Process; PW pathway; *p < 0.05; **p < 0.01; ***p < 0.001; paired t test) in human (left) and mouse (right) samples upon a blastp against bacterial Swiss-Prot entries. b PCA plots of function abundances, with each dot indicating a different human (left) or mouse (right) subject
Mentions: Finally, we focused on the functional annotation of metaproteomic data. In a preliminary test, we obtained a generally better functional annotation performance (in terms of absolute and relative amount of entries with complete annotation, at several levels) when blasting sequences against bacterial entries from Swiss-Prot, rather than blasting against the whole UniProt (i.e., Swiss-Prot+TrEMBL) DB, or retrieving KEGG or SEED functional information from MEGAN (data not shown). Functional information was therefore retrieved from Swiss-Prot. As shown in Fig. 6a, the contig-based DB significantly outperformed the read-based DB in terms of peptide annotation yield, while UP-B behaved slightly worse than the contig-based DB in human, and dramatically worse than both MG-DBs in mouse. Consistently with taxonomic data, human function abundances clustered according to individuals, whereas mouse data clustered according to the DB used (PCA plots in Fig. 6b). Furthermore, LEfSe differential analysis revealed that up to 15 and 39 % of the identified functions varied significantly when using different DB types in human and mouse, respectively (Additional file 8: Figure S8 and Additional file 9: Dataset S1). Interestingly, 60 protein families detected in all mice using the contig-based DB (16 % of total identification achieved using that DB) were not detected at all when using one or both the other DBs.Fig. 6

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a &ldquo;merged&rdquo; database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.