Limits...
The impact of sequence database choice on metaproteomic results in gut microbiota studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.


Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037606&req=5

Fig1: Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2

Mentions: The main aim of this study was to perform a systematic comparison of the main DB types which can be used in a metaproteomic study, according to both sequencing-based and sequencing-independent approaches. To this purpose, we chose to conduct a preliminary investigation of the impact of several DB construction variables using a single human stool sample. Hence, as illustrated in Fig. 1, the fecal material collected from a healthy human volunteer was split into two equal portions, to be processed in parallel according to a metagenomic and a metaproteomic approach, and therefore to generate a wide gamut of sequence DBs. In total, we generated 19 different DBs, whose characteristics are presented in Table 1. Among them, 12 were experimental metagenomic DBs (i.e., obtained by metagenomic sequencing of the same individual and analyzed by metaproteomics; MG-DBs), differing based on common sequencing and sequence processing variables, namely sequencing depth, read assembly level, and ORF finding method; three were “pseudo-metagenomes”, containing UniProt sequences taxonomically filtered based on 16S-rDNA sequencing data at thee different taxonomic levels (species, genus, and family); four were sequencing-independent DBs, with the first comprising the entire set of bacterial sequences deposited in UniProt (UP-B) at the time of the analysis, and the remaining three constructed according to a previously described taxonomy-based proteomic iterative (PI) strategy [15, 20] (see Methods for further details). MS data obtained from the human stool sample were then searched against all sequence DBs in parallel using three different bioinformatics platforms and two FDR thresholds (namely, 1 and 5 %), in order to rule out potential search engine- or statistics-based biases.Fig. 1


The impact of sequence database choice on metaproteomic results in gut microbiota studies
Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037606&req=5

Fig1: Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2
Mentions: The main aim of this study was to perform a systematic comparison of the main DB types which can be used in a metaproteomic study, according to both sequencing-based and sequencing-independent approaches. To this purpose, we chose to conduct a preliminary investigation of the impact of several DB construction variables using a single human stool sample. Hence, as illustrated in Fig. 1, the fecal material collected from a healthy human volunteer was split into two equal portions, to be processed in parallel according to a metagenomic and a metaproteomic approach, and therefore to generate a wide gamut of sequence DBs. In total, we generated 19 different DBs, whose characteristics are presented in Table 1. Among them, 12 were experimental metagenomic DBs (i.e., obtained by metagenomic sequencing of the same individual and analyzed by metaproteomics; MG-DBs), differing based on common sequencing and sequence processing variables, namely sequencing depth, read assembly level, and ORF finding method; three were “pseudo-metagenomes”, containing UniProt sequences taxonomically filtered based on 16S-rDNA sequencing data at thee different taxonomic levels (species, genus, and family); four were sequencing-independent DBs, with the first comprising the entire set of bacterial sequences deposited in UniProt (UP-B) at the time of the analysis, and the remaining three constructed according to a previously described taxonomy-based proteomic iterative (PI) strategy [15, 20] (see Methods for further details). MS data obtained from the human stool sample were then searched against all sequence DBs in parallel using three different bioinformatics platforms and two FDR thresholds (namely, 1 and 5 %), in order to rule out potential search engine- or statistics-based biases.Fig. 1

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.