Limits...
The impact of sequence database choice on metaproteomic results in gut microbiota studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.


Peptide identification metrics in human gut metaproteomic data obtained using different databases. a Non-redundant peptides identified by searching the MS spectra obtained from a human stool sample against 19 different types of sequence DBs. Each graph illustrates the results achieved using a different bioinformatic platform, namely (from left to right) MetaProteomeAnalyzer (MPA), MaxQuant (MQ), and Proteome Discoverer (PD). RF and R6 (in blue) represent sequence DBs based on metagenomic reads processed by FragGeneScan or six-frame translation, respectively, while CF and C6 (in red) represent sequence DBs based on metagenomic assembled contigs processed by FragGeneScan or six-frame translation, respectively; 18, 6, and 3 are referred to the sequencing depth (in gigabases). Data from UniProt-based DBs are depicted in green (“pseudo-metagenomes” based on taxa found by 16S rDNA gene analysis), orange (sequences selected based on taxa found by a proteomic iterative approach, PI; see Methods for further details), and turquoise (all bacterial sequences); F, G and S are referred to the level of taxonomic filtering (family, genus, and species, respectively). Each column in the histograms contains a darker (identifications with FDR <1 %) and a lighter part (additional identifications with FDR <5 %). b Total peptide identifications obtained using five representative DBs and related multiple DB searches (ampersand indicates merging results from different DBs). All searches were run in triplicate using the three abovementioned bioinformatic platforms
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037606&req=5

Fig2: Peptide identification metrics in human gut metaproteomic data obtained using different databases. a Non-redundant peptides identified by searching the MS spectra obtained from a human stool sample against 19 different types of sequence DBs. Each graph illustrates the results achieved using a different bioinformatic platform, namely (from left to right) MetaProteomeAnalyzer (MPA), MaxQuant (MQ), and Proteome Discoverer (PD). RF and R6 (in blue) represent sequence DBs based on metagenomic reads processed by FragGeneScan or six-frame translation, respectively, while CF and C6 (in red) represent sequence DBs based on metagenomic assembled contigs processed by FragGeneScan or six-frame translation, respectively; 18, 6, and 3 are referred to the sequencing depth (in gigabases). Data from UniProt-based DBs are depicted in green (“pseudo-metagenomes” based on taxa found by 16S rDNA gene analysis), orange (sequences selected based on taxa found by a proteomic iterative approach, PI; see Methods for further details), and turquoise (all bacterial sequences); F, G and S are referred to the level of taxonomic filtering (family, genus, and species, respectively). Each column in the histograms contains a darker (identifications with FDR <1 %) and a lighter part (additional identifications with FDR <5 %). b Total peptide identifications obtained using five representative DBs and related multiple DB searches (ampersand indicates merging results from different DBs). All searches were run in triplicate using the three abovementioned bioinformatic platforms

Mentions: Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2


The impact of sequence database choice on metaproteomic results in gut microbiota studies
Peptide identification metrics in human gut metaproteomic data obtained using different databases. a Non-redundant peptides identified by searching the MS spectra obtained from a human stool sample against 19 different types of sequence DBs. Each graph illustrates the results achieved using a different bioinformatic platform, namely (from left to right) MetaProteomeAnalyzer (MPA), MaxQuant (MQ), and Proteome Discoverer (PD). RF and R6 (in blue) represent sequence DBs based on metagenomic reads processed by FragGeneScan or six-frame translation, respectively, while CF and C6 (in red) represent sequence DBs based on metagenomic assembled contigs processed by FragGeneScan or six-frame translation, respectively; 18, 6, and 3 are referred to the sequencing depth (in gigabases). Data from UniProt-based DBs are depicted in green (“pseudo-metagenomes” based on taxa found by 16S rDNA gene analysis), orange (sequences selected based on taxa found by a proteomic iterative approach, PI; see Methods for further details), and turquoise (all bacterial sequences); F, G and S are referred to the level of taxonomic filtering (family, genus, and species, respectively). Each column in the histograms contains a darker (identifications with FDR <1 %) and a lighter part (additional identifications with FDR <5 %). b Total peptide identifications obtained using five representative DBs and related multiple DB searches (ampersand indicates merging results from different DBs). All searches were run in triplicate using the three abovementioned bioinformatic platforms
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037606&req=5

Fig2: Peptide identification metrics in human gut metaproteomic data obtained using different databases. a Non-redundant peptides identified by searching the MS spectra obtained from a human stool sample against 19 different types of sequence DBs. Each graph illustrates the results achieved using a different bioinformatic platform, namely (from left to right) MetaProteomeAnalyzer (MPA), MaxQuant (MQ), and Proteome Discoverer (PD). RF and R6 (in blue) represent sequence DBs based on metagenomic reads processed by FragGeneScan or six-frame translation, respectively, while CF and C6 (in red) represent sequence DBs based on metagenomic assembled contigs processed by FragGeneScan or six-frame translation, respectively; 18, 6, and 3 are referred to the sequencing depth (in gigabases). Data from UniProt-based DBs are depicted in green (“pseudo-metagenomes” based on taxa found by 16S rDNA gene analysis), orange (sequences selected based on taxa found by a proteomic iterative approach, PI; see Methods for further details), and turquoise (all bacterial sequences); F, G and S are referred to the level of taxonomic filtering (family, genus, and species, respectively). Each column in the histograms contains a darker (identifications with FDR <1 %) and a lighter part (additional identifications with FDR <5 %). b Total peptide identifications obtained using five representative DBs and related multiple DB searches (ampersand indicates merging results from different DBs). All searches were run in triplicate using the three abovementioned bioinformatic platforms
Mentions: Design of the human stool sample experiment. Database-related acronyms contained into colored cylinders correspond to those indicated in Fig. 2

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a &ldquo;merged&rdquo; database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.