Limits...
The impact of sequence database choice on metaproteomic results in gut microbiota studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.


Peptide identification metrics in human and mouse gut metaproteomic data obtained using different databases. a Results obtained with human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. R-A and C-A are referred to DBs constructed with metagenomic reads and contigs from all samples, respectively. UP-B indicates a UniProt-based DB containing all bacterial entries. Ampersand indicates merging results from parallel DB searches. Asterisks indicate significantly different peptide identification yields between DBs (*p < 0.05; **p < 0.01; ***p < 0.001; paired t test), with the asterisk color corresponding to the DB to which the comparison is referred (e.g., the green asterisk over blue dots indicates significance of the difference between “R-A” and “R-A & UP-B”). b Comparison of metagenomic DB construction strategies applied to human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. “R” and “C” refer to reads (blue) and contigs (red), respectively. Matched DBs (sequences from the gut metagenome of the same subject analyzed by metaproteomics) are indicated with “-M,” unmatched DBs (sequences from a gut metagenome of another subject of the same host species) are indicated with “-UM;” merged DBs (sequences from gut metagenomes of all subjects analyzed for that host species) are indicated with “-A.” *p < 0.05 (paired t test)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037606&req=5

Fig3: Peptide identification metrics in human and mouse gut metaproteomic data obtained using different databases. a Results obtained with human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. R-A and C-A are referred to DBs constructed with metagenomic reads and contigs from all samples, respectively. UP-B indicates a UniProt-based DB containing all bacterial entries. Ampersand indicates merging results from parallel DB searches. Asterisks indicate significantly different peptide identification yields between DBs (*p < 0.05; **p < 0.01; ***p < 0.001; paired t test), with the asterisk color corresponding to the DB to which the comparison is referred (e.g., the green asterisk over blue dots indicates significance of the difference between “R-A” and “R-A & UP-B”). b Comparison of metagenomic DB construction strategies applied to human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. “R” and “C” refer to reads (blue) and contigs (red), respectively. Matched DBs (sequences from the gut metagenome of the same subject analyzed by metaproteomics) are indicated with “-M,” unmatched DBs (sequences from a gut metagenome of another subject of the same host species) are indicated with “-UM;” merged DBs (sequences from gut metagenomes of all subjects analyzed for that host species) are indicated with “-A.” *p < 0.05 (paired t test)

Mentions: As a striking result, for all mouse samples, UP-B performed dramatically worse than MG-DBs, while the number of identified peptides between the two DB families were similar in human samples (Fig. 3a). Again, each DB type provided a significant amount of unique peptides in both species, confirming multiple DB searches as a mean to considerably increase the number of peptide identifications.Fig. 3


The impact of sequence database choice on metaproteomic results in gut microbiota studies
Peptide identification metrics in human and mouse gut metaproteomic data obtained using different databases. a Results obtained with human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. R-A and C-A are referred to DBs constructed with metagenomic reads and contigs from all samples, respectively. UP-B indicates a UniProt-based DB containing all bacterial entries. Ampersand indicates merging results from parallel DB searches. Asterisks indicate significantly different peptide identification yields between DBs (*p < 0.05; **p < 0.01; ***p < 0.001; paired t test), with the asterisk color corresponding to the DB to which the comparison is referred (e.g., the green asterisk over blue dots indicates significance of the difference between “R-A” and “R-A & UP-B”). b Comparison of metagenomic DB construction strategies applied to human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. “R” and “C” refer to reads (blue) and contigs (red), respectively. Matched DBs (sequences from the gut metagenome of the same subject analyzed by metaproteomics) are indicated with “-M,” unmatched DBs (sequences from a gut metagenome of another subject of the same host species) are indicated with “-UM;” merged DBs (sequences from gut metagenomes of all subjects analyzed for that host species) are indicated with “-A.” *p < 0.05 (paired t test)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037606&req=5

Fig3: Peptide identification metrics in human and mouse gut metaproteomic data obtained using different databases. a Results obtained with human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. R-A and C-A are referred to DBs constructed with metagenomic reads and contigs from all samples, respectively. UP-B indicates a UniProt-based DB containing all bacterial entries. Ampersand indicates merging results from parallel DB searches. Asterisks indicate significantly different peptide identification yields between DBs (*p < 0.05; **p < 0.01; ***p < 0.001; paired t test), with the asterisk color corresponding to the DB to which the comparison is referred (e.g., the green asterisk over blue dots indicates significance of the difference between “R-A” and “R-A & UP-B”). b Comparison of metagenomic DB construction strategies applied to human (N = 3, left) and mouse (N = 3, right) samples, with each dot representing a single sample. “R” and “C” refer to reads (blue) and contigs (red), respectively. Matched DBs (sequences from the gut metagenome of the same subject analyzed by metaproteomics) are indicated with “-M,” unmatched DBs (sequences from a gut metagenome of another subject of the same host species) are indicated with “-UM;” merged DBs (sequences from gut metagenomes of all subjects analyzed for that host species) are indicated with “-A.” *p < 0.05 (paired t test)
Mentions: As a striking result, for all mouse samples, UP-B performed dramatically worse than MG-DBs, while the number of identified peptides between the two DB families were similar in human samples (Fig. 3a). Again, each DB type provided a significant amount of unique peptides in both species, confirming multiple DB searches as a mean to considerably increase the number of peptide identifications.Fig. 3

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a &ldquo;merged&rdquo; database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.