Limits...
The impact of sequence database choice on metaproteomic results in gut microbiota studies

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.


Taxonomic annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). The annotation tools used were MEGAN (MEG) and Unipept (Up). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different taxonomic levels (p phylum; c class; o order; f family; g genus; s species) in human (left) and mouse (right) samples. b PCA plots of taxa abundances, with each dot indicating a different human (left) or mouse (right) subject
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037606&req=5

Fig4: Taxonomic annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). The annotation tools used were MEGAN (MEG) and Unipept (Up). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different taxonomic levels (p phylum; c class; o order; f family; g genus; s species) in human (left) and mouse (right) samples. b PCA plots of taxa abundances, with each dot indicating a different human (left) or mouse (right) subject

Mentions: As a further step, we performed taxonomic annotation by means of two established tools which employ a lowest common ancestor (LCA) algorithm: first, MEGAN was used to carry out LCA classification of read, contig, and protein sequences (identified from each respective DB) based on results of sequence alignment versus NCBI-nr entries; second, Unipept was employed to perform LCA classification of tryptic peptide sequences (obtained from each DB search) based on full homology with UniProt entries. As shown in Fig. 4a, UP-B generally provided a poorer annotation performance compared to MG-DBs, with small differences in humans and greater disparity in mice (especially down to the order level). Moreover, when focusing on MG-DBs, a global decrease in peptide annotation yield was found in mouse when compared to human samples. Concerning the annotation tools, MEGAN reached higher taxonomic annotation yields compared to Unipept, for both host species and all DB types. A strong impact of sequence length could be also observed for MEGAN, with the contig-based DBs (average sequence length of 209 and 344 amino acids in humans and mice, respectively) significantly outperforming the read-based DBs (average sequence length of 48 and 45 amino acids in humans and mice, respectively). Alpha-diversity calculation also revealed DB- and annotation tool-dependent biases (Additional file 4: Figure S4). MEGAN classification considerably enhanced the divergence between human (higher) and mouse (lower) gut metaproteome alpha-diversity, and generally higher alpha-diversity values were retrieved from Unipept data compared to MEGAN results. Even more intriguingly, according to the principal component analysis (PCA) of taxa abundances, human and mouse data clustered in a completely different fashion, namely according to individuals and to DBs, respectively (Fig. 4b). This clearly indicates a strong DB-dependent bias in taxonomic annotation of mouse gut metaproteomic data.Fig. 4


The impact of sequence database choice on metaproteomic results in gut microbiota studies
Taxonomic annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). The annotation tools used were MEGAN (MEG) and Unipept (Up). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different taxonomic levels (p phylum; c class; o order; f family; g genus; s species) in human (left) and mouse (right) samples. b PCA plots of taxa abundances, with each dot indicating a different human (left) or mouse (right) subject
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037606&req=5

Fig4: Taxonomic annotation of human and mouse gut metaproteomic data obtained using different databases. DBs were made up of reads (dark blue, R-A) or contigs (red, C-A) from gut metagenomes of all subjects analyzed for each host species, or all bacterial sequences deposited in UniProt (turquoise, UP-B). The annotation tools used were MEGAN (MEG) and Unipept (Up). a Histograms showing the mean number (N = 3, with error bar indicating standard error of the mean) of non-redundant peptides identified (tot) and annotated at different taxonomic levels (p phylum; c class; o order; f family; g genus; s species) in human (left) and mouse (right) samples. b PCA plots of taxa abundances, with each dot indicating a different human (left) or mouse (right) subject
Mentions: As a further step, we performed taxonomic annotation by means of two established tools which employ a lowest common ancestor (LCA) algorithm: first, MEGAN was used to carry out LCA classification of read, contig, and protein sequences (identified from each respective DB) based on results of sequence alignment versus NCBI-nr entries; second, Unipept was employed to perform LCA classification of tryptic peptide sequences (obtained from each DB search) based on full homology with UniProt entries. As shown in Fig. 4a, UP-B generally provided a poorer annotation performance compared to MG-DBs, with small differences in humans and greater disparity in mice (especially down to the order level). Moreover, when focusing on MG-DBs, a global decrease in peptide annotation yield was found in mouse when compared to human samples. Concerning the annotation tools, MEGAN reached higher taxonomic annotation yields compared to Unipept, for both host species and all DB types. A strong impact of sequence length could be also observed for MEGAN, with the contig-based DBs (average sequence length of 209 and 344 amino acids in humans and mice, respectively) significantly outperforming the read-based DBs (average sequence length of 48 and 45 amino acids in humans and mice, respectively). Alpha-diversity calculation also revealed DB- and annotation tool-dependent biases (Additional file 4: Figure S4). MEGAN classification considerably enhanced the divergence between human (higher) and mouse (lower) gut metaproteome alpha-diversity, and generally higher alpha-diversity values were retrieved from Unipept data compared to MEGAN results. Even more intriguingly, according to the principal component analysis (PCA) of taxa abundances, human and mouse data clustered in a completely different fashion, namely according to individuals and to DBs, respectively (Fig. 4b). This clearly indicates a strong DB-dependent bias in taxonomic annotation of mouse gut metaproteomic data.Fig. 4

View Article: PubMed Central - PubMed

ABSTRACT

Background: Elucidating the role of gut microbiota in physiological and pathological processes has recently emerged as a key research aim in life sciences. In this respect, metaproteomics, the study of the whole protein complement of a microbial community, can provide a unique contribution by revealing which functions are actually being expressed by specific microbial taxa. However, its wide application to gut microbiota research has been hindered by challenges in data analysis, especially related to the choice of the proper sequence databases for protein identification.

Results: Here, we present a systematic investigation of variables concerning database construction and annotation and evaluate their impact on human and mouse gut metaproteomic results. We found that both publicly available and experimental metagenomic databases lead to the identification of unique peptide assortments, suggesting parallel database searches as a mean to gain more complete information. In particular, the contribution of experimental metagenomic databases was revealed to be mandatory when dealing with mouse samples. Moreover, the use of a “merged” database, containing all metagenomic sequences from the population under study, was found to be generally preferable over the use of sample-matched databases. We also observed that taxonomic and functional results are strongly database-dependent, in particular when analyzing the mouse gut microbiota. As a striking example, the Firmicutes/Bacteroidetes ratio varied up to tenfold depending on the database used. Finally, assembling reads into longer contigs provided significant advantages in terms of functional annotation yields.

Conclusions: This study contributes to identify host- and database-specific biases which need to be taken into account in a metaproteomic experiment, providing meaningful insights on how to design gut microbiota studies and to perform metaproteomic data analysis. In particular, the use of multiple databases and annotation tools has to be encouraged, even though this requires appropriate bioinformatic resources.

Electronic supplementary material: The online version of this article (doi:10.1186/s40168-016-0196-8) contains supplementary material, which is available to authorized users.

No MeSH data available.