Limits...
PaxDb, a database of protein abundance averages across all three domains of life.

Wang M, Weiss M, Simonovic M, Haertinger G, Schrimpf SP, Hengartner MO, von Mering C - Mol. Cell Proteomics (2012)

Bottom Line: Although protein expression is regulated both temporally and spatially, most proteins have an intrinsic, "typical" range of functionally effective abundance levels.Publicly available experimental data are mapped onto a common namespace and, in the case of tandem mass spectrometry data, re-processed using a standardized spectral counting pipeline.We score and rank each contributing, individual data set by assessing its consistency against externally provided protein-network information, and demonstrate that our weighted integration exhibits more consistency than the data sets individually.

View Article: PubMed Central - PubMed

Affiliation: Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland.

ABSTRACT
Although protein expression is regulated both temporally and spatially, most proteins have an intrinsic, "typical" range of functionally effective abundance levels. These extend from a few molecules per cell for signaling proteins, to millions of molecules for structural proteins. When addressing fundamental questions related to protein evolution, translation and folding, but also in routine laboratory work, a simple rough estimate of the average wild type abundance of each detectable protein in an organism is often desirable. Here, we introduce a meta-resource dedicated to integrating information on absolute protein abundance levels; we place particular emphasis on deep coverage, consistent post-processing and comparability across different organisms. Publicly available experimental data are mapped onto a common namespace and, in the case of tandem mass spectrometry data, re-processed using a standardized spectral counting pipeline. By aggregating and averaging over the various samples, conditions and cell-types, the resulting integrated data set achieves increased coverage and a high dynamic range. We score and rank each contributing, individual data set by assessing its consistency against externally provided protein-network information, and demonstrate that our weighted integration exhibits more consistency than the data sets individually. The current PaxDb-release 2.1 (at http://pax-db.org/) presents whole-organism data as well as tissue-resolved data, and covers 85,000 proteins in 12 model organisms. All values can be seamlessly compared across organisms via pre-computed orthology relationships.

Show MeSH

Related in: MedlinePlus

Data set quality. The quality scoring system in PaxDb is based on the assumption that interacting proteins should have roughly similar abundances. A, left: two examples of protein complexes in S. cerevisiae, both involved in replication but with different abundance levels. Right: abundance ratios of all interacting protein pairs, plotted as histograms. Shuffled data (in red) shows a shift toward higher ratios compared with actual data (in blue). B, For two different data sets, both the median of the actual ratios (in blue) and the distribution of medians obtained from 500 × shuffling of abundances (in red) are shown. The PaxDb score corresponds to the z-transformed distance. The same two data sets are plotted against RNAseq data below. C, The consolidated data set, with a higher PaxDb score, also correlates better with mRNA abundances, and covers more proteins. Yeast mRNA quantification data is from ref (62).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3412977&req=5

Figure 2: Data set quality. The quality scoring system in PaxDb is based on the assumption that interacting proteins should have roughly similar abundances. A, left: two examples of protein complexes in S. cerevisiae, both involved in replication but with different abundance levels. Right: abundance ratios of all interacting protein pairs, plotted as histograms. Shuffled data (in red) shows a shift toward higher ratios compared with actual data (in blue). B, For two different data sets, both the median of the actual ratios (in blue) and the distribution of medians obtained from 500 × shuffling of abundances (in red) are shown. The PaxDb score corresponds to the z-transformed distance. The same two data sets are plotted against RNAseq data below. C, The consolidated data set, with a higher PaxDb score, also correlates better with mRNA abundances, and covers more proteins. Yeast mRNA quantification data is from ref (62).

Mentions: Because gold-standard benchmark/reference information on protein abundance is often not available, gauging the quality of the individual data sets in PaxDb is far from trivial. Here, we employ two distinct, indirect strategies for obtaining a rough estimate on data quality: one is based on protein-protein interaction information, and the other is based on abundance measurements of mRNA molecules. The use of protein-protein interaction data is based on the assumption that interacting proteins should on average have roughly similar steady-state abundances (see also above, “Experimental Procedures”). As shown in Fig. 2, this assumption is an oversimplifying approximation at best: the median abundance ratio of two interacting proteins currently stands at about 3:1 in yeast. Indeed, interacting proteins do not necessarily have to be expressed at similar levels, especially in the case of transient or regulated interactions. However, the observed 3:1 ratio is much less than the abundance ratio of 10:1 that can be expected by chance (assessed from repeatedly shuffling the abundance data), hence it does provide an intrinsic quality estimator. We express this measure as a Z-score-distance from the random distribution, and designate it as the “interaction consistency score.” Likewise, the use of mRNA abundances as a quality measure is also based on a simplifying assumption: that the average steady-state abundance of an mRNA species should be a rough predictor of the steady-state abundance of its encoded protein. Of course, this does not necessarily have to hold for any given transcript/protein pair, but the overall correlation has been shown to be highly significant, and to improve recently as measurement accuracies improved as well (60). Similar to the interaction test described above, this mRNA correlation test has the advantage of including the majority of proteins in a sample, and of sharing very little systematic technical biases with protein quantifications in general. Using yeast as an example, Fig. 2 shows that our scoring indeed reveals strong differences between the various protein abundance data sets (see also supplemental Figs. S1 and S2). Importantly, the data consolidation performed at PaxDb (i.e. a weighted average over the contributing data sets) does score highest on both measures, achieving an interaction consistency score of about 23.8 and a Spearman rank correlation against mRNA of about 0.64 in yeast (Fig. 2, supplemental Fig. S3). Relative to each other, both of our consistency measures exhibit a reasonably good correlation (RS = 0.74, p value < 0.0005, Supplemental Fig. S4A), and thus allow us to provide at least an initial, rough estimate of data quality/consistency.


PaxDb, a database of protein abundance averages across all three domains of life.

Wang M, Weiss M, Simonovic M, Haertinger G, Schrimpf SP, Hengartner MO, von Mering C - Mol. Cell Proteomics (2012)

Data set quality. The quality scoring system in PaxDb is based on the assumption that interacting proteins should have roughly similar abundances. A, left: two examples of protein complexes in S. cerevisiae, both involved in replication but with different abundance levels. Right: abundance ratios of all interacting protein pairs, plotted as histograms. Shuffled data (in red) shows a shift toward higher ratios compared with actual data (in blue). B, For two different data sets, both the median of the actual ratios (in blue) and the distribution of medians obtained from 500 × shuffling of abundances (in red) are shown. The PaxDb score corresponds to the z-transformed distance. The same two data sets are plotted against RNAseq data below. C, The consolidated data set, with a higher PaxDb score, also correlates better with mRNA abundances, and covers more proteins. Yeast mRNA quantification data is from ref (62).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3412977&req=5

Figure 2: Data set quality. The quality scoring system in PaxDb is based on the assumption that interacting proteins should have roughly similar abundances. A, left: two examples of protein complexes in S. cerevisiae, both involved in replication but with different abundance levels. Right: abundance ratios of all interacting protein pairs, plotted as histograms. Shuffled data (in red) shows a shift toward higher ratios compared with actual data (in blue). B, For two different data sets, both the median of the actual ratios (in blue) and the distribution of medians obtained from 500 × shuffling of abundances (in red) are shown. The PaxDb score corresponds to the z-transformed distance. The same two data sets are plotted against RNAseq data below. C, The consolidated data set, with a higher PaxDb score, also correlates better with mRNA abundances, and covers more proteins. Yeast mRNA quantification data is from ref (62).
Mentions: Because gold-standard benchmark/reference information on protein abundance is often not available, gauging the quality of the individual data sets in PaxDb is far from trivial. Here, we employ two distinct, indirect strategies for obtaining a rough estimate on data quality: one is based on protein-protein interaction information, and the other is based on abundance measurements of mRNA molecules. The use of protein-protein interaction data is based on the assumption that interacting proteins should on average have roughly similar steady-state abundances (see also above, “Experimental Procedures”). As shown in Fig. 2, this assumption is an oversimplifying approximation at best: the median abundance ratio of two interacting proteins currently stands at about 3:1 in yeast. Indeed, interacting proteins do not necessarily have to be expressed at similar levels, especially in the case of transient or regulated interactions. However, the observed 3:1 ratio is much less than the abundance ratio of 10:1 that can be expected by chance (assessed from repeatedly shuffling the abundance data), hence it does provide an intrinsic quality estimator. We express this measure as a Z-score-distance from the random distribution, and designate it as the “interaction consistency score.” Likewise, the use of mRNA abundances as a quality measure is also based on a simplifying assumption: that the average steady-state abundance of an mRNA species should be a rough predictor of the steady-state abundance of its encoded protein. Of course, this does not necessarily have to hold for any given transcript/protein pair, but the overall correlation has been shown to be highly significant, and to improve recently as measurement accuracies improved as well (60). Similar to the interaction test described above, this mRNA correlation test has the advantage of including the majority of proteins in a sample, and of sharing very little systematic technical biases with protein quantifications in general. Using yeast as an example, Fig. 2 shows that our scoring indeed reveals strong differences between the various protein abundance data sets (see also supplemental Figs. S1 and S2). Importantly, the data consolidation performed at PaxDb (i.e. a weighted average over the contributing data sets) does score highest on both measures, achieving an interaction consistency score of about 23.8 and a Spearman rank correlation against mRNA of about 0.64 in yeast (Fig. 2, supplemental Fig. S3). Relative to each other, both of our consistency measures exhibit a reasonably good correlation (RS = 0.74, p value < 0.0005, Supplemental Fig. S4A), and thus allow us to provide at least an initial, rough estimate of data quality/consistency.

Bottom Line: Although protein expression is regulated both temporally and spatially, most proteins have an intrinsic, "typical" range of functionally effective abundance levels.Publicly available experimental data are mapped onto a common namespace and, in the case of tandem mass spectrometry data, re-processed using a standardized spectral counting pipeline.We score and rank each contributing, individual data set by assessing its consistency against externally provided protein-network information, and demonstrate that our weighted integration exhibits more consistency than the data sets individually.

View Article: PubMed Central - PubMed

Affiliation: Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland.

ABSTRACT
Although protein expression is regulated both temporally and spatially, most proteins have an intrinsic, "typical" range of functionally effective abundance levels. These extend from a few molecules per cell for signaling proteins, to millions of molecules for structural proteins. When addressing fundamental questions related to protein evolution, translation and folding, but also in routine laboratory work, a simple rough estimate of the average wild type abundance of each detectable protein in an organism is often desirable. Here, we introduce a meta-resource dedicated to integrating information on absolute protein abundance levels; we place particular emphasis on deep coverage, consistent post-processing and comparability across different organisms. Publicly available experimental data are mapped onto a common namespace and, in the case of tandem mass spectrometry data, re-processed using a standardized spectral counting pipeline. By aggregating and averaging over the various samples, conditions and cell-types, the resulting integrated data set achieves increased coverage and a high dynamic range. We score and rank each contributing, individual data set by assessing its consistency against externally provided protein-network information, and demonstrate that our weighted integration exhibits more consistency than the data sets individually. The current PaxDb-release 2.1 (at http://pax-db.org/) presents whole-organism data as well as tissue-resolved data, and covers 85,000 proteins in 12 model organisms. All values can be seamlessly compared across organisms via pre-computed orthology relationships.

Show MeSH
Related in: MedlinePlus