Limits...
Comprehensive comparison of large-scale tissue expression datasets.

Santos A, Tsafou K, Stolte C, Pletscher-Frankild S, O'Donoghue SI, Jensen LJ - PeerJ (2015)

Bottom Line: Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated.We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining.We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression.

View Article: PubMed Central - HTML - PubMed

Affiliation: Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen , Denmark.

ABSTRACT
For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression. By developing comparable confidence scores for all types of evidence, we show that it is possible to improve both quality and coverage by combining the datasets. To facilitate use and visualization of our work, we have developed the TISSUES resource (http://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface.

No MeSH data available.


Quality of the transcriptome datasets.(A) To assess the correlation between expression level and confidence, we compared the transcriptome datasets to a gold standard, namely UniProtKB. We quantified the quality of the datasets in terms of its fold enrichment for correct gene–tissue associations compared to random chance. The comparison shows that higher expression values imply higher quality and that the three confidence cutoffs (vertical dotted lines) used correspond to equivalent quality in all datasets. (B) The distribution of expression breadth for UniProtKB is strongly skewed towards tissue-specific proteins, contrary to what was seen for transcriptome datasets. (C) We thus constructed a consensus mRNA reference set; its expression breadth distribution is in line with that of the individual mRNA datasets. (D) The mRNA reference set is highly complementary to the UniProtKB gold standard, providing 7,384 gene–tissue association that are not in the latter.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493645&req=5

fig-4: Quality of the transcriptome datasets.(A) To assess the correlation between expression level and confidence, we compared the transcriptome datasets to a gold standard, namely UniProtKB. We quantified the quality of the datasets in terms of its fold enrichment for correct gene–tissue associations compared to random chance. The comparison shows that higher expression values imply higher quality and that the three confidence cutoffs (vertical dotted lines) used correspond to equivalent quality in all datasets. (B) The distribution of expression breadth for UniProtKB is strongly skewed towards tissue-specific proteins, contrary to what was seen for transcriptome datasets. (C) We thus constructed a consensus mRNA reference set; its expression breadth distribution is in line with that of the individual mRNA datasets. (D) The mRNA reference set is highly complementary to the UniProtKB gold standard, providing 7,384 gene–tissue association that are not in the latter.

Mentions: The comparison showed that fold enrichment for gold-standard associations increased steadily with expression value from all datasets (Fig. 4A). This was expected because, in general, the more abundant a transcript, the more reliably it can be identified. Moreover, we find that the low-, medium-, and high-confidence cutoffs used in the preceding analyses correspond to the same quality in all datasets. However, a dataset of lower quality will give fewer associations at any given confidence cutoff.


Comprehensive comparison of large-scale tissue expression datasets.

Santos A, Tsafou K, Stolte C, Pletscher-Frankild S, O'Donoghue SI, Jensen LJ - PeerJ (2015)

Quality of the transcriptome datasets.(A) To assess the correlation between expression level and confidence, we compared the transcriptome datasets to a gold standard, namely UniProtKB. We quantified the quality of the datasets in terms of its fold enrichment for correct gene–tissue associations compared to random chance. The comparison shows that higher expression values imply higher quality and that the three confidence cutoffs (vertical dotted lines) used correspond to equivalent quality in all datasets. (B) The distribution of expression breadth for UniProtKB is strongly skewed towards tissue-specific proteins, contrary to what was seen for transcriptome datasets. (C) We thus constructed a consensus mRNA reference set; its expression breadth distribution is in line with that of the individual mRNA datasets. (D) The mRNA reference set is highly complementary to the UniProtKB gold standard, providing 7,384 gene–tissue association that are not in the latter.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493645&req=5

fig-4: Quality of the transcriptome datasets.(A) To assess the correlation between expression level and confidence, we compared the transcriptome datasets to a gold standard, namely UniProtKB. We quantified the quality of the datasets in terms of its fold enrichment for correct gene–tissue associations compared to random chance. The comparison shows that higher expression values imply higher quality and that the three confidence cutoffs (vertical dotted lines) used correspond to equivalent quality in all datasets. (B) The distribution of expression breadth for UniProtKB is strongly skewed towards tissue-specific proteins, contrary to what was seen for transcriptome datasets. (C) We thus constructed a consensus mRNA reference set; its expression breadth distribution is in line with that of the individual mRNA datasets. (D) The mRNA reference set is highly complementary to the UniProtKB gold standard, providing 7,384 gene–tissue association that are not in the latter.
Mentions: The comparison showed that fold enrichment for gold-standard associations increased steadily with expression value from all datasets (Fig. 4A). This was expected because, in general, the more abundant a transcript, the more reliably it can be identified. Moreover, we find that the low-, medium-, and high-confidence cutoffs used in the preceding analyses correspond to the same quality in all datasets. However, a dataset of lower quality will give fewer associations at any given confidence cutoff.

Bottom Line: Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated.We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining.We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression.

View Article: PubMed Central - HTML - PubMed

Affiliation: Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen , Denmark.

ABSTRACT
For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression. By developing comparable confidence scores for all types of evidence, we show that it is possible to improve both quality and coverage by combining the datasets. To facilitate use and visualization of our work, we have developed the TISSUES resource (http://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface.

No MeSH data available.