Limits...
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH
The LC dataset augments functional predictions. (a) Evaluation of curated literature against GO biological process as a standard. Comparisons of enrichment for functional relationships in LC dataset versus a variety of HTP datasets as scored against GO biological process are shown as the individual data points. The effect of the LC dataset on the predictive power of a Bayesian heterogeneous integration scheme [28] is shown by the curves. FN, false negatives; FP, false positives; TP, true positives. (b) Comparison of functional diversity in LC versus a variety of HTP datasets. The number of distinct functional groups (GO biological process terms) spanned by the LC dataset at decreasing levels of precision and recall. One hundred and forty-six independent GO terms were tested, all with fewer than 300 total annotations. A minimum F-score threshold (harmonic mean of precision and recall) was plotted against the number of GO terms needed to achieve that threshold for each of the data types.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1561585&req=5

Figure 9: The LC dataset augments functional predictions. (a) Evaluation of curated literature against GO biological process as a standard. Comparisons of enrichment for functional relationships in LC dataset versus a variety of HTP datasets as scored against GO biological process are shown as the individual data points. The effect of the LC dataset on the predictive power of a Bayesian heterogeneous integration scheme [28] is shown by the curves. FN, false negatives; FP, false positives; TP, true positives. (b) Comparison of functional diversity in LC versus a variety of HTP datasets. The number of distinct functional groups (GO biological process terms) spanned by the LC dataset at decreasing levels of precision and recall. One hundred and forty-six independent GO terms were tested, all with fewer than 300 total annotations. A minimum F-score threshold (harmonic mean of precision and recall) was plotted against the number of GO terms needed to achieve that threshold for each of the data types.

Mentions: Many different approaches have been devised to improve the power of large-scale datasets to predict gene or protein functions, including simple combinations of different datasets, Bayesian integration of multiple data sources, and inherent network properties of true versus false interactions [3,14,26-30]. To assess the capability of the LC dataset to assign new gene functions, we first evaluated the enrichment of known functional relationships in LC-PI pairs by comparing them with GO process annotations. We compared the LC pairs relative to a variety of HTP genomic data on the basis of both precision (that is, proportion of results known to be true positives) and recall (that is, proportion of known positives identified). The LC-PI dataset returned approximately 70% precision on about 14,000 pairs, as compared with 50% precision on 2,500 pairs for the HTP-PI dataset and 70% precision on only 800 true positive pairs for coexpression datasets (Figure 9a).


Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

The LC dataset augments functional predictions. (a) Evaluation of curated literature against GO biological process as a standard. Comparisons of enrichment for functional relationships in LC dataset versus a variety of HTP datasets as scored against GO biological process are shown as the individual data points. The effect of the LC dataset on the predictive power of a Bayesian heterogeneous integration scheme [28] is shown by the curves. FN, false negatives; FP, false positives; TP, true positives. (b) Comparison of functional diversity in LC versus a variety of HTP datasets. The number of distinct functional groups (GO biological process terms) spanned by the LC dataset at decreasing levels of precision and recall. One hundred and forty-six independent GO terms were tested, all with fewer than 300 total annotations. A minimum F-score threshold (harmonic mean of precision and recall) was plotted against the number of GO terms needed to achieve that threshold for each of the data types.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1561585&req=5

Figure 9: The LC dataset augments functional predictions. (a) Evaluation of curated literature against GO biological process as a standard. Comparisons of enrichment for functional relationships in LC dataset versus a variety of HTP datasets as scored against GO biological process are shown as the individual data points. The effect of the LC dataset on the predictive power of a Bayesian heterogeneous integration scheme [28] is shown by the curves. FN, false negatives; FP, false positives; TP, true positives. (b) Comparison of functional diversity in LC versus a variety of HTP datasets. The number of distinct functional groups (GO biological process terms) spanned by the LC dataset at decreasing levels of precision and recall. One hundred and forty-six independent GO terms were tested, all with fewer than 300 total annotations. A minimum F-score threshold (harmonic mean of precision and recall) was plotted against the number of GO terms needed to achieve that threshold for each of the data types.
Mentions: Many different approaches have been devised to improve the power of large-scale datasets to predict gene or protein functions, including simple combinations of different datasets, Bayesian integration of multiple data sources, and inherent network properties of true versus false interactions [3,14,26-30]. To assess the capability of the LC dataset to assign new gene functions, we first evaluated the enrichment of known functional relationships in LC-PI pairs by comparing them with GO process annotations. We compared the LC pairs relative to a variety of HTP genomic data on the basis of both precision (that is, proportion of results known to be true positives) and recall (that is, proportion of known positives identified). The LC-PI dataset returned approximately 70% precision on about 14,000 pairs, as compared with 50% precision on 2,500 pairs for the HTP-PI dataset and 70% precision on only 800 true positive pairs for coexpression datasets (Figure 9a).

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH