Limits...
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH
Distribution of GO terms for genes or proteins involved in genetic and physical interactions compared with genome-wide distribution. (a) Distribution of indicated GO cellular component, molecular function and biological process terms for nodes in each dataset. Sce refers to the distribution for all genes or proteins. (b) Fraction of interactions that share common GO terms in each of the three GO categories. High-level GO annotations (GO-Slim) were obtained from the SGD. The mean shared annotation is significantly higher for LC-PI than for HTP-PI for each of the three categories (Fisher's exact test, P < 1 × 10-10).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1561585&req=5

Figure 3: Distribution of GO terms for genes or proteins involved in genetic and physical interactions compared with genome-wide distribution. (a) Distribution of indicated GO cellular component, molecular function and biological process terms for nodes in each dataset. Sce refers to the distribution for all genes or proteins. (b) Fraction of interactions that share common GO terms in each of the three GO categories. High-level GO annotations (GO-Slim) were obtained from the SGD. The mean shared annotation is significantly higher for LC-PI than for HTP-PI for each of the three categories (Fisher's exact test, P < 1 × 10-10).

Mentions: To determine how closely protein and genetic interaction pairs match existing GO descriptors of gene or protein function, we assessed high-level GO terms represented within different interaction datasets. The distribution of GO component, GO function and GO process categories for each dataset was determined and compared with the total distribution for all yeast genes (Figure 3a). Given that the GO annotation for S. cerevisiae is derived from the primary literature [47], it was not surprising that the LC-PI and LC-GI datasets showed a similar distribution across GO categories and terms, including under-representation for the term 'unknown' in each of the three GO categories. In contrast, the HTP-PI and HTP-GI datasets contained more genes designated as 'unknown', and a corresponding depletion in known categories. Certain specific GO categories were favored in the LC datasets, accompanied by concordance in the rank order of GO function or process terms between the LC-PI and LC-GI datasets, probably because of inherent bias in the literature towards subfields of biology (see also Additional data file 3).


Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

Distribution of GO terms for genes or proteins involved in genetic and physical interactions compared with genome-wide distribution. (a) Distribution of indicated GO cellular component, molecular function and biological process terms for nodes in each dataset. Sce refers to the distribution for all genes or proteins. (b) Fraction of interactions that share common GO terms in each of the three GO categories. High-level GO annotations (GO-Slim) were obtained from the SGD. The mean shared annotation is significantly higher for LC-PI than for HTP-PI for each of the three categories (Fisher's exact test, P < 1 × 10-10).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1561585&req=5

Figure 3: Distribution of GO terms for genes or proteins involved in genetic and physical interactions compared with genome-wide distribution. (a) Distribution of indicated GO cellular component, molecular function and biological process terms for nodes in each dataset. Sce refers to the distribution for all genes or proteins. (b) Fraction of interactions that share common GO terms in each of the three GO categories. High-level GO annotations (GO-Slim) were obtained from the SGD. The mean shared annotation is significantly higher for LC-PI than for HTP-PI for each of the three categories (Fisher's exact test, P < 1 × 10-10).
Mentions: To determine how closely protein and genetic interaction pairs match existing GO descriptors of gene or protein function, we assessed high-level GO terms represented within different interaction datasets. The distribution of GO component, GO function and GO process categories for each dataset was determined and compared with the total distribution for all yeast genes (Figure 3a). Given that the GO annotation for S. cerevisiae is derived from the primary literature [47], it was not surprising that the LC-PI and LC-GI datasets showed a similar distribution across GO categories and terms, including under-representation for the term 'unknown' in each of the three GO categories. In contrast, the HTP-PI and HTP-GI datasets contained more genes designated as 'unknown', and a corresponding depletion in known categories. Certain specific GO categories were favored in the LC datasets, accompanied by concordance in the rank order of GO function or process terms between the LC-PI and LC-GI datasets, probably because of inherent bias in the literature towards subfields of biology (see also Additional data file 3).

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH