Limits...
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH
Validation of interactions within interaction datasets. (a) The fraction of interactions in each dataset supported by multiple validations (that is, different publications or types of experimental evidence). (b) The fraction of interactions in each indicated dataset supported by more than one publication or type of experimental evidence. (c) Better studied proteins or genes, as defined by the number of supporting publications relative to node connectivity (designated bias, see Materials and methods), tend to be more highly connected within the physical or genetic networks. (d) The study bias towards essential genes in each dataset. (e) The distribution of conserved proteins in interaction datasets. Frequency refers to fraction of the dataset in each bin. Orthologous eukaryotic clusters for seven standard species (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Encephalitozoon cuniculi) were obtained from the COG database [96]. Sc refers to all budding yeast proteins as a reference dataset; non-LC refers to all HTP interactions except those that overlap with the LC datasets; X refers to yeast genes that were not assigned to any of the COG clusters and contains yeast-specific genes in addition to genes that have orthologs in only one of the other six species.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1561585&req=5

Figure 2: Validation of interactions within interaction datasets. (a) The fraction of interactions in each dataset supported by multiple validations (that is, different publications or types of experimental evidence). (b) The fraction of interactions in each indicated dataset supported by more than one publication or type of experimental evidence. (c) Better studied proteins or genes, as defined by the number of supporting publications relative to node connectivity (designated bias, see Materials and methods), tend to be more highly connected within the physical or genetic networks. (d) The study bias towards essential genes in each dataset. (e) The distribution of conserved proteins in interaction datasets. Frequency refers to fraction of the dataset in each bin. Orthologous eukaryotic clusters for seven standard species (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Encephalitozoon cuniculi) were obtained from the COG database [96]. Sc refers to all budding yeast proteins as a reference dataset; non-LC refers to all HTP interactions except those that overlap with the LC datasets; X refers to yeast genes that were not assigned to any of the COG clusters and contains yeast-specific genes in addition to genes that have orthologs in only one of the other six species.

Mentions: As all types of experimental evidence for each interaction were culled from each publication, it was possible to estimate the extent to which interactions in each dataset were overtly validated, either by more than one experimental method and/or by multiple publications. Even in the LC-PI and LC-GI datasets, most interactions were directly documented only once, with 33% and 20% of interactions in each respective dataset being reproduced by at least two publications or experimental methods (Figure 2a,b). Only a small fraction of any dataset was validated more than once (Figure 2a). These estimates of re-coverage are inherently conservative because of the minimal spoke representation used for each complex. Of particular importance, interactions that are well established in an initial publication are unlikely to be directly repeated by subsequent publications that build on the same line of enquiry.


Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae.

Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M - J. Biol. (2006)

Validation of interactions within interaction datasets. (a) The fraction of interactions in each dataset supported by multiple validations (that is, different publications or types of experimental evidence). (b) The fraction of interactions in each indicated dataset supported by more than one publication or type of experimental evidence. (c) Better studied proteins or genes, as defined by the number of supporting publications relative to node connectivity (designated bias, see Materials and methods), tend to be more highly connected within the physical or genetic networks. (d) The study bias towards essential genes in each dataset. (e) The distribution of conserved proteins in interaction datasets. Frequency refers to fraction of the dataset in each bin. Orthologous eukaryotic clusters for seven standard species (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Encephalitozoon cuniculi) were obtained from the COG database [96]. Sc refers to all budding yeast proteins as a reference dataset; non-LC refers to all HTP interactions except those that overlap with the LC datasets; X refers to yeast genes that were not assigned to any of the COG clusters and contains yeast-specific genes in addition to genes that have orthologs in only one of the other six species.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1561585&req=5

Figure 2: Validation of interactions within interaction datasets. (a) The fraction of interactions in each dataset supported by multiple validations (that is, different publications or types of experimental evidence). (b) The fraction of interactions in each indicated dataset supported by more than one publication or type of experimental evidence. (c) Better studied proteins or genes, as defined by the number of supporting publications relative to node connectivity (designated bias, see Materials and methods), tend to be more highly connected within the physical or genetic networks. (d) The study bias towards essential genes in each dataset. (e) The distribution of conserved proteins in interaction datasets. Frequency refers to fraction of the dataset in each bin. Orthologous eukaryotic clusters for seven standard species (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Encephalitozoon cuniculi) were obtained from the COG database [96]. Sc refers to all budding yeast proteins as a reference dataset; non-LC refers to all HTP interactions except those that overlap with the LC datasets; X refers to yeast genes that were not assigned to any of the COG clusters and contains yeast-specific genes in addition to genes that have orthologs in only one of the other six species.
Mentions: As all types of experimental evidence for each interaction were culled from each publication, it was possible to estimate the extent to which interactions in each dataset were overtly validated, either by more than one experimental method and/or by multiple publications. Even in the LC-PI and LC-GI datasets, most interactions were directly documented only once, with 33% and 20% of interactions in each respective dataset being reproduced by at least two publications or experimental methods (Figure 2a,b). Only a small fraction of any dataset was validated more than once (Figure 2a). These estimates of re-coverage are inherently conservative because of the minimal spoke representation used for each complex. Of particular importance, interactions that are well established in an initial publication are unlikely to be directly repeated by subsequent publications that build on the same line of enquiry.

Bottom Line: Sparse coverage in HTP datasets may, however, distort network properties and confound predictions.We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications.We show that the LC dataset considerably improves the predictive power of network-analysis approaches.

View Article: PubMed Central - HTML - PubMed

Affiliation: Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto ON M5G 1X5, Canada. teresa.reguly@utoronto.ca

ABSTRACT

Background: The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.

Results: We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID (http://www.thebiogrid.org) and SGD (http://www.yeastgenome.org/) databases.

Conclusion: Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.

Show MeSH