Limits...
Superfamily assignments for the yeast proteome through integration of structure prediction with the gene ontology.

Malmström L, Riffle M, Strauss CE, Chivian D, Davis TN, Bonneau R, Baker D - PLoS Biol. (2007)

Bottom Line: Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown.This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach.We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry, University of Washington, Seattle, Washington, United States of America.

ABSTRACT
Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown. Yeast proteins were parsed into 14,934 domains, and those lacking sequence similarity to proteins of known structure were folded using the Rosetta de novo structure prediction method on the World Community Grid. This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach. We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them. We have also assigned structural annotations to 7,094 predicted domains based on fold recognition and homology modeling methods. The domain predictions and structural information are available in an online database at http://rd.plos.org/10.1371_journal.pbio.0050076_01.

Show MeSH
Predictions of Domains with Confident SCOP Superfamily Assignment Scores That Were Subsequently SolvedThe predicted structure is shown in panel 1 (left), and the native structure or the structure of the homolog is shown in panel 2 (middle). The section matching the domain in the solved structure is colored red in panel 3 (right).(A) TRS20/YBR254C is involved in the endoplasmic reticulum (ER) to Golgi transport and was predicted to belong to the SNARE-like (d.110.4) SCOP superfamily with a P(SF/D,GO) = 0.98 (PMCM = 0.69). The Z-score between the predicted structure and the structure of a homolog with 33% sequence identity was 7.76.(B) ECM15/YBL001C is involved in cell wall organization and biogenesis, and was predicted to belong to the MTH1187/YkoF-like (d.58.48) superfamily with P(SF/D,GO) = 0.87 (PMCM = 0.82; MAMMOTH Z-score to native structure was 8.18).(C) KAP104/YBR017C is involved in nuclear localization sequence binding and was predicted to belong to the ARM repeat (a.118.1) superfamily with P(SF/D,GO) = 0.99 (PMCM = 0.8; MAMMOTH Z-score to homologous structure [31% sequence identity] was 4.41).(D) FIS1/YIL065C was predicted to belong to the SCOP superfamily TPR-like (a.118.8) with a PMCM of 0.98. The Mammoth Z-score between the predicted structure and the experimental structure was 13.34.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1828141&req=5

pbio-0050076-g004: Predictions of Domains with Confident SCOP Superfamily Assignment Scores That Were Subsequently SolvedThe predicted structure is shown in panel 1 (left), and the native structure or the structure of the homolog is shown in panel 2 (middle). The section matching the domain in the solved structure is colored red in panel 3 (right).(A) TRS20/YBR254C is involved in the endoplasmic reticulum (ER) to Golgi transport and was predicted to belong to the SNARE-like (d.110.4) SCOP superfamily with a P(SF/D,GO) = 0.98 (PMCM = 0.69). The Z-score between the predicted structure and the structure of a homolog with 33% sequence identity was 7.76.(B) ECM15/YBL001C is involved in cell wall organization and biogenesis, and was predicted to belong to the MTH1187/YkoF-like (d.58.48) superfamily with P(SF/D,GO) = 0.87 (PMCM = 0.82; MAMMOTH Z-score to native structure was 8.18).(C) KAP104/YBR017C is involved in nuclear localization sequence binding and was predicted to belong to the ARM repeat (a.118.1) superfamily with P(SF/D,GO) = 0.99 (PMCM = 0.8; MAMMOTH Z-score to homologous structure [31% sequence identity] was 4.41).(D) FIS1/YIL065C was predicted to belong to the SCOP superfamily TPR-like (a.118.8) with a PMCM of 0.98. The Mammoth Z-score between the predicted structure and the experimental structure was 13.34.

Mentions: True performance of these technologies cannot be assessed on the benchmark dataset because the domain boundaries of this set are perfect (derived from known structures in the ASTRAL database). A subset of the proteins without links to known structure at the start of this project now have strong homology to a structure that has since been solved, see Figure 4 for examples. These recently solved structures give us an opportunity to assess the performance of our technology without bias in the selection of the proteins, with real domain prediction error incorporated, and without the contamination of the results by weak homology to known structures. Twenty-seven domains from this project now have a homolog of known structure (see Materials and Methods for homology definition) and three of the 11 predictions with a PMCM of 0.8 or higher are correct. Five of seven proteins from this set with a P(SF/D,GO) ≥ 0.8 are correct; five of which (three correct) also have a PMCM of 0.8. The small sample size represented by this dataset makes it difficult to assess accurate upper and lower bounds of the estimated error. We have also generated a much larger dataset (see Materials and Methods), as part of the Human Proteome Folding Project (HPF). This dataset, containing proteins from over 150 organisms, was derived without the use of fold recognition (and thus is not identical to the protocol for yeast), but provides valuable information as to the effect of domain prediction on our procedure. A total of 44% of the 207 predictions that were made for recently solved structures with PMCM above 0.8 in this dataset were correct; 84% of the 51 predictions with P(SF/D,GO) above 0.8 were correct; and 31 of these predictions had both a PMCM and a P(SF/D,GO) above 0.8, and 27 of these were correct. Over all three validation sets, more than 40% of the predictions with a PMCM above 0.8, and more than 75% of the predictions with a P(SF/D,GO) above 0.8, are correct, illustrating the value of data integration in this work.


Superfamily assignments for the yeast proteome through integration of structure prediction with the gene ontology.

Malmström L, Riffle M, Strauss CE, Chivian D, Davis TN, Bonneau R, Baker D - PLoS Biol. (2007)

Predictions of Domains with Confident SCOP Superfamily Assignment Scores That Were Subsequently SolvedThe predicted structure is shown in panel 1 (left), and the native structure or the structure of the homolog is shown in panel 2 (middle). The section matching the domain in the solved structure is colored red in panel 3 (right).(A) TRS20/YBR254C is involved in the endoplasmic reticulum (ER) to Golgi transport and was predicted to belong to the SNARE-like (d.110.4) SCOP superfamily with a P(SF/D,GO) = 0.98 (PMCM = 0.69). The Z-score between the predicted structure and the structure of a homolog with 33% sequence identity was 7.76.(B) ECM15/YBL001C is involved in cell wall organization and biogenesis, and was predicted to belong to the MTH1187/YkoF-like (d.58.48) superfamily with P(SF/D,GO) = 0.87 (PMCM = 0.82; MAMMOTH Z-score to native structure was 8.18).(C) KAP104/YBR017C is involved in nuclear localization sequence binding and was predicted to belong to the ARM repeat (a.118.1) superfamily with P(SF/D,GO) = 0.99 (PMCM = 0.8; MAMMOTH Z-score to homologous structure [31% sequence identity] was 4.41).(D) FIS1/YIL065C was predicted to belong to the SCOP superfamily TPR-like (a.118.8) with a PMCM of 0.98. The Mammoth Z-score between the predicted structure and the experimental structure was 13.34.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1828141&req=5

pbio-0050076-g004: Predictions of Domains with Confident SCOP Superfamily Assignment Scores That Were Subsequently SolvedThe predicted structure is shown in panel 1 (left), and the native structure or the structure of the homolog is shown in panel 2 (middle). The section matching the domain in the solved structure is colored red in panel 3 (right).(A) TRS20/YBR254C is involved in the endoplasmic reticulum (ER) to Golgi transport and was predicted to belong to the SNARE-like (d.110.4) SCOP superfamily with a P(SF/D,GO) = 0.98 (PMCM = 0.69). The Z-score between the predicted structure and the structure of a homolog with 33% sequence identity was 7.76.(B) ECM15/YBL001C is involved in cell wall organization and biogenesis, and was predicted to belong to the MTH1187/YkoF-like (d.58.48) superfamily with P(SF/D,GO) = 0.87 (PMCM = 0.82; MAMMOTH Z-score to native structure was 8.18).(C) KAP104/YBR017C is involved in nuclear localization sequence binding and was predicted to belong to the ARM repeat (a.118.1) superfamily with P(SF/D,GO) = 0.99 (PMCM = 0.8; MAMMOTH Z-score to homologous structure [31% sequence identity] was 4.41).(D) FIS1/YIL065C was predicted to belong to the SCOP superfamily TPR-like (a.118.8) with a PMCM of 0.98. The Mammoth Z-score between the predicted structure and the experimental structure was 13.34.
Mentions: True performance of these technologies cannot be assessed on the benchmark dataset because the domain boundaries of this set are perfect (derived from known structures in the ASTRAL database). A subset of the proteins without links to known structure at the start of this project now have strong homology to a structure that has since been solved, see Figure 4 for examples. These recently solved structures give us an opportunity to assess the performance of our technology without bias in the selection of the proteins, with real domain prediction error incorporated, and without the contamination of the results by weak homology to known structures. Twenty-seven domains from this project now have a homolog of known structure (see Materials and Methods for homology definition) and three of the 11 predictions with a PMCM of 0.8 or higher are correct. Five of seven proteins from this set with a P(SF/D,GO) ≥ 0.8 are correct; five of which (three correct) also have a PMCM of 0.8. The small sample size represented by this dataset makes it difficult to assess accurate upper and lower bounds of the estimated error. We have also generated a much larger dataset (see Materials and Methods), as part of the Human Proteome Folding Project (HPF). This dataset, containing proteins from over 150 organisms, was derived without the use of fold recognition (and thus is not identical to the protocol for yeast), but provides valuable information as to the effect of domain prediction on our procedure. A total of 44% of the 207 predictions that were made for recently solved structures with PMCM above 0.8 in this dataset were correct; 84% of the 51 predictions with P(SF/D,GO) above 0.8 were correct; and 31 of these predictions had both a PMCM and a P(SF/D,GO) above 0.8, and 27 of these were correct. Over all three validation sets, more than 40% of the predictions with a PMCM above 0.8, and more than 75% of the predictions with a P(SF/D,GO) above 0.8, are correct, illustrating the value of data integration in this work.

Bottom Line: Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown.This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach.We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry, University of Washington, Seattle, Washington, United States of America.

ABSTRACT
Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown. Yeast proteins were parsed into 14,934 domains, and those lacking sequence similarity to proteins of known structure were folded using the Rosetta de novo structure prediction method on the World Community Grid. This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach. We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them. We have also assigned structural annotations to 7,094 predicted domains based on fold recognition and homology modeling methods. The domain predictions and structural information are available in an online database at http://rd.plos.org/10.1371_journal.pbio.0050076_01.

Show MeSH