Limits...
Data growth and its impact on the SCOP database: new developments.

Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG - Nucleic Acids Res. (2007)

Bottom Line: A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date.We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships.We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies.

View Article: PubMed Central - PubMed

Affiliation: MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 0QH, UK.

ABSTRACT
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

Show MeSH
Workflow of the SCOP update protocol. The update sequence set of new unclassified structures is derived from the PDB SEQRES record. Disordered regions at the termini are masked. The update sequences are clustered using a threshold of 100% identity and 95% coverage for the inclusion of protein sequence into the cluster set. The resulting clusters are used to select a representative sequence set. This dataset is used as a primary input to the pre-classification pipeline. The representative cluster set is first compared using BLAST against itself and a database of non-redundant representative ASTRAL sequences for SCOP domains. This step allows detection of close homologs, usually members of the same SCOP family. Representative sequences without significant sequence match (E-value = 0.001) are further used for two-step PSI-BLAST searches. In the first step, a position-specific scoring matrix (PSSM) is generated by searching the NCBI non-redundant protein database. The resulting PSSM is saved after ten PSI-BLAST iterations or less if the program converges. In the second step, each saved PSSM is used to scan databases of representative ASTRAL and update sequences. In addition, the representative cluster set of unclassified proteins is submitted for RPS-BLAST search against a database of Pfam profiles. The resulting matches are then compared with the matches of pre-computed large-scale comparisons of SCOP domains and Pfam families. A provisional SCOP classification assignment is made for those proteins with a matching region in Pfam that has given a hit to SCOP domain. The results of both RPS-BLAST and PSI-BLAST are used to identify relationships between more distant homologs that are likely to be members of the same SCOP superfamily. Update proteins that are identical or nearly identical to domains classified in the current SCOP release or in the SCOP developmental version are classified automatically. The remaining proteins with and without provisional classification are curated manually.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2238974&req=5

Figure 1: Workflow of the SCOP update protocol. The update sequence set of new unclassified structures is derived from the PDB SEQRES record. Disordered regions at the termini are masked. The update sequences are clustered using a threshold of 100% identity and 95% coverage for the inclusion of protein sequence into the cluster set. The resulting clusters are used to select a representative sequence set. This dataset is used as a primary input to the pre-classification pipeline. The representative cluster set is first compared using BLAST against itself and a database of non-redundant representative ASTRAL sequences for SCOP domains. This step allows detection of close homologs, usually members of the same SCOP family. Representative sequences without significant sequence match (E-value = 0.001) are further used for two-step PSI-BLAST searches. In the first step, a position-specific scoring matrix (PSSM) is generated by searching the NCBI non-redundant protein database. The resulting PSSM is saved after ten PSI-BLAST iterations or less if the program converges. In the second step, each saved PSSM is used to scan databases of representative ASTRAL and update sequences. In addition, the representative cluster set of unclassified proteins is submitted for RPS-BLAST search against a database of Pfam profiles. The resulting matches are then compared with the matches of pre-computed large-scale comparisons of SCOP domains and Pfam families. A provisional SCOP classification assignment is made for those proteins with a matching region in Pfam that has given a hit to SCOP domain. The results of both RPS-BLAST and PSI-BLAST are used to identify relationships between more distant homologs that are likely to be members of the same SCOP superfamily. Update proteins that are identical or nearly identical to domains classified in the current SCOP release or in the SCOP developmental version are classified automatically. The remaining proteins with and without provisional classification are curated manually.

Mentions: The workflow of the update protocol is shown in detail in Figure 1. The BLAST search allows detection of close homologs, which usually belong to the same SCOP family, whereas the two-step PSI-BLAST (11) and RPS-BLAST searches are used to identify similarities indicative of a more distant relationship (at the Superfamily level). Where the results of PSI-BLAST and RPS-BLAST methods overlap, they provide a consensus pre-classification that has proved to be reliable. In addition, each of these methods detects unique matches, which assist the final classification. For example, such a unique match suggested the relationship of an uncharacterized protein AF0060 from Archaeoglobus fulgidus (12) to the recently discovered superfamily of all-alpha NTP pyrophosphohydrolases with a potential ‘house cleaning’ function (13). This relationship was detected only by RPS-BLAST through Pfam families PF03819 and PF08761, both being linked to this SCOP superfamily. Manual inspection of the AF0060 structure confirmed the superfamily assignment, but also revealed its distinct features, including the subunit fold, tetrameric biological unit and active site architecture. Therefore we classify AF0060 to a new family of this superfamily. This example underlines intrinsic differences between the sequence-based annotations and the structural classification. For example, Pfam assigns AF0060 to the MazG-like family (PF03819), to which it shares a local sequence similarity, including the conserved metal ion-binding motif. The SCOP classification does not support this family assignment but suggests that AF0060 has a more distant relationship to the MazG-like proteins at the Superfamily level.Figure 1.


Data growth and its impact on the SCOP database: new developments.

Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG - Nucleic Acids Res. (2007)

Workflow of the SCOP update protocol. The update sequence set of new unclassified structures is derived from the PDB SEQRES record. Disordered regions at the termini are masked. The update sequences are clustered using a threshold of 100% identity and 95% coverage for the inclusion of protein sequence into the cluster set. The resulting clusters are used to select a representative sequence set. This dataset is used as a primary input to the pre-classification pipeline. The representative cluster set is first compared using BLAST against itself and a database of non-redundant representative ASTRAL sequences for SCOP domains. This step allows detection of close homologs, usually members of the same SCOP family. Representative sequences without significant sequence match (E-value = 0.001) are further used for two-step PSI-BLAST searches. In the first step, a position-specific scoring matrix (PSSM) is generated by searching the NCBI non-redundant protein database. The resulting PSSM is saved after ten PSI-BLAST iterations or less if the program converges. In the second step, each saved PSSM is used to scan databases of representative ASTRAL and update sequences. In addition, the representative cluster set of unclassified proteins is submitted for RPS-BLAST search against a database of Pfam profiles. The resulting matches are then compared with the matches of pre-computed large-scale comparisons of SCOP domains and Pfam families. A provisional SCOP classification assignment is made for those proteins with a matching region in Pfam that has given a hit to SCOP domain. The results of both RPS-BLAST and PSI-BLAST are used to identify relationships between more distant homologs that are likely to be members of the same SCOP superfamily. Update proteins that are identical or nearly identical to domains classified in the current SCOP release or in the SCOP developmental version are classified automatically. The remaining proteins with and without provisional classification are curated manually.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2238974&req=5

Figure 1: Workflow of the SCOP update protocol. The update sequence set of new unclassified structures is derived from the PDB SEQRES record. Disordered regions at the termini are masked. The update sequences are clustered using a threshold of 100% identity and 95% coverage for the inclusion of protein sequence into the cluster set. The resulting clusters are used to select a representative sequence set. This dataset is used as a primary input to the pre-classification pipeline. The representative cluster set is first compared using BLAST against itself and a database of non-redundant representative ASTRAL sequences for SCOP domains. This step allows detection of close homologs, usually members of the same SCOP family. Representative sequences without significant sequence match (E-value = 0.001) are further used for two-step PSI-BLAST searches. In the first step, a position-specific scoring matrix (PSSM) is generated by searching the NCBI non-redundant protein database. The resulting PSSM is saved after ten PSI-BLAST iterations or less if the program converges. In the second step, each saved PSSM is used to scan databases of representative ASTRAL and update sequences. In addition, the representative cluster set of unclassified proteins is submitted for RPS-BLAST search against a database of Pfam profiles. The resulting matches are then compared with the matches of pre-computed large-scale comparisons of SCOP domains and Pfam families. A provisional SCOP classification assignment is made for those proteins with a matching region in Pfam that has given a hit to SCOP domain. The results of both RPS-BLAST and PSI-BLAST are used to identify relationships between more distant homologs that are likely to be members of the same SCOP superfamily. Update proteins that are identical or nearly identical to domains classified in the current SCOP release or in the SCOP developmental version are classified automatically. The remaining proteins with and without provisional classification are curated manually.
Mentions: The workflow of the update protocol is shown in detail in Figure 1. The BLAST search allows detection of close homologs, which usually belong to the same SCOP family, whereas the two-step PSI-BLAST (11) and RPS-BLAST searches are used to identify similarities indicative of a more distant relationship (at the Superfamily level). Where the results of PSI-BLAST and RPS-BLAST methods overlap, they provide a consensus pre-classification that has proved to be reliable. In addition, each of these methods detects unique matches, which assist the final classification. For example, such a unique match suggested the relationship of an uncharacterized protein AF0060 from Archaeoglobus fulgidus (12) to the recently discovered superfamily of all-alpha NTP pyrophosphohydrolases with a potential ‘house cleaning’ function (13). This relationship was detected only by RPS-BLAST through Pfam families PF03819 and PF08761, both being linked to this SCOP superfamily. Manual inspection of the AF0060 structure confirmed the superfamily assignment, but also revealed its distinct features, including the subunit fold, tetrameric biological unit and active site architecture. Therefore we classify AF0060 to a new family of this superfamily. This example underlines intrinsic differences between the sequence-based annotations and the structural classification. For example, Pfam assigns AF0060 to the MazG-like family (PF03819), to which it shares a local sequence similarity, including the conserved metal ion-binding motif. The SCOP classification does not support this family assignment but suggests that AF0060 has a more distant relationship to the MazG-like proteins at the Superfamily level.Figure 1.

Bottom Line: A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date.We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships.We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies.

View Article: PubMed Central - PubMed

Affiliation: MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 0QH, UK.

ABSTRACT
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

Show MeSH