Limits...
Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition.

Schmidt Am Busch M, Sedano A, Simonson T - PLoS ONE (2010)

Bottom Line: The results confirm and generalize our earlier study of SH2 and SH3 domains.For some families, designed sequences can be a useful complement to experimental ones for homologue searching.However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, Palaiseau, France.

ABSTRACT

Background: Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases.

Methodology/principal findings: WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed.

Conclusions/significance: For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

Show MeSH
Correlation analysis for the PDZ domain 1QAU.A) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higly-correlated amino acid pairs. Red dots along the top and side label the network shown in B). B) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). C) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. D) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled ‘None’). Thick lines represent true homologues; thin lines show false positives.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2864755&req=5

pone-0010410-g008: Correlation analysis for the PDZ domain 1QAU.A) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higly-correlated amino acid pairs. Red dots along the top and side label the network shown in B). B) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). C) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. D) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled ‘None’). Thick lines represent true homologues; thin lines show false positives.

Mentions: To further characterize the correlations between amino acid positions, we consider a single example, the PDZ domain 1QAU. For this protein, we have analyzed the correlated mutations that occur within the set of 10,000 (not 8,000) low-energy designed sequences. A correlation coefficient was defined in Methods, which is effectively a mutual sequence entropy [78], [81]. Considering all pairs of positions in 1QAU, we obtain a covariance matrix, which is shown in Fig. 8A. By inspecting the matrix and the protein 3D structure, we identified a small network of five amino acids that are strongly correlated; Fig. 8B shows them in the context of the 3D structure. The different amino acid sequences that occur most frequently at these five positions, within the set of designed sequences, are shown in Fig. 8C. In fact, the sequences are described with the reduced alphabet of nine ‘flavors’ defined in the previous section. Therefore, we speak of ‘sequence patterns’, rather than sequences. The two most frequent sequence patterns are HDWWW (16.4% of the 10,000 low-energy sequences, or 1640 sequences) and DLWWL (9.6% of the sequences).


Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition.

Schmidt Am Busch M, Sedano A, Simonson T - PLoS ONE (2010)

Correlation analysis for the PDZ domain 1QAU.A) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higly-correlated amino acid pairs. Red dots along the top and side label the network shown in B). B) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). C) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. D) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled ‘None’). Thick lines represent true homologues; thin lines show false positives.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2864755&req=5

pone-0010410-g008: Correlation analysis for the PDZ domain 1QAU.A) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higly-correlated amino acid pairs. Red dots along the top and side label the network shown in B). B) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). C) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. D) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled ‘None’). Thick lines represent true homologues; thin lines show false positives.
Mentions: To further characterize the correlations between amino acid positions, we consider a single example, the PDZ domain 1QAU. For this protein, we have analyzed the correlated mutations that occur within the set of 10,000 (not 8,000) low-energy designed sequences. A correlation coefficient was defined in Methods, which is effectively a mutual sequence entropy [78], [81]. Considering all pairs of positions in 1QAU, we obtain a covariance matrix, which is shown in Fig. 8A. By inspecting the matrix and the protein 3D structure, we identified a small network of five amino acids that are strongly correlated; Fig. 8B shows them in the context of the 3D structure. The different amino acid sequences that occur most frequently at these five positions, within the set of designed sequences, are shown in Fig. 8C. In fact, the sequences are described with the reduced alphabet of nine ‘flavors’ defined in the previous section. Therefore, we speak of ‘sequence patterns’, rather than sequences. The two most frequent sequence patterns are HDWWW (16.4% of the 10,000 low-energy sequences, or 1640 sequences) and DLWWL (9.6% of the sequences).

Bottom Line: The results confirm and generalize our earlier study of SH2 and SH3 domains.For some families, designed sequences can be a useful complement to experimental ones for homologue searching.However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, Palaiseau, France.

ABSTRACT

Background: Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases.

Methodology/principal findings: WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed.

Conclusions/significance: For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

Show MeSH