Limits...
Protein structural similarity search by Ramachandran codes.

Lo WC, Huang PJ, Chang CH, Lyu PC - BMC Bioinformatics (2007)

Bottom Line: To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools.It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools.These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Bioinformatics and Structural Biology, National Tsing Hua University, 101, Section 2 Kuang Fu Road, Hsinchu 30013, Taiwan. b861636@life.nthu.edu.tw

ABSTRACT

Background: Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases.

Results: We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms.

Conclusion: As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

Show MeSH

Related in: MedlinePlus

Examples of distantly related proteins retrieved by SARST. (a) The superimposed structures (cross-eye stereo view) of the interleukin 8-like chemokines from human, [SCOP:d1b3aa_] (blue) and [SCOP:d1tvxa_] (red). These two proteins have a low sequence identity while their structures are highly similar. The minimum RMSD calculated from positive positions of the Ramachandran sequence alignment is 1.68 Å. (b) Three-dimensional structures and the superimposition of [SCOP:d1tpo__] (blue) (SCOP sccs id: b.47.1.2), the trypsin from Bos taurus, and [SCOP:d1p3ca_] (red) (SCOP sccs id: b.47.1.1), a Bacillus intermedius glutamyl endopeptidase. These two proteases belong to different families but have similar structures. Although the amino acid sequence alignment fails to detect their functional similarities, the catalytic triad residues (highlighted in green) are well aligned by SARST. Their minimum RMSD is 4.17 Å, whereas their amino acid sequence identity is 22%. The secondary structural cartoons were generated by PROCHECK [54] and then modified with colors and gaps.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194796&req=5

Figure 6: Examples of distantly related proteins retrieved by SARST. (a) The superimposed structures (cross-eye stereo view) of the interleukin 8-like chemokines from human, [SCOP:d1b3aa_] (blue) and [SCOP:d1tvxa_] (red). These two proteins have a low sequence identity while their structures are highly similar. The minimum RMSD calculated from positive positions of the Ramachandran sequence alignment is 1.68 Å. (b) Three-dimensional structures and the superimposition of [SCOP:d1tpo__] (blue) (SCOP sccs id: b.47.1.2), the trypsin from Bos taurus, and [SCOP:d1p3ca_] (red) (SCOP sccs id: b.47.1.1), a Bacillus intermedius glutamyl endopeptidase. These two proteases belong to different families but have similar structures. Although the amino acid sequence alignment fails to detect their functional similarities, the catalytic triad residues (highlighted in green) are well aligned by SARST. Their minimum RMSD is 4.17 Å, whereas their amino acid sequence identity is 22%. The secondary structural cartoons were generated by PROCHECK [54] and then modified with colors and gaps.

Mentions: In the first example (Figure 6a), [SCOP:d1b3aa_] was the query protein and [SCOP:d1tvxa_] was one of its relevant retrievals. Both of these proteins are interleukin 8-like human chemokines. Their amino acid sequence identity was only 17.2% over a small alignment length (29 residues), whereas they were structurally very similar (minimum RMSD: 1.68 Å) with a much larger RM string alignment length (51 positions). This example indicates that SARST could successfully identify protein homologs sharing highly conserved 3D structures but low overall sequence homology (also seen in Figure 5). In the second example (Figure 6b), [SCOP:d1p3ca_], a Bacillus intermedius glutamyl endopeptidase, was a high score irrelevant retrieval of the query protein [SCOP:d1tpo__], trypsin from cow (Bos taurus). These two proteases exhibited only a 22% amino acid sequence identity. They had similar structures, and the catalytic triads were well aligned by SARST even though they belong to different families in the SCOP classification. There were several missing residues in the query protein, and there were major differences in length for some of the secondary structure elements (SSE), which would normally cause some failure to previous linear encoding methods [18]. SARST successfully identified the structural and functional similarities using suitable "X" scores and gap penalties. (Note that SARST is a database search tool that aims to rapidly distinguish high from low similarities but not to give optimum pairwise structural alignments. The RM sequence alignments shown in Figure 6 demonstrate how SARST works on protein homologs sharing low amino acid sequence identity but does not guarantee the best way to superimpose protein structures.)


Protein structural similarity search by Ramachandran codes.

Lo WC, Huang PJ, Chang CH, Lyu PC - BMC Bioinformatics (2007)

Examples of distantly related proteins retrieved by SARST. (a) The superimposed structures (cross-eye stereo view) of the interleukin 8-like chemokines from human, [SCOP:d1b3aa_] (blue) and [SCOP:d1tvxa_] (red). These two proteins have a low sequence identity while their structures are highly similar. The minimum RMSD calculated from positive positions of the Ramachandran sequence alignment is 1.68 Å. (b) Three-dimensional structures and the superimposition of [SCOP:d1tpo__] (blue) (SCOP sccs id: b.47.1.2), the trypsin from Bos taurus, and [SCOP:d1p3ca_] (red) (SCOP sccs id: b.47.1.1), a Bacillus intermedius glutamyl endopeptidase. These two proteases belong to different families but have similar structures. Although the amino acid sequence alignment fails to detect their functional similarities, the catalytic triad residues (highlighted in green) are well aligned by SARST. Their minimum RMSD is 4.17 Å, whereas their amino acid sequence identity is 22%. The secondary structural cartoons were generated by PROCHECK [54] and then modified with colors and gaps.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194796&req=5

Figure 6: Examples of distantly related proteins retrieved by SARST. (a) The superimposed structures (cross-eye stereo view) of the interleukin 8-like chemokines from human, [SCOP:d1b3aa_] (blue) and [SCOP:d1tvxa_] (red). These two proteins have a low sequence identity while their structures are highly similar. The minimum RMSD calculated from positive positions of the Ramachandran sequence alignment is 1.68 Å. (b) Three-dimensional structures and the superimposition of [SCOP:d1tpo__] (blue) (SCOP sccs id: b.47.1.2), the trypsin from Bos taurus, and [SCOP:d1p3ca_] (red) (SCOP sccs id: b.47.1.1), a Bacillus intermedius glutamyl endopeptidase. These two proteases belong to different families but have similar structures. Although the amino acid sequence alignment fails to detect their functional similarities, the catalytic triad residues (highlighted in green) are well aligned by SARST. Their minimum RMSD is 4.17 Å, whereas their amino acid sequence identity is 22%. The secondary structural cartoons were generated by PROCHECK [54] and then modified with colors and gaps.
Mentions: In the first example (Figure 6a), [SCOP:d1b3aa_] was the query protein and [SCOP:d1tvxa_] was one of its relevant retrievals. Both of these proteins are interleukin 8-like human chemokines. Their amino acid sequence identity was only 17.2% over a small alignment length (29 residues), whereas they were structurally very similar (minimum RMSD: 1.68 Å) with a much larger RM string alignment length (51 positions). This example indicates that SARST could successfully identify protein homologs sharing highly conserved 3D structures but low overall sequence homology (also seen in Figure 5). In the second example (Figure 6b), [SCOP:d1p3ca_], a Bacillus intermedius glutamyl endopeptidase, was a high score irrelevant retrieval of the query protein [SCOP:d1tpo__], trypsin from cow (Bos taurus). These two proteases exhibited only a 22% amino acid sequence identity. They had similar structures, and the catalytic triads were well aligned by SARST even though they belong to different families in the SCOP classification. There were several missing residues in the query protein, and there were major differences in length for some of the secondary structure elements (SSE), which would normally cause some failure to previous linear encoding methods [18]. SARST successfully identified the structural and functional similarities using suitable "X" scores and gap penalties. (Note that SARST is a database search tool that aims to rapidly distinguish high from low similarities but not to give optimum pairwise structural alignments. The RM sequence alignments shown in Figure 6 demonstrate how SARST works on protein homologs sharing low amino acid sequence identity but does not guarantee the best way to superimpose protein structures.)

Bottom Line: To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools.It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools.These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Bioinformatics and Structural Biology, National Tsing Hua University, 101, Section 2 Kuang Fu Road, Hsinchu 30013, Taiwan. b861636@life.nthu.edu.tw

ABSTRACT

Background: Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases.

Results: We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms.

Conclusion: As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

Show MeSH
Related in: MedlinePlus