Limits...
Protein structural similarity search by Ramachandran codes.

Lo WC, Huang PJ, Chang CH, Lyu PC - BMC Bioinformatics (2007)

Bottom Line: To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools.It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools.These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Bioinformatics and Structural Biology, National Tsing Hua University, 101, Section 2 Kuang Fu Road, Hsinchu 30013, Taiwan. b861636@life.nthu.edu.tw

ABSTRACT

Background: Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases.

Results: We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms.

Conclusion: As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

Show MeSH

Related in: MedlinePlus

Effects of sequence identities on the precision of several search methods. The structure similarity search method, SARST, was able to detect remote homology with increased precisions compared with other linear encoding algorithms and the conventional amino acid sequence search method, BLAST. These data also show that there is still room left for the improvement of linear encoding methodology. Possible solutions are proposed in Discussion. The average precisions used in this figure were calculated at the representative 60% recall level.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194796&req=5

Figure 5: Effects of sequence identities on the precision of several search methods. The structure similarity search method, SARST, was able to detect remote homology with increased precisions compared with other linear encoding algorithms and the conventional amino acid sequence search method, BLAST. These data also show that there is still room left for the improvement of linear encoding methodology. Possible solutions are proposed in Discussion. The average precisions used in this figure were calculated at the representative 60% recall level.

Mentions: Using this new target database, IR experiments were performed to examine the effects of low sequence identities. Various identity subsets of the target database were searched. As shown in Figure 5, the precision of SARST decreased as it encountered proteins with low sequences identities but was not as negatively affected as the precision of BLAST, which decreased substantially when the sequence identities fell below 30%. In comparison with recent linear encoding methods like YAKUSA and 3D-BLAST, the precision of SARST was generally improved. It could be observed that, when tested with these non-redundant datasets, the accuracy of linear encoding methods was substantially lower than geometric algorithms like FAST and CE. We propose that this is because of the unavoidable loss of structural information in the process of 3D to 1D transformation, a phenomena discussed in the latter part of this article.


Protein structural similarity search by Ramachandran codes.

Lo WC, Huang PJ, Chang CH, Lyu PC - BMC Bioinformatics (2007)

Effects of sequence identities on the precision of several search methods. The structure similarity search method, SARST, was able to detect remote homology with increased precisions compared with other linear encoding algorithms and the conventional amino acid sequence search method, BLAST. These data also show that there is still room left for the improvement of linear encoding methodology. Possible solutions are proposed in Discussion. The average precisions used in this figure were calculated at the representative 60% recall level.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194796&req=5

Figure 5: Effects of sequence identities on the precision of several search methods. The structure similarity search method, SARST, was able to detect remote homology with increased precisions compared with other linear encoding algorithms and the conventional amino acid sequence search method, BLAST. These data also show that there is still room left for the improvement of linear encoding methodology. Possible solutions are proposed in Discussion. The average precisions used in this figure were calculated at the representative 60% recall level.
Mentions: Using this new target database, IR experiments were performed to examine the effects of low sequence identities. Various identity subsets of the target database were searched. As shown in Figure 5, the precision of SARST decreased as it encountered proteins with low sequences identities but was not as negatively affected as the precision of BLAST, which decreased substantially when the sequence identities fell below 30%. In comparison with recent linear encoding methods like YAKUSA and 3D-BLAST, the precision of SARST was generally improved. It could be observed that, when tested with these non-redundant datasets, the accuracy of linear encoding methods was substantially lower than geometric algorithms like FAST and CE. We propose that this is because of the unavoidable loss of structural information in the process of 3D to 1D transformation, a phenomena discussed in the latter part of this article.

Bottom Line: To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools.It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools.These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Bioinformatics and Structural Biology, National Tsing Hua University, 101, Section 2 Kuang Fu Road, Hsinchu 30013, Taiwan. b861636@life.nthu.edu.tw

ABSTRACT

Background: Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases.

Results: We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms.

Conclusion: As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era.

Show MeSH
Related in: MedlinePlus