Limits...
Similarity search for local protein structures at atomic resolution by exploiting a database management system

View Article: PubMed Central - PubMed

ABSTRACT

A method to search for local structural similarities in proteins at atomic resolution is presented. It is demonstrated that a huge amount of structural data can be handled within a reasonable CPU time by using a conventional relational database management system with appropriate indexing of geometric data. This method, which we call geometric indexing, can enumerate ligand binding sites that are structurally similar to sub-structures of a query protein among more than 160,000 possible candidates within a few hours of CPU time on an ordinary desktop computer. After detecting a set of high scoring ligand binding sites by the geometric indexing search, structural alignments at atomic resolution are constructed by iteratively applying the Hungarian algorithm, and the statistical significance of the final score is estimated from an empirical model based on a gamma distribution. Applications of this method to several protein structures clearly shows that significant similarities can be detected between local structures of non-homologous as well as homologous proteins.

No MeSH data available.


Distribution of IR scores of randomly selected templates. The red bars indicate the histogram of IR scores of randomly selected templates obtained for the query 101m. The green line is the probability density function (PDF) of the gamma distribution GAM(α, β) with the parameters α=1.32 and β=1.75 calculated from the mean and variance of the scores. The blue line is the PDF of the type 2 extreme value distribution with the parameters determined to best fit the histogram.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC5036654&req=5

f4-3_75: Distribution of IR scores of randomly selected templates. The red bars indicate the histogram of IR scores of randomly selected templates obtained for the query 101m. The green line is the probability density function (PDF) of the gamma distribution GAM(α, β) with the parameters α=1.32 and β=1.75 calculated from the mean and variance of the scores. The blue line is the PDF of the type 2 extreme value distribution with the parameters determined to best fit the histogram.

Mentions: In order to estimate the statistical significance of IR score, we examined its distribution. We first performed a GI search, and then randomly selected 50,000 hits for iterative refinement. After the refinement, the histogram of the IR score was plotted. Fig. 4 is an example obtained for the query 101m. It is clearly seen that the distribution is well approximated by a gamma distribution (Fig. 4, green line). We also fitted the type-2 (Fréchet) extreme value distribution (since the IR score is non-negative), but the fit was not as good as the gamma distribution (Fig. 4, blue line). The same trend was observed for other proteins. Thus, we use the gamma distribution for calculating the statistical significance of the IR score. Since the parameters of the gamma distribution may be different depending on queries, they are calculated by random sampling each time a search is performed.


Similarity search for local protein structures at atomic resolution by exploiting a database management system
Distribution of IR scores of randomly selected templates. The red bars indicate the histogram of IR scores of randomly selected templates obtained for the query 101m. The green line is the probability density function (PDF) of the gamma distribution GAM(α, β) with the parameters α=1.32 and β=1.75 calculated from the mean and variance of the scores. The blue line is the PDF of the type 2 extreme value distribution with the parameters determined to best fit the histogram.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC5036654&req=5

f4-3_75: Distribution of IR scores of randomly selected templates. The red bars indicate the histogram of IR scores of randomly selected templates obtained for the query 101m. The green line is the probability density function (PDF) of the gamma distribution GAM(α, β) with the parameters α=1.32 and β=1.75 calculated from the mean and variance of the scores. The blue line is the PDF of the type 2 extreme value distribution with the parameters determined to best fit the histogram.
Mentions: In order to estimate the statistical significance of IR score, we examined its distribution. We first performed a GI search, and then randomly selected 50,000 hits for iterative refinement. After the refinement, the histogram of the IR score was plotted. Fig. 4 is an example obtained for the query 101m. It is clearly seen that the distribution is well approximated by a gamma distribution (Fig. 4, green line). We also fitted the type-2 (Fréchet) extreme value distribution (since the IR score is non-negative), but the fit was not as good as the gamma distribution (Fig. 4, blue line). The same trend was observed for other proteins. Thus, we use the gamma distribution for calculating the statistical significance of the IR score. Since the parameters of the gamma distribution may be different depending on queries, they are calculated by random sampling each time a search is performed.

View Article: PubMed Central - PubMed

ABSTRACT

A method to search for local structural similarities in proteins at atomic resolution is presented. It is demonstrated that a huge amount of structural data can be handled within a reasonable CPU time by using a conventional relational database management system with appropriate indexing of geometric data. This method, which we call geometric indexing, can enumerate ligand binding sites that are structurally similar to sub-structures of a query protein among more than 160,000 possible candidates within a few hours of CPU time on an ordinary desktop computer. After detecting a set of high scoring ligand binding sites by the geometric indexing search, structural alignments at atomic resolution are constructed by iteratively applying the Hungarian algorithm, and the statistical significance of the final score is estimated from an empirical model based on a gamma distribution. Applications of this method to several protein structures clearly shows that significant similarities can be detected between local structures of non-homologous as well as homologous proteins.

No MeSH data available.