Limits...
Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D Structure and sequence properties.

Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ - PLoS Comput. Biol. (2009)

Bottom Line: This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data.Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures.Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

View Article: PubMed Central - PubMed

Affiliation: College of Computer and Information Science, Northeastern University, Boston, Massachusetts, United States of America.

ABSTRACT
A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

Show MeSH

Related in: MedlinePlus

Averaged ROC curves for two versions of the POOL method, one that predicts ionizable residues only POOL(TION)xPOOL(G) and the other that predicts all residues POOL(TALL)xPOOL(G) through the incorporation of environment variables.Recall rate for all annotated active site residues is plotted as a function of the false positive rate for all residues in the 64 protein test set.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2612599&req=5

pcbi-1000266-g003: Averaged ROC curves for two versions of the POOL method, one that predicts ionizable residues only POOL(TION)xPOOL(G) and the other that predicts all residues POOL(TALL)xPOOL(G) through the incorporation of environment variables.Recall rate for all annotated active site residues is plotted as a function of the false positive rate for all residues in the 64 protein test set.

Mentions: So far only predictions for ionizable residues have been described. The THEMATICS environment variables are now used to incorporate predictions for non-ionizable residues in the active site. Figure 3 shows the ROC curve for a combined method by which a single merged, rank-ordered list of all residues, both ionizable and non-ionizable, in a protein is generated. The method assigns probability estimates for ionizable residues using the best of the previous ionizables-only estimators, the estimator corresponding to the best ROC curve POOL(T4)xPOOL(G) in Figure 2. It also assigns probability estimates to non-ionizable residues using POOL with the two THEMATICS environment features chained with POOL(G), and then rank orders all the residues based on their probability estimates. We give this new estimator obtained by merging these ionizable-only probability estimates with these non-ionizable-only probability estimates the name POOL(TALL)xPOOL(G). Also included in Figure 3 for comparison is a ROC curve for POOL(TION)xPOOL(G) based on the same estimates for the ionizable residues but assigning probability estimates of zero to all non-ionizable residues. Note that the data for this latter method are essentially the same as those of the POOL(T4)xPOOL(G) curve of Figure 2, except that the denominator for the recall values is now the number of total active-site residues in the protein, whether ionizable or not, and the denominator for the false positive rate is now the total number of non-active-site residues in the protein, ionizable or not. The improved ROC curve for the merged estimate method POOL(TALL)xPOOL(G) compared to the curve for the ionizables-only method POOL(TION)xPOOL(G) indicates that taking into account both THEMATICS environment variables and cleft information does indeed help identify the non-ionizable active-site residues. When the lists are merged, the rankings of some annotated positive ionizable residues may be lowered, but it is apparent that this effect is more than offset, on average, by the inclusion in the ranking of some annotated, positive, non-ionizable residues that are obviously missed by excluding them altogether. If this were not the case, then one would expect the merged curve to cross below (and to the right of) the comparison curve in the lower recall (and lower false positive) range.


Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D Structure and sequence properties.

Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ - PLoS Comput. Biol. (2009)

Averaged ROC curves for two versions of the POOL method, one that predicts ionizable residues only POOL(TION)xPOOL(G) and the other that predicts all residues POOL(TALL)xPOOL(G) through the incorporation of environment variables.Recall rate for all annotated active site residues is plotted as a function of the false positive rate for all residues in the 64 protein test set.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2612599&req=5

pcbi-1000266-g003: Averaged ROC curves for two versions of the POOL method, one that predicts ionizable residues only POOL(TION)xPOOL(G) and the other that predicts all residues POOL(TALL)xPOOL(G) through the incorporation of environment variables.Recall rate for all annotated active site residues is plotted as a function of the false positive rate for all residues in the 64 protein test set.
Mentions: So far only predictions for ionizable residues have been described. The THEMATICS environment variables are now used to incorporate predictions for non-ionizable residues in the active site. Figure 3 shows the ROC curve for a combined method by which a single merged, rank-ordered list of all residues, both ionizable and non-ionizable, in a protein is generated. The method assigns probability estimates for ionizable residues using the best of the previous ionizables-only estimators, the estimator corresponding to the best ROC curve POOL(T4)xPOOL(G) in Figure 2. It also assigns probability estimates to non-ionizable residues using POOL with the two THEMATICS environment features chained with POOL(G), and then rank orders all the residues based on their probability estimates. We give this new estimator obtained by merging these ionizable-only probability estimates with these non-ionizable-only probability estimates the name POOL(TALL)xPOOL(G). Also included in Figure 3 for comparison is a ROC curve for POOL(TION)xPOOL(G) based on the same estimates for the ionizable residues but assigning probability estimates of zero to all non-ionizable residues. Note that the data for this latter method are essentially the same as those of the POOL(T4)xPOOL(G) curve of Figure 2, except that the denominator for the recall values is now the number of total active-site residues in the protein, whether ionizable or not, and the denominator for the false positive rate is now the total number of non-active-site residues in the protein, ionizable or not. The improved ROC curve for the merged estimate method POOL(TALL)xPOOL(G) compared to the curve for the ionizables-only method POOL(TION)xPOOL(G) indicates that taking into account both THEMATICS environment variables and cleft information does indeed help identify the non-ionizable active-site residues. When the lists are merged, the rankings of some annotated positive ionizable residues may be lowered, but it is apparent that this effect is more than offset, on average, by the inclusion in the ranking of some annotated, positive, non-ionizable residues that are obviously missed by excluding them altogether. If this were not the case, then one would expect the merged curve to cross below (and to the right of) the comparison curve in the lower recall (and lower false positive) range.

Bottom Line: This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data.Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures.Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

View Article: PubMed Central - PubMed

Affiliation: College of Computer and Information Science, Northeastern University, Boston, Massachusetts, United States of America.

ABSTRACT
A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

Show MeSH
Related in: MedlinePlus