Limits...
Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D Structure and sequence properties.

Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ - PLoS Comput. Biol. (2009)

Bottom Line: This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data.Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures.Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

View Article: PubMed Central - PubMed

Affiliation: College of Computer and Information Science, Northeastern University, Boston, Massachusetts, United States of America.

ABSTRACT
A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

Show MeSH

Related in: MedlinePlus

Histogram of the first annotated active site residue.Top: Percentage of all proteins with specified rank of the first annotated active site residue in the ordered list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set. Bottom: Cumulative distribution of the first annotated active site residue in the ranked list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2612599&req=5

pcbi-1000266-g007: Histogram of the first annotated active site residue.Top: Percentage of all proteins with specified rank of the first annotated active site residue in the ordered list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set. Bottom: Cumulative distribution of the first annotated active site residue in the ranked list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set.

Mentions: Another interesting result of our approach is one that is only obtainable from methods that generate a ranked list: the rank of the first annotated true positive in the list. This metric is useful for users who are interested in finding a few of the active site residue candidates and who do not necessarily need to know all of the active site residues. For instance, users could use the list from the POOL method to guide their site directed mutagenesis experiments by going down the ranked list one by one. A histogram giving the rank of the first active site residue found by POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set is shown in Figure 7. The median rank of the first true positive active site residue in the 160 protein set with POOL(T)xPOOL(G)xPOOL(C) method is two. For 46 out of 160 proteins, the first residue in the resulting ranked list is an annotated active site residue. 65.0%, 81.3% and 90.0% of the 160 proteins have the first annotated active site residue located within the top 3, 5 and 10 residues of the ranked list, respectively. Such measurements are not easily made for binary classification methods.


Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D Structure and sequence properties.

Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ - PLoS Comput. Biol. (2009)

Histogram of the first annotated active site residue.Top: Percentage of all proteins with specified rank of the first annotated active site residue in the ordered list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set. Bottom: Cumulative distribution of the first annotated active site residue in the ranked list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2612599&req=5

pcbi-1000266-g007: Histogram of the first annotated active site residue.Top: Percentage of all proteins with specified rank of the first annotated active site residue in the ordered list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set. Bottom: Cumulative distribution of the first annotated active site residue in the ranked list from POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set.
Mentions: Another interesting result of our approach is one that is only obtainable from methods that generate a ranked list: the rank of the first annotated true positive in the list. This metric is useful for users who are interested in finding a few of the active site residue candidates and who do not necessarily need to know all of the active site residues. For instance, users could use the list from the POOL method to guide their site directed mutagenesis experiments by going down the ranked list one by one. A histogram giving the rank of the first active site residue found by POOL(T)xPOOL(G)xPOOL(C) on the 160 protein set is shown in Figure 7. The median rank of the first true positive active site residue in the 160 protein set with POOL(T)xPOOL(G)xPOOL(C) method is two. For 46 out of 160 proteins, the first residue in the resulting ranked list is an annotated active site residue. 65.0%, 81.3% and 90.0% of the 160 proteins have the first annotated active site residue located within the top 3, 5 and 10 residues of the ranked list, respectively. Such measurements are not easily made for binary classification methods.

Bottom Line: This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data.Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures.Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

View Article: PubMed Central - PubMed

Affiliation: College of Computer and Information Science, Northeastern University, Boston, Massachusetts, United States of America.

ABSTRACT
A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

Show MeSH
Related in: MedlinePlus