Limits...
An Algebro-topological description of protein domain structure.

Penner RC, Knudsen M, Wiuf C, Andersen JE - PLoS ONE (2011)

Bottom Line: Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds.We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH.In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.

View Article: PubMed Central - PubMed

Affiliation: Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark.

ABSTRACT
The space of possible protein structures appears vast and continuous, and the relationship between primary, secondary and tertiary structure levels is complex. Protein structure comparison and classification is therefore a difficult but important task since structure is a determinant for molecular interaction and function. We introduce a novel mathematical abstraction based on geometric topology to describe protein domain structure. Using the locations of the backbone atoms and the hydrogen bonds, we build a combinatorial object--a so-called fatgraph. The description is discrete yet gives rise to a 2-dimensional mathematical surface. Thus, each protein domain corresponds to a particular mathematical surface with characteristic topological invariants, such as the genus (number of holes) and the number of boundary components. Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds. We introduce the notion of robust variables, that is variables that are robust towards minor changes in the structure/fatgraph, and show that the genus and the number of boundary components are robust. Further, we investigate the distribution of different fatgraph variables and show how only four variables are capable of distinguishing different folds. We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH. In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.

Show MeSH

Related in: MedlinePlus

The classifier consists of a collection (forest) of classification trees.For each domain in the test set, each classification tree votes, and a consensus is reached. We used  trees, and for each domain we calculated the percentages,  and , of the most frequent and the second most frequent votes occuring. The top plots show the distributions of  and  for the correctly and wrongly classified domains at the H-level in v3.2.0. If  is large, the majority of votes are cast for the top two candidates, and if furthermore  is large, many more votes have been cast for the winner compared to the runner-up. Therefore, if both  and  are large, this indicates that the classifier makes a confident prediction. The distributions of  and  for correctly predicted domains show that the confidence in correct predictions is generally high. On the other hand, confidence in wrong predictions is much more uniform. The bottom plots show the distributions corresponding to the  newly added domains in v3.3.0 with non-existent classification in v3.2.0 and the  domains for which CATH has not yet provided a classification. Both show the same kind of uncertainty as observed for the wrongly classified domains.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3101207&req=5

pone-0019670-g012: The classifier consists of a collection (forest) of classification trees.For each domain in the test set, each classification tree votes, and a consensus is reached. We used trees, and for each domain we calculated the percentages, and , of the most frequent and the second most frequent votes occuring. The top plots show the distributions of and for the correctly and wrongly classified domains at the H-level in v3.2.0. If is large, the majority of votes are cast for the top two candidates, and if furthermore is large, many more votes have been cast for the winner compared to the runner-up. Therefore, if both and are large, this indicates that the classifier makes a confident prediction. The distributions of and for correctly predicted domains show that the confidence in correct predictions is generally high. On the other hand, confidence in wrong predictions is much more uniform. The bottom plots show the distributions corresponding to the newly added domains in v3.3.0 with non-existent classification in v3.2.0 and the domains for which CATH has not yet provided a classification. Both show the same kind of uncertainty as observed for the wrongly classified domains.

Mentions: In v3.2.0 (SAll, see Materials and Methods, sections 5), we selected the largest H-levels ( of all domains) and randomly sampled of the domains for training, while keeping for testing (Materials and Methods, sections 5 and 6). In addition, we tested the classifier on the new domains in v3.3.0 that are not already in v3.2.0. Fig. 11 shows the results. We assigned of the domains in v3.2.0 into their correct H-level, whereas , and are correctly assigned at the T-, A-, and C-level, respectively. For the new domains in v3.3.0, the percentages are smaller: (H), (T), (A), and (C). When the classifier makes a correct prediction, it does so with high confidence whereas it is less certain when making a false prediction, and a similar lack of confidence is observed when the classifier is applied to domains which have not been assigned a classification by CATH or domains with H-levels that are new in v3.3.0 (Fig. 12).


An Algebro-topological description of protein domain structure.

Penner RC, Knudsen M, Wiuf C, Andersen JE - PLoS ONE (2011)

The classifier consists of a collection (forest) of classification trees.For each domain in the test set, each classification tree votes, and a consensus is reached. We used  trees, and for each domain we calculated the percentages,  and , of the most frequent and the second most frequent votes occuring. The top plots show the distributions of  and  for the correctly and wrongly classified domains at the H-level in v3.2.0. If  is large, the majority of votes are cast for the top two candidates, and if furthermore  is large, many more votes have been cast for the winner compared to the runner-up. Therefore, if both  and  are large, this indicates that the classifier makes a confident prediction. The distributions of  and  for correctly predicted domains show that the confidence in correct predictions is generally high. On the other hand, confidence in wrong predictions is much more uniform. The bottom plots show the distributions corresponding to the  newly added domains in v3.3.0 with non-existent classification in v3.2.0 and the  domains for which CATH has not yet provided a classification. Both show the same kind of uncertainty as observed for the wrongly classified domains.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3101207&req=5

pone-0019670-g012: The classifier consists of a collection (forest) of classification trees.For each domain in the test set, each classification tree votes, and a consensus is reached. We used trees, and for each domain we calculated the percentages, and , of the most frequent and the second most frequent votes occuring. The top plots show the distributions of and for the correctly and wrongly classified domains at the H-level in v3.2.0. If is large, the majority of votes are cast for the top two candidates, and if furthermore is large, many more votes have been cast for the winner compared to the runner-up. Therefore, if both and are large, this indicates that the classifier makes a confident prediction. The distributions of and for correctly predicted domains show that the confidence in correct predictions is generally high. On the other hand, confidence in wrong predictions is much more uniform. The bottom plots show the distributions corresponding to the newly added domains in v3.3.0 with non-existent classification in v3.2.0 and the domains for which CATH has not yet provided a classification. Both show the same kind of uncertainty as observed for the wrongly classified domains.
Mentions: In v3.2.0 (SAll, see Materials and Methods, sections 5), we selected the largest H-levels ( of all domains) and randomly sampled of the domains for training, while keeping for testing (Materials and Methods, sections 5 and 6). In addition, we tested the classifier on the new domains in v3.3.0 that are not already in v3.2.0. Fig. 11 shows the results. We assigned of the domains in v3.2.0 into their correct H-level, whereas , and are correctly assigned at the T-, A-, and C-level, respectively. For the new domains in v3.3.0, the percentages are smaller: (H), (T), (A), and (C). When the classifier makes a correct prediction, it does so with high confidence whereas it is less certain when making a false prediction, and a similar lack of confidence is observed when the classifier is applied to domains which have not been assigned a classification by CATH or domains with H-levels that are new in v3.3.0 (Fig. 12).

Bottom Line: Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds.We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH.In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.

View Article: PubMed Central - PubMed

Affiliation: Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark.

ABSTRACT
The space of possible protein structures appears vast and continuous, and the relationship between primary, secondary and tertiary structure levels is complex. Protein structure comparison and classification is therefore a difficult but important task since structure is a determinant for molecular interaction and function. We introduce a novel mathematical abstraction based on geometric topology to describe protein domain structure. Using the locations of the backbone atoms and the hydrogen bonds, we build a combinatorial object--a so-called fatgraph. The description is discrete yet gives rise to a 2-dimensional mathematical surface. Thus, each protein domain corresponds to a particular mathematical surface with characteristic topological invariants, such as the genus (number of holes) and the number of boundary components. Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds. We introduce the notion of robust variables, that is variables that are robust towards minor changes in the structure/fatgraph, and show that the genus and the number of boundary components are robust. Further, we investigate the distribution of different fatgraph variables and show how only four variables are capable of distinguishing different folds. We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH. In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.

Show MeSH
Related in: MedlinePlus