Limits...
Protein interaction sentence detection using multiple semantic kernels.

Polajnar T, Damoulas T, Girolami M - J Biomed Semantics (2011)

Bottom Line: We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts.The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences.The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computing Science, University of Glasgow, Glasgow, UK. tamara@dcs.gla.ac.uk.

ABSTRACT

Background: Detection of sentences that describe protein-protein interactions (PPIs) in biomedical publications is a challenging and unresolved pattern recognition problem. Many state-of-the-art approaches for this task employ kernel classification methods, in particular support vector machines (SVMs). In this work we propose a novel data integration approach that utilises semantic kernels and a kernel classification method that is a probabilistic analogue to SVMs. Semantic kernels are created from statistical information gathered from large amounts of unlabelled text using lexical semantic models. Several semantic kernels are then fused into an overall composite classification space. In this initial study, we use simple features in order to examine whether the use of combinations of kernels constructed using word-based semantic models can improve PPI sentence detection.

Results: We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts. The proposed kernel composition method also allows us to automatically infer the most discriminative kernels.

Conclusions: The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences. This study, however, is only a first step in evaluation of semantic kernels and probabilistic multiple kernel learning in the context of PPI detection. The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.

No MeSH data available.


Related in: MedlinePlus

Comparison of experimental settings. This graph represents the AUC (higher pair of lines) and the F-score (lower pair of lines) results for two different experimental setups. In the first setup (AUC1, F1), all of the sentences from an abstract are either in training or test data within a fold of the cross-validation experiment. In the second setup, the sentence vectors are randomised first and then cut into cross-validation folds. The experiment shows 10 runs of randomised 10-fold cross-validation experiments. The vertical bars demonstrate the separations between the runs. For the first experiment F1 = 0.7300 ± 0.0058 and AUC = 0.8902 ± 0.0033, while for the second experiment F1 = 0.6808 ± 0.0051 and AUC = 0.8886 ± 0.0023. The t-test shows there is no significant difference between these experimental setups (p = 0.9640 for the F-scores and p = 0.5467 for the AUC).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3116455&req=5

Figure 5: Comparison of experimental settings. This graph represents the AUC (higher pair of lines) and the F-score (lower pair of lines) results for two different experimental setups. In the first setup (AUC1, F1), all of the sentences from an abstract are either in training or test data within a fold of the cross-validation experiment. In the second setup, the sentence vectors are randomised first and then cut into cross-validation folds. The experiment shows 10 runs of randomised 10-fold cross-validation experiments. The vertical bars demonstrate the separations between the runs. For the first experiment F1 = 0.7300 ± 0.0058 and AUC = 0.8902 ± 0.0033, while for the second experiment F1 = 0.6808 ± 0.0051 and AUC = 0.8886 ± 0.0023. The t-test shows there is no significant difference between these experimental setups (p = 0.9640 for the F-scores and p = 0.5467 for the AUC).

Mentions: When AImed data is used with the syntactic features, for example in [23,67], it is usually applied with the original 10-fold cross-validation (10 × 10 cv) data split provided by the dataset authors [20]. We can see from Figure 5, which demonstrates 10 different runs of a 10 cv experiment, that there can be great variations between the performance of an algorithm on different randomisations of data.


Protein interaction sentence detection using multiple semantic kernels.

Polajnar T, Damoulas T, Girolami M - J Biomed Semantics (2011)

Comparison of experimental settings. This graph represents the AUC (higher pair of lines) and the F-score (lower pair of lines) results for two different experimental setups. In the first setup (AUC1, F1), all of the sentences from an abstract are either in training or test data within a fold of the cross-validation experiment. In the second setup, the sentence vectors are randomised first and then cut into cross-validation folds. The experiment shows 10 runs of randomised 10-fold cross-validation experiments. The vertical bars demonstrate the separations between the runs. For the first experiment F1 = 0.7300 ± 0.0058 and AUC = 0.8902 ± 0.0033, while for the second experiment F1 = 0.6808 ± 0.0051 and AUC = 0.8886 ± 0.0023. The t-test shows there is no significant difference between these experimental setups (p = 0.9640 for the F-scores and p = 0.5467 for the AUC).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3116455&req=5

Figure 5: Comparison of experimental settings. This graph represents the AUC (higher pair of lines) and the F-score (lower pair of lines) results for two different experimental setups. In the first setup (AUC1, F1), all of the sentences from an abstract are either in training or test data within a fold of the cross-validation experiment. In the second setup, the sentence vectors are randomised first and then cut into cross-validation folds. The experiment shows 10 runs of randomised 10-fold cross-validation experiments. The vertical bars demonstrate the separations between the runs. For the first experiment F1 = 0.7300 ± 0.0058 and AUC = 0.8902 ± 0.0033, while for the second experiment F1 = 0.6808 ± 0.0051 and AUC = 0.8886 ± 0.0023. The t-test shows there is no significant difference between these experimental setups (p = 0.9640 for the F-scores and p = 0.5467 for the AUC).
Mentions: When AImed data is used with the syntactic features, for example in [23,67], it is usually applied with the original 10-fold cross-validation (10 × 10 cv) data split provided by the dataset authors [20]. We can see from Figure 5, which demonstrates 10 different runs of a 10 cv experiment, that there can be great variations between the performance of an algorithm on different randomisations of data.

Bottom Line: We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts.The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences.The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Computing Science, University of Glasgow, Glasgow, UK. tamara@dcs.gla.ac.uk.

ABSTRACT

Background: Detection of sentences that describe protein-protein interactions (PPIs) in biomedical publications is a challenging and unresolved pattern recognition problem. Many state-of-the-art approaches for this task employ kernel classification methods, in particular support vector machines (SVMs). In this work we propose a novel data integration approach that utilises semantic kernels and a kernel classification method that is a probabilistic analogue to SVMs. Semantic kernels are created from statistical information gathered from large amounts of unlabelled text using lexical semantic models. Several semantic kernels are then fused into an overall composite classification space. In this initial study, we use simple features in order to examine whether the use of combinations of kernels constructed using word-based semantic models can improve PPI sentence detection.

Results: We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts. The proposed kernel composition method also allows us to automatically infer the most discriminative kernels.

Conclusions: The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences. This study, however, is only a first step in evaluation of semantic kernels and probabilistic multiple kernel learning in the context of PPI detection. The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.

No MeSH data available.


Related in: MedlinePlus