Limits...
Large scale application of neural network based semantic role labeling for automated relation extraction from biomedical texts.

Barnickel T, Weston J, Collobert R, Mewes HW, Stümpflen V - PLoS ONE (2009)

Bottom Line: To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed.The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43.The presented approach bridges the gap between fast, co-occurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

View Article: PubMed Central - PubMed

Affiliation: Helmholtz Zentrum München, Institute of Bioinformatics and Systems Biology (MIPS), Neuherberg, Germany. thorsten.barnickel@helmholtz-muenchen.de

ABSTRACT
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA ("Semantic Extraction using a Neural Network Architecture"), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, co-occurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

Show MeSH

Related in: MedlinePlus

Biomedical sentence and one corresponding PAS for the verb ‘blocked’.A sentence annotated by SENNA with semantic, PropBank conforming roles. ARG0: The subject phrase of the verb/the blocker; ARG1: the argument role of the verb, the thing blocked; rel: the verb; ARGM-MNR: modifier argument for manner.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2712690&req=5

pone-0006393-g001: Biomedical sentence and one corresponding PAS for the verb ‘blocked’.A sentence annotated by SENNA with semantic, PropBank conforming roles. ARG0: The subject phrase of the verb/the blocker; ARG1: the argument role of the verb, the thing blocked; rel: the verb; ARGM-MNR: modifier argument for manner.

Mentions: A comparatively new approach belonging to the second application type is Semantic Role Labeling (SRL). SRL determines the semantic roles syntactic constituents of a sentence play in relation to a certain predicate. A set of a verb and its corresponding semantic arguments is called a “predicate-argument-structure” (PAS) (figure 1). Typical semantic roles according to the annotation system of the PropBank corpus [3], a newspaper corpus many of the existing SRL systems are based on, are the subject role (labeled with “ARG0”) and the argument role (typically “ARG1”) of a verb. The semantic roles of the remaining three core arguments ARG2-ARG5 are more diverse and depend on the verb of each PAS [3]. All verbs can also have so-called “modifier” arguments (ARGM) such as location (“ARGM_LOC”), time, cause and others.


Large scale application of neural network based semantic role labeling for automated relation extraction from biomedical texts.

Barnickel T, Weston J, Collobert R, Mewes HW, Stümpflen V - PLoS ONE (2009)

Biomedical sentence and one corresponding PAS for the verb ‘blocked’.A sentence annotated by SENNA with semantic, PropBank conforming roles. ARG0: The subject phrase of the verb/the blocker; ARG1: the argument role of the verb, the thing blocked; rel: the verb; ARGM-MNR: modifier argument for manner.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2712690&req=5

pone-0006393-g001: Biomedical sentence and one corresponding PAS for the verb ‘blocked’.A sentence annotated by SENNA with semantic, PropBank conforming roles. ARG0: The subject phrase of the verb/the blocker; ARG1: the argument role of the verb, the thing blocked; rel: the verb; ARGM-MNR: modifier argument for manner.
Mentions: A comparatively new approach belonging to the second application type is Semantic Role Labeling (SRL). SRL determines the semantic roles syntactic constituents of a sentence play in relation to a certain predicate. A set of a verb and its corresponding semantic arguments is called a “predicate-argument-structure” (PAS) (figure 1). Typical semantic roles according to the annotation system of the PropBank corpus [3], a newspaper corpus many of the existing SRL systems are based on, are the subject role (labeled with “ARG0”) and the argument role (typically “ARG1”) of a verb. The semantic roles of the remaining three core arguments ARG2-ARG5 are more diverse and depend on the verb of each PAS [3]. All verbs can also have so-called “modifier” arguments (ARGM) such as location (“ARGM_LOC”), time, cause and others.

Bottom Line: To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed.The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43.The presented approach bridges the gap between fast, co-occurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

View Article: PubMed Central - PubMed

Affiliation: Helmholtz Zentrum München, Institute of Bioinformatics and Systems Biology (MIPS), Neuherberg, Germany. thorsten.barnickel@helmholtz-muenchen.de

ABSTRACT
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA ("Semantic Extraction using a Neural Network Architecture"), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, co-occurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

Show MeSH
Related in: MedlinePlus