Limits...
BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH

Related in: MedlinePlus

NP bracketing. Unresolved (left) vs. resolved at the level of coordination (right) NP bracketing. Thick lines denote NP macro-dependencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1808065&req=5

Figure 10: NP bracketing. Unresolved (left) vs. resolved at the level of coordination (right) NP bracketing. Thick lines denote NP macro-dependencies.

Mentions: In a corpus that is used to develop tools for extracting bioentity relationships, it is not critical to resolve the internal structure for every noun phrase, a general principle followed by both BioInfer and, for example, the Genia treebank. In the BioInfer corpus, pre-modifier attachment in noun phrases without internal coordination is not resolved, as these phrases typically only contain at most one named entity. In noun phrases that contain internal coordination, however, each of the coordinates can represent a different named entity and resolving the coordination and thus separating the named entities from each other is vital to the IE task. The distinction is illustrated in Figure 10. From the perspective of an IE system extracting bioentity relationships, the difference between the correct bracketing [[actin filament] core bundles] and the unresolved bracketing [actin filament core bundles] is not critically important because actin is the only protein in the phrase and the inner structure of the phrase can be ignored. On the contrary, the correct resolution of the coordination [[[myosin heavy chain] and [actin]] isoforms] is important because the coordinates are protein names and hence of importance to the IE system. While the inner structure of the coordinated phrases is still not necessarily fully resolved, the two protein names are separated by the coordination. Further, while major wide-coverage dependency parsers such as Connexor Machinese Syntax, MiniPar, and Link Parser do not resolve the attachment of pre-modifiers in noun phrases, they do resolve coordination. The annotation in BioInfer thus coincides with the intended coverage of these parsers.


BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

NP bracketing. Unresolved (left) vs. resolved at the level of coordination (right) NP bracketing. Thick lines denote NP macro-dependencies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1808065&req=5

Figure 10: NP bracketing. Unresolved (left) vs. resolved at the level of coordination (right) NP bracketing. Thick lines denote NP macro-dependencies.
Mentions: In a corpus that is used to develop tools for extracting bioentity relationships, it is not critical to resolve the internal structure for every noun phrase, a general principle followed by both BioInfer and, for example, the Genia treebank. In the BioInfer corpus, pre-modifier attachment in noun phrases without internal coordination is not resolved, as these phrases typically only contain at most one named entity. In noun phrases that contain internal coordination, however, each of the coordinates can represent a different named entity and resolving the coordination and thus separating the named entities from each other is vital to the IE task. The distinction is illustrated in Figure 10. From the perspective of an IE system extracting bioentity relationships, the difference between the correct bracketing [[actin filament] core bundles] and the unresolved bracketing [actin filament core bundles] is not critically important because actin is the only protein in the phrase and the inner structure of the phrase can be ignored. On the contrary, the correct resolution of the coordination [[[myosin heavy chain] and [actin]] isoforms] is important because the coordinates are protein names and hence of importance to the IE system. While the inner structure of the coordinated phrases is still not necessarily fully resolved, the two protein names are separated by the coordination. Further, while major wide-coverage dependency parsers such as Connexor Machinese Syntax, MiniPar, and Link Parser do not resolve the attachment of pre-modifiers in noun phrases, they do resolve coordination. The annotation in BioInfer thus coincides with the intended coverage of these parsers.

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH
Related in: MedlinePlus