Limits...
BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH
The BioInfer visualizer. BioInfer visualizer graphical user interface showing sentence with syntactic annotation with typed dependencies, with text corresponding to selected entities and relationships highlighted (topmost). A hierarchical view of relationships is shown middle left and entities with their types on the middle right. The bottom of the screen contains a sentence selector.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1808065&req=5

Figure 11: The BioInfer visualizer. BioInfer visualizer graphical user interface showing sentence with syntactic annotation with typed dependencies, with text corresponding to selected entities and relationships highlighted (topmost). A hierarchical view of relationships is shown middle left and entities with their types on the middle right. The bottom of the screen contains a sentence selector.

Mentions: The supporting software is based on an extendable framework for parsing the corpus and representing it as data structures that can be accessed through a fully documented API (Application Programming Interface) providing methods for accessing different aspects of the annotation. In addition, we provide programs built on this API that visualize the annotations and extract them in a simplified, human-readable form (see Appendix I). The user interface of Bioinfer Visualizer is shown in Figure 11. With the BioInfer API, it is easy to create programs which, for example, analyze the corpus data or transform the annotation into formats required by other software. All supporting software is written in Python, thoroughly documented, and available under an open-source license on the corpus website.


BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

The BioInfer visualizer. BioInfer visualizer graphical user interface showing sentence with syntactic annotation with typed dependencies, with text corresponding to selected entities and relationships highlighted (topmost). A hierarchical view of relationships is shown middle left and entities with their types on the middle right. The bottom of the screen contains a sentence selector.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1808065&req=5

Figure 11: The BioInfer visualizer. BioInfer visualizer graphical user interface showing sentence with syntactic annotation with typed dependencies, with text corresponding to selected entities and relationships highlighted (topmost). A hierarchical view of relationships is shown middle left and entities with their types on the middle right. The bottom of the screen contains a sentence selector.
Mentions: The supporting software is based on an extendable framework for parsing the corpus and representing it as data structures that can be accessed through a fully documented API (Application Programming Interface) providing methods for accessing different aspects of the annotation. In addition, we provide programs built on this API that visualize the annotations and extract them in a simplified, human-readable form (see Appendix I). The user interface of Bioinfer Visualizer is shown in Figure 11. With the BioInfer API, it is easy to create programs which, for example, analyze the corpus data or transform the annotation into formats required by other software. All supporting software is written in Python, thoroughly documented, and available under an open-source license on the corpus website.

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH