Limits...
BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH
The relationship type ontology. Non-leaf nodes, drawn as boxes, are relationship type classes. Leaf nodes, drawn as ovals, are the predicate names used in the relationship annotation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1808065&req=5

Figure 3: The relationship type ontology. Non-leaf nodes, drawn as boxes, are relationship type classes. Leaf nodes, drawn as ovals, are the predicate names used in the relationship annotation.

Mentions: The relationship type ontology (Figure 3) specifies four broad classes of relationships. The observation class captures experimental observations, such as co-occurrence and coprecipitation. The classes part- of and is-a describe taxonomical, substructure/superstructure, physical/functional similarity, and equality relationships. The fourth class of relationships specifies causal relationships, where one entity causes a change in the state of another entity. While the class structure of the relationship ontology is fixed, the predicates are introduced into the ontology according to need – the choice of predicates is thus empirical and depends on the coverage of the annotated corpus. For example, of all the possible post-translational modifications only few, such as PHOSPHORYLATE, were sufficiently common in the corpus to merit a separate predicate in the ontology. To illustrate the coverage of the predicates of the annotated relationships and to estimate how often new predicates would need to be added in extending the annotation, we calculated the cumulative total number of different predicates that occur in the annotation when traversing the sentences in an arbitrary order (Figure 4). We find that, for example, 50% of all predicates are, on average, seen already after the first 74 sentences. Further, the total number of predicates seen increases only by one in approximately the last 220 sentences. We may extrapolate that to annotate another 1100 sentences, only approximately five new predicates would need to be introduced into the ontology.


BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T - BMC Bioinformatics (2007)

The relationship type ontology. Non-leaf nodes, drawn as boxes, are relationship type classes. Leaf nodes, drawn as ovals, are the predicate names used in the relationship annotation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1808065&req=5

Figure 3: The relationship type ontology. Non-leaf nodes, drawn as boxes, are relationship type classes. Leaf nodes, drawn as ovals, are the predicate names used in the relationship annotation.
Mentions: The relationship type ontology (Figure 3) specifies four broad classes of relationships. The observation class captures experimental observations, such as co-occurrence and coprecipitation. The classes part- of and is-a describe taxonomical, substructure/superstructure, physical/functional similarity, and equality relationships. The fourth class of relationships specifies causal relationships, where one entity causes a change in the state of another entity. While the class structure of the relationship ontology is fixed, the predicates are introduced into the ontology according to need – the choice of predicates is thus empirical and depends on the coverage of the annotated corpus. For example, of all the possible post-translational modifications only few, such as PHOSPHORYLATE, were sufficiently common in the corpus to merit a separate predicate in the ontology. To illustrate the coverage of the predicates of the annotated relationships and to estimate how often new predicates would need to be added in extending the annotation, we calculated the cumulative total number of different predicates that occur in the annotation when traversing the sentences in an arbitrary order (Figure 4). We find that, for example, 50% of all predicates are, on average, seen already after the first 74 sentences. Further, the total number of predicates seen increases only by one in approximately the last 220 sentences. We may extrapolate that to annotate another 1100 sentences, only approximately five new predicates would need to be introduced into the ontology.

Bottom Line: Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies.Supporting software is provided with the corpus.We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Turku Centre for Computer Science (TUCS), University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland. sampo.pyysalo@it.utu.fi

ABSTRACT

Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.

Results: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

Conclusion: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Show MeSH