Limits...
Using workflows to explore and optimise named entity recognition for chemistry.

Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S - PLoS ONE (2011)

Bottom Line: Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy.On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%.On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom. balakrishna.kolluru@manchester.ac.uk

ABSTRACT
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

Show MeSH

Related in: MedlinePlus

Oscar3 refactored as a workflow of different components.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3102085&req=5

pone-0020181-g003: Oscar3 refactored as a workflow of different components.

Mentions: Conceptually, Oscar3 [9] is a named entity recogniser which classifies tokens into chemical entities based on either likelihoods or a pattern match. Therefore, Oscar3 was divided into the following components as shown in Figure 3. This is just one of the many possible manifestations of the workflows; other configurations such as different tokenisers and components implementing machine learning techniques can be easily accommodated to make a new workflow.


Using workflows to explore and optimise named entity recognition for chemistry.

Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S - PLoS ONE (2011)

Oscar3 refactored as a workflow of different components.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3102085&req=5

pone-0020181-g003: Oscar3 refactored as a workflow of different components.
Mentions: Conceptually, Oscar3 [9] is a named entity recogniser which classifies tokens into chemical entities based on either likelihoods or a pattern match. Therefore, Oscar3 was divided into the following components as shown in Figure 3. This is just one of the many possible manifestations of the workflows; other configurations such as different tokenisers and components implementing machine learning techniques can be easily accommodated to make a new workflow.

Bottom Line: Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy.On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%.On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

View Article: PubMed Central - PubMed

Affiliation: National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom. balakrishna.kolluru@manchester.ac.uk

ABSTRACT
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

Show MeSH
Related in: MedlinePlus