Limits...
Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

Fu X, Batista-Navarro R, Rak R, Ananiadou S - J Biomed Semantics (2015)

Bottom Line: However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition.This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents.When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching).

View Article: PubMed Central - PubMed

Affiliation: National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK.

ABSTRACT

Background: Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients.

Methods: A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents.

Results: When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors.

Conclusions: We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

No MeSH data available.


Related in: MedlinePlus

The user interface for linking mentions to ontologies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4364458&req=5

Fig3: The user interface for linking mentions to ontologies.

Mentions: After running the syntactic tools, the workflow splits into four branches. The first branch performs joint annotation of concepts pertaining to Problem, Treatment and TestOrMeasure by means of the NERsuite [45] component, a named entity recogniser (NER) based on an implementation of conditional random fields [46]. Supplied with a model trained on the 2010 i2b2/VA challenge training set [47], this NER is employed to provide domain experts with automatically generated cues which could aid them in marking up full phrases describing COPD phenotypes. Meanwhile, the NERsuite component in the second branch is configured to recognise disease mentions using a model trained on the NCBI Disease corpus [48]. The third branch performs drug name recognition using the Chemical Entity Recogniser, an adaptation of NERsuite employing chemistry-specific features and heuristics [49] which was parameterised with a model trained on the Drug-Drug Interaction (DDI) corpus [50]. Finally, by means of the Truecase Asciifier, Brown, OBO Anatomy and UMLS Dictionary Feature Extractors, the last branch extracts various features required by the Anatomical Entity Tagger which is capable of recognising anatomical concepts [51]. The Annotation Merger component collects annotations produced by the various concept recognisers whilst the Manual Annotation Editor allows human annotators to manually correct, add or remove automatically generated annotations via its rich graphical user interface (FigureĀ 3).Figure 3


Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

Fu X, Batista-Navarro R, Rak R, Ananiadou S - J Biomed Semantics (2015)

The user interface for linking mentions to ontologies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4364458&req=5

Fig3: The user interface for linking mentions to ontologies.
Mentions: After running the syntactic tools, the workflow splits into four branches. The first branch performs joint annotation of concepts pertaining to Problem, Treatment and TestOrMeasure by means of the NERsuite [45] component, a named entity recogniser (NER) based on an implementation of conditional random fields [46]. Supplied with a model trained on the 2010 i2b2/VA challenge training set [47], this NER is employed to provide domain experts with automatically generated cues which could aid them in marking up full phrases describing COPD phenotypes. Meanwhile, the NERsuite component in the second branch is configured to recognise disease mentions using a model trained on the NCBI Disease corpus [48]. The third branch performs drug name recognition using the Chemical Entity Recogniser, an adaptation of NERsuite employing chemistry-specific features and heuristics [49] which was parameterised with a model trained on the Drug-Drug Interaction (DDI) corpus [50]. Finally, by means of the Truecase Asciifier, Brown, OBO Anatomy and UMLS Dictionary Feature Extractors, the last branch extracts various features required by the Anatomical Entity Tagger which is capable of recognising anatomical concepts [51]. The Annotation Merger component collects annotations produced by the various concept recognisers whilst the Manual Annotation Editor allows human annotators to manually correct, add or remove automatically generated annotations via its rich graphical user interface (FigureĀ 3).Figure 3

Bottom Line: However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition.This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents.When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching).

View Article: PubMed Central - PubMed

Affiliation: National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK.

ABSTRACT

Background: Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients.

Methods: A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents.

Results: When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors.

Conclusions: We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

No MeSH data available.


Related in: MedlinePlus