Limits...
Towards comprehensive syntactic and semantic annotations of the clinical narrative.

Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C, Hwang JD, Choi JD, Dligach D, Nielsen RD, Martin J, Ward W, Palmer M, Savova GK - J Am Med Inform Assoc (2013)

Bottom Line: Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750).A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations.This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain.

View Article: PubMed Central - PubMed

Affiliation: Department of Linguistics, University of Colorado, Boulder, Colorado, USA.

ABSTRACT

Objective: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components.

Methods: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed.

Results: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations.

Conclusions: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.

Show MeSH
Example of clinical Treebank guideline changes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3756257&req=5

AMIAJNL2012001317F2: Example of clinical Treebank guideline changes.

Mentions: Treebank annotations consist of POS, phrasal and function tags, and empty categories (see below), which are organized in a tree-like structure (see figure 1). We adapted Penn's POS Tagging Guidelines, Bracketing Guidelines, and all associated addenda, as well as the biomedical guideline8 supplements. We adjusted existing policies and implemented new guidelines to account for differences encountered in the clinical domain. PTB's supplements include policies for spoken language and biomedical annotation; however, CN contain a number of previously unseen patterns that required the refinements in the PTB annotation policies. For example, fragmentary sentences like ‘Coughing up purulent material.’ which are common in clinical data, are now annotated as S with new function tag, –RED, to mark them as reduced and given full subject and argument structures. Under existing PTB policies these are annotated as top-level FRAG and lack full argument structure. Figure 2 presents examples of Treebank changes. Additionally, current PTB tokenization was fine-tuned to handle certain abbreviations more accurately. For example, under current PTB tokenization policy, the shorthand notation ‘d/c’ (for the verb ‘discontinue’) would be annotated as three tokens: d/AFX//HYPH c/VB; however, we decided to annotate the abbreviation as one token (d/c/VB) to better align with the full form of the verb.


Towards comprehensive syntactic and semantic annotations of the clinical narrative.

Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C, Hwang JD, Choi JD, Dligach D, Nielsen RD, Martin J, Ward W, Palmer M, Savova GK - J Am Med Inform Assoc (2013)

Example of clinical Treebank guideline changes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3756257&req=5

AMIAJNL2012001317F2: Example of clinical Treebank guideline changes.
Mentions: Treebank annotations consist of POS, phrasal and function tags, and empty categories (see below), which are organized in a tree-like structure (see figure 1). We adapted Penn's POS Tagging Guidelines, Bracketing Guidelines, and all associated addenda, as well as the biomedical guideline8 supplements. We adjusted existing policies and implemented new guidelines to account for differences encountered in the clinical domain. PTB's supplements include policies for spoken language and biomedical annotation; however, CN contain a number of previously unseen patterns that required the refinements in the PTB annotation policies. For example, fragmentary sentences like ‘Coughing up purulent material.’ which are common in clinical data, are now annotated as S with new function tag, –RED, to mark them as reduced and given full subject and argument structures. Under existing PTB policies these are annotated as top-level FRAG and lack full argument structure. Figure 2 presents examples of Treebank changes. Additionally, current PTB tokenization was fine-tuned to handle certain abbreviations more accurately. For example, under current PTB tokenization policy, the shorthand notation ‘d/c’ (for the verb ‘discontinue’) would be annotated as three tokens: d/AFX//HYPH c/VB; however, we decided to annotate the abbreviation as one token (d/c/VB) to better align with the full form of the verb.

Bottom Line: Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750).A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations.This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain.

View Article: PubMed Central - PubMed

Affiliation: Department of Linguistics, University of Colorado, Boulder, Colorado, USA.

ABSTRACT

Objective: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components.

Methods: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed.

Results: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations.

Conclusions: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.

Show MeSH