Limits...
Negation's not solved: generalizability versus optimizability in clinical natural language processing.

Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, Clark C - PLoS ONE (2014)

Bottom Line: A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been "solved." This work proposes that an optimizable solution does not equal a generalizable solution.Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon.The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America; Oregon Health and Science University, Portland, Oregon, United States of America.

ABSTRACT
A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been "solved." This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.

Show MeSH

Related in: MedlinePlus

Significance bands of model performance for each test corpus.These are labeled with successive letters from right to left in Table 4.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4231086&req=5

pone-0112774-g002: Significance bands of model performance for each test corpus.These are labeled with successive letters from right to left in Table 4.

Mentions: The practical question a user might ask is: “How can I maximize negation detection performance for my data?” Table 4 below illustrates the difficulty of answering this question by showing performance on four corpora (columns) by various systems (rows). Row 0 gives previously reported comparison statistics for i2b2 data (MITRE [2]) and the NegEx TestSet (GenNegEx 1.2.0, see https://code.google.com/p/negex/wiki/TestSet); SHARP and MiPACQ do not have previous results to compare with. We have grouped these systems to be representative of three strategies for negation detection that are used in the community: the unedited, rule-based YTEX algorithm (row 1); machine learning classifiers when only out-of-domain data (OOD) is available (rows 2–6); and machine learning classifiers when some in-domain data is available (rows 7–9). Note that row 7 is equivalent to the diagonal from rows 2–6, namely, where the training set and test set are from (different portions of) the same corpus. Table 4 also includes significance bands down each column; pair-wise approximate randomization significance tests for F1 score, aggregated by document, are reported for p<0.05. Values in a column labeled with different successive superscripted letters (e.g., 93.9a and 92.6b) indicate that there is a significant difference between two systems. These bands are further visualized in Figure 2.


Negation's not solved: generalizability versus optimizability in clinical natural language processing.

Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, Clark C - PLoS ONE (2014)

Significance bands of model performance for each test corpus.These are labeled with successive letters from right to left in Table 4.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4231086&req=5

pone-0112774-g002: Significance bands of model performance for each test corpus.These are labeled with successive letters from right to left in Table 4.
Mentions: The practical question a user might ask is: “How can I maximize negation detection performance for my data?” Table 4 below illustrates the difficulty of answering this question by showing performance on four corpora (columns) by various systems (rows). Row 0 gives previously reported comparison statistics for i2b2 data (MITRE [2]) and the NegEx TestSet (GenNegEx 1.2.0, see https://code.google.com/p/negex/wiki/TestSet); SHARP and MiPACQ do not have previous results to compare with. We have grouped these systems to be representative of three strategies for negation detection that are used in the community: the unedited, rule-based YTEX algorithm (row 1); machine learning classifiers when only out-of-domain data (OOD) is available (rows 2–6); and machine learning classifiers when some in-domain data is available (rows 7–9). Note that row 7 is equivalent to the diagonal from rows 2–6, namely, where the training set and test set are from (different portions of) the same corpus. Table 4 also includes significance bands down each column; pair-wise approximate randomization significance tests for F1 score, aggregated by document, are reported for p<0.05. Values in a column labeled with different successive superscripted letters (e.g., 93.9a and 92.6b) indicate that there is a significant difference between two systems. These bands are further visualized in Figure 2.

Bottom Line: A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been "solved." This work proposes that an optimizable solution does not equal a generalizable solution.Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon.The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America; Oregon Health and Science University, Portland, Oregon, United States of America.

ABSTRACT
A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been "solved." This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.

Show MeSH
Related in: MedlinePlus