Limits...
Automatic classification of diseases from free-text death certificates for real-time surveillance.

Koopman B, Karimi S, Nguyen A, McGuire R, Muscatello D, Kemp M, Truran D, Zhang M, Thackway S - BMC Med Inform Decis Mak (2015)

Bottom Line: Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness.More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80).In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness.

View Article: PubMed Central - PubMed

Affiliation: Australian e-Health Research Centre, CSIRO, Royal Brisbane and Women's Hospital, Brisbane, Australia. bevan.koopman@csiro.au.

ABSTRACT

Background: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV.

Methods: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000-2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors.

Results: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness.

Conclusions: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates.

No MeSH data available.


Related in: MedlinePlus

Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV. Error bars show 0.95 confidence intervals
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4502908&req=5

Fig1: Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV. Error bars show 0.95 confidence intervals

Mentions: Table 4 presents the detailed classification results for diseases of interest, with rule-based results shown in (a) and machine learning results shown in (b). In addition, a confusion matrix, which provides a breakdown of true positives, false positives, true negatives and false negatives, is shown for each disease. A graphical summary of the results is shown in the plot of Fig. 1.Fig. 1


Automatic classification of diseases from free-text death certificates for real-time surveillance.

Koopman B, Karimi S, Nguyen A, McGuire R, Muscatello D, Kemp M, Truran D, Zhang M, Thackway S - BMC Med Inform Decis Mak (2015)

Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV. Error bars show 0.95 confidence intervals
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4502908&req=5

Fig1: Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV. Error bars show 0.95 confidence intervals
Mentions: Table 4 presents the detailed classification results for diseases of interest, with rule-based results shown in (a) and machine learning results shown in (b). In addition, a confusion matrix, which provides a breakdown of true positives, false positives, true negatives and false negatives, is shown for each disease. A graphical summary of the results is shown in the plot of Fig. 1.Fig. 1

Bottom Line: Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness.More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80).In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness.

View Article: PubMed Central - PubMed

Affiliation: Australian e-Health Research Centre, CSIRO, Royal Brisbane and Women's Hospital, Brisbane, Australia. bevan.koopman@csiro.au.

ABSTRACT

Background: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV.

Methods: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000-2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors.

Results: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness.

Conclusions: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates.

No MeSH data available.


Related in: MedlinePlus