Limits...
Fine-grained information extraction from German transthoracic echocardiography reports.

Toepfer M, Corovic H, Fette G, Klügl P, Störk S, Puppe F - BMC Med Inform Decis Mak (2015)

Bottom Line: In particular, principal aspects as defined in a standardized external terminology were recognized with f 1=.989 (micro average) and f 1=.963 (macro average).As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout.Extracted results populate a clinical data warehouse which supports clinical research.

View Article: PubMed Central - PubMed

Affiliation: Chair of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, D-97074, Germany. martin.toepfer@uni-wuerzburg.de.

ABSTRACT

Background: Information extraction techniques that get structured representations out of unstructured data make a large amount of clinically relevant information about patients accessible for semantic applications. These methods typically rely on standardized terminologies that guide this process. Many languages and clinical domains, however, lack appropriate resources and tools, as well as evaluations of their applications, especially if detailed conceptualizations of the domain are required. For instance, German transthoracic echocardiography reports have not been targeted sufficiently before, despite of their importance for clinical trials. This work therefore aimed at development and evaluation of an information extraction component with a fine-grained terminology that enables to recognize almost all relevant information stated in German transthoracic echocardiography reports at the University Hospital of Würzburg.

Methods: A domain expert validated and iteratively refined an automatically inferred base terminology. The terminology was used by an ontology-driven information extraction system that outputs attribute value pairs. The final component has been mapped to the central elements of a standardized terminology, and it has been evaluated according to documents with different layouts.

Results: The final system achieved state-of-the-art precision (micro average.996) and recall (micro average.961) on 100 test documents that represent more than 90 % of all reports. In particular, principal aspects as defined in a standardized external terminology were recognized with f 1=.989 (micro average) and f 1=.963 (macro average). As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout.

Conclusions: The developed terminology and the proposed information extraction system allow to extract fine-grained information from German semi-structured transthoracic echocardiography reports with very high precision and high recall on the majority of documents at the University Hospital of Würzburg. Extracted results populate a clinical data warehouse which supports clinical research.

No MeSH data available.


Overview of the terminology development and information extraction setting. Based on default resources (dictionaries, templates, etc.) and supported by automatically inferred concept proposals, domain experts iteratively refine the domain knowledge and the high-level extraction knowledge of the terminology for each clinical subdomain (top left); technical experts adapt preexisting segmentation and filtering rules to the needs of specific subdomains (bottom left). The terminology and the segmentation module are integrated into a generic ontology-driven information extraction method that keeps the same across domains (mid right). It populates extracted attribute value pairs into a clinical data warehouse (top). $: input documents are (pre)processed by a de-identification module in order to ensure patient privacy
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4643516&req=5

Fig2: Overview of the terminology development and information extraction setting. Based on default resources (dictionaries, templates, etc.) and supported by automatically inferred concept proposals, domain experts iteratively refine the domain knowledge and the high-level extraction knowledge of the terminology for each clinical subdomain (top left); technical experts adapt preexisting segmentation and filtering rules to the needs of specific subdomains (bottom left). The terminology and the segmentation module are integrated into a generic ontology-driven information extraction method that keeps the same across domains (mid right). It populates extracted attribute value pairs into a clinical data warehouse (top). $: input documents are (pre)processed by a de-identification module in order to ensure patient privacy

Mentions: As outlined in Fig. 2, the central elements of the intended terminology development and information extraction setting are the terminology of the clinical subdomain, a domain expert, a technical expert, a collection of clinical documents (training set), and algorithmic components for terminology learning (learning tools), refinement support (terminology editor), segmentation (rule scripts), and a generic ontology-driven information extraction algorithm. Mappings to external conceptualizations can be used to create standardized views on the data as depicted in Fig. 3. Documents are de-identified in order to preserve patient privacy. Finally, deployed information extraction modules can be used to populate a clinical data warehouse either directly in a clinical data warehouse environment like [19] or integrated as a text mining service into a cloud infrastructure like [20].Fig. 2


Fine-grained information extraction from German transthoracic echocardiography reports.

Toepfer M, Corovic H, Fette G, Klügl P, Störk S, Puppe F - BMC Med Inform Decis Mak (2015)

Overview of the terminology development and information extraction setting. Based on default resources (dictionaries, templates, etc.) and supported by automatically inferred concept proposals, domain experts iteratively refine the domain knowledge and the high-level extraction knowledge of the terminology for each clinical subdomain (top left); technical experts adapt preexisting segmentation and filtering rules to the needs of specific subdomains (bottom left). The terminology and the segmentation module are integrated into a generic ontology-driven information extraction method that keeps the same across domains (mid right). It populates extracted attribute value pairs into a clinical data warehouse (top). $: input documents are (pre)processed by a de-identification module in order to ensure patient privacy
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4643516&req=5

Fig2: Overview of the terminology development and information extraction setting. Based on default resources (dictionaries, templates, etc.) and supported by automatically inferred concept proposals, domain experts iteratively refine the domain knowledge and the high-level extraction knowledge of the terminology for each clinical subdomain (top left); technical experts adapt preexisting segmentation and filtering rules to the needs of specific subdomains (bottom left). The terminology and the segmentation module are integrated into a generic ontology-driven information extraction method that keeps the same across domains (mid right). It populates extracted attribute value pairs into a clinical data warehouse (top). $: input documents are (pre)processed by a de-identification module in order to ensure patient privacy
Mentions: As outlined in Fig. 2, the central elements of the intended terminology development and information extraction setting are the terminology of the clinical subdomain, a domain expert, a technical expert, a collection of clinical documents (training set), and algorithmic components for terminology learning (learning tools), refinement support (terminology editor), segmentation (rule scripts), and a generic ontology-driven information extraction algorithm. Mappings to external conceptualizations can be used to create standardized views on the data as depicted in Fig. 3. Documents are de-identified in order to preserve patient privacy. Finally, deployed information extraction modules can be used to populate a clinical data warehouse either directly in a clinical data warehouse environment like [19] or integrated as a text mining service into a cloud infrastructure like [20].Fig. 2

Bottom Line: In particular, principal aspects as defined in a standardized external terminology were recognized with f 1=.989 (micro average) and f 1=.963 (macro average).As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout.Extracted results populate a clinical data warehouse which supports clinical research.

View Article: PubMed Central - PubMed

Affiliation: Chair of Computer Science VI, University of Würzburg, Am Hubland, Würzburg, D-97074, Germany. martin.toepfer@uni-wuerzburg.de.

ABSTRACT

Background: Information extraction techniques that get structured representations out of unstructured data make a large amount of clinically relevant information about patients accessible for semantic applications. These methods typically rely on standardized terminologies that guide this process. Many languages and clinical domains, however, lack appropriate resources and tools, as well as evaluations of their applications, especially if detailed conceptualizations of the domain are required. For instance, German transthoracic echocardiography reports have not been targeted sufficiently before, despite of their importance for clinical trials. This work therefore aimed at development and evaluation of an information extraction component with a fine-grained terminology that enables to recognize almost all relevant information stated in German transthoracic echocardiography reports at the University Hospital of Würzburg.

Methods: A domain expert validated and iteratively refined an automatically inferred base terminology. The terminology was used by an ontology-driven information extraction system that outputs attribute value pairs. The final component has been mapped to the central elements of a standardized terminology, and it has been evaluated according to documents with different layouts.

Results: The final system achieved state-of-the-art precision (micro average.996) and recall (micro average.961) on 100 test documents that represent more than 90 % of all reports. In particular, principal aspects as defined in a standardized external terminology were recognized with f 1=.989 (micro average) and f 1=.963 (macro average). As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout.

Conclusions: The developed terminology and the proposed information extraction system allow to extract fine-grained information from German semi-structured transthoracic echocardiography reports with very high precision and high recall on the majority of documents at the University Hospital of Würzburg. Extracted results populate a clinical data warehouse which supports clinical research.

No MeSH data available.