Limits...
Discovering relations between indirectly connected biomedical concepts.

Weissenborn D, Schroeder M, Tsatsaronis G - J Biomed Semantics (2015)

Bottom Line: Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts.Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.

View Article: PubMed Central - PubMed

Affiliation: DFKI Projektbüro Berlin, Alt-Moabit 91c, Berlin, 10559 Germany ; Biotechnology Center, Technische Universität Dresden, Tatzberg 47/49, Dresden, 01307 Germany.

ABSTRACT

Background: The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.

Results: It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely "has target", and "may treat", are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.

Conclusions: Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.

No MeSH data available.


Distribution of vertex degree and edge labels in unpruned, unstructured part of the knowledge graph, in log-scale. Figure (a) shows the distribution of vertex degrees. Similarly, Figure (b) shows the distribution of edge labels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492092&req=5

Fig5: Distribution of vertex degree and edge labels in unpruned, unstructured part of the knowledge graph, in log-scale. Figure (a) shows the distribution of vertex degrees. Similarly, Figure (b) shows the distribution of edge labels.

Mentions: For the already annotated MEDLINE corpus of 2012, triples were extracted by extracting dependency paths between two annotated concepts in each sentence. ClearNLP [28] was used for dependency parsing, because it is very fast and provides existing models trained on medical text. The resulting set of triples was stored in a titan [29] graph database. During extraction only dependency paths of length up to 6 were considered. The resulting graph contains 278,061 vertices (i.e., concepts) with an average degree of 600 in- and outgoing edges, resulting in 83 million edges (i.e., extracted triples) of around 16 million different labels (i.e., dependency paths), where each label thus occurs on average 5.2 times. In total, 29.7 million pairs of vertices are connected to each other. Both vertex degrees and edge label occurrences follow a very heavy tailed distribution (see Figure 5), i.e., most of the vertices and edge labels only occur very scarcely. Because there is so little data for those concepts and dependency paths, there is no value in keeping those for statistical learning methods. Therefore, the graph was pruned at a total concept occurrence of at least 40 for vertices and a total label occurrence of at least 50 for edges, after manual inspection of the occurrence statistics (see Figure 5). The pruned, unstructured part of the knowledge graph contains 84,635 vertices and around 39 million edges with 104,953 different labels between around 9 million connected concept pairs. Another 2.8 million pairs for relations stemming from UMLS and DrugBank were added to the graph as edges, but no new concepts were introduced, because the graph would have grown too large if all concepts of the UMLS would have been included as vertices.Figure 5


Discovering relations between indirectly connected biomedical concepts.

Weissenborn D, Schroeder M, Tsatsaronis G - J Biomed Semantics (2015)

Distribution of vertex degree and edge labels in unpruned, unstructured part of the knowledge graph, in log-scale. Figure (a) shows the distribution of vertex degrees. Similarly, Figure (b) shows the distribution of edge labels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492092&req=5

Fig5: Distribution of vertex degree and edge labels in unpruned, unstructured part of the knowledge graph, in log-scale. Figure (a) shows the distribution of vertex degrees. Similarly, Figure (b) shows the distribution of edge labels.
Mentions: For the already annotated MEDLINE corpus of 2012, triples were extracted by extracting dependency paths between two annotated concepts in each sentence. ClearNLP [28] was used for dependency parsing, because it is very fast and provides existing models trained on medical text. The resulting set of triples was stored in a titan [29] graph database. During extraction only dependency paths of length up to 6 were considered. The resulting graph contains 278,061 vertices (i.e., concepts) with an average degree of 600 in- and outgoing edges, resulting in 83 million edges (i.e., extracted triples) of around 16 million different labels (i.e., dependency paths), where each label thus occurs on average 5.2 times. In total, 29.7 million pairs of vertices are connected to each other. Both vertex degrees and edge label occurrences follow a very heavy tailed distribution (see Figure 5), i.e., most of the vertices and edge labels only occur very scarcely. Because there is so little data for those concepts and dependency paths, there is no value in keeping those for statistical learning methods. Therefore, the graph was pruned at a total concept occurrence of at least 40 for vertices and a total label occurrence of at least 50 for edges, after manual inspection of the occurrence statistics (see Figure 5). The pruned, unstructured part of the knowledge graph contains 84,635 vertices and around 39 million edges with 104,953 different labels between around 9 million connected concept pairs. Another 2.8 million pairs for relations stemming from UMLS and DrugBank were added to the graph as edges, but no new concepts were introduced, because the graph would have grown too large if all concepts of the UMLS would have been included as vertices.Figure 5

Bottom Line: Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts.Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.

View Article: PubMed Central - PubMed

Affiliation: DFKI Projektbüro Berlin, Alt-Moabit 91c, Berlin, 10559 Germany ; Biotechnology Center, Technische Universität Dresden, Tatzberg 47/49, Dresden, 01307 Germany.

ABSTRACT

Background: The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.

Results: It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely "has target", and "may treat", are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.

Conclusions: Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.

No MeSH data available.