Limits...
Benchmarking ontologies: bigger or better?

Yao L, Divoli A, Mayzus I, Evans JA, Rzhetsky A - PLoS Comput. Biol. (2011)

Bottom Line: Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains.Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels.As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.

View Article: PubMed Central - PubMed

Affiliation: Department of Biomedical Informatics, Columbia University, New York, New York, USA.

ABSTRACT
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.

Show MeSH

Related in: MedlinePlus

Four examples of synonym substitution probabilities in three corpora in our study.Plots A–D correspond to the headwords futile (adjective), stretch (verb), headache (noun) and cat (noun) respectively. The horizontal position of each synonym represents the substitution probability on a logarithmic scale as does the font size. The color of each synonym indicates the corpus in which the substitution is most probable: black – medicine, red – novels, and blue – news. The frequency of each headword in the three corpora is also listed using the same color codes.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3020923&req=5

pcbi-1001055-g003: Four examples of synonym substitution probabilities in three corpora in our study.Plots A–D correspond to the headwords futile (adjective), stretch (verb), headache (noun) and cat (noun) respectively. The horizontal position of each synonym represents the substitution probability on a logarithmic scale as does the font size. The color of each synonym indicates the corpus in which the substitution is most probable: black – medicine, red – novels, and blue – news. The frequency of each headword in the three corpora is also listed using the same color codes.

Mentions: While thesauri typically aim to capture universal properties of language, corpora can be surprisingly dissimilar and sometimes disjoint in their use of words and synonym substitutions. Figures 3 and 4 visualize ten words whose synonym substitution probabilities are most unlike one another across the medicine, news and novels corpora. Some words carry a different semantic sense in each corpus (e.g., cat as feline versus CT scan versus Caterpillar construction equipment), while other words have very different distributions of common senses.


Benchmarking ontologies: bigger or better?

Yao L, Divoli A, Mayzus I, Evans JA, Rzhetsky A - PLoS Comput. Biol. (2011)

Four examples of synonym substitution probabilities in three corpora in our study.Plots A–D correspond to the headwords futile (adjective), stretch (verb), headache (noun) and cat (noun) respectively. The horizontal position of each synonym represents the substitution probability on a logarithmic scale as does the font size. The color of each synonym indicates the corpus in which the substitution is most probable: black – medicine, red – novels, and blue – news. The frequency of each headword in the three corpora is also listed using the same color codes.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3020923&req=5

pcbi-1001055-g003: Four examples of synonym substitution probabilities in three corpora in our study.Plots A–D correspond to the headwords futile (adjective), stretch (verb), headache (noun) and cat (noun) respectively. The horizontal position of each synonym represents the substitution probability on a logarithmic scale as does the font size. The color of each synonym indicates the corpus in which the substitution is most probable: black – medicine, red – novels, and blue – news. The frequency of each headword in the three corpora is also listed using the same color codes.
Mentions: While thesauri typically aim to capture universal properties of language, corpora can be surprisingly dissimilar and sometimes disjoint in their use of words and synonym substitutions. Figures 3 and 4 visualize ten words whose synonym substitution probabilities are most unlike one another across the medicine, news and novels corpora. Some words carry a different semantic sense in each corpus (e.g., cat as feline versus CT scan versus Caterpillar construction equipment), while other words have very different distributions of common senses.

Bottom Line: Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains.Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels.As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.

View Article: PubMed Central - PubMed

Affiliation: Department of Biomedical Informatics, Columbia University, New York, New York, USA.

ABSTRACT
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.

Show MeSH
Related in: MedlinePlus