Limits...
Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts.

Duan W, Song M, Yates A - BMC Bioinformatics (2009)

Bottom Line: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word.On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%.On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA. weisi.duan@temple.edu

ABSTRACT

Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort.

Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters.

Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.

Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

Show MeSH

Related in: MedlinePlus

A minimum spanning tree with slot annotations. Our technique for handling outliers performs a breadth-first traversal of the minimum spanning tree in order to count the number of nodes in each subtrees.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2665052&req=5

Figure 2: A minimum spanning tree with slot annotations. Our technique for handling outliers performs a breadth-first traversal of the minimum spanning tree in order to count the number of nodes in each subtrees.

Mentions: The first step in the algorithm is to calculate the size of each subtree in the MST. We first define a set of slots to store the number of nodes in a subtree. Each slot stores the number of nodes in a subtree rooted at a node x and connected to the rest of the MST by edge (x, y). Figure 2 shows an example MST with attached slots. In this figure, for example, slot [(V1, V0), V1] is the size of the subtree that is rooted at V1 and is connected to the rest of the tree along edge (V1, V0). It includes nodes V7 through V11. Notice that this subtree is different from the subtree rooted at V1 and connected to the rest of the tree by edge (V1, V10) – that subtree includes all nodes except V9, V10, and V11.


Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts.

Duan W, Song M, Yates A - BMC Bioinformatics (2009)

A minimum spanning tree with slot annotations. Our technique for handling outliers performs a breadth-first traversal of the minimum spanning tree in order to count the number of nodes in each subtrees.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2665052&req=5

Figure 2: A minimum spanning tree with slot annotations. Our technique for handling outliers performs a breadth-first traversal of the minimum spanning tree in order to count the number of nodes in each subtrees.
Mentions: The first step in the algorithm is to calculate the size of each subtree in the MST. We first define a set of slots to store the number of nodes in a subtree. Each slot stores the number of nodes in a subtree rooted at a node x and connected to the rest of the MST by edge (x, y). Figure 2 shows an example MST with attached slots. In this figure, for example, slot [(V1, V0), V1] is the size of the subtree that is rooted at V1 and is connected to the rest of the tree along edge (V1, V0). It includes nodes V7 through V11. Notice that this subtree is different from the subtree rooted at V1 and connected to the rest of the tree by edge (V1, V10) – that subtree includes all nodes except V9, V10, and V11.

Bottom Line: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word.On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%.On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA. weisi.duan@temple.edu

ABSTRACT

Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort.

Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters.

Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.

Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

Show MeSH
Related in: MedlinePlus