Limits...
Synonym set extraction from the biomedical literature by lexical pattern discovery.

McCrae J, Collier N - BMC Bioinformatics (2008)

Bottom Line: The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis.We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers.We also concluded that the accuracy can be improved by grouping into synonym sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Institute of Informatics, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo, 101-8430, Japan. jmccrae@nii.ac.jp

ABSTRACT

Background: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way.

Results: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia.

Conclusion: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.

Show MeSH

Related in: MedlinePlus

Random Graph to Synsets. Illustration of conversion from a random graph to a set of synsets. The top graph illustrates the random output from the classifier, which includes a spurious link from "dandy fever" to "yellow fever". This graph is corrected by forming into synsets removing the spurious link and adding two links which were not found by the classifier.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2335115&req=5

Figure 2: Random Graph to Synsets. Illustration of conversion from a random graph to a set of synsets. The top graph illustrates the random output from the classifier, which includes a spurious link from "dandy fever" to "yellow fever". This graph is corrected by forming into synsets removing the spurious link and adding two links which were not found by the classifier.

Mentions: The results we gained from the statistical classification procedure gave only the probability of a particular term pair being synonymous. However we would expect every pair of terms in a synset to be synonymous and these binary classification results do not guarantee that such a transitivity relation exists. As such we shall assume that every pair of terms in a synset are synonymous and no pairs of terms in different synsets are synonymous (although this is technically incorrect as some words may be polysemous). This clearly leads to the result that synsets are complete graphs, so we can consider our goal as that of finding the closest set of complete sub-graphs to our random graph. As an example consider Figure 2, which shows a graph representation of the output on top, where the nodes represents terms, and they are connected if the classifier predicts the terms are synonymous. It is clear that the graph above should give two synsets as shown in the bottom graph.


Synonym set extraction from the biomedical literature by lexical pattern discovery.

McCrae J, Collier N - BMC Bioinformatics (2008)

Random Graph to Synsets. Illustration of conversion from a random graph to a set of synsets. The top graph illustrates the random output from the classifier, which includes a spurious link from "dandy fever" to "yellow fever". This graph is corrected by forming into synsets removing the spurious link and adding two links which were not found by the classifier.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2335115&req=5

Figure 2: Random Graph to Synsets. Illustration of conversion from a random graph to a set of synsets. The top graph illustrates the random output from the classifier, which includes a spurious link from "dandy fever" to "yellow fever". This graph is corrected by forming into synsets removing the spurious link and adding two links which were not found by the classifier.
Mentions: The results we gained from the statistical classification procedure gave only the probability of a particular term pair being synonymous. However we would expect every pair of terms in a synset to be synonymous and these binary classification results do not guarantee that such a transitivity relation exists. As such we shall assume that every pair of terms in a synset are synonymous and no pairs of terms in different synsets are synonymous (although this is technically incorrect as some words may be polysemous). This clearly leads to the result that synsets are complete graphs, so we can consider our goal as that of finding the closest set of complete sub-graphs to our random graph. As an example consider Figure 2, which shows a graph representation of the output on top, where the nodes represents terms, and they are connected if the classifier predicts the terms are synonymous. It is clear that the graph above should give two synsets as shown in the bottom graph.

Bottom Line: The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis.We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers.We also concluded that the accuracy can be improved by grouping into synonym sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Institute of Informatics, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo, 101-8430, Japan. jmccrae@nii.ac.jp

ABSTRACT

Background: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way.

Results: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia.

Conclusion: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.

Show MeSH
Related in: MedlinePlus