Limits...
Learning Semantic Tags from Big Data for Clinical Text Representation.

Li Y, Liu H - AMIA Jt Summits Transl Sci Proc (2015)

Bottom Line: Addressing this issue, we propose a novel method for word and n-gram representation at semantic level.We first represent each word by its distance with a set of reference features calculated by reference distance estimator (RDE) learned from labeled and unlabeled data, and then generate new features using simple techniques of discretization, random sampling and merging.We show that the new features significantly outperform classical bag-of-words and n-grams in the task of heart disease risk factor extraction in i2b2 2014 challenge.

View Article: PubMed Central - PubMed

Affiliation: Mayo Clinic, Rochester, MN.

ABSTRACT
In clinical text mining, it is one of the biggest challenges to represent medical terminologies and n-gram terms in sparse medical reports using either supervised or unsupervised methods. Addressing this issue, we propose a novel method for word and n-gram representation at semantic level. We first represent each word by its distance with a set of reference features calculated by reference distance estimator (RDE) learned from labeled and unlabeled data, and then generate new features using simple techniques of discretization, random sampling and merging. The new features are a set of binary rules that can be interpreted as semantic tags derived from word and n-grams. We show that the new features significantly outperform classical bag-of-words and n-grams in the task of heart disease risk factor extraction in i2b2 2014 challenge. It is promising to see that semantics tags can be used to replace the original text entirely with even better prediction performance as well as derive new rules beyond lexical level.

No MeSH data available.


Related in: MedlinePlus

An example of RDSM algorithm for semantic tag generation. The reference features are for the task of identifying sentences related to medications in i2b2 2014. Features in red color are final features for classification.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4525271&req=5

f1-2092036: An example of RDSM algorithm for semantic tag generation. The reference features are for the task of identifying sentences related to medications in i2b2 2014. Features in red color are final features for classification.

Mentions: The algorithm for n-gram generation is simply to merge the semantic tags of consecutive word by the product and then find a subset from the merged word semantic tags by random sampling. The referring step is to represent each word by its co-occurrence with reference features in big data, which can incorporate rich information beyond the small size of annotated data [5][6][7]. Discretization is to generate interpretable binary rules from real valued distance. Merging step is necessary because the combination of the discretized distances from multiple references tend to provide more accurate information than each individual reference feature. However, to enumerate all the possible combination of reference features is a NP hard problem, so we used random sampling for subset selection within the computational power of the current computers. In the experiment section we will show that the reference feature size R, discretization size D, sample size S and merging size M are important parameters to achieve good performance. In our experiment, R was set at 200, and the discretization intervals were divided by {0, 0.005}. For word representation we combined the two feature sets (S=40, M=8) and (S=40, M=10). For bigrams, we used the model (S=60, M=4) for both the first and second words. For trigrams, (S=20, M=3) was used for all the three words. Figure 1 shows an example of the RDSM method for word and bigram representation.


Learning Semantic Tags from Big Data for Clinical Text Representation.

Li Y, Liu H - AMIA Jt Summits Transl Sci Proc (2015)

An example of RDSM algorithm for semantic tag generation. The reference features are for the task of identifying sentences related to medications in i2b2 2014. Features in red color are final features for classification.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4525271&req=5

f1-2092036: An example of RDSM algorithm for semantic tag generation. The reference features are for the task of identifying sentences related to medications in i2b2 2014. Features in red color are final features for classification.
Mentions: The algorithm for n-gram generation is simply to merge the semantic tags of consecutive word by the product and then find a subset from the merged word semantic tags by random sampling. The referring step is to represent each word by its co-occurrence with reference features in big data, which can incorporate rich information beyond the small size of annotated data [5][6][7]. Discretization is to generate interpretable binary rules from real valued distance. Merging step is necessary because the combination of the discretized distances from multiple references tend to provide more accurate information than each individual reference feature. However, to enumerate all the possible combination of reference features is a NP hard problem, so we used random sampling for subset selection within the computational power of the current computers. In the experiment section we will show that the reference feature size R, discretization size D, sample size S and merging size M are important parameters to achieve good performance. In our experiment, R was set at 200, and the discretization intervals were divided by {0, 0.005}. For word representation we combined the two feature sets (S=40, M=8) and (S=40, M=10). For bigrams, we used the model (S=60, M=4) for both the first and second words. For trigrams, (S=20, M=3) was used for all the three words. Figure 1 shows an example of the RDSM method for word and bigram representation.

Bottom Line: Addressing this issue, we propose a novel method for word and n-gram representation at semantic level.We first represent each word by its distance with a set of reference features calculated by reference distance estimator (RDE) learned from labeled and unlabeled data, and then generate new features using simple techniques of discretization, random sampling and merging.We show that the new features significantly outperform classical bag-of-words and n-grams in the task of heart disease risk factor extraction in i2b2 2014 challenge.

View Article: PubMed Central - PubMed

Affiliation: Mayo Clinic, Rochester, MN.

ABSTRACT
In clinical text mining, it is one of the biggest challenges to represent medical terminologies and n-gram terms in sparse medical reports using either supervised or unsupervised methods. Addressing this issue, we propose a novel method for word and n-gram representation at semantic level. We first represent each word by its distance with a set of reference features calculated by reference distance estimator (RDE) learned from labeled and unlabeled data, and then generate new features using simple techniques of discretization, random sampling and merging. The new features are a set of binary rules that can be interpreted as semantic tags derived from word and n-grams. We show that the new features significantly outperform classical bag-of-words and n-grams in the task of heart disease risk factor extraction in i2b2 2014 challenge. It is promising to see that semantics tags can be used to replace the original text entirely with even better prediction performance as well as derive new rules beyond lexical level.

No MeSH data available.


Related in: MedlinePlus