Limits...
The Ineffectiveness of Within - Document Term Frequency in Text Classification.

Wilbur WJ, Kim W - Inf Retr Boston (2009)

Bottom Line: Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the Exponential-Family Approximation of the DCM (EDCM), as well as support vector machines (SVM).We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM.We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

View Article: PubMed Central - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, U.S.A.

ABSTRACT
For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the Exponential-Family Approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

No MeSH data available.


A comparison of performance coming from the MM, SMM, DCM, EDCM, TF–IDF SVM, and IDF SVM models over the different binary and multiclass classification tasks we studied, except TREC_GEN
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2744136&req=5

Fig1: A comparison of performance coming from the MM, SMM, DCM, EDCM, TF–IDF SVM, and IDF SVM models over the different binary and multiclass classification tasks we studied, except TREC_GEN

Mentions: The strongest trend one sees in examining the results of Fig. 1 is that the SMM results are almost identical to the DCM and EDCM results. The only exceptions are the WebKB data where DCM and EDCM slightly outperform SMM and the Industry Sector data where SMM is ahead of DCM and EDCM. The fact that DCM and EDCM produce almost identical results is expected based on the results of Elkan (2006) showing that these two approaches produce almost the same probabilities. Here we also see that SMM and MM are relatively close in performance most of the time with SMM appearing to have a small advantage. If we note that the data we report here are based on 27 individual classification problems (MeSH Terms 10, Reuters 10, MDR 3, REBASE 1, 20 Newsgroups 1, Industry Sector 1, WebKB 1) and tally which classifier performs best on each problem we see (data not shown) that SMM wins 20 times, MM 6 times, and there is 1 tie. By the sign test this is statistically significant in favor of SMM. This observation that local frequency does not benefit in the MM model was also made by (Schneider 2005). Also in Fig. 1 one sees that TF–IDF SVM and IDF SVM appear to be very close in performance with TF–IDF SVM having a small advantage. Again if we consider the 27 individual classification tasks and tally the winners we see that TF–IDF SVM wins 12 times, IDF SVM wins 6 times, and there are 9 ties. This is not statistically significant by the sign test.Fig. 1


The Ineffectiveness of Within - Document Term Frequency in Text Classification.

Wilbur WJ, Kim W - Inf Retr Boston (2009)

A comparison of performance coming from the MM, SMM, DCM, EDCM, TF–IDF SVM, and IDF SVM models over the different binary and multiclass classification tasks we studied, except TREC_GEN
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2744136&req=5

Fig1: A comparison of performance coming from the MM, SMM, DCM, EDCM, TF–IDF SVM, and IDF SVM models over the different binary and multiclass classification tasks we studied, except TREC_GEN
Mentions: The strongest trend one sees in examining the results of Fig. 1 is that the SMM results are almost identical to the DCM and EDCM results. The only exceptions are the WebKB data where DCM and EDCM slightly outperform SMM and the Industry Sector data where SMM is ahead of DCM and EDCM. The fact that DCM and EDCM produce almost identical results is expected based on the results of Elkan (2006) showing that these two approaches produce almost the same probabilities. Here we also see that SMM and MM are relatively close in performance most of the time with SMM appearing to have a small advantage. If we note that the data we report here are based on 27 individual classification problems (MeSH Terms 10, Reuters 10, MDR 3, REBASE 1, 20 Newsgroups 1, Industry Sector 1, WebKB 1) and tally which classifier performs best on each problem we see (data not shown) that SMM wins 20 times, MM 6 times, and there is 1 tie. By the sign test this is statistically significant in favor of SMM. This observation that local frequency does not benefit in the MM model was also made by (Schneider 2005). Also in Fig. 1 one sees that TF–IDF SVM and IDF SVM appear to be very close in performance with TF–IDF SVM having a small advantage. Again if we consider the 27 individual classification tasks and tally the winners we see that TF–IDF SVM wins 12 times, IDF SVM wins 6 times, and there are 9 ties. This is not statistically significant by the sign test.Fig. 1

Bottom Line: Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the Exponential-Family Approximation of the DCM (EDCM), as well as support vector machines (SVM).We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM.We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

View Article: PubMed Central - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, U.S.A.

ABSTRACT
For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the Exponential-Family Approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

No MeSH data available.