=Paper= {{Paper |id=Vol-1144/paper4 |storemode=property |title=A New Fine-Grained Weighting Method in Multi-Label Text Classification |pdfUrl=https://ceur-ws.org/Vol-1144/paper4.pdf |volume=Vol-1144 |dblpUrl=https://dblp.org/rec/conf/maics/Lee14 }} ==A New Fine-Grained Weighting Method in Multi-Label Text Classification== https://ceur-ws.org/Vol-1144/paper4.pdf
      A New Fine-Grained Weighting Method in Multi-Label Text Classification

                                                         Chang-Hwan Lee
                                          Department of Information and Communications
                                                       Dongguk Univerity
                                                           Seoul, Korea
                                                         chlee@dgu.ac.kr



                           Abstract                                     This paper presents a new weighting method for multi-
                                                                     nomial naive Bayes learning. In current multinomial naive
     Multi-label classification is one of the important re-
     search areas in data mining. In this paper, a new multi-        Bayes, each word is associated with its frequency, and
     label classification method using multinomial naive             the frequency can be regarded as the importance of the
     Bayes is proposed. We use a new fine-grained weight-            word. Therefore, each word (feature) is given a weight (fre-
     ing method for calculating the weights of feature values        quency). Furthermore we compare the performance of the
     in multinomial naive Bayes. Our experiments show that           proposed model with that of other state-of-the-art multi-
     the value weighting method could improve the perfor-            label classifiers.
     mance of multinomial naive Bayes learning.                         The rest of this paper is structured as follows. Section II
                                                                     shows the related works on naive Bayesian document classi-
                        Introduction                                 fier and multi-label problem. Section III discusses the multi-
Correctly classifying the documents into particular category         nomial naive Bayesian algorithm and the new value weight-
is still a challenging task because of large and vast amount of      ing method, and Section V shows the experimental results
features in the dataset. In particular, multi-label text classifi-   of the proposed methods. Finally Section VI summarizes the
cation problems have received considerable attention, since          contributions made in this paper.
each document may each be associated with multiple class
labels. For example, documents may belong to multiple gen-                                Related Work
res, such as entertainment and sports, or religion and culture.      Text classifiers based on naive Bayes have been studied ex-
In this paper we explore the multi-label text classification         tensively in the literature. Especially there have been many
problem, and there are two important issues in it.                   researches for using MNB model in text classification.
   The first is to improve the performance of multi-label text          McCallum and Nigam (McCallum and Nigam 1998)
classification tasks. In multi-label text classification prob-       compares classification performance between multi-variate
lem, multinomial naive Bayes (MNB) algorithm has been                Bernoulli model and multinomial model. In multi-variate
most commonly used. MNB classifier is an efficient and reli-         Bernoulli model, a document is considered as a binary fea-
able text classifier, and many researchers usually regard it as      ture vector, and it expresses whether each word is present
the standard naive Bayes text classifier in recent years. How-       or absent. They show that the multi-variate Bernoulli model
ever, their performances are not as good as some other learn-        performs well with small vocabulary sizes, but the multino-
ing methods such as support vector machines and boost-               mial model usually performs even better at larger vocabulary
ing. In this paper, we mainly investigate the reasons behind         sizes.
the poor performance of MNB. Then to enhance the per-                   Rennie et al. (Rennie et al. 2003) introduced Complement
formance of MNB method, we propose a new paradigm of                 Naive Bayes (CNB) for the skewed training data. CNB esti-
weighting method, called value weighing method, for MNB              mates parameters using data from all classes except the cur-
learning.                                                            rently estimated class. Furthermore they demonstrated that
   The second issue in multi-label classification is to ef-          MNB can achieve better accuracy by adopting a TFIDF rep-
fectively make use of the dependency relationships be-               resentation, traditionally used in Information Retrieval.
tween class labels. One common approach to process these                Schneider (Schneider 2005) addressed the problems of
class dependencies is to treat each class as a separate bi-          naive Bayesian text classifier and shows that they can be
nary classification problem; that is called binary relevance         solved by some simple corrections. He effectively removed
method(BR). (Read et al. 2009) When building the classi-             duplicate words in a document to account for burstiness phe-
fiers, BR does not directly model correlations which exist           nomena in text. And he proposed to use uniform priors to
between labels in the training data. However, in many real-          avoid problems with skewed class distributions when the
world tasks, labels are highly interdependent. Therefore, the        documents are very short.
key to successful multi-label learning is how to effectively            In recent years, there has been much study in multi-label
exploit dependencies between different labels.                       classification problem as motivated from emerging applica-
tions. As mentioned above, the key to successful multi-label       in the document, the more important it becomes. However,
learning is how to effectively exploit dependencies between        when a certain word appears for the first time, it is very im-
different labels.                                                  portant word in terms of the category of the document. For
   Ghamrawi and McCallum (Ghamrawi and McCallum                    example, when a word ’computer’ is encountered at first,
2005) proposed two undirected graphical models that di-            it gives us a lot of information that this document is about
rectly parameterize label dependencies in multi-label classi-      technology. However, if the same word ’computer’ already
fication. The first is Collective Multi-Label classifier (CML)     appeared 100 times, the subsequent ’computer’ has virtually
which jointly learns parameters for each pair of labels.           no importance.
The second is Collective Multi-Label with Features classi-            In current MNB, we can not discriminate the importance
fier(CMLF) which learns parameters for feature-label-label         of each frequency values of word. In fact, the importance
triples.                                                           of a word increases linearly with the frequency of the word.
   McCallum (McCallum 1999) defines a probabilistic gen-           The performance of MNB can be improved by mitigating
erative model according to which, each label generates dif-        these assumptions. The following section describes these is-
ferent words. Based on this model a multi-label document is        sues in detail.
produced by a mixture of the word distributions of its labels.
The mixture models are trained by EM, selecting the most           A Fine-Grained Weighting Method in Text
probable set of labels from the power set of possible classes.     Classification
   Read et al. (Read et al. 2009) introduced classifier chain.     In classification learning, a classifier assigns a class label to
The classifier chain is binary relevance-based methods and         a new instance. Naive Bayesian learning uses Bayes theorem
consist of binary classifiers which are linked in a chain. They    to calculate the most likely class label of the new instance.
again propose an ensemble of classifier chain combining            A new instance d is classified to the class with the maxi-
several classifier chain by changing the order for the labels.     mum posterior probability. In naive Bayesian learning, since
                                                                   all features are considered to be independent given the class
 Improving the Performance of Multinomial                          value, the classification on d is defined as follows
Naive Bayes in Document Classification Tasks                                   Vnb (d) = argmaxc P (c)
                                                                                                           Y
                                                                                                                P (aij |c)
In this paper, we assume that documents are generated ac-                                                aij ∈d
cording to a multinomial event model. Thus a document is
                                                                   where aij represents the j-th value of the i-th feature.
represented as a vector x = (f1 , ..., f|V | ) of word counts
                                                                      Since the assumption that all features are equally im-
where |V | is the vocabulary size and each ft indicates how        portant hardly holds true in real world application, there
often t-th word Wt occurs in x. Given model parameters             have been some attempts to relax this assumption in ma-
p(Wt |c) and p(c), assuming independence of the words, the         chine learning methods. The feature weighting in naive
most likely class value c for a document x is computed as          Bayesian approach is one approach for easing the indepen-
                                         |V |                      dence assumption. Feature weighting assigns a continuous
                                                                   value weight to each feature, and is thus a more flexible
                                         Y
         c∗M N B (x) = arg max p(c)             p(Wt |c)ft   (1)
                             c
                                         t=1
                                                                   method than feature selection. The naive Bayesian classi-
                                                                   fication with feature weighting is represented as follows
where p(Wtj |c) is the conditional probability that a word                                                 Y
Wt may happen in a document x given the class value c and                   c∗N B−F W (x) = arg max p(c)       p(ai |c)wi    (3)
                                                                                                   c
p(c) is the prior probability that a document with class la-                                                  i
bel c may happen in the document collections. The values of        In this formula, unlike the ordinary naive Bayesian ap-
p(Wt |c) and p(c) are estimated from training documents us-        proach, each feature i has its own weight wi . The wi can
ing maximum likelihood estimation with a Laplacean prior:          be any real number, representing the significance of feature
(Schneider 2004)                                                   i. The feature weighted naive Bayesian method involves a
                         P                                         much larger search space than feature selection, and is gen-
                     1 + xi ∈c fit                |c|
    p(Wt |c) =         P|V | P           , p(c) =        (2)       erally known to improve the performance of naive Bayesian
                |V | +               fit          N                learning (Lee et al. 2011).
                          t=1    xi ∈c
                                                                      MNB classification is a special form of feature weighted
where fit represents the t-th term frequency of i-th docu-         naive Bayesian learning. The MNB classification with fea-
ment, |c| represents the number of class value c, and N is         ture weighting is represented as follows.
the number of training document, respectively.
                                                                                                             |V |
   MNB provides reasonable prediction performance and                                                        Y
easy to implement. But it has some unrealistic assumptions               c∗M N B−F W (x) = arg max p(c)             p(Wt |c)wt   (4)
                                                                                                c
that affect overall performance of classification. The first                                                 t=1
is that all features are equally important in MNB learning.        The basic idea of feature weighting is that the more impor-
Because this assumption is rarely true in real-world appli-        tant a feature is, the higher its weight is. In feature weighting
cations, the predictions estimated by MNB are sometimes            naive Bayes, each word Wt has its own weight wt .
poor. The second is that the importance of each word grows            In traditional MNB (Equation 1), the frequency(ft ) of
proportionally with its frequency. The more a word appears         each word Wt plays the role of the significance of the word.
Therefore, the basic assumption in MNB is that when a cer-
tain word appears frequently in a document, the word grows         Table 1: Test datasets for the value weighting method.
important in proportion to its occurrence. Each word is given          Dataset       #(Data)     #(Label)    #(Feature)
a weight which is the frequency of the word in the document.
   Kim et al. (Kim et al. 2006) proposed a feature weight-             New3             3204            6         13196
ing scheme using information gain. Information gain for a              Ohsumed          1003           10          3183
word given a class, which becomes the weight of the word,              Amazon           1500           50         10000
is calculated as follows:
             wt = ft · {H(C) − H(C|Wt )}                  (5)   method is that when a certain word frequency bin is ob-
             X     X                      p(c, Wi )             served, it gives a certain amount of information to the tar-
      = ft ·               p(c, Wi ) log                        get word feature. The more information a word frequency
                                         p(c)p(Wi )
             c   Wi ∈{Wt ,W̄t }                                 bin provides to the target class, the more important the bin
where p(c, Wi ) is the number of documents with the word        becomes. The critical part now is how to define or select a
Wi and class label c divided by the total number of docu-       proper measure which can correctly measure the amount of
ments, and p(Wi ) is the number of documents with the word      information.
Wi divided by the total number of documents, respectively.         In this paper, we employ Hellinger measure (Beran 1977)
   In this paper, we think of a new method in which weights     in order to calculate the difference between the probability
are assigned in a more fine-grained way. We are going to        of a priori distribution and that of a posteriori distribution of
treat each occurrence of a word differently in terms of its     the target class. The Hellinger measure (denoted as HW) for
importance. When a certain word appears for the first time      a word frequency bin atj is defined as
in a document, it becomes very important with respect to the                                                       2 !1/2
classification of the document. For example, when a word                               X q                p
’diabetes’ appears for the first time in a document, it pro-      HM (C|atj ) =                p(c|atj ) − p(c)               (7)
                                                                                       c
vides a significant implication that this document is about
’health’. However, when the same word ’diabetes’ already        The formula HM (C|atj ) is the average mutual information
appeared many times, say 100, the next occurrence has vir-      between the events c and atj with the expectation taken with
tually no importance. The probability of the second occur-      respect to a posteriori probability distribution of C. It can be
rence is much higher than that of the first occurrence. Be-     used as a proper weighting function. So the value weight for
cause MNB treats the significance of each occurrence of a       a word frequency bin atj is defined as
word equally, the multinomial model does not account for
this phenomenon. In this paper we will investigate whether                       1
                                                                     wtj   =        HM (C|atj )
assigning weights to each word count can improve the per-                        Zt
formance of classification.                                                             q                  2 !1/2
   In order to implement the fine-grained weighting, we first                    1 X                   p
                                                                           =                p(c|atj ) − p(c)        (8)
discretize the term frequencies of each word. The discretiza-                    Zt   c
tion task converts a continuous term frequency ft to a cat-
egorical word frequency bin atj , which represents the j-th     where Zt is a normalization constant given as Zt =
discretized value of the term frequency. In other words, in-     1 P
                                                                          wtj . The |at | represents the number of word fre-
stead of assigning a weight to each word feature(e.g. MNB),     |at | j|t
we assign a weight to each word frequency bin. After that,      quency bins in word feature t.
the weights of these word frequency bin are automatically
calculated using training data.                                                Experimental Evaluation
   We call this method as value weighting method. As we         In order to evaluate the performance of each proposed
can see, unlike the current feature weighting methods, the      method, we divide the experiment into two subsections.
value weighting method calculates a weight for each word        Firstly, performance of value weighting method is compared
frequency bin. The value weighting method in MNB can be         with MNB. Secondly, we compare co-training with other
defined as follows.                                             multi-label algorithms such as BR, CC, etc.
                                      Y
    c∗M N B−V W (x) = arg max p(c)         p(atj |c)wtj   (6)      In this section, we describe how we conducted the exper-
                             c
                                     atj ∈x                     iments for measuring the effects of value weighting method.
                                                                We then present the empirical results obtained using these
where wtj represents the weight of word frequency bin atj .
                                                                methods.
You can easily see that each word frequency bin is assigned
                                                                   We used 3 text datasets to conduct our empirical study
a different weight.
                                                                for text classification. All these datasets have been widely
Calculating Value Weights This section describes the            used in text classification, and are publicly available. Table1
value weighting method for calculating weights of fre-          1 provides a brief description of each dataset.
quency bins. In this paper, we will use an information-            ’New3’ (TunedIT 2012) dataset contains a collection of
theoretic method for assigning weights to each word fre-        news stories and ’Ohsumed’ (TunedIT 2012) is a dataset
quency bin. The basic assumption of the value weighting         of medical articles. ’Amazon’ (Frank and Asuncion 2010)
             Table 2: Accuracies of the methods.
           Dataset         NB         MNB      MNB-VW
           New3            0.842      0.935         0.929
           Ohsumed         0.918      0.951         0.986
           Amazon          0.971      0.976         0.994



           Table 3: Weights of Feature Value Bins

 Dataset             1st        2nd           3rd       4th      5th
 News3         0.0023       1.9674      1.2436       0.8458   0.7523
 Ohsumed       0.0022       1.7143      1.4672       0.8310   0.4902
 Amazon        0.1152       1.6844      1.2484       1.3550   1.0888
                                                                             Figure 1: Average value weights of each dataset

                                                                                 Conclusions and Future Work
dataset is derived from the customer reviews in Amazon
Commerce Website for authorship identification. The con-               In the paper, multinomial naive Bayes algorithm, applied to
tinuous features in datasets were discretized using equal dis-         multi-label document classification, is improved in a num-
tance method with 5 bins.                                              ber of ways. A new paradigm of weighting method, called
                                                                       value weighting method, is proposed for multinomial naive
   In this experiment, single-label classification was con-            Bayesian learning. The proposed method is a more fine-
ducted. So the value weighting method(MNB-VW) was                      grained weighting method, and assigns a different weight to
used without employing the co-training model. The pro-                 each feature value.
posed MNB-VW is compared with regular naive Bayes
                                                                          The experimental results show that the value weight-
(NB) and multinomial naive Bayes (MNB). We used Weka
                                                                       ing method shows better performance in most cases than
software to run NB and MNB.
                                                                       its counterpart algorithms. As a result, this work suggests
   Table 2 shows the results of the accuracies of these meth-          that we could improve the performance of multinomial
ods. The numbers with bold letter mean they are the best               naive Bayes in multi-label text classification by using value
accuracy among NB, MNB, and MNB-VW. The MNB-VW                         weighting approach. In the future, it will be interesting to
shows the best performance in 2 cases. And it always shows             apply this approach to some large scale document datasets
better performance than NB method. These results clearly               in order to see the scalability of the proposed method.
indicate that assigning weights to each word frequency bin
could improve the performance of the classification task of
naive Bayesian in document classification.                                               Acknowledgments
   We calculated the average value weights of all features in          This work was supported in part by National Research Foun-
each dataset. Because the number of bins is 5, all features            dation of Korea (NRF) (Grant number: 2011- 0023296).
have 5 discretized word frequency bins. The average value
weights used in above experiment are shown in Figure 3.
                                                                                              References
   When term frequency is zero, it is discretized to first
word frequency bin. The remaining term frequencies are dis-            R. J. Beran. Minimum hellinger distances for parametric
cretized by equal distance and the discretized values are as-          models. Ann. Statistics, 5:445–463, 1977.
signed in order of frequency. The highest term frequencies             A. Frank and A. Asuncion. Uci machine learning repository.
are discretized to the fifth word frequency bin.                       http://archive.ics.uci.edu/ml, 2010.
   Table 3 shows the weights of each frequency bin, and                N. Ghamrawi and A. McCallum. Collective multi-label clas-
clearly shows that the low word frequency bins generally               sification. In Inter. Conf. on Inform. and Know. Manage.,
have higher weight than high word frequency bins. It means             2005.
that higher weights are generally assigned to low term fre-
quencies. Notice that the 1st bin means the term frequency             S. Kim, K. Han, H. Rim, and S. Myaeng. Some effective
is zero. Therefore, it reasonable that the 1st bin (TF=0)              techniques for naive bayes text classification. IEEE Trans-
has very low weight value. Actually, the event that a word             actions on Knowledge and Data Engineering, 18(11):1457–
occurs once or less frequently has more significance than              1466, 2006.
the event that a word occurs several times. These experi-              Chang-Hwan Lee, Fernando Gutierrez, and Dejing Dou.
ments demonstrate that each frequency bin has different im-            Calculating feature weights in naive bayes with kullback-
portance in terms of document classification, and our fine-            leibler measure. In 11th IEEE International Conference on
grained weighting method could clearly solves this issue.              Data Mining, 2011.
A. McCallum and K. Nigam. A comparison of event models
for naive bayes text classification. In AAAI-98 Workshop on
Learning for Text Categorization, 1998.
A. McCallum. Multi-label text classification with a mixture
model trained by em. In AAAI99 Workshop on Text Learn-
ing, 1999.
J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier
chains for multi-label classification. In ECML/PKDD, pages
254–269, 2009.
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the
poor assumptions of naive bayes text classifiers. In In Pro-
ceedings of the 20th international conference on Machine
learning (ICML), pages 616–623,, 2003.
K. M. Schneider. A new feature selection score for multino-
mial naive bayes text classification based on kl-divergence.
In In Proceedings of the 42nd Meeting of the Association of
Computational Linguistics (ACL), pages 186–189, 2004.
K. Schneider. Techniques for improving the performance of
naive bayes for text classification. In LNCS Vol 3406, pages
682–693, 2005.
TunedIT. Tunedit data mining blog. http://www.
tunedit.org, 2012.