=Paper= {{Paper |id=Vol-1458/G07_CRC81_Ghasem |storemode=property |title=A New Approach for Selecting Informative Features For Text Classification |pdfUrl=https://ceur-ws.org/Vol-1458/G07_CRC81_Ghasem.pdf |volume=Vol-1458 |dblpUrl=https://dblp.org/rec/conf/lwa/GhasemFM15a }} ==A New Approach for Selecting Informative Features For Text Classification== https://ceur-ws.org/Vol-1458/G07_CRC81_Ghasem.pdf
      A New Approach For Selecting Informative
          Features For Text Classification

             Zinnar Ghasem1 , Ingo Frommholz1 , and Carsten Maple2
                           1
                            University of Bedfordshire, UK
                               2
                              University of Warwick, UK
                    {zinnar.ghasem,ingo.frommholz}@beds.ac.uk
                           carsten.maple@warwick.ac.uk



        Abstract. Selecting useful and informative features to classify text is
        not only important to decrease the size of the feature space, but as well
        for the overall performance and precision of machine learning. In this
        study we propose a new feature selection method called Informative Fea-
        ture Selector (IFS). Different machine learning algorithms and datasets
        have been utilised to examine the effectiveness of IFS, and it is compared
        to well-established methods, namely Information Gain, Odd Ratio, Chi
        Square, Mutual Information and Class Discriminative Measure. Our ex-
        periments show that IFS is able to outperform aforementioned methods
        and to produce effective and efficient results.

        Keywords: Feature selection, text classification, text preprocessing


1     Introduction

Automatic text classification is increasingly imperative for managing a substan-
tial amount of data and information on the Web, for instance for search, spam
filtering, malware classification, etc. It has been well studied over the past half
century [1, 2], and a number of machine learning and statistical algorithms have
extensively been used in various types of classification. In text classification,
each document needs to be represented as a vector of features, for which various
methods have been employed – Bag-of-Words is one of the major methods, where
all unique features (in this case terms) of documents are utilised. This approach
results in a high dimensional vector space of features, particularly in the case of
a big dataset or where we find a large volume of document content. This large
set of features is one of main issues of text classification, therefore it is highly
desirable to reduce the size of the vector space by selecting useful features.
     For this purpose, we propose a probabilistic Informative Feature Selector
(IFS). The IFS is compared using ten folds cross validation to some of the
    Copyright c 2015 by the papers authors. Copying permitted only for private and
    academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of
    the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.
    October 2015, published at http://ceur-ws.org




                                        382
2       Zinnar Ghasem, Ingo Frommholz, and Carsten Maple

well-known methods, namely Chi Square (chi), Odd Ratio(OR), Information
Gain (IG), Class Discriminative Measures (CDM) and Mutual Information (MI).
The results of our experiment show that IFS is comparable to some and superior
to others in term classification accuracy. The rest of this paper is organised as
follows: Section 2 lists all feature selection methods which have been compared
with IFS. Section 3 introduces IFS method. Section 4 outlines the experiments;
datasets and performance measures, while results are analysed in section 5 and
conclusion is the last section.


2     Related Work

As aforementioned, there exist a number of feature selection methods for text
classification; among them IG, OR, CDM, and Chi have been claimed to perform
well in experimental studies [3–6, 11, 8]. Thus, specifically those methods have
been selected to be evaluated against IFS. In the following subsections, a brief
description of each method is provided.


2.1   Information Gain (IG)

Information gain evaluates the usefulness of a feature for classification based on
absence and presence of that feature in documents [9]. Information Gain defines
the relationship between class cj and feature ti in the following formula.
                         m
                         X                                      m
                                                                X
      IG(ti , cj ) = −         P (cj ) log(P (cj )) + P (ti )         P (cj |ti ) log(P (cj |ti ))
                         j=1                                    j=1
                                                                m
                                                                X
                                                    +P (t¯i )         P (cj |t¯i ) log(P (cj |t¯i )),   (1)
                                                                j=1


where P (cj ) is the probability of cj , P (ti ), and P (t¯i ) is the probability of ti
occurring and not occurring in cj correspondingly , while P (cj |ti ) is the proba-
bility of cj given the probability of ti , and P (cj |t¯i ) is the probability of cj given
the probability of t¯i and m is number of classes in the training set.


2.2   Mutual Information (MI)

Mutual information evaluates the association between a feature and a class as
                                                                     
                                   P (ti ∧ cj )             P (ti |cj )
             M I(ti , cj ) = log                    = log                 . (2)
                                   P (ti )P (cj )            P (ti )

However, using the the two ways contingency table shown in Table 1, Eq. 2 can
be represented in terms of the number of documents in a class as shown in Eq. 3
where α and λ represent the number of documents containing and not containing
term ti in class cj respectively, similarly β and δ are the number of documents




                                              383
A New Approach For Selecting Informative Features For Text Classification           3

                        Table 1. Two ways contingency table

                                             Cj              ¬Cj
                        term ti              α                 β
                        term t¯i             λ                 δ



containing and not containing ti in ¬Cj . If η = α + β + λ + δ is total number of
documents in all classes, MI can be estimated as
                                                          
                                                α×η
                      M I(ti , cj ) = log                                     (3)
                                            (α + λ)(α + β)

2.3   Chi Square (χ2 )
χ2 or chi has been reported as one of top feature selection methods and it
measures the correlation between feature ti and class cj [3]. Using Table 1, χ2 is
defined as
                                             η × (αδ − βλ)2
                    χ2 (ti , cj ) =                                               (4)
                                      (α + λ)(β + δ)(α + β)(λ + δ)

The goodness of ti decreases as the independence between ti and cj increases
to the point where the value is zero, that is, when the number of documents
containing term ti in classes is equal.

2.4   Odd Ratio (OR)
Odd Ratio was initially developed for a binary classification; it has been reported
by [6] that this value has additional advantages to other classification methods.
Odd Ratio can be computed as follow:
                                                                                
                      
                        odd(ti |pos)
                                              P (ti |pos)(P (1 − P (ti |neg)))
        OR(ti ) = log odd(t  i |neg)
                                       = log                                       (5)
                                               P (ti |neg)(P (1 − P (ti |P os)))

where both “pos” and “neg” represent a positive and negative class, respectively.

2.5   Class Discrimination Measure (CDM)
The CDM is an improved version of Multi-class Odd Ratio (MOR), where MOR
is originally based on Odd Ratio. CDM is introduced in [10] and has reportedly
outperformed IG.
                                             m
                                             X           P (ti |cj )
                          CM D(ti , cj ) =         log                            (6)
                                             j=1
                                                         P (ti |c¯j )

where P (ti |c¯j ) is the probability that ti occurs in another class than cj .




                                           384
4        Zinnar Ghasem, Ingo Frommholz, and Carsten Maple

3     Informative Feature Selector (IFS)

The principal aim of a feature selection method is to differentiate between infor-
mative and uninformative features by giving highest values to most informative
and lowest values to least informative features, which subsequently facilitates the
process of reducing the size of the feature vector space. However, the following
issues must be taken into account by a feature selection method in the process
of weighting the features for text classification. A feature that appears in all
documents of a class and does not appear in any documents of other class(es)
is a good discriminator and should be given the highest value. Consequently, a
feature which equally appears in all classes should be given a lower value. The
value assigned to a feature should reflect its degree of discrimination and its
correlation between the classes.3 .
    Based on these consideration, the IFS method is formulated as follows:

               IF S(ti , cj ) = log(|(P (ti |cj )P (t¯i |c¯j ) − P (ti |c¯j )P (t¯i |cj ))|
                                     |P (ti |cj )−P (ti |c¯j )|
                                   min(P (ti |cj ),P (ti |c¯j ))+1 + 1)                        (7)

where P (ti |cj ) is the probability of ti appearing in class cj and P (t¯i |cj ) is the
probability of ti not appearing in class cj . Similarly P (ti |c¯j ) is the probability
of ti appearing in another class c¯j and P (t¯i |c¯j ) is probability of ti not occurring
in c¯j .
     IFS has been formulated in a way so that the overall value given to a feature
is sensitive to changes in the number of features which are absent, shared or
present in classes. In order for IFS to assign a value which reflects the usefulness
of a feature for classification, and to adhere to above considerations, IFS makes
use of the difference between the probability of a feature that appears in both
classes, then dividing it by the smallest value between P (ti |cj ) and P (ti |c¯j ); 1 is
added in case the smallest value is zero (this is the 2nd part of the equation). This
makes sure the feature is assigned an appropriate value not only according to the
differences between P (ti |cj ) and P (ti |c¯j ) but also according to the probability of
ti occurring in the intersection between cj and c¯j . However, the calculation so far
does not consider the number of documents which do not contain features in all
classes; therefore, both the absence and presence of features in classes P (ti |cj ),
P (t¯i |cj ), P (ti |c¯j ) and P (t¯i |c¯j ) are used in the calculation to reflect the probability
of a feature being absent or present in classes. The whole formula makes sure
the feature is assigned a value which reflects its usefulness for classification.
This is not always the case in some other approaches such as OR, in which
the intersection or number of shared features between classes is not taken into
account. The maximum and minimum values are between zero and one, and
a feature which equally appears in both classes is assigned a minimum value,
3
    We find similar considerations in classical probabilistic information retrieval models
    (where often the probability that a term occurs in the relevant/non-relevant docu-
    ments is used) and also in the well-known concept of the inverse document frequency
    (terms appearing in less documents are deemed good discriminators).




                                               385
A New Approach For Selecting Informative Features For Text Classification        5

which is zero in case of balanced binary classes. The method adheres to all above
mentioned considerations.


4     Evaluation
A 10-fold cross-validation of text classification on both unbalanced and balanced
datasets of spam emails and SMS have been carried out to test the performance
and sensitivity of the IFS method to the number of documents in classes and
the distribution ratio of documents between classes. In this experiment we fol-
lowed [11] and used binary classification, because the results of binary classifi-
cation reflect the effectiveness of the used measure more directly than that of
multi-class classification [11]. Moreover, resolving the binary text classification
also means resolving multi-classes [2]. For this experiment, two supervised algo-
rithms namely Support Vector Machines (SVM) and Neural Network (NN) have
been employed, as both algorithms are performing well in text classification [1,
12].

4.1   Dataset and Performance Measure
Two sets of data have been used in our experiments, namely the enron1 email
collection [13], in total 5172 emails, in which 1500 are spam email and 3672 are
legitimate emails, and an SMS collection, which was introduced in[14, 15] and
consists of 5574 SMS messages, where 4827 are legitimate SMS and 747 spam
SMS. From these two datasets we have created 3 test datasets, to enable us to
test the sensitivity of the methods with respect to allocation and distribution
of documents in classes. The first (balanced ) test dataset consist of 3000 emails
with 1500 spam and randomly chosen 1500 from legitimate emails. The 2nd
set (unbalanced ) consists of 4500 emails in which 1500 are spam emails and
randomly chosen 3000 legitimate emails. Finally third test dataset (unbalanced )
is based on the SMS dataset, which consists of 747 spam and randomly chosen
2576 legitimate ones from a total of 3323 SMS messages. The top n subset of
features with highest weight were used in experiment, and n was set to 10, 15,
25, 50, 100, 200, 300, 400 and 500.
    The performance of all methods was assessed using the F-measure (F1 ) based
on precision and recall as defined in [2] using micro-averaging.


5     Result Analysis
5.1   Unbalanced datasets
The F-measures of the 10-fold cross validation of the email and SMS datasets
based on SVM and NN classifiers are shown in Figures 1, 2, 3, and 4 respectively.
Based on the results, IFS is performing well with both classifiers. Across all
used feature sets, IFS is either superior or comparable to others, while in some
instances it is shown as second best.




                                     386
6      Zinnar Ghasem, Ingo Frommholz, and Carsten Maple

                         1.00
                                                                                            CDM
                         0.95                                                               Chi




            F-measures
                                                                                            IG
                         0.90
                                                                                            OR
                         0.85                                                               MI
                                                                                            IFS
                         0.80
                                0   50   100 150 200 250 300 350 400 450 500
                                                number of features


                     Fig. 1. F-measures of SVM with unbalanced emails dataset

                         1.00
                                                                                            CDM
                         0.95
                                                                                            Chi
          F-measure




                         0.90
                                                                                            IG
                         0.85
                                                                                            OR
                         0.80
                                                                                            MI
                         0.75
                                                                                            IFS
                         0.70
                                0   50   100 150 200 250 300 350 400 450 500
                                                number of features


                         Fig. 2. F-measures of NN with unbalanced emails dataset

                      1.00
                                                                                            CDM
                      0.95                                                                  Chi
         F-measure




                                                                                            IG
                      0.90
                                                                                            OR
                      0.85                                                                  MI

                      0.80                                                                  IFS
                                0   50   100   150    200 250 300 350     400   450   500
                                                     number of features


                         Fig. 3. F-measures of NN with unbalanced SMS collection



5.2   Balanced datasets

The outcome of F-measures of the 10-fold cross validation for above mentioned
methods using SVM and the NN classifier with datasets, 3000 emails with ratio
of 1:1 respectively, is shown in Figures 5 and 6. Considering F-measure, IFS yet
again is performing well in balanced datasets, and it could be noticed from the
results that IFS is superior to others and only in some cases slightly scoring
behind. In general, we can conclude that IFS is robust in performing well in
both SVM and NN classifiers, compared to other methods.




                                                         387
A New Approach For Selecting Informative Features For Text Classification                        7

                        1.00
                                                                                           CDM
                                                                                           Chi




            F-measure
                        0.95
                                                                                           IG
                                                                                           OR
                        0.90
                                                                                           MI
                                                                                           IFS
                        0.85
                               0   50   100   150    200 250 300 350     400   450   500
                                                    number of features


                        Fig. 4. F-measures of SVM with unbalanced SMS collection

                        1.00
                                                                                           CDM
                        0.95
                        0.90                                                               Chi
           F-measure




                        0.85                                                               IG
                        0.80                                                               OR
                        0.75
                                                                                           MI
                        0.70
                        0.65                                                               IFS
                               0   50   100 150 200 250 300 350 400 450 500
                                               number of features


                         Fig. 5. F-measures of SVM with balanced emails dataset

                       1.00
                       0.95                                                                CDM
                       0.90
                       0.85                                                                Chi
         F-measure




                       0.80
                       0.75                                                                IG
                       0.70
                       0.65                                                                OR
                       0.60
                       0.55                                                                MI
                       0.50
                       0.45                                                                IFS
                               0   50   100 150 200 250 300 350 400 450 500
                                               number of features


                          Fig. 6. F-measures of NN with balanced emails dataset



6   Conclusion

This paper has introduced a new method called IFS to select informative features
for classification. The method has been evaluated against some well established
feature space reduction methods. Throughout the F-measure values of the 10-
fold cross validations result, and processing time, IFS has shown to be robust
and often superior, at least competitive compared to other methods. In future
work, we aim to test IFS on multi classes datasets. We also aim to improve our
method by considering and calculating the frequency of a feature in document
and add its weight to the overall usefulness of the feature.




                                                       388
8       Zinnar Ghasem, Ingo Frommholz, and Carsten Maple

References
1. Alan S. Abrahams, Eloise Coupey, Eva X. Zhong, Reza Barkhi, and Pete S. Man-
  asantivongs.: Audience targeting by B-to-B advertisement classification: A neural
  network approach. Expert Systems with Applications, Elsevier, 40(8): 2777-2791,
  (2013)
2. Fabrizio Sebastiani.: Machine Learning in Automated Text Categorization. ACM
  computing surveys (CSUR), 34(1):1-47, (2002)
3. Yiming Yang and Jan O Pedersen.: A Comparative Study of Features selection in
  text categorization. The Fourteenth International Conference on Machine Learning
  (ICML), 97:412-420 (1997)
4. Jingnian Chen, Houkuan Huang, Shengfeng Tian, and Youli Qu.: Expert Systems
  with Applications Feature selection for text classification with Naive Bayes. Expert
  Systems With Applications, Elsevier, 36(3):5432-5435 (2009)
5. George Foreman.: An Extensive Empirical Study of Feature Selection Metrics for
  Text Classification. Journal of Machine Learning Research, 3:1289-1305 (2003)
6. Jun-Ming Xu, Xiaojin Zhu, and Amy Bellmore. Fast learning for sentiment analysis
  on bullying.: In Proceedings of the First International Workshop on Issues of Senti-
  ment Discovery and Opinion Mining - WISDOM, New York, USA, ACM Press, 1-6
  (2012)
7. Hiroshi Ogura, Hiromi Amano, and Masato Kondo.: Feature selection with a mea-
  sure of deviations from Poisson in text categorization. Expert Systems with Appli-
  cations, Elsevier, 36(3):6826-6832 (2009)
8. Dunja Mladenic and Marko Grobelnik.: Feature selection for unbalanced class dis-
  tribution and Naive Bayes. In Proceedings of the Sixteenth International Conference
  on Machine Learning, 258-267 (1999)
9. Yan Xu. A Comparative Study on Feature Selection in Unbalance Text Classifica-
  tion. In Information Science and Engineering (ISISE), 2012 International Symposium,
  IEEE-CPS, 44-47 (2012)
10. Matthew Chang and Chung Keung Poon.: Using phrases as features in email clas-
  sification.The Journal of Systems and Software, Elsevier, 82(6): 1036-1045 (2009)
11. Hiroshi Ogura, Hiromi Amano, and Masato Kondo.: Feature selection with a mea-
  sure of deviations from Poisson in text categorization. Expert Systems with Appli-
  cations, Elsevier, 36(3):6826-6832 (2009)
12. Yang, Y. and Liu, X., A re-examination of text categorization methods. Proceed-
  ings of the 22nd annual international ACM SIGIR conference on Research and de-
  velopment in information retrieval SIGIR 99, pp.4249 (1999)
13. Vangelis Metsis, I Androutsopoulos, and G Paliouras.: Spam filtering with naive
  bayes - which naive bayes?. CEAS, the third conference on Email and Anti-Spam,
  27-28 (2006)
14. Gordon V. Cormack, Jose Mara Gomez Hidalgo, and Enrique Puertas Sanz.: Fea-
  ture engineering for mobile (SMS) spam filtering. In Proceedings of the 30th annual
  international ACM SIGIR conference on Research and development in information
  retrieval, 871-872 (2007)
15. Tiago A Almeida, Jose Maria G Hidalgo, and Akebo Yamakami.: Contributions to
  the study of SMS spam filtering: new collection and results. In Proceedings of the
  11th ACM symposium on Document engineering, 259-262 (2011)




                                      389