<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A New Approach For Selecting Informative Features For Text Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zinnar Ghasem</string-name>
          <email>zinnar.ghasem@beds.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ingo Frommholz</string-name>
          <email>ingo.frommholz@beds.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carsten Maple</string-name>
          <email>carsten.maple@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bedfordshire</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Warwick</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>382</fpage>
      <lpage>389</lpage>
      <abstract>
        <p>Selecting useful and informative features to classify text is not only important to decrease the size of the feature space, but as well for the overall performance and precision of machine learning. In this study we propose a new feature selection method called Informative Feature Selector (IFS). Di erent machine learning algorithms and datasets have been utilised to examine the e ectiveness of IFS, and it is compared to well-established methods, namely Information Gain, Odd Ratio, Chi Square, Mutual Information and Class Discriminative Measure. Our experiments show that IFS is able to outperform aforementioned methods and to produce e ective and e cient results.</p>
      </abstract>
      <kwd-group>
        <kwd>Feature selection</kwd>
        <kwd>text classi cation</kwd>
        <kwd>text preprocessing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Automatic text classi cation is increasingly imperative for managing a
substantial amount of data and information on the Web, for instance for search, spam
ltering, malware classi cation, etc. It has been well studied over the past half
century [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], and a number of machine learning and statistical algorithms have
extensively been used in various types of classi cation. In text classi cation,
each document needs to be represented as a vector of features, for which various
methods have been employed { Bag-of-Words is one of the major methods, where
all unique features (in this case terms) of documents are utilised. This approach
results in a high dimensional vector space of features, particularly in the case of
a big dataset or where we nd a large volume of document content. This large
set of features is one of main issues of text classi cation, therefore it is highly
desirable to reduce the size of the vector space by selecting useful features.
      </p>
      <p>For this purpose, we propose a probabilistic Informative Feature Selector
(IFS). The IFS is compared using ten folds cross validation to some of the
well-known methods, namely Chi Square (chi), Odd Ratio(OR), Information
Gain (IG), Class Discriminative Measures (CDM) and Mutual Information (MI).
The results of our experiment show that IFS is comparable to some and superior
to others in term classi cation accuracy. The rest of this paper is organised as
follows: Section 2 lists all feature selection methods which have been compared
with IFS. Section 3 introduces IFS method. Section 4 outlines the experiments;
datasets and performance measures, while results are analysed in section 5 and
conclusion is the last section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>As aforementioned, there exist a number of feature selection methods for text
classi cation; among them IG, OR, CDM, and Chi have been claimed to perform
well in experimental studies [3{6, 11, 8]. Thus, speci cally those methods have
been selected to be evaluated against IFS. In the following subsections, a brief
description of each method is provided.
2.1</p>
      <sec id="sec-2-1">
        <title>Information Gain (IG)</title>
        <p>
          Information gain evaluates the usefulness of a feature for classi cation based on
absence and presence of that feature in documents [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Information Gain de nes
the relationship between class cj and feature ti in the following formula.
        </p>
        <p>IG(ti; cj ) =
m m
X P (cj ) log(P (cj )) + P (ti) X P (cj jti) log(P (cj jti))
j=1 j=1</p>
        <p>m
+P (ti) X P (cj jti) log(P (cj jti));
j=1
where P (cj ) is the probability of cj , P (ti), and P (ti) is the probability of ti
occurring and not occurring in cj correspondingly , while P (cj jti) is the
probability of cj given the probability of ti , and P (cj jti) is the probability of cj given
the probability of ti and m is number of classes in the training set.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Mutual Information (MI)</title>
        <p>Mutual information evaluates the association between a feature and a class as
However, using the the two ways contingency table shown in Table 1, Eq. 2 can
be represented in terms of the number of documents in a class as shown in Eq. 3
where and represent the number of documents containing and not containing
term ti in class cj respectively, similarly and are the number of documents
(1)
(2)</p>
        <p>Cj
containing and not containing ti in :Cj . If =
documents in all classes, MI can be estimated as
+ + + is total number of
2.3</p>
        <p>
          Chi Square ( 2)
2 or chi has been reported as one of top feature selection methods and it
measures the correlation between feature ti and class cj [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Using Table 1, 2 is
de ned as
2(ti; cj ) =
        </p>
        <p>( )2
( + )( + )( + )( + )
The goodness of ti decreases as the independence between ti and cj increases
to the point where the value is zero, that is, when the number of documents
containing term ti in classes is equal.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Odd Ratio (OR)</title>
        <p>
          Odd Ratio was initially developed for a binary classi cation; it has been reported
by [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that this value has additional advantages to other classi cation methods.
Odd Ratio can be computed as follow:
        </p>
        <p>OR(ti) = log
odd(tijpos)
odd(tijneg)
= log</p>
        <sec id="sec-2-3-1">
          <title>P (tijpos)(P (1</title>
          <p>P (tijneg)(P (1</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>P (tijneg))) P (tijP os)))</title>
          <p>where both \pos" and \neg" represent a positive and negative class, respectively.
2.5</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Class Discrimination Measure (CDM)</title>
        <p>
          The CDM is an improved version of Multi-class Odd Ratio (MOR), where MOR
is originally based on Odd Ratio. CDM is introduced in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and has reportedly
outperformed IG.
        </p>
        <p>m
CM D(ti; cj ) = X log
j=1</p>
        <sec id="sec-2-4-1">
          <title>P (tijcj ) P (tijcj )</title>
          <p>where P (tijcj ) is the probability that ti occurs in another class than cj .
(3)
(4)
(5)
(6)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Informative Feature Selector (IFS)</title>
      <p>The principal aim of a feature selection method is to di erentiate between
informative and uninformative features by giving highest values to most informative
and lowest values to least informative features, which subsequently facilitates the
process of reducing the size of the feature vector space. However, the following
issues must be taken into account by a feature selection method in the process
of weighting the features for text classi cation. A feature that appears in all
documents of a class and does not appear in any documents of other class(es)
is a good discriminator and should be given the highest value. Consequently, a
feature which equally appears in all classes should be given a lower value. The
value assigned to a feature should re ect its degree of discrimination and its
correlation between the classes.3.</p>
      <p>Based on these consideration, the IFS method is formulated as follows:
IF S(ti; cj ) = log(j(P (tijcj )P (tijcj )</p>
      <p>jP (tijcj) P (tijcj)j
min(P (tijcj);P (tijcj))+1 + 1)</p>
      <sec id="sec-3-1">
        <title>P (tijcj )P (tijcj ))j</title>
        <p>(7)
where P (tijcj ) is the probability of ti appearing in class cj and P (tijcj ) is the
probability of ti not appearing in class cj . Similarly P (tijcj ) is the probability
of ti appearing in another class cj and P (tijcj ) is probability of ti not occurring
in cj .</p>
        <p>IFS has been formulated in a way so that the overall value given to a feature
is sensitive to changes in the number of features which are absent, shared or
present in classes. In order for IFS to assign a value which re ects the usefulness
of a feature for classi cation, and to adhere to above considerations, IFS makes
use of the di erence between the probability of a feature that appears in both
classes, then dividing it by the smallest value between P (tijcj ) and P (tijcj ); 1 is
added in case the smallest value is zero (this is the 2nd part of the equation). This
makes sure the feature is assigned an appropriate value not only according to the
di erences between P (tijcj ) and P (tijcj ) but also according to the probability of
ti occurring in the intersection between cj and cj . However, the calculation so far
does not consider the number of documents which do not contain features in all
classes; therefore, both the absence and presence of features in classes P (tijcj ),
P (tijcj ), P (tijcj ) and P (tijcj ) are used in the calculation to re ect the probability
of a feature being absent or present in classes. The whole formula makes sure
the feature is assigned a value which re ects its usefulness for classi cation.
This is not always the case in some other approaches such as OR, in which
the intersection or number of shared features between classes is not taken into
account. The maximum and minimum values are between zero and one, and
a feature which equally appears in both classes is assigned a minimum value,
3 We nd similar considerations in classical probabilistic information retrieval models
(where often the probability that a term occurs in the relevant/non-relevant
documents is used) and also in the well-known concept of the inverse document frequency
(terms appearing in less documents are deemed good discriminators).
which is zero in case of balanced binary classes. The method adheres to all above
mentioned considerations.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        A 10-fold cross-validation of text classi cation on both unbalanced and balanced
datasets of spam emails and SMS have been carried out to test the performance
and sensitivity of the IFS method to the number of documents in classes and
the distribution ratio of documents between classes. In this experiment we
followed [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and used binary classi cation, because the results of binary classi
cation re ect the e ectiveness of the used measure more directly than that of
multi-class classi cation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Moreover, resolving the binary text classi cation
also means resolving multi-classes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For this experiment, two supervised
algorithms namely Support Vector Machines (SVM) and Neural Network (NN) have
been employed, as both algorithms are performing well in text classi cation [
        <xref ref-type="bibr" rid="ref1 ref12">1,
12</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Dataset and Performance Measure</title>
        <p>
          Two sets of data have been used in our experiments, namely the enron1 email
collection [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], in total 5172 emails, in which 1500 are spam email and 3672 are
legitimate emails, and an SMS collection, which was introduced in[
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ] and
consists of 5574 SMS messages, where 4827 are legitimate SMS and 747 spam
SMS. From these two datasets we have created 3 test datasets, to enable us to
test the sensitivity of the methods with respect to allocation and distribution
of documents in classes. The rst (balanced ) test dataset consist of 3000 emails
with 1500 spam and randomly chosen 1500 from legitimate emails. The 2nd
set (unbalanced ) consists of 4500 emails in which 1500 are spam emails and
randomly chosen 3000 legitimate emails. Finally third test dataset (unbalanced )
is based on the SMS dataset, which consists of 747 spam and randomly chosen
2576 legitimate ones from a total of 3323 SMS messages. The top n subset of
features with highest weight were used in experiment, and n was set to 10, 15,
25, 50, 100, 200, 300, 400 and 500.
        </p>
        <p>
          The performance of all methods was assessed using the F-measure (F1) based
on precision and recall as de ned in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] using micro-averaging.
5
5.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Result Analysis</title>
      <sec id="sec-5-1">
        <title>Unbalanced datasets</title>
        <p>The F-measures of the 10-fold cross validation of the email and SMS datasets
based on SVM and NN classi ers are shown in Figures 1, 2, 3, and 4 respectively.
Based on the results, IFS is performing well with both classi ers. Across all
used feature sets, IFS is either superior or comparable to others, while in some
instances it is shown as second best.</p>
        <p>CDM
Chi
IG
OR
MI
IFS
CDM
Chi
IG
OR
MI
IFS
CDM
Chi
IG
OR
MI
IFS
1.00
0
50 100 150 200 250 300 350 400 450 500</p>
        <p>number of features
The outcome of F-measures of the 10-fold cross validation for above mentioned
methods using SVM and the NN classi er with datasets, 3000 emails with ratio
of 1:1 respectively, is shown in Figures 5 and 6. Considering F-measure, IFS yet
again is performing well in balanced datasets, and it could be noticed from the
results that IFS is superior to others and only in some cases slightly scoring
behind. In general, we can conclude that IFS is robust in performing well in
both SVM and NN classi ers, compared to other methods.</p>
        <p>CDM
Chi
IG
OR
MI
IFS
CDM
Chi
IG
OR
MI
IFS
CDM
Chi
IG
OR
MI
IFS
0.85
1.00
0.95
e0.90
r
su0.85
a
e0.80
m
-F0.75
0.70
0.65
1.00
0.95
0.90
e0.85
r0.80
su0.75
ae0.70
m0.65
-F0.60
0.55
0.50
0.45
0
50 100 150 200 250 300 350 400 450 500</p>
        <p>number of features
This paper has introduced a new method called IFS to select informative features
for classi cation. The method has been evaluated against some well established
feature space reduction methods. Throughout the F-measure values of the
10fold cross validations result, and processing time, IFS has shown to be robust
and often superior, at least competitive compared to other methods. In future
work, we aim to test IFS on multi classes datasets. We also aim to improve our
method by considering and calculating the frequency of a feature in document
and add its weight to the overall usefulness of the feature.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alan</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Abrahams</surname>
          </string-name>
          , Eloise Coupey,
          <string-name>
            <surname>Eva</surname>
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhong</surname>
          </string-name>
          , Reza Barkhi, and
          <string-name>
            <surname>Pete</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Manasantivongs</surname>
          </string-name>
          .:
          <article-title>Audience targeting by B-to-B advertisement classi cation: A neural network approach</article-title>
          .
          <source>Expert Systems with Applications</source>
          , Elsevier,
          <volume>40</volume>
          (
          <issue>8</issue>
          ):
          <fpage>2777</fpage>
          -
          <lpage>2791</lpage>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .:
          <source>Machine Learning in Automated Text Categorization. ACM computing surveys (CSUR)</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          , (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          and Jan O Pedersen.
          <article-title>: A Comparative Study of Features selection in text categorization</article-title>
          .
          <source>The Fourteenth International Conference on Machine Learning (ICML)</source>
          ,
          <volume>97</volume>
          :
          <fpage>412</fpage>
          -
          <lpage>420</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Jingnian</given-names>
            <surname>Chen</surname>
          </string-name>
          , Houkuan Huang,
          <string-name>
            <given-names>Shengfeng</given-names>
            <surname>Tian</surname>
          </string-name>
          , and Youli Qu.
          <article-title>: Expert Systems with Applications Feature selection for text classi cation with Naive Bayes. Expert Systems With Applications</article-title>
          , Elsevier,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>5432</fpage>
          -
          <lpage>5435</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>George</given-names>
            <surname>Foreman</surname>
          </string-name>
          .:
          <article-title>An Extensive Empirical Study of Feature Selection Metrics for Text Classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>1289</fpage>
          -
          <lpage>1305</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jun-Ming</surname>
            <given-names>Xu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaojin</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Amy</given-names>
            <surname>Bellmore</surname>
          </string-name>
          .
          <article-title>Fast learning for sentiment analysis on bullying</article-title>
          .:
          <source>In Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining - WISDOM</source>
          , New York, USA, ACM Press, 1-
          <fpage>6</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Hiroshi</given-names>
            <surname>Ogura</surname>
          </string-name>
          , Hiromi Amano, and Masato Kondo.:
          <article-title>Feature selection with a measure of deviations from Poisson in text categorization</article-title>
          .
          <source>Expert Systems with Applications</source>
          , Elsevier,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>6826</fpage>
          -
          <lpage>6832</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Dunja</given-names>
            <surname>Mladenic</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marko</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          .:
          <article-title>Feature selection for unbalanced class distribution and Naive Bayes</article-title>
          .
          <source>In Proceedings of the Sixteenth International Conference on Machine Learning</source>
          ,
          <fpage>258</fpage>
          -
          <lpage>267</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Yan</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A Comparative Study on Feature Selection in Unbalance Text Classi cation</article-title>
          .
          <source>In Information Science and Engineering (ISISE)</source>
          ,
          <source>2012 International Symposium, IEEE-CPS</source>
          ,
          <fpage>44</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Matthew Chang and Chung Keung Poon.: Using phrases as features in email classi cation</article-title>
          .
          <source>The Journal of Systems and Software, Elsevier</source>
          ,
          <volume>82</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1036</fpage>
          -
          <lpage>1045</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hiroshi</surname>
            <given-names>Ogura</given-names>
          </string-name>
          , Hiromi Amano, and Masato Kondo.:
          <article-title>Feature selection with a measure of deviations from Poisson in text categorization</article-title>
          .
          <source>Expert Systems with Applications</source>
          , Elsevier,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>6826</fpage>
          -
          <lpage>6832</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <article-title>A re-examination of text categorization methods</article-title>
          .
          <source>Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 99</source>
          , pp.
          <volume>4249</volume>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Vangelis</given-names>
            <surname>Metsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G</given-names>
            <surname>Paliouras</surname>
          </string-name>
          .
          <article-title>: Spam ltering with naive bayes - which naive bayes?</article-title>
          .
          <source>CEAS, the third conference on Email and Anti-Spam</source>
          ,
          <fpage>27</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Gordon</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          , Jose Mara Gomez Hidalgo, and Enrique Puertas Sanz.:
          <article-title>Feature engineering for mobile (SMS) spam ltering</article-title>
          .
          <source>In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <fpage>871</fpage>
          -
          <lpage>872</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Tiago</surname>
            <given-names>A Almeida</given-names>
          </string-name>
          , Jose Maria G Hidalgo,
          <article-title>and Akebo Yamakami.: Contributions to the study of SMS spam ltering: new collection and results</article-title>
          .
          <source>In Proceedings of the 11th ACM symposium on Document engineering</source>
          ,
          <fpage>259</fpage>
          -
          <lpage>262</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>