<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Micro-Documents for Feature Selection: The Case of Ordinal Text Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Baccianella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Sebastiani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche 56124 Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Most popular feature selection methods for text classi cation (TC) are based on binary information concerning the presence/absence of the feature in each training document. As such, these methods do not exploit term frequency information. In order to overcome this drawback we break down each training document of length k into k training \microdocuments", each consisting of a single word occurrence and endowed with the class information of the original training document. We study the impact of this strategy in the case of ordinal TC; the experiments show that this strategy substantially improves e ectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Feature selection (FS) is a technique for reducing the dimensionality of a vector
space in learning tasks (see e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). It consists in identifying a subset S T
of the original feature set T such that jSj jT j (with = jSj=jT j called the
reduction level ) and such that S reaches the best compromise between (a) the
e ciency of the learning process and of the classi ers, and (b) the e ectiveness
of the resulting classi ers. In text classi cation (TC) the most popular approach
to FS is the lter approach: a real-valued function f is applied to each feature
in T in order to compute its expected contribution to solving the classi cation
task, and only the jSj features with the highest f value are retained.
      </p>
      <p>The most popular instances of function f , such as information gain (a.k.a.
mutual information), chi-square, odds ratio, pointwise mutual information, and
the like, are based on binary information concerning the presence/absence of the
feature in each training document. For instance, in pointwise mutual information,
de ned as P M I(tk; cj ) = log2 PP(t(kt)kP;c(jc)j) , the value P (tk) is the probability that
feature tk occurs at all in a random training document. As such, P M I and all
the other above-mentioned functions do not exploit a rich source of information,
namely, the information concerning how many times tk occurs in a given training
document (term frequency ). In this paper we propose a lter approach to FS
which attempts to overcome this drawback. The approach consists in breaking
down each training document di into length(di) training \micro-documents" (
documents ), each consisting of a single word occurrence and endowed with the
same class information of the original training document. In this paper we limit
our experiments to the case of \ordinal" text classi cation (see below).</p>
      <p>This paper is organized as follows. In Section 2 we present our
-documentbased approach to FS. Section 3 describes experiments we have conducted using
two SVM-based learning methods and two large datasets of product reviews.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Feature Selection for OC based on Training -documents</title>
      <p>Let us x some terminology and notation. Ordinal classi cation (OC { also
known as ordinal regression) consists in estimating (from a training set T r) a
target function : X ! R which maps each object xi 2 X into exactly one of
an ordered sequence (here called rankset ) R = hr1 : : : rni of ranks (aka
\scores", or \labels", or \classes"). The result of the estimation is a function ^
called the classi er, which we will evaluate on a test set T e. Our FS methods
will typically consist of (a) scoring each feature tk 2 T by means of a function
that measures the predicted utility of tk for the classi cation process, and, (b)
given a predetermined reduction level , selecting the jSj = jT j top-scoring
features.</p>
      <p>
        The FS methods that we use in this paper are the V ar IDF , RR(V ar IDF ),
RR(IGOR) and RR(AC IDF ) methods originally de ned in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (an extended
version of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), to which we refer the reader for details. All these functions only
use information concerning the presence/absence of feature tk in training
document di. We attempt to overcome this drawback by breaking down each training
document di into length(di) training \ -documents", each consisting of a single
word occurrence and endowed with the same class information of the original
training document. The training set T r is then replaced, for FS purposes only,
by the set of the training \ -documents" obtained from it. All the original FS
methods are obviously still applicable after this move: however, these
methods are now de facto sensitive to term frequency, since a training document dj
belonging to class ci and containing r occurrences of feature tk has generated
(among others) r training -documents containing (only) tk and belonging to ci.
      </p>
      <p>
        The move from training documents to training -documents is, as far as FS
is concerned, akin to the move, in nave Bayesian learners, from a multivariate
Bernoulli event model (where documents are events) to a multinomial event
model (where word occurrences are events). In the context of TC this move
was originally discussed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, in that case the authors reported that
little di erence in performance was found when selecting features via the former
model rather than via the latter model (no actual e ectiveness gures were given,
though). Our work may be seen as exporting that idea outside the realm of nave
Bayesian learners, and outside the realm of single-label TC, neither of which has
been done before to the best of our knowledge.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>
        We have tested the proposed method on two di erent datasets for ordinal text
classi cation. The rst is the TripAdvisor-15763 dataset, with the same split
between training and test documents as used in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], resulting in 10,508 documents
used for training and 5,255 for test. The second dataset is the Amazon-83713,
with the same split between training and test documents as in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], resulting in
20,000 documents used for training and 63,713 for test. Both datasets consist
of textual product reviews scored on a scale from 1 to 5 \stars". As our main
evaluation measure we use the macroaveraged mean absolute error (M AEM )
measure proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        We have tested our methods with two di erent SVM-based learning
algorithms for ordinal regression, -SVR and SVOR; see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for more details. As the
baselines against which to test our -documents-based approach we have used
the results we have obtained in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (on the same datasets and with the same
learning algorithms) with the versions based on \regular" training documents of
the V ar IDF , RR(V ar IDF ), RR(IGOR) and RR(AC IDF ) methods. We
have set the and C parameters of both learners to the optimal values that we
had obtained in the experiments of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; this means that the parameters are
optimal for the baselines but not necessarily for the methods proposed here, which
lends even higher value to the results obtained by the methods proposed here.
      </p>
      <p>
        The experimental protocol essentially conforms to that of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As a vectorial
representation, after stop word removal (and no stemming) we have used
standard bag-of words with cosine-normalized tf idf weighting. We have run all our
experiments for all the 100 reduction levels 2 f0:001; 0:01; 0:02; 0:03; : : : ; 0:99g.
      </p>
      <p>
        For the V ar IDF , RR(V ar IDF ) and RR(AC IDF ) methods we have
set the smoothing parameter (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) to 0:1. For the same methods we have
used the optimal (individually for each method) values of the a parameter (see
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) that we had obtained in the experiments of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; again, this means that the
parameters are optimal for the baselines but not necessarily for the methods
proposed here. For RR(AC IDF ), the E error measure (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) was taken
to be j ^(di) (di)j (i.e., absolute error), given that it is the document-level
analogue of M AEM .
3.1
      </p>
      <p>
        Results
The main observation to be made from the results of our experiments (which
are not reported here in detail reasons of space { see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for more details) is that
the use of training -documents substantially enhances the accuracy of ordinal
TC, since it is practically always the case that the M AEM values of the
document-based versions are better than the corresponding values of the \regular
document" -based versions, irrespective of FS function, dataset, and learner.
Overall, these results derive from a massive experimental work, consisting of
(100 reduction levels 2 datasets 2 learners 4 FS functions =) 1600 new
train-and-test experiments (which are additional to the other 1600 that produced
out baselines and that were already presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have shown that using -documents in place of \regular" training documents
in feature selection substantially improves the e ectiveness of ordinal text
classi cation. In future experiments we plan to validate this method on the more
standard cases of binary classi cation and multiclass classi cation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.O.:</given-names>
          </string-name>
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In: Proceedings of the 14th International Conference on Machine Learning (ICML'97)</source>
          , Nashville,
          <string-name>
            <surname>US</surname>
          </string-name>
          (
          <year>1997</year>
          )
          <volume>412</volume>
          {
          <fpage>420</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Feature selection for ordinal text classi cation</article-title>
          .
          <source>Technical Report 2010-TR-014</source>
          ,
          <article-title>Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa</article-title>
          ,
          <string-name>
            <surname>IT</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Feature selection for ordinal regression</article-title>
          .
          <source>In: Proceedings of the 25th ACM Symposium on Applied Computing (SAC'10)</source>
          , Sierre,
          <string-name>
            <surname>CH</surname>
          </string-name>
          (
          <year>2010</year>
          )
          <volume>1748</volume>
          {
          <fpage>1754</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A comparison of event models for naive Bayes text classi cation</article-title>
          .
          <source>In: Proceedings of the AAAI Workshop on Learning for Text Categorization</source>
          , Madison,
          <string-name>
            <surname>US</surname>
          </string-name>
          (
          <year>1998</year>
          )
          <volume>41</volume>
          {
          <fpage>48</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Evaluation measures for ordinal text classi cation</article-title>
          .
          <source>In: Proceedings of the 9th IEEE International Conference on Intelligent Systems Design and Applications (ISDA'09)</source>
          , Pisa,
          <string-name>
            <surname>IT</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <volume>283</volume>
          {
          <fpage>287</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Using micro-documents for feature selection: The case of ordinal text classi cation</article-title>
          .
          <source>Technical Report 2011-TR-001</source>
          ,
          <article-title>Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa</article-title>
          ,
          <string-name>
            <surname>IT</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>