<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Selecting Features for Ordinal Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Baccianella</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy firstname.lastname@isti.cnr.it</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>We present four new feature selection methods for ordinal regression and test them against four di erent baselines on two large datasets of product reviews.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>In (text) classi cation, feature selection (FS) consists in
identifying a subset S T of the original feature set T such
that jSj jT j ( = jSj=jT j being called the reduction level )
and such that S reaches the best compromise between (a)
the e ectiveness of the resulting classi ers and (b) the e
ciency of the learning process and of the classi ers. While
feature selection has been extensively investigated for
standard classi cation, it has not for a related and important
learning task, namely, ordinal classi cation (OC { aka
ordinal regression). OC consists in estimating a target function
: D ! R mapping each object di 2 D into exactly one of
an ordered sequence R = hr1 : : : rni of ranks, by means
of a function ^ called the classi er.</p>
      <p>
        We here address the problem of feature selection for OC.
We use a \ lter" approach, in which a scoring function Score
is applied to each feature tk 2 T in order to measure the
predicted utility of tk for the classi cation task (the higher
the value of Score, the higher the predicted utility), after
which only the jSj top-scoring features are retained. We
have designed four novel feature scoring functions for OC,
and tested them on two datasets of product review data,
using as baselines the only three feature scoring functions for
OC previously proposed in the literature, i.e., the probability
redistribution procedure (PRP) function proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and
the minimum variance (V ar) and round robin on minimum
variance (RR(V ar)) functions proposed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        FEATURE SELECTION FOR OC
Our rst method, V ar IDF , is a variant of the V ar method
described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and is meant to prevent features occurring
very infrequently, such as hapax legomena, to be top-ranked,
This is a short summary of a paper forthcoming in the
Proceedings of the 25th ACM Symposium on Applied
Computing (SAC'10), Sierre, CH, 2010.
which would obviously be undesirable. It is de ned as
Score(tk) =
      </p>
      <p>V ar(tk) (IDF (tk))a
(1)
where IDF is the standard inverse document frequency and
a is a parameter (to be optimized on a validation set) that
allows to ne-tune the relative contributions of variance and
IDF to the product. Given that V ar(tk) = 0 implies that
Score(tk) = 0, we smooth V ar(tk) by adding to it a small
value = 0:1 prior to multiplying it by IDF (tk).</p>
      <p>Our second method, RR(V ar IDF ), is a variant of V ar
IDF meant to prevent it from exclusively catering for a
certain rank and disregarding the others. It consists in (i)
provisionally \assigning" each feature tk to the rank closest
to the mean of its distribution across the ranks; (ii) sorting,
for each rank rj , the features provisionally assigned to rj
in decreasing order of their value of the Score function of
Equation 1; and (iii) enforcing a \round robin" policy in
which the n ranks take turns, for nx rounds, in picking their
favourite features from the top-most elements of their
rankspeci c orderings.</p>
      <p>Our third method, RR(IGOR), is based on the idea of
viewing ordinal classi cation on R = hr1 : : : rni as the
generation of (n 1) binary classi ers  j , each of which
is in charge of separating Rj = fr1; : : : ; rj g from Rj =
frj+1; : : : ; rng, for j = 1; : : : ; (n 1). For each feature tk
we thus compute (n 1) di erent values of IG(tk; cj ) (the
classic information gain function of binary classi cation),
by taking cj = r1 [ : : : [ rj and cj = rj+1 [ : : : [ rn, for
j = 1; : : : ; (n 1). Similarly to RR(V ar IDF ), we (i) sort,
for each of the (n 1) binary classi ers  j , the features in
decreasing order of IG(tk; cj ) value, and (ii) enforce a
roundrobin policy in which the (n 1) classi ers  j take turns,
for nx 1 rounds, in picking their favourite features from the
top-most elements of their classi er-speci c orderings.</p>
      <p>Our fourth method, RR(N C IDF ), directly optimizes
the chosen error measure E. Let us de ne the negative
correlation of tk with rj in the training set T r as
X</p>
      <p>E( ~ j ; di)
N CT r(tk; rj ) = fdi2T r j tk2dig</p>
      <p>jfdi 2 T r j tk 2 digj
where ~ j is the \trivial" classi er that assigns all di 2 D to rj
and E( ^ ; di) represents the error that ^ makes in classifying
di. Let the rank R(tk) associated to a feature tk be
R(tk) = arg min N CT r(tk; rj )
rj2R
1.8
1.6
0.8</p>
      <p>Amazon-83713</p>
      <p>FFS
Triv</p>
      <p>Var
RR(Var)</p>
      <p>PRP
RRIGOR</p>
      <p>Var*IDF
RR(Var*IDF)
RR(NC*IDF)
0
20
80</p>
      <p>100
40</p>
      <p>60</p>
      <p>Reduction Level
where the a parameter serves the same purpose as in
Equation 1. Similarly to the 2nd and 3rd methods, to select
the best x features we apply a round-robin policy in which
each rj is allowed to pick, among the features such that
R(tk) = rj, the nx features with the best Score.</p>
      <p>
        More details on these methods can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>EXPERIMENTS</title>
      <p>
        We have tested the proposed measures on two di erent datasets,
TripAdvisor-15763 and Amazon-83713, whose
characteristics are summarized in Table 1. Both datasets consist of
product reviews scored on a scale of one to ve \stars", and
(as shown by Table 1) are highly imbalanced. See [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
more details.
      </p>
      <p>
        As the evaluation measure we use the macroaveraged mean
absolute error (M AEM ), proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and de ned as
M AEM ( ^ ; T e) = 1 Xn 1
n j=1 jT ejj di2T ej
      </p>
      <p>X j ^ (di)
where T ej denotes the set of test documents whose true rank
is rj and the \M" superscript indicates \macroaveraging".</p>
      <p>We compare our methods with the three baselines
mentioned at the end of Section 1 and with the \trivial baseline"
that consists in scoring all test documents as 3 stars.</p>
      <p>
        As a learning device we use -support vector regression
( -SVR) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as implemented in the freely available LibSvm
library. As a vectorial representation, after stop word
removal (and no stemming) we use standard bag-of words
with cosine-normalized tf idf weighting. We have run all
our experiments for all the 100 reduction levels 2 f0:001,
0:01; 0:02; 0:03; : : : ; 0:99g. For the V ar IDF , RR(V ar
IDF ) and RR(N C IDF ) methods we have (individually
for each method) optimized the a parameter on a validation
set V a extracted from the training set T r, and then
retrained the optimized classi er on the full training set T r.
For RR(N C IDF ), E( ^ ; di) was taken to be j ^ (di) (di)j.
      </p>
      <p>
        The results of our tests are displayed in Figure 1, in which
e ectiveness is plotted as a function of the tested
reduction level. For reasons of space only the Amazon-83713
results are displayed (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for the TripAdvisor-15763 results,
which are anyway fairly similar). The experiments show that
the four novel techniques proposed here are
dramatically superior to the four baselines;
our four techniques are fairly stable across 2 [0:05; 1:0],
and deteriorate, sometimes rapidly, only for the very
aggressive levels, i.e., for 2 [0:001; 0:05]). This is in
stark contrast with the instability of the baselines.
for 2 [0:01; 0:3] the proposed techniques even
outperform the full feature set. This indicates that one
can reduce the feature set by an order of magnitude
(with the ensuing bene ts in terms of training-time
and testing-time e ciency) and obtain an accuracy
equal or even slightly superior (roughly a 10%
improvement, in the best cases) to that obtainable with the full
feature set.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baccianella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Evaluation measures for ordinal text classi cation</article-title>
          .
          <source>In Proceedings of the 9th IEEE Int'l Conference on Intelligent Systems Design and Applications (ISDA'09)</source>
          , pages
          <fpage>283</fpage>
          {
          <fpage>287</fpage>
          ,
          <string-name>
            <surname>Pisa</surname>
            ,
            <given-names>IT</given-names>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baccianella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          <article-title>. Multi-facet rating of product reviews</article-title>
          .
          <source>In Proceedings of the 31st European Conference on Information Retrieval (ECIR'09)</source>
          , pages
          <fpage>461</fpage>
          {
          <fpage>472</fpage>
          , Toulouse, FR,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baccianella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Feature selection for ordinal regression</article-title>
          .
          <source>In Proceedings of the 25th ACM Symposium on Applied Computing (SAC'10)</source>
          , Sierre,
          <string-name>
            <surname>CH</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mukras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lothian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Harper</surname>
          </string-name>
          .
          <article-title>Information gain feature selection for ordinal text classi cation using probability re-distribution</article-title>
          .
          <source>In Proceedings of the IJCAI'07 Workshop on Text Mining and Link Analysis</source>
          , Hyderabad, IN,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Scho</surname>
          </string-name>
          lkopf,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Williamson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Bartlett</surname>
          </string-name>
          .
          <article-title>New support vector algorithms</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>12</volume>
          (
          <issue>5</issue>
          ):
          <volume>1207</volume>
          |
          <fpage>1245</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>