Selecting Features for Ordinal Text Classification                                                                           ∗


                             Stefano Baccianella, Andrea Esuli and Fabrizio Sebastiani
                                            Istituto di Scienza e Tecnologie dell’Informazione
                                                    Consiglio Nazionale delle Ricerche
                                               Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy
                                                     firstname.lastname@isti.cnr.it


ABSTRACT                                                               which would obviously be undesirable. It is defined as
We present four new feature selection methods for ordinal
                                                                                    Score(tk ) = −V ar(tk ) ∗ (IDF (tk ))a                   (1)
regression and test them against four different baselines on
two large datasets of product reviews.
                                                                       where IDF is the standard inverse document frequency and
                                                                       a is a parameter (to be optimized on a validation set) that
1.   INTRODUCTION                                                      allows to fine-tune the relative contributions of variance and
In (text) classification, feature selection (FS) consists in iden-     IDF to the product. Given that V ar(tk ) = 0 implies that
tifying a subset S ⊂ T of the original feature set T such              Score(tk ) = 0, we smooth V ar(tk ) by adding to it a small
that |S|  |T | (ξ = |S|/|T | being called the reduction level )       value  = 0.1 prior to multiplying it by IDF (tk ).
and such that S reaches the best compromise between (a)                   Our second method, RR(V ar ∗IDF ), is a variant of V ar ∗
the effectiveness of the resulting classifiers and (b) the effi-       IDF meant to prevent it from exclusively catering for a
ciency of the learning process and of the classifiers. While           certain rank and disregarding the others. It consists in (i)
feature selection has been extensively investigated for stan-          provisionally “assigning” each feature tk to the rank closest
dard classification, it has not for a related and important            to the mean of its distribution across the ranks; (ii) sorting,
learning task, namely, ordinal classification (OC – aka ordi-          for each rank rj , the features provisionally assigned to rj
nal regression). OC consists in estimating a target function           in decreasing order of their value of the Score function of
Φ : D → R mapping each object di ∈ D into exactly one of               Equation 1; and (iii) enforcing a “round robin” policy in
an ordered sequence R = hr1 ≺ . . . ≺ rn i of ranks, by means          which the n ranks take turns, for nx rounds, in picking their
                                                                       favourite features from the top-most elements of their rank-
of a function Φ̂ called the classifier.
                                                                       specific orderings.
   We here address the problem of feature selection for OC.
                                                                          Our third method, RR(IGOR), is based on the idea of
We use a “filter” approach, in which a scoring function Score
                                                                       viewing ordinal classification on R = hr1 ≺ . . . ≺ rn i as the
is applied to each feature tk ∈ T in order to measure the
predicted utility of tk for the classification task (the higher        generation of (n − 1) binary classifiers Φ̈j , each of which
the value of Score, the higher the predicted utility), after           is in charge of separating Rj = {r1 , . . . , rj } from Rj =
which only the |S| top-scoring features are retained. We               {rj+1 , . . . , rn }, for j = 1, . . . , (n − 1). For each feature tk
have designed four novel feature scoring functions for OC,             we thus compute (n − 1) different values of IG(tk , cj ) (the
and tested them on two datasets of product review data,                classic information gain function of binary classification),
using as baselines the only three feature scoring functions for        by taking cj = r1 ∪ . . . ∪ rj and cj = rj+1 ∪ . . . ∪ rn , for
OC previously proposed in the literature, i.e., the probability        j = 1, . . . , (n − 1). Similarly to RR(V ar ∗ IDF ), we (i) sort,
redistribution procedure (PRP) function proposed in [4], and           for each of the (n − 1) binary classifiers Φ̈j , the features in
the minimum variance (V ar) and round robin on minimum                 decreasing order of IG(tk , cj ) value, and (ii) enforce a round-
variance (RR(V ar)) functions proposed in [2].                         robin policy in which the (n − 1) classifiers Φ̈j take turns,
                                                                             x
                                                                       for n−1     rounds, in picking their favourite features from the
                                                                       top-most elements of their classifier-specific orderings.
2.   FEATURE SELECTION FOR OC                                             Our fourth method, RR(N C ∗ IDF ), directly optimizes
Our first method, V ar∗IDF , is a variant of the V ar method           the chosen error measure E. Let us define the negative cor-
described in [2], and is meant to prevent features occurring           relation of tk with rj in the training set T r as
very infrequently, such as hapax legomena, to be top-ranked,                                                   X
∗                                                                                                                             E(Φ̃j , di )
 This is a short summary of a paper forthcoming in the
                                                                                                        {di ∈T r | tk ∈di }
Proceedings of the 25th ACM Symposium on Applied Com-                             N CT r (tk , rj ) =
puting (SAC’10), Sierre, CH, 2010.                                                                        |{di ∈ T r | tk ∈ di }|

                                                                       where Φ̃j is the “trivial” classifier that assigns all di ∈ D to rj
                                                                       and E(Φ̂, di ) represents the error that Φ̂ makes in classifying
                                                                       di . Let the rank R(tk ) associated to a feature tk be
Appears in the Proceedings of the 1st Italian Information Retrieval
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                                  R(tk ) = arg min N CT r (tk , rj )
                                                                                                         rj ∈R
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
                                                       Dataset                           |T r|       |V a|          |T e|        1        2        3        4        5
                                                  TripAdvisor-15763                     10,508       3,941          5,255      3.9%     7.2%     9.4%    34.5%    45.0%
                                                    Amazon-83713                        20,000       4,000         63,713     16.2%     7.9%     9.1%    23.2%    43.6%

Table 1: Main characteristics of the two datasets used in this paper; the last five columns indicate the fraction
of documents that have a given number of “stars”.


                                                                                                                               As a learning device we use -support vector regression
                                                                 Amazon-83713
                                                                                                                            (-SVR) [5] as implemented in the freely available LibSvm
                                     1.8
                                                                                                   FFS
                                                                                                                            library. As a vectorial representation, after stop word re-
                                                                                                    Triv
                                                                                                    Var
                                                                                               RR(Var)
                                                                                                                            moval (and no stemming) we use standard bag-of words
                                     1.6                                                           PRP
                                                                                               RRIGOR
                                                                                                                            with cosine-normalized tf idf weighting. We have run all
                                                                                                Var*IDF
                                                                                            RR(Var*IDF)                     our experiments for all the 100 reduction levels ξ ∈ {0.001,
       Mean Absolute Error (macro)


                                                                                            RR(NC*IDF)
                                     1.4                                                                                    0.01, 0.02, 0.03, . . . , 0.99}. For the V ar ∗ IDF , RR(V ar ∗
                                                                                                                            IDF ) and RR(N C ∗ IDF ) methods we have (individually
                                     1.2
                                                                                                                            for each method) optimized the a parameter on a validation
                                                                                                                            set V a extracted from the training set T r, and then re-
                                      1
                                                                                                                            trained the optimized classifier on the full training set T r.
                                                                                                                            For RR(N C∗IDF ), E(Φ̂, di ) was taken to be |Φ̂(di )−Φ(di )|.
                                                                                                                               The results of our tests are displayed in Figure 1, in which
                                     0.8
                                                                                                                            effectiveness is plotted as a function of the tested reduc-
                                           0     20         40                     60         80             100            tion level. For reasons of space only the Amazon-83713 re-
                                                                 Reduction Level
                                                                                                                            sults are displayed (see [3] for the TripAdvisor-15763 results,
                                                                                                                            which are anyway fairly similar). The experiments show that
                                                                                                                                 • the four novel techniques proposed here are dramati-
                                                                                                                                   cally superior to the four baselines;
Figure 1: Results obtained on the Amazon-83713
                                                                                                                                 • our four techniques are fairly stable across ξ ∈ [0.05, 1.0],
dataset. Results are evaluated with M AE M ; lower
                                                                                                                                   and deteriorate, sometimes rapidly, only for the very
values are better. “FFS” refers to the full feature
                                                                                                                                   aggressive levels, i.e., for ξ ∈ [0.001, 0.05]). This is in
set (i.e., no feature selection), while “Triv” refers to
                                                                                                                                   stark contrast with the instability of the baselines.
uniform assignment to the 3 Stars class.
                                                                                                                                 • for ξ ∈ [0.01, 0.3] the proposed techniques even out-
                                                                                                                                   perform the full feature set. This indicates that one
We can now define                                                                                                                  can reduce the feature set by an order of magnitude
                                     Score(tk ) = −N CT r (tk , R(tk )) ∗ (IDF (tk ))a                         (2)                 (with the ensuing benefits in terms of training-time
                                                                                                                                   and testing-time efficiency) and obtain an accuracy
where the a parameter serves the same purpose as in Equa-                                                                          equal or even slightly superior (roughly a 10% improve-
tion 1. Similarly to the 2nd and 3rd methods, to select                                                                            ment, in the best cases) to that obtainable with the full
the best x features we apply a round-robin policy in which                                                                         feature set.
each rj is allowed to pick, among the features such that
R(tk ) = rj , the nx features with the best Score.
   More details on these methods can be found in [3].                                                                       4.     REFERENCES
                                                                                                                            [1] S. Baccianella, A. Esuli, and F. Sebastiani. Evaluation
                                                                                                                                measures for ordinal text classification. In Proceedings
3.           EXPERIMENTS                                                                                                        of the 9th IEEE Int’l Conference on Intelligent Systems
We have tested the proposed measures on two different datasets,                                                                 Design and Applications (ISDA’09), pages 283–287,
TripAdvisor-15763 and Amazon-83713, whose characteris-                                                                          Pisa, IT, 2009.
tics are summarized in Table 1. Both datasets consist of                                                                    [2] S. Baccianella, A. Esuli, and F. Sebastiani. Multi-facet
                                                                                                                                rating of product reviews. In Proceedings of the 31st
product reviews scored on a scale of one to five “stars”, and                                                                   European Conference on Information Retrieval
(as shown by Table 1) are highly imbalanced. See [3] for                                                                        (ECIR’09), pages 461–472, Toulouse, FR, 2009.
more details.                                                                                                               [3] S. Baccianella, A. Esuli, and F. Sebastiani. Feature
   As the evaluation measure we use the macroaveraged mean                                                                      selection for ordinal regression. In Proceedings of the
absolute error (M AE M ), proposed in [1] and defined as                                                                        25th ACM Symposium on Applied Computing
                                                                                                                                (SAC’10), Sierre, CH, 2010.
                                                         n
                                                      1X 1          X                                                       [4] R. Mukras, N. Wiratunga, R. Lothian, S. Chakraborti,
     M AE M (Φ̂, T e) =                                               |Φ̂(di ) − Φ(di )|                       (3)              and D. Harper. Information gain feature selection for
                                                      n j=1 |T ej |
                                                                        di ∈T ej                                                ordinal text classification using probability
                                                                                                                                re-distribution. In Proceedings of the IJCAI’07
where T ej denotes the set of test documents whose true rank                                                                    Workshop on Text Mining and Link Analysis,
is rj and the “M” superscript indicates “macroaveraging”.                                                                       Hyderabad, IN, 2007.
   We compare our methods with the three baselines men-                                                                     [5] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L.
tioned at the end of Section 1 and with the “trivial baseline”                                                                  Bartlett. New support vector algorithms. Neural
that consists in scoring all test documents as 3 stars.                                                                         Computation, 12(5):1207—1245, 2000.