1. INTRODUCTION

Selecting Features for Ordinal Text Classification

Stefano Baccianella

Andrea Esuli

Italy firstname.lastname@isti.cnr.it

2010

27 28

We present four new feature selection methods for ordinal regression and test them against four di erent baselines on two large datasets of product reviews.

1. INTRODUCTION

In (text) classi cation, feature selection (FS) consists in identifying a subset S T of the original feature set T such that jSj jT j ( = jSj=jT j being called the reduction level ) and such that S reaches the best compromise between (a) the e ectiveness of the resulting classi ers and (b) the e ciency of the learning process and of the classi ers. While feature selection has been extensively investigated for standard classi cation, it has not for a related and important learning task, namely, ordinal classi cation (OC { aka ordinal regression). OC consists in estimating a target function : D ! R mapping each object di 2 D into exactly one of an ordered sequence R = hr1 : : : rni of ranks, by means of a function ^ called the classi er.

We here address the problem of feature selection for OC. We use a \ lter" approach, in which a scoring function Score is applied to each feature tk 2 T in order to measure the predicted utility of tk for the classi cation task (the higher the value of Score, the higher the predicted utility), after which only the jSj top-scoring features are retained. We have designed four novel feature scoring functions for OC, and tested them on two datasets of product review data, using as baselines the only three feature scoring functions for OC previously proposed in the literature, i.e., the probability redistribution procedure (PRP) function proposed in [ 4 ], and the minimum variance (V ar) and round robin on minimum variance (RR(V ar)) functions proposed in [ 2 ].

FEATURE SELECTION FOR OC Our rst method, V ar IDF , is a variant of the V ar method described in [ 2 ], and is meant to prevent features occurring very infrequently, such as hapax legomena, to be top-ranked, This is a short summary of a paper forthcoming in the Proceedings of the 25th ACM Symposium on Applied Computing (SAC'10), Sierre, CH, 2010. which would obviously be undesirable. It is de ned as Score(tk) =

V ar(tk) (IDF (tk))a (1) where IDF is the standard inverse document frequency and a is a parameter (to be optimized on a validation set) that allows to ne-tune the relative contributions of variance and IDF to the product. Given that V ar(tk) = 0 implies that Score(tk) = 0, we smooth V ar(tk) by adding to it a small value = 0:1 prior to multiplying it by IDF (tk).

Our second method, RR(V ar IDF ), is a variant of V ar IDF meant to prevent it from exclusively catering for a certain rank and disregarding the others. It consists in (i) provisionally \assigning" each feature tk to the rank closest to the mean of its distribution across the ranks; (ii) sorting, for each rank rj , the features provisionally assigned to rj in decreasing order of their value of the Score function of Equation 1; and (iii) enforcing a \round robin" policy in which the n ranks take turns, for nx rounds, in picking their favourite features from the top-most elements of their rankspeci c orderings.

Our third method, RR(IGOR), is based on the idea of viewing ordinal classi cation on R = hr1 : : : rni as the generation of (n 1) binary classi ers j , each of which is in charge of separating Rj = fr1; : : : ; rj g from Rj = frj+1; : : : ; rng, for j = 1; : : : ; (n 1). For each feature tk we thus compute (n 1) di erent values of IG(tk; cj ) (the classic information gain function of binary classi cation), by taking cj = r1 [ : : : [ rj and cj = rj+1 [ : : : [ rn, for j = 1; : : : ; (n 1). Similarly to RR(V ar IDF ), we (i) sort, for each of the (n 1) binary classi ers j , the features in decreasing order of IG(tk; cj ) value, and (ii) enforce a roundrobin policy in which the (n 1) classi ers j take turns, for nx 1 rounds, in picking their favourite features from the top-most elements of their classi er-speci c orderings.

Our fourth method, RR(N C IDF ), directly optimizes the chosen error measure E. Let us de ne the negative correlation of tk with rj in the training set T r as X

E( ~ j ; di) N CT r(tk; rj ) = fdi2T r j tk2dig

jfdi 2 T r j tk 2 digj where ~ j is the \trivial" classi er that assigns all di 2 D to rj and E( ^ ; di) represents the error that ^ makes in classifying di. Let the rank R(tk) associated to a feature tk be R(tk) = arg min N CT r(tk; rj ) rj2R 1.8 1.6 0.8

Amazon-83713

FFS Triv

Var RR(Var)

PRP RRIGOR

Var*IDF RR(Var*IDF) RR(NC*IDF) 0 20 80

100 40

Reduction Level where the a parameter serves the same purpose as in Equation 1. Similarly to the 2nd and 3rd methods, to select the best x features we apply a round-robin policy in which each rj is allowed to pick, among the features such that R(tk) = rj, the nx features with the best Score.

More details on these methods can be found in [ 3 ].

EXPERIMENTS

We have tested the proposed measures on two di erent datasets, TripAdvisor-15763 and Amazon-83713, whose characteristics are summarized in Table 1. Both datasets consist of product reviews scored on a scale of one to ve \stars", and (as shown by Table 1) are highly imbalanced. See [ 3 ] for more details.

As the evaluation measure we use the macroaveraged mean absolute error (M AEM ), proposed in [ 1 ] and de ned as M AEM ( ^ ; T e) = 1 Xn 1 n j=1 jT ejj di2T ej

X j ^ (di) where T ej denotes the set of test documents whose true rank is rj and the \M" superscript indicates \macroaveraging".

We compare our methods with the three baselines mentioned at the end of Section 1 and with the \trivial baseline" that consists in scoring all test documents as 3 stars.

As a learning device we use -support vector regression ( -SVR) [ 5 ] as implemented in the freely available LibSvm library. As a vectorial representation, after stop word removal (and no stemming) we use standard bag-of words with cosine-normalized tf idf weighting. We have run all our experiments for all the 100 reduction levels 2 f0:001, 0:01; 0:02; 0:03; : : : ; 0:99g. For the V ar IDF , RR(V ar IDF ) and RR(N C IDF ) methods we have (individually for each method) optimized the a parameter on a validation set V a extracted from the training set T r, and then retrained the optimized classi er on the full training set T r. For RR(N C IDF ), E( ^ ; di) was taken to be j ^ (di) (di)j.

The results of our tests are displayed in Figure 1, in which e ectiveness is plotted as a function of the tested reduction level. For reasons of space only the Amazon-83713 results are displayed (see [ 3 ] for the TripAdvisor-15763 results, which are anyway fairly similar). The experiments show that the four novel techniques proposed here are dramatically superior to the four baselines; our four techniques are fairly stable across 2 [0:05; 1:0], and deteriorate, sometimes rapidly, only for the very aggressive levels, i.e., for 2 [0:001; 0:05]). This is in stark contrast with the instability of the baselines. for 2 [0:01; 0:3] the proposed techniques even outperform the full feature set. This indicates that one can reduce the feature set by an order of magnitude (with the ensuing bene ts in terms of training-time and testing-time e ciency) and obtain an accuracy equal or even slightly superior (roughly a 10% improvement, in the best cases) to that obtainable with the full feature set.

[1]

Baccianella ,

Esuli , and

Sebastiani . Evaluation measures for ordinal text classi cation . In Proceedings of the 9th IEEE Int'l Conference on Intelligent Systems Design and Applications (ISDA'09) , pages 283 { 287 , Pisa , IT , 2009 .

[2]

Baccianella ,

Esuli , and

Sebastiani . Multi-facet rating of product reviews . In Proceedings of the 31st European Conference on Information Retrieval (ECIR'09) , pages 461 { 472 , Toulouse, FR, 2009 .

[3]

Baccianella ,

Esuli , and

Sebastiani . Feature selection for ordinal regression . In Proceedings of the 25th ACM Symposium on Applied Computing (SAC'10) , Sierre, CH , 2010 .

[4]

Mukras ,

Wiratunga ,

Lothian ,

Chakraborti , and

Harper . Information gain feature selection for ordinal text classi cation using probability re-distribution . In Proceedings of the IJCAI'07 Workshop on Text Mining and Link Analysis , Hyderabad, IN, 2007 .

[5]

Scho lkopf,

A. J.

Smola ,

R. C.

Williamson , and

P. L.

Bartlett . New support vector algorithms . Neural Computation , 12 ( 5 ): 1207 | 1245 , 2000 .