Selecting Features for Ordinal Text Classification ∗ Stefano Baccianella, Andrea Esuli and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy firstname.lastname@isti.cnr.it ABSTRACT which would obviously be undesirable. It is defined as We present four new feature selection methods for ordinal Score(tk ) = −V ar(tk ) ∗ (IDF (tk ))a (1) regression and test them against four different baselines on two large datasets of product reviews. where IDF is the standard inverse document frequency and a is a parameter (to be optimized on a validation set) that 1. INTRODUCTION allows to fine-tune the relative contributions of variance and In (text) classification, feature selection (FS) consists in iden- IDF to the product. Given that V ar(tk ) = 0 implies that tifying a subset S ⊂ T of the original feature set T such Score(tk ) = 0, we smooth V ar(tk ) by adding to it a small that |S|  |T | (ξ = |S|/|T | being called the reduction level ) value  = 0.1 prior to multiplying it by IDF (tk ). and such that S reaches the best compromise between (a) Our second method, RR(V ar ∗IDF ), is a variant of V ar ∗ the effectiveness of the resulting classifiers and (b) the effi- IDF meant to prevent it from exclusively catering for a ciency of the learning process and of the classifiers. While certain rank and disregarding the others. It consists in (i) feature selection has been extensively investigated for stan- provisionally “assigning” each feature tk to the rank closest dard classification, it has not for a related and important to the mean of its distribution across the ranks; (ii) sorting, learning task, namely, ordinal classification (OC – aka ordi- for each rank rj , the features provisionally assigned to rj nal regression). OC consists in estimating a target function in decreasing order of their value of the Score function of Φ : D → R mapping each object di ∈ D into exactly one of Equation 1; and (iii) enforcing a “round robin” policy in an ordered sequence R = hr1 ≺ . . . ≺ rn i of ranks, by means which the n ranks take turns, for nx rounds, in picking their favourite features from the top-most elements of their rank- of a function Φ̂ called the classifier. specific orderings. We here address the problem of feature selection for OC. Our third method, RR(IGOR), is based on the idea of We use a “filter” approach, in which a scoring function Score viewing ordinal classification on R = hr1 ≺ . . . ≺ rn i as the is applied to each feature tk ∈ T in order to measure the predicted utility of tk for the classification task (the higher generation of (n − 1) binary classifiers Φ̈j , each of which the value of Score, the higher the predicted utility), after is in charge of separating Rj = {r1 , . . . , rj } from Rj = which only the |S| top-scoring features are retained. We {rj+1 , . . . , rn }, for j = 1, . . . , (n − 1). For each feature tk have designed four novel feature scoring functions for OC, we thus compute (n − 1) different values of IG(tk , cj ) (the and tested them on two datasets of product review data, classic information gain function of binary classification), using as baselines the only three feature scoring functions for by taking cj = r1 ∪ . . . ∪ rj and cj = rj+1 ∪ . . . ∪ rn , for OC previously proposed in the literature, i.e., the probability j = 1, . . . , (n − 1). Similarly to RR(V ar ∗ IDF ), we (i) sort, redistribution procedure (PRP) function proposed in [4], and for each of the (n − 1) binary classifiers Φ̈j , the features in the minimum variance (V ar) and round robin on minimum decreasing order of IG(tk , cj ) value, and (ii) enforce a round- variance (RR(V ar)) functions proposed in [2]. robin policy in which the (n − 1) classifiers Φ̈j take turns, x for n−1 rounds, in picking their favourite features from the top-most elements of their classifier-specific orderings. 2. FEATURE SELECTION FOR OC Our fourth method, RR(N C ∗ IDF ), directly optimizes Our first method, V ar∗IDF , is a variant of the V ar method the chosen error measure E. Let us define the negative cor- described in [2], and is meant to prevent features occurring relation of tk with rj in the training set T r as very infrequently, such as hapax legomena, to be top-ranked, X ∗ E(Φ̃j , di ) This is a short summary of a paper forthcoming in the {di ∈T r | tk ∈di } Proceedings of the 25th ACM Symposium on Applied Com- N CT r (tk , rj ) = puting (SAC’10), Sierre, CH, 2010. |{di ∈ T r | tk ∈ di }| where Φ̃j is the “trivial” classifier that assigns all di ∈ D to rj and E(Φ̂, di ) represents the error that Φ̂ makes in classifying di . Let the rank R(tk ) associated to a feature tk be Appears in the Proceedings of the 1st Italian Information Retrieval Workshop (IIR’10), January 27–28, 2010, Padova, Italy. R(tk ) = arg min N CT r (tk , rj ) rj ∈R http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. Dataset |T r| |V a| |T e| 1 2 3 4 5 TripAdvisor-15763 10,508 3,941 5,255 3.9% 7.2% 9.4% 34.5% 45.0% Amazon-83713 20,000 4,000 63,713 16.2% 7.9% 9.1% 23.2% 43.6% Table 1: Main characteristics of the two datasets used in this paper; the last five columns indicate the fraction of documents that have a given number of “stars”. As a learning device we use -support vector regression Amazon-83713 (-SVR) [5] as implemented in the freely available LibSvm 1.8 FFS library. As a vectorial representation, after stop word re- Triv Var RR(Var) moval (and no stemming) we use standard bag-of words 1.6 PRP RRIGOR with cosine-normalized tf idf weighting. We have run all Var*IDF RR(Var*IDF) our experiments for all the 100 reduction levels ξ ∈ {0.001, Mean Absolute Error (macro) RR(NC*IDF) 1.4 0.01, 0.02, 0.03, . . . , 0.99}. For the V ar ∗ IDF , RR(V ar ∗ IDF ) and RR(N C ∗ IDF ) methods we have (individually 1.2 for each method) optimized the a parameter on a validation set V a extracted from the training set T r, and then re- 1 trained the optimized classifier on the full training set T r. For RR(N C∗IDF ), E(Φ̂, di ) was taken to be |Φ̂(di )−Φ(di )|. The results of our tests are displayed in Figure 1, in which 0.8 effectiveness is plotted as a function of the tested reduc- 0 20 40 60 80 100 tion level. For reasons of space only the Amazon-83713 re- Reduction Level sults are displayed (see [3] for the TripAdvisor-15763 results, which are anyway fairly similar). The experiments show that • the four novel techniques proposed here are dramati- cally superior to the four baselines; Figure 1: Results obtained on the Amazon-83713 • our four techniques are fairly stable across ξ ∈ [0.05, 1.0], dataset. Results are evaluated with M AE M ; lower and deteriorate, sometimes rapidly, only for the very values are better. “FFS” refers to the full feature aggressive levels, i.e., for ξ ∈ [0.001, 0.05]). This is in set (i.e., no feature selection), while “Triv” refers to stark contrast with the instability of the baselines. uniform assignment to the 3 Stars class. • for ξ ∈ [0.01, 0.3] the proposed techniques even out- perform the full feature set. This indicates that one We can now define can reduce the feature set by an order of magnitude Score(tk ) = −N CT r (tk , R(tk )) ∗ (IDF (tk ))a (2) (with the ensuing benefits in terms of training-time and testing-time efficiency) and obtain an accuracy where the a parameter serves the same purpose as in Equa- equal or even slightly superior (roughly a 10% improve- tion 1. Similarly to the 2nd and 3rd methods, to select ment, in the best cases) to that obtainable with the full the best x features we apply a round-robin policy in which feature set. each rj is allowed to pick, among the features such that R(tk ) = rj , the nx features with the best Score. More details on these methods can be found in [3]. 4. REFERENCES [1] S. Baccianella, A. Esuli, and F. Sebastiani. Evaluation measures for ordinal text classification. In Proceedings 3. EXPERIMENTS of the 9th IEEE Int’l Conference on Intelligent Systems We have tested the proposed measures on two different datasets, Design and Applications (ISDA’09), pages 283–287, TripAdvisor-15763 and Amazon-83713, whose characteris- Pisa, IT, 2009. tics are summarized in Table 1. Both datasets consist of [2] S. Baccianella, A. Esuli, and F. Sebastiani. Multi-facet rating of product reviews. In Proceedings of the 31st product reviews scored on a scale of one to five “stars”, and European Conference on Information Retrieval (as shown by Table 1) are highly imbalanced. See [3] for (ECIR’09), pages 461–472, Toulouse, FR, 2009. more details. [3] S. Baccianella, A. Esuli, and F. Sebastiani. Feature As the evaluation measure we use the macroaveraged mean selection for ordinal regression. In Proceedings of the absolute error (M AE M ), proposed in [1] and defined as 25th ACM Symposium on Applied Computing (SAC’10), Sierre, CH, 2010. n 1X 1 X [4] R. Mukras, N. Wiratunga, R. Lothian, S. Chakraborti, M AE M (Φ̂, T e) = |Φ̂(di ) − Φ(di )| (3) and D. Harper. Information gain feature selection for n j=1 |T ej | di ∈T ej ordinal text classification using probability re-distribution. In Proceedings of the IJCAI’07 where T ej denotes the set of test documents whose true rank Workshop on Text Mining and Link Analysis, is rj and the “M” superscript indicates “macroaveraging”. Hyderabad, IN, 2007. We compare our methods with the three baselines men- [5] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. tioned at the end of Section 1 and with the “trivial baseline” Bartlett. New support vector algorithms. Neural that consists in scoring all test documents as 3 stars. Computation, 12(5):1207—1245, 2000.