Introduction

Using Micro-Documents for Feature Selection: The Case of Ordinal Text Classi cation

Stefano Baccianella

Andrea Esuli

Fabrizio Sebastiani

0 0 Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche 56124 Pisa , Italy

Most popular feature selection methods for text classi cation (TC) are based on binary information concerning the presence/absence of the feature in each training document. As such, these methods do not exploit term frequency information. In order to overcome this drawback we break down each training document of length k into k training \microdocuments", each consisting of a single word occurrence and endowed with the class information of the original training document. We study the impact of this strategy in the case of ordinal TC; the experiments show that this strategy substantially improves e ectiveness.

Introduction

Feature selection (FS) is a technique for reducing the dimensionality of a vector space in learning tasks (see e.g., [ 1 ]). It consists in identifying a subset S T of the original feature set T such that jSj jT j (with = jSj=jT j called the reduction level ) and such that S reaches the best compromise between (a) the e ciency of the learning process and of the classi ers, and (b) the e ectiveness of the resulting classi ers. In text classi cation (TC) the most popular approach to FS is the lter approach: a real-valued function f is applied to each feature in T in order to compute its expected contribution to solving the classi cation task, and only the jSj features with the highest f value are retained.

The most popular instances of function f , such as information gain (a.k.a. mutual information), chi-square, odds ratio, pointwise mutual information, and the like, are based on binary information concerning the presence/absence of the feature in each training document. For instance, in pointwise mutual information, de ned as P M I(tk; cj ) = log2 PP(t(kt)kP;c(jc)j) , the value P (tk) is the probability that feature tk occurs at all in a random training document. As such, P M I and all the other above-mentioned functions do not exploit a rich source of information, namely, the information concerning how many times tk occurs in a given training document (term frequency ). In this paper we propose a lter approach to FS which attempts to overcome this drawback. The approach consists in breaking down each training document di into length(di) training \micro-documents" ( documents ), each consisting of a single word occurrence and endowed with the same class information of the original training document. In this paper we limit our experiments to the case of \ordinal" text classi cation (see below).

This paper is organized as follows. In Section 2 we present our -documentbased approach to FS. Section 3 describes experiments we have conducted using two SVM-based learning methods and two large datasets of product reviews. 2

Feature Selection for OC based on Training -documents

Let us x some terminology and notation. Ordinal classi cation (OC { also known as ordinal regression) consists in estimating (from a training set T r) a target function : X ! R which maps each object xi 2 X into exactly one of an ordered sequence (here called rankset ) R = hr1 : : : rni of ranks (aka \scores", or \labels", or \classes"). The result of the estimation is a function ^ called the classi er, which we will evaluate on a test set T e. Our FS methods will typically consist of (a) scoring each feature tk 2 T by means of a function that measures the predicted utility of tk for the classi cation process, and, (b) given a predetermined reduction level , selecting the jSj = jT j top-scoring features.

The FS methods that we use in this paper are the V ar IDF , RR(V ar IDF ), RR(IGOR) and RR(AC IDF ) methods originally de ned in [ 2 ] (an extended version of [ 3 ]), to which we refer the reader for details. All these functions only use information concerning the presence/absence of feature tk in training document di. We attempt to overcome this drawback by breaking down each training document di into length(di) training \ -documents", each consisting of a single word occurrence and endowed with the same class information of the original training document. The training set T r is then replaced, for FS purposes only, by the set of the training \ -documents" obtained from it. All the original FS methods are obviously still applicable after this move: however, these methods are now de facto sensitive to term frequency, since a training document dj belonging to class ci and containing r occurrences of feature tk has generated (among others) r training -documents containing (only) tk and belonging to ci.

The move from training documents to training -documents is, as far as FS is concerned, akin to the move, in nave Bayesian learners, from a multivariate Bernoulli event model (where documents are events) to a multinomial event model (where word occurrences are events). In the context of TC this move was originally discussed in [ 4 ]. However, in that case the authors reported that little di erence in performance was found when selecting features via the former model rather than via the latter model (no actual e ectiveness gures were given, though). Our work may be seen as exporting that idea outside the realm of nave Bayesian learners, and outside the realm of single-label TC, neither of which has been done before to the best of our knowledge.

Experiments

We have tested the proposed method on two di erent datasets for ordinal text classi cation. The rst is the TripAdvisor-15763 dataset, with the same split between training and test documents as used in [ 2 ], resulting in 10,508 documents used for training and 5,255 for test. The second dataset is the Amazon-83713, with the same split between training and test documents as in [ 2 ], resulting in 20,000 documents used for training and 63,713 for test. Both datasets consist of textual product reviews scored on a scale from 1 to 5 \stars". As our main evaluation measure we use the macroaveraged mean absolute error (M AEM ) measure proposed in [ 5 ].

We have tested our methods with two di erent SVM-based learning algorithms for ordinal regression, -SVR and SVOR; see [ 2 ] for more details. As the baselines against which to test our -documents-based approach we have used the results we have obtained in [ 2 ] (on the same datasets and with the same learning algorithms) with the versions based on \regular" training documents of the V ar IDF , RR(V ar IDF ), RR(IGOR) and RR(AC IDF ) methods. We have set the and C parameters of both learners to the optimal values that we had obtained in the experiments of [ 2 ]; this means that the parameters are optimal for the baselines but not necessarily for the methods proposed here, which lends even higher value to the results obtained by the methods proposed here.

The experimental protocol essentially conforms to that of [ 2 ]. As a vectorial representation, after stop word removal (and no stemming) we have used standard bag-of words with cosine-normalized tf idf weighting. We have run all our experiments for all the 100 reduction levels 2 f0:001; 0:01; 0:02; 0:03; : : : ; 0:99g.

For the V ar IDF , RR(V ar IDF ) and RR(AC IDF ) methods we have set the smoothing parameter (see [ 2 ]) to 0:1. For the same methods we have used the optimal (individually for each method) values of the a parameter (see [ 2 ]) that we had obtained in the experiments of [ 2 ]; again, this means that the parameters are optimal for the baselines but not necessarily for the methods proposed here. For RR(AC IDF ), the E error measure (see [ 2 ]) was taken to be j ^(di) (di)j (i.e., absolute error), given that it is the document-level analogue of M AEM . 3.1

Results The main observation to be made from the results of our experiments (which are not reported here in detail reasons of space { see [ 6 ] for more details) is that the use of training -documents substantially enhances the accuracy of ordinal TC, since it is practically always the case that the M AEM values of the document-based versions are better than the corresponding values of the \regular document" -based versions, irrespective of FS function, dataset, and learner. Overall, these results derive from a massive experimental work, consisting of (100 reduction levels 2 datasets 2 learners 4 FS functions =) 1600 new train-and-test experiments (which are additional to the other 1600 that produced out baselines and that were already presented in [ 2 ]).

Conclusion

We have shown that using -documents in place of \regular" training documents in feature selection substantially improves the e ectiveness of ordinal text classi cation. In future experiments we plan to validate this method on the more standard cases of binary classi cation and multiclass classi cation.

1. Yang , Y. , Pedersen , J.O.: A comparative study on feature selection in text categorization . In: Proceedings of the 14th International Conference on Machine Learning (ICML'97) , Nashville, US ( 1997 ) 412 { 420

2. Baccianella , S. , Esuli , A. , Sebastiani , F. : Feature selection for ordinal text classi cation . Technical Report 2010-TR-014 , Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa , IT ( 2010 )

3. Baccianella , S. , Esuli , A. , Sebastiani , F. : Feature selection for ordinal regression . In: Proceedings of the 25th ACM Symposium on Applied Computing (SAC'10) , Sierre, CH ( 2010 ) 1748 { 1754

4. McCallum , A.K. , Nigam , K. : A comparison of event models for naive Bayes text classi cation . In: Proceedings of the AAAI Workshop on Learning for Text Categorization , Madison, US ( 1998 ) 41 { 48

5. Baccianella , S. , Esuli , A. , Sebastiani , F. : Evaluation measures for ordinal text classi cation . In: Proceedings of the 9th IEEE International Conference on Intelligent Systems Design and Applications (ISDA'09) , Pisa, IT ( 2009 ) 283 { 287

6. Baccianella , S. , Esuli , A. , Sebastiani , F. : Using micro-documents for feature selection: The case of ordinal text classi cation . Technical Report 2011-TR-001 , Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa , IT ( 2011 )