-

Evaluating Feature Selection Methods for Multi-Label Text Classi cation

Newton Spola

newtonspolaor@gmail.com

Grigorios Tsoumakas

1 0 Laboratory of Computational Intelligence Institute of Mathematics and Computer Science University of Sa~o Paulo Sa~o Carlos , Brazil 1 Machine Learning and Knowledge Discovery Department of Informatics Aristotle University of Thessaloniki Thessaloniki , Greece

Multi-label text classi cation deals with problems in which each document is associated with a subset of categories. These documents often consist of a large number of words, which can hinder the performance of learning algorithms. Feature selection is a popular task to nd representative words and remove unimportant ones, which could speed up learning and even improve learning performance. This work evaluates eight feature selection algorithms in text benchmark datasets. The best algorithms are subsequently compared with random feature selection and classi ers built using all features. Results agree with literature by nding that well-known approaches, such as maximum chi-squared scoring across all labels, are good choices to reduce text dimensionality while reaching competitive multi-label classi cation performance.

problem transformation binary relevance round-robin rand-robin chi-squared bi-normal separation

Classical single-label learning deals with problems in which each dataset instance (or example) is described by a set of features and associated with only one label from a disjoint set of labels L. Single-label text classi cation (or text categorization), for example, learns from data in which each document has a unique category (topic) as label. If L = 2, this task is called binary text classi cation, and it is called multi-class text classi cation if L > 2.

Although a large amount of research has been carried out on single-label learning, the correspondent learning algorithms do not t well into applications composed of instances annotated with subsets of labels from L. Even in some text categorization problems, each document is labeled with several topics simultaneously, such that the learning algorithm should tackle more than one label simultaneously to learn accordingly. Motivated by this scenario, multi-label learning algorithms have been developed [ 32,26 ].

Irrelevant and/or redundant features can hinder the performance of singlelabel and multi-label learning algorithms due to the \curse of dimensionality" [ 16 ]. Thus, the Feature Selection (FS) task is often applied before learning to nd features which describe the dataset as well as, or even better than, the original set of features does, and remove the remaining ones. FS also speeds up learning algorithms and sometimes improves their performance [ 34 ].

Research on multi-label feature selection is still scarce. For example, many publications evaluate a number of FS algorithms in only a few multi-label datasets. This work contributes to reduce this gap by comparing 8 FS methods in 20 multi-label text classi cation datasets (9 from di erent sources and 11 from a web page). The methods combine 2 feature evaluation measures, Chi-squared (CS) and Bi-Normal Separation (BNS) [ 8 ], with 4 aggregation strategies to tackle multiple labels [ 9 ], some of them still unexplored for multi-label datasets. Results show that well-known approaches, such as considering the maximum CS score of each feature across all labels, led to some of the best classi cation models.

The rest of this work is organized as follows: Section 2 brie y presents multilabel learning, FS and related work. Section 3 describes the methods evaluated in Section 4, which is followed by the conclusion and future work in Section 5. 2

Background

This section describes basic notations and concepts related to multi-label learning and feature selection. Related work in multi-label feature selection for textual datasets is also considered. 2.1

Multi-label learning

Let D be a dataset composed of N examples Ei = (xi;Yi), i = 1::N . Each example (instance) Ei is associated with a feature vector xi = (xi1; xi2; : : : ; xiM ) described by M features (attributes) Xj , j = 1::M , and a subset of labels Yi L, where L = fy1; y2; : : : yqg is the set of q labels. Table 1 shows this representation. In this scenario, the multi-label learning task consists in generating a model H which, given an unseen instance E = (x; ?), is capable of accurately predicting its subset of labels Y , i.e., H(E) ! Y .

Multi-label learning methods can be organized into two main categories: algorithm adaptation and problem transformation [ 26 ]. The former one includes learning algorithms extended to deal with multi-label data directly, such as the Multi-label Naive Bayes algorithm [ 33 ]. On the other hand, the latter category consists of algorithm independent methods, as any state of the art single-label learning method can learn from each single-label problem generated by these methods. The Binary Relevance (BR) approach exempli es this category by transforming a multi-label dataset into q single-label datasets, learning from each single-label problem separately and combining the results.

Furthermore, exploiting label dependence during learning can improve performance [ 4 ]. An alternative categorization organizes multi-label learning methods based on the order of label correlations taken into account [ 32 ]. First-order strategies ignore co-existence of other labels during learning, as BR does. Secondorder strategies, exempli ed by Calibrated Label [ 10 ], consider pairwise relations between labels. High-order strategies, such as Random k-labelsets [ 28 ], consider relations among more labels.

Although high-order strategies potentially model wider label correlations, they are usually computationally more demanding. In this work, the problem transformation/ rst-order strategy BR is used for classi cation. 2.2

Feature selection

FS for multi-label text datasets often applies single-label feature evaluation measures, i.e., measures to score the quality of features, after using problem transformation approaches, such as BR [ 6,31 ]. Moreover, these measures usually follow the lter approach [ 16 ]. Unlike the wrapper and embedded approaches, lters remove irrelevant and/or redundant features regardless of the learning algorithm, which can save computational resources when working with large datasets. Both FS measures used in this work agree with these popular choices.

CS and BNS share the same notation [ 8 ]. Let tp, f p, f n and tn be the number of (feature) true positives, false positives, false negatives and true negatives in a binary dataset. In this scenario, tp counts when a feature and a label under evaluation co-occur, i.e., both are positive, while f p counts the cases in which only the feature is positive. De ning the remaining notations is straightforward.

Chi-squared estimates the independence between the occurrence of a feature Xj and the occurrence of a label yi, such that the higher the measure value, the more related Xj and yi are. CS is de ned by Equation 1, where Ppos = (tp + f n)=(tp + f p + f n + tn), Pneg = (f p + tn)=(tp + f p + f n + tn) and t(count;expect) = (count expect)2=expect.

CS(tp;f p;f n;tn) = t (tp; (tp + f p) Ppos) + t (f n; (f n + tn) Ppos) + t (f p; (tp + f p) Pneg) + t (tn; (f n + tn) Pneg) : (1)

Bi-Normal Separation measures the separation between two thresholds (positive and negative classes) in a Gaussian function. This measure models the occurrence of a feature Xj in the documents as a random Normal variable exceeding a hypothetical threshold, such that the frequency of Xj corresponds to the area under the curve past the threshold. As de ned by Equation 2, the higher the di erence between the thresholds, the better the feature Xj is. Let F 1 be the standard Normal distribution inverse cumulative probability function (z-score), tpr = tp=(tp + f n) and f pr = f p=(f p + tn).

BN S(tp;f p;f n;tn) = jF 1(tpr)

F 1(f pr)j: (2) 2.3

Related work

FS has been an active research topic in supervised learning, with several related publications and comprehensive surveys [ 34,26,16 ]. Most of this research has been mainly proposed to support single-label classi cation, but there are also many publications on feature selection for multi-label text classi cation.

The systematic review process, a method to perform a wide, replicable and rigorous literature review, was carried out in [ 22 ] and recently updated to search for multi-label FS publications. Most of the methods found are lters, which are useful to save time/space when working with large textual datasets. Table 2 summarizes some publications on lter FS. As mentioned, many publications use only few datasets to evaluate their methods. Moreover, Information Gain (IG) and CS, two potentially related measures [ 31 ], are the most usual ones. After transforming a multi-label dataset into q binary datasets by BR and counting the number of (feature) true/false positives/negatives, any feature evaluation measure for binary data can be applied according to the macro-averaged approach [ 15 ], exempli ed in [ 23,25,28,20 ].

In what follows, four aggregation strategies are described. Let tpyi , f pyi , tnyi and f nyi be the number of (feature) true/false positives/negatives for a label yi, i = 1::q and a feature Xj , j = 1::M .

The well-known Mean strategy (Mean) [ 31 ] averages the scores obtained after applying the measure f e in the binary dataset related to each label yi (Equation 3). Max (Max), which returns the maximum score obtained across all labels (Equation 4), is also popular.

q M ean(Xj ) = 1 X f e (tpyi ;f pyi ;tnyi ;f nyi ) : q i=1

q M ax(Xj ) = max f e (tpyi ;f pyi ;tnyi ;f nyi ) : i=1 (3) (4)

A nding that some feature evaluation measures can be blinded by a surplus of strongly predictive features for frequent labels, while largely ignoring features needed to discriminate hard (low frequency) labels, motivated the proposal of Round-Robin (RoR) and Rand-Robin (RaR) [ 9 ]. After calculating q feature rankings, the former variation takes the best feature in the ranking related to each label in turn. On the other hand, the latter one takes the best feature for a label randomly chosen with probability inversely proportional to its frequency. Each feature taken in turn is removed from the q feature rankings.

Algorithm 3.1 suggests a generic implementation for all the strategies, which can be optimized according to each one for code optimization. In what follows, the main procedures and variables of this algorithm are described.

Textual data is often represented as sparse data, as not all features (words) will occur in every instance (document). This property is considered by the procedure invertedIndexes (Line 2) for countings. As result, there are M + q rows, such that each row consists in the inverted indexes linking a feature or a label to the instances they occur. Thus, a (feature) true positive is veri ed every time a feature and a label co-occur in the same instance (Line 7). Based on the number of inverted indexes, i.e., the frequency of each feature or label, f p, f n and tn are easily set. Then the score calculated by a feature evaluation measure f e (Line 15) is set to the matrix of feature rankings F RM .

Algorithm 3.1 ends with the application of one of the strategies in Line 19. It should be emphasized that the Equations 3 and 4 would use the matrix F RM instead of reapplying the feature evaluation measure f e.

Algorithm 3.1 implementation can support parallelization and serialization, enabling user to save time/space in large datasets. First, the algorithm is split into several independent tasks, which is helpful to successful parallelization [ 11 ]. Second, serialization is considered to enable the algorithm to save a stream of bytes in the disk and load it back when necessary, releasing space in memory. Algorithm 3.1 Generic implementation for the aggregation strategies Input: M ulti-label dataset D Output: F eature ranking F R 1: Initialize tp and F RM 2: finvertedF eatureIndexes;invertedLabelIndexesg 3: for each label of invertedLabelIndexes do 4: for each f eature of invertedF eatureIndexes do 5: for each invertedIndexF of f eature do 6: for each invertedIndexL of label do 7: if invertedIndexF = invertedIndexL then 8: tp tp + 1 9: end if 10: end for 11: end for 12: f p numberIndexes(f eature) tp 13: f n numberIndexes(label) tp 14: tn N (tp + f p + f n) 15: F RM [label][f eature] f e(tp;f p;f n;tn) 16: Reinitialize tp 17: end for 18: end for 19: F R aggregationStrategy(F RM ) 20: return F R invertedIndexes(D) 4

Experimental evaluation

In this work, 8 text FS methods (2 f eature evaluation measures 4 strategies) are applied in 20 benchmark datasets. The best methods are after compared with the classi ers built using All Features (AF) and using the features selected by Random Feature Selection (RFS) [ 8 ]. The RaR strategy and RFS were executed three times due to their stochasticity, and the correspondent Micro F-Measure values from RaR and RFS were averaged before calculating the average ranking.

Some implemented procedures use Weka [ 30 ] and LIBLINEAR [ 7 ] resources. All the reported classi cation results were obtained by Mulan [ 27 ], a framework for multi-label classi cation, using 10-fold cross-validation with paired folds. 4.1

Datasets and experimental setup

3http://mulan.sourceforge.net/datasets.html 4http://meka.sourceforge.net/#datasets

After applying each FS method in a dataset, the BR + Linear SVM (BRLL) method, e cient to classify large sparse datasets [ 7 ], was used. BRLL classi ers were built from data described by the best t features found by a FS method, in which t = 10%; 20%; : : : ; 90% of the number of features M . The learning algorithm was executed with SVM C = 3, tolerance of stopping criterion e = 0:001 and remaining parameters with default values5.

All classi cation models built were evaluated according to Micro FMeasure [ 26 ]. This evaluation measure, de ned by Equation 5, has values in the interval [0::1] and the higher its value, the better the multi-label classi er performance is. Let TPyi , FPyi , TNyi and FNyi be, respectively, the number of true/false positives/negatives for a label yj from the set of labels L. 2 Pq M icro F -M easure(H;D) = 2 Pjq=1 TPyj + Pjqj==11 FTPPyyjj + Pjq=1 FNyj : (5) 4.2

Results and discussion

The micro F-measure of the 8 feature selection methods at the 9 percentages of selected features for each one of the 20 datasets are available in an online appendix6. Following the recommendations in [ 5 ], we will here compare di erent feature selection approaches at speci c percentages of selected features based on their average rankings across all datasets.

We rst discuss the relative performance of the 4 aggregation strategies (Max, Mean, RoR, RaR) for each feature evaluation measure (CS, BNS) separately. 5Solvers in LIBLINEAR are insensitive to C.

6http://tiny.cc/e0ke3w

We notice that for both CS and BNS the Mean aggregation performs best when the percentage of features is low (up to 50% for BNS and 40% for CS). For larger percentages of features RaR (and RoR in the case of 90%) performs best for BNS, while Max performs best for CS. Recall that for each feature, Mean averages the scores across all labels, while Max, RoR and RaR are based on a single label. Therefore, at lower number of features, it is probably the case that Max, RoR and RaR are not considering enough features for some of the labels in contrast to Mean. As the percentage of selected features increases, Max, RoR and Rar manage to select enough features for all labels and outweigh Mean, which is selecting features that work well for all labels on average.

The nding that Mean and Max lead to good classi cation models agrees with earlier publications which combine them with di erent feature evaluation measures, such as ReliefF and IG [ 23 ], CS [ 25,28,20 ] and mutual information [ 31 ].

We now focus on methods that in the previous comparison achieved the best average ranking for more than one percentage of selected features. These are: MeanBNS, RaRBNS, MaxCS and MeanCS. We will discuss the relative performance of these methods along with the baselines of using All Features (AF) and Random Feature Selection (RFS). Table 5 shows the corresponding average rankings and standard deviations. Note that the performance of AF is the same independently of the percentage of selected features, yet its relative ranking with respect to competing methods can and does di er.

We notice that the best (lowest) average ranking is achieved by Max and Mean combined with CS. Besides obtaining better ranking than RFS, they also outperform AF, ful lling the requirements of any reasonable feature selection method. CS, which behaves erratically for very small expected counts common in text classi cation, was found worse than BNS for single-label classi cation [ 8 ]. However, we here see that this scenario is reversed in the case of multi-label classi cation. The strategies Mean and Max seem to mitigate the disadvantage of CS. This could be because they consider more than one label, with di erent expected counts, when evaluating features.

We complete the comparison of the 8 feature selection methods by analyzing the similarity of the feature subsets that are selected by each method. In particular, we calculate a similarity index between each pair of methods [ 12 ]. This could be useful, for example, to identify diverse feature selection methods for constructing ensembles [ 1 ]. In this analysis, only one run of RaR is considered. For the sake of saving space, Table 6 shows the similarity values averaged across all datasets at a speci c percentage of features (t = 50%), highlighting similarity values larger than 0:7 with bold typeface. Nevertheless, the patterns found also occur for other percentage of selected features.

We rst notice that CS methods are quite di erent from BNS methods, as one would expect. Within BNS methods, we see that MeanBNS selects quite di erent feature subsets from the ones found by the other BNS methods, which in turn select relatively similar feature subsets. Within CS methods, we see 3 pairs of methods selecting similar features: RaR/RoR, Mean/Max and Mean/RoR. 5

Conclusion

This work evaluated 8 FS methods to support multi-label text classi cation in 20 benchmark datasets. They are based on 2 feature evaluation measures and 4 strategies to consider label information while evaluating features. The best methods from this group also highlighted in an experimental comparison with the classi ers built using all features and using features randomly selected.

The popular algorithms MeanCS and MaxCS, which respectively rank features according to the average or the maximum Chi-squared score across all labels, led to most of the best classi ers while using less features. The former was the best choice when the number of features was smaller. As the number of features increased, the latter yielded the best classi ers.

Future work will apply some of the best FS methods and their optimized implementation to rank features in large textual datasets. Furthermore, we plan to evaluate e cient FS methods which are able to consider label information in a higher level than the one considered by MeanCS and MaxCS [ 32 ].

Acknowledgment

This research was supported by the S~ao Paulo Research Foundation (FAPESP), grant 2012/23906-2. The authors would like to thank Maria C. Monard, Huei D. Lee and Everton A. Cherman for their support in multi-label feature selection.

1. Cannas , L.M. , Dess , N. , Pes , B. : Assessing similarity of feature selection techniques in high-dimensional domains . Pattern Recognition Letters 34 ( 12 ), 1446 { 1453 ( 2013 )

2 . Chang , Y.C. , Chen , S.M. , Liau , C.J.: Multilabel text categorization based on a new linear classi er learning method and a category-sensitive re nement method . Expert Systems with Applications 34 ( 3 ), 1948 { 1953 ( 2008 )

3. Chen , W. , Yan , J. , Zhang, B. , Chen , Z. , Yang , Q. : Document transformation for multi-label feature selection in text categorization . In: IEEE International Conference on Data Mining . pp. 451 { 456 ( 2007 )

4. Dembczynski , K. , Waegeman , W., Cheng, W., H?llermeier, E.: On label dependence and loss minimization in multi-label classi cation . Machine Learning 88 , 5{ 45 ( 2012 )

5. Demsar , J.: Statistical comparisons of classi ers over multiple data sets . Journal of Machine Learning Research 7 , 1 { 30 ( 2006 )

6. Dendamrongvit , S. , Vateekul , P. , Kubat , M. : Irrelevant attributes and imbalanced classes in multi-label text-categorization domains . Intelligent Data Analysis 15 ( 6 ), 843 { 859 ( 2011 )

7. Fan , R.E. , Chang , K.W. , Hsieh , C.J. , Wang , X.R. , Lin , C.J.: LIBLINEAR: A library for large linear classi cation . Journal of Machine Learning Research 9 , 1871 { 1874 ( 2008 )

8. Forman , G. : An extensive empirical study of feature selection metrics for text classi cation . Journal of Machine Learning Research 3 , 1289 { 1305 ( 2003 )

9. Forman , G.: A pitfall and solution in multi-class feature selection for text classi - cation . HPL-2004- 86 ( 2004 ), Hewlett Packard

10. Furnkranz, J., Hullermeier, E., Menc

, E.L. , Brinker , K. : Multilabel classi cation via calibrated label ranking . Machine Learning 73 ( 2 ), 133 { 153 ( 2008 )

11. Kononenko, I., Robnik-Sikonja , M. : Weighting and local methods non-myopic feature quality evaluation with (R)ReliefF . In: Liu, H. , Motoda , H. (eds.) Computational Methods of Feature Selection , pp. 169 { 191 . Chapman & Hall/CRC ( 2008 )

12. Kuncheva , L.I.: A stability index for feature selection . In: IASTED International Multi-Conference: arti cial intelligence and applications . pp. 390 { 395 ( 2007 )

13. Lastra , G. , Luaces , O. , Quevedo , J.R. , Bahamonde , A. : Graphical feature selection for multilabel classi cation tasks . In: International Conference on Advances in Intelligent Data Analysis . pp. 246 { 257 ( 2011 )

14. Lee , J. , Kim , D.W. : Feature selection for multi-label classi cation using multivariate mutual information . Pattern Recognition Letters 34 ( 3 ), 349 { 357 ( 2013 )

15. Lewis , D.D. , Yang , Y. , Rose , T.G. , Li , F. : Rcv1: A new benchmark collection for text categorization research . Journal of Machine Learning Research 5 , 361 { 397 ( 2004 )

16. Liu , H. , Motoda , H.: Computational Methods of Feature Selection . Chapman & Hall/CRC ( 2008 )

17. Mayne , A. , Perry , R.: Hierarchically classifying documents with multiple labels . In: IEEE Symposium on Computational Intelligence and Data Mining . pp. 133 { 139 ( 2009 )

18. Nardiello , P. , Sebastiani , F. , Sperduti , A. : Discretizing continuous attributes in adaboost for text categorization . In: European Conference on Information Retrieval Research . pp. 320 { 334 ( 2003 )

19. Novovicova , J. , Somol , P. , Haindl , M. , Pudil , P. : Conditional mutual information based feature selection for classi cation task . In: Iberoamerican conference on Progress in pattern recognition, image analysis and applications . pp. 417 { 426 ( 2007 )

20. Olsson , J.O.S. , Oard , D.W.: Combining feature selectors for text classi cation . In: ACM International Conference on Information and Knowledge Management . pp. 798 { 799 ( 2006 )

21. Saleh , S.N. , El-Sonbaty , Y. : A feature selection algorithm with redundancy reduction for text classi cation . In: International Symposium on Computer and Information Sciences . pp. 1 { 6 ( 2007 )

22. Spolao^r, N., Monard , M.C. , Lee , H.D.: A systematic review to identify feature selection publications in multi-labeled data . ICMC Technical Report No 374. 31 pg . ( 2012 ), University of S?o Paulo

23. Spolao^r, N., Cherman , E.A. , Monard , M.C. , Lee , H.D.: A comparison of multi-label feature selection methods using the problem transformation approach . Electronic Notes in Theoretical Computer Science 292 , 135 { 151 ( 2013 )

24. Toutanova , K. , Chen , F. , Popat , K. , Hofmann , T. : Text classi cation in a hierarchical mixture model for small training sets . In: International Conference on Information and Knowledge Management . pp. 105 { 113 ( 2001 )

25. Trohidis , K. , Tsoumakas , G. , Kalliris , G. , Vlahavas, I.: Multi-label classi cation of music into emotions . In: International Conference on Music Information Retrieval . pp. 1 { 6 ( 2008 )

26. Tsoumakas , G. , Katakis , I. , Vlahavas , I.P. : Mining multi-label data . In: Maimon, O. , Rokach , L . (eds.) Data Mining and Knowledge Discovery Handbook , pp. 667 { 685 . Springer ( 2010 )

27. Tsoumakas , G. , Spyromitros-Xiou s , E., Vilcek , J. , Vlahavas , I. : Mulan: A java library for multi-label learning . Journal of Machine Learning Research 12 , 2411 { 2414 ( 2011 )

28. Tsoumakas , G. , Vlahavas , I. : Random k-labelsets: An ensemble method for multilabel classi cation . In: European Conference on Machine Learning . pp. 406 { 417 ( 2007 )

29. Wang , B. , Jia , Y ., Han, Y ., Han , W.: E ective feature selection on data with uncertain labels . In: IEEE International Conference on Data Engineering . pp. 1657 { 1662 ( 2009 )

30. Witten , I.H. , Frank , E. : Data Mining: Practical Machine Learning Tools and Techniques . Morgan Kaufmann ( 2011 )

31. Yang , Y. , Pedersen , J.O.: A comparative study on feature selection in text categorization . In: International Conference on Machine Learning . pp. 412 { 420 ( 1997 )

32. Zhang , M. , Zhou , Z. : A review on multi-label learning algorithms (in press) . IEEE Transactions on Knowledge and Data Engineering PrePrints(PrePrints) , 1 { 1 ( 2013 )

33. Zhang , M.L. , Pen~a, J.M. , Robles , V.: Feature selection for multi-label Naive Bayes classi cation . Information Sciences 179 , 3218 { 3229 ( 2009 )

34. Zhao , Z. , Morstatter , F. , Sharma , S. , Alelyani , S. , Anand , A. , Liu , H.: Advancing feature selection research - ASU feature selection repository . Technical Report ( 2011 ), Arizona State University

35. Zheng , Z. , Wu , X. , Srihari , R.: Feature selection for text categorization on imbalanced data . SIGKDD Explorations Newsletter 6 ( 1 ), 80 { 89 ( 2004 )