Evaluating Feature Selection Methods for Multi-Label Text Classification Newton Spolaôr1 and Grigorios Tsoumakas2 1 Laboratory of Computational Intelligence Institute of Mathematics and Computer Science University of São Paulo São Carlos, Brazil 2 Machine Learning and Knowledge Discovery Department of Informatics Aristotle University of Thessaloniki Thessaloniki, Greece newtonspolaor@gmail.com,greg@csd.auth.gr Abstract. Multi-label text classification deals with problems in which each document is associated with a subset of categories. These documents often consist of a large number of words, which can hinder the perfor- mance of learning algorithms. Feature selection is a popular task to find representative words and remove unimportant ones, which could speed up learning and even improve learning performance. This work evalu- ates eight feature selection algorithms in text benchmark datasets. The best algorithms are subsequently compared with random feature selec- tion and classifiers built using all features. Results agree with literature by finding that well-known approaches, such as maximum chi-squared scoring across all labels, are good choices to reduce text dimensionality while reaching competitive multi-label classification performance. Keywords: problem transformation, binary relevance, round-robin, rand-robin, chi-squared, bi-normal separation 1 Introduction Classical single-label learning deals with problems in which each dataset instance (or example) is described by a set of features and associated with only one label from a disjoint set of labels L. Single-label text classification (or text catego- rization), for example, learns from data in which each document has a unique category (topic) as label. If L = 2, this task is called binary text classification, and it is called multi-class text classification if L > 2. Although a large amount of research has been carried out on single-label learning, the correspondent learning algorithms do not fit well into applications composed of instances annotated with subsets of labels from L. Even in some text categorization problems, each document is labeled with several topics si- multaneously, such that the learning algorithm should tackle more than one la- bel simultaneously to learn accordingly. Motivated by this scenario, multi-label learning algorithms have been developed [32,26]. 2 Irrelevant and/or redundant features can hinder the performance of single- label and multi-label learning algorithms due to the “curse of dimensional- ity” [16]. Thus, the Feature Selection (FS) task is often applied before learning to find features which describe the dataset as well as, or even better than, the original set of features does, and remove the remaining ones. FS also speeds up learning algorithms and sometimes improves their performance [34]. Research on multi-label feature selection is still scarce. For example, many publications evaluate a number of FS algorithms in only a few multi-label datasets. This work contributes to reduce this gap by comparing 8 FS methods in 20 multi-label text classification datasets (9 from different sources and 11 from a web page). The methods combine 2 feature evaluation measures, Chi-squared (CS) and Bi-Normal Separation (BNS) [8], with 4 aggregation strategies to tackle multiple labels [9], some of them still unexplored for multi-label datasets. Results show that well-known approaches, such as considering the maximum CS score of each feature across all labels, led to some of the best classification models. The rest of this work is organized as follows: Section 2 briefly presents multi- label learning, FS and related work. Section 3 describes the methods evaluated in Section 4, which is followed by the conclusion and future work in Section 5. 2 Background This section describes basic notations and concepts related to multi-label learn- ing and feature selection. Related work in multi-label feature selection for textual datasets is also considered. 2.1 Multi-label learning Let D be a dataset composed of N examples Ei = (xi ,Yi ), i = 1..N . Each example (instance) Ei is associated with a feature vector xi = (xi1 , xi2 , . . . , xiM ) described by M features (attributes) Xj , j = 1..M , and a subset of labels Yi ⊆ L, where L = {y1 , y2 , . . . yq } is the set of q labels. Table 1 shows this representation. In this scenario, the multi-label learning task consists in generating a model H which, given an unseen instance E = (x, ?), is capable of accurately predicting its subset of labels Y , i.e., H(E) → Y . Table 1. Multi-label data. X1 X2 . . . XM Y E1 x11 x12 . . . x1M Y1 E2 x21 x22 . . . x2M Y2 . . . . . . . . . .. . . . . . . . EN xN 1 xN 2 . . . xN M YN 3 Multi-label learning methods can be organized into two main categories: algorithm adaptation and problem transformation [26]. The former one includes learning algorithms extended to deal with multi-label data directly, such as the Multi-label Naive Bayes algorithm [33]. On the other hand, the latter category consists of algorithm independent methods, as any state of the art single-label learning method can learn from each single-label problem generated by these methods. The Binary Relevance (BR) approach exemplifies this category by transforming a multi-label dataset into q single-label datasets, learning from each single-label problem separately and combining the results. Furthermore, exploiting label dependence during learning can improve per- formance [4]. An alternative categorization organizes multi-label learning meth- ods based on the order of label correlations taken into account [32]. First-order strategies ignore co-existence of other labels during learning, as BR does. Second- order strategies, exemplified by Calibrated Label [10], consider pairwise relations between labels. High-order strategies, such as Random k-labelsets [28], consider relations among more labels. Although high-order strategies potentially model wider label correlations, they are usually computationally more demanding. In this work, the problem transformation/first-order strategy BR is used for classification. 2.2 Feature selection FS for multi-label text datasets often applies single-label feature evaluation mea- sures, i.e., measures to score the quality of features, after using problem transfor- mation approaches, such as BR [6,31]. Moreover, these measures usually follow the filter approach [16]. Unlike the wrapper and embedded approaches, filters re- move irrelevant and/or redundant features regardless of the learning algorithm, which can save computational resources when working with large datasets. Both FS measures used in this work agree with these popular choices. CS and BNS share the same notation [8]. Let tp, f p, f n and tn be the number of (feature) true positives, false positives, false negatives and true negatives in a binary dataset. In this scenario, tp counts when a feature and a label under evaluation co-occur, i.e., both are positive, while f p counts the cases in which only the feature is positive. Defining the remaining notations is straightforward. Chi-squared estimates the independence between the occurrence of a feature Xj and the occurrence of a label yi , such that the higher the measure value, the more related Xj and yi are. CS is defined by Equation 1, where Ppos = (tp + f n)/(tp + f p + f n + tn), Pneg = (f p + tn)/(tp + f p + f n + tn) and t(count,expect) = (count − expect)2 /expect. CS(tp,f p,f n,tn) = t (tp, (tp + f p) Ppos ) + t (f n, (f n + tn) Ppos ) + t (f p, (tp + f p) Pneg ) + t (tn, (f n + tn) Pneg ) . (1) Bi-Normal Separation measures the separation between two thresholds (pos- itive and negative classes) in a Gaussian function. This measure models the 4 occurrence of a feature Xj in the documents as a random Normal variable ex- ceeding a hypothetical threshold, such that the frequency of Xj corresponds to the area under the curve past the threshold. As defined by Equation 2, the higher the difference between the thresholds, the better the feature Xj is. Let F −1 be the standard Normal distribution inverse cumulative probability function (z-score), tpr = tp/(tp + f n) and f pr = f p/(f p + tn). BN S(tp,f p,f n,tn) = |F −1 (tpr) − F −1 (f pr)|. (2) 2.3 Related work FS has been an active research topic in supervised learning, with several related publications and comprehensive surveys [34,26,16]. Most of this research has been mainly proposed to support single-label classification, but there are also many publications on feature selection for multi-label text classification. The systematic review process, a method to perform a wide, replicable and rigorous literature review, was carried out in [22] and recently updated to search for multi-label FS publications. Most of the methods found are filters, which are useful to save time/space when working with large textual datasets. Table 2 summarizes some publications on filter FS. As mentioned, many publications use only few datasets to evaluate their methods. Moreover, Information Gain (IG) and CS, two potentially related measures [31], are the most usual ones. Table 2. Multi-label filter feature selection publications. Number of Paper Feature evaluation measure datasets used [31] 2 IG, CS, document frequency, term strength and mutual information [24] 2 IG [18] 1 CS [15] 1 CS [35] 1 IG, CS, correlation coefficient, odds ratio, odds ratio-square and signed IG [20] 1 IG, CS and document frequency [3] 2 IG, CS and orthogonal centroid feature selection [19] 1 conditional mutual information [21] 1 minimal redundancy maximal relevance [28] 3 CS [2] 1 mutual information [17] 1 BNS [29] 5 Hilbert-Schmidt independence criterion [6] 6 IG [13] 8 symmetrical uncertainty [14] 3 mutual information [23] 10 IG and ReliefF 3 Multi-label feature selection methods After transforming a multi-label dataset into q binary datasets by BR and count- ing the number of (feature) true/false positives/negatives, any feature evalua- 5 tion measure for binary data can be applied according to the macro-averaged approach [15], exemplified in [23,25,28,20]. In what follows, four aggregation strategies are described. Let tpyi , f pyi , tnyi and f nyi be the number of (feature) true/false positives/negatives for a label yi , i = 1..q and a feature Xj , j = 1..M . The well-known Mean strategy (Mean) [31] averages the scores obtained after applying the measure f e in the binary dataset related to each label yi (Equa- tion 3). Max (Max), which returns the maximum score obtained across all labels (Equation 4), is also popular. q 1X M ean(Xj ) = f e (tpyi ,f pyi ,tnyi ,f nyi ) . (3) q i=1 q M ax(Xj ) = max f e (tpyi ,f pyi ,tnyi ,f nyi ) . (4) i=1 A finding that some feature evaluation measures can be blinded by a surplus of strongly predictive features for frequent labels, while largely ignoring features needed to discriminate hard (low frequency) labels, motivated the proposal of Round-Robin (RoR) and Rand-Robin (RaR) [9]. After calculating q feature rank- ings, the former variation takes the best feature in the ranking related to each label in turn. On the other hand, the latter one takes the best feature for a label randomly chosen with probability inversely proportional to its frequency. Each feature taken in turn is removed from the q feature rankings. Algorithm 3.1 suggests a generic implementation for all the strategies, which can be optimized according to each one for code optimization. In what follows, the main procedures and variables of this algorithm are described. Textual data is often represented as sparse data, as not all features (words) will occur in every instance (document). This property is considered by the procedure invertedIndexes (Line 2) for countings. As result, there are M + q rows, such that each row consists in the inverted indexes linking a feature or a label to the instances they occur. Thus, a (feature) true positive is verified every time a feature and a label co-occur in the same instance (Line 7). Based on the number of inverted indexes, i.e., the frequency of each feature or label, f p, f n and tn are easily set. Then the score calculated by a feature evaluation measure f e (Line 15) is set to the matrix of feature rankings F RM . Algorithm 3.1 ends with the application of one of the strategies in Line 19. It should be emphasized that the Equations 3 and 4 would use the matrix F RM instead of reapplying the feature evaluation measure f e. Algorithm 3.1 implementation can support parallelization and serialization, enabling user to save time/space in large datasets. First, the algorithm is split into several independent tasks, which is helpful to successful parallelization [11]. Second, serialization is considered to enable the algorithm to save a stream of bytes in the disk and load it back when necessary, releasing space in memory. 6 Algorithm 3.1 Generic implementation for the aggregation strategies Input: M ulti-label dataset D Output: F eature ranking F R 1: Initialize tp and F RM 2: {invertedF eatureIndexes,invertedLabelIndexes} ← invertedIndexes(D) 3: for each label of invertedLabelIndexes do 4: for each f eature of invertedF eatureIndexes do 5: for each invertedIndexF of f eature do 6: for each invertedIndexL of label do 7: if invertedIndexF = invertedIndexL then 8: tp ← tp + 1 9: end if 10: end for 11: end for 12: f p ← numberIndexes(f eature) − tp 13: f n ← numberIndexes(label) − tp 14: tn ← N − (tp + f p + f n) 15: F RM [label][f eature] ← f e(tp,f p,f n,tn) 16: Reinitialize tp 17: end for 18: end for 19: F R ← aggregationStrategy(F RM ) 20: return F R 4 Experimental evaluation In this work, 8 text FS methods (2 f eature evaluation measures×4 strategies) are applied in 20 benchmark datasets. The best methods are after compared with the classifiers built using All Features (AF) and using the features selected by Random Feature Selection (RFS) [8]. The RaR strategy and RFS were executed three times due to their stochasticity, and the correspondent Micro F-Measure values from RaR and RFS were averaged before calculating the average ranking. Some implemented procedures use Weka [30] and LIBLINEAR [7] resources. All the reported classification results were obtained by Mulan [27], a framework for multi-label classification, using 10-fold cross-validation with paired folds. 4.1 Datasets and experimental setup Table 3 shows, for each dataset, its name the number of instances (N ), features (M ) and labels (q), the label cardinality (LC), which is the average number of labels associated with each example, the label density (LD), which is a normal- ized version of LC divided by the total number of labels, and the number of distinct combinations (DC) of labels. The datasets were obtained from Mulan 3 and Meka 4 repositories. 3 http://mulan.sourceforge.net/datasets.html 4 http://meka.sourceforge.net/#datasets 7 Table 3. Benchmark datasets used. Name N M q LC LD DC arts3 7484 23146 26 1.65 0.06 599 bibtex3 7395 1836 159 2.40 0.02 2856 bookmarks3 87856 2150 208 2.03 0.01 18716 business3 11214 21924 30 1.60 0.05 233 computers3 12444 34096 33 1.51 0.05 428 3 delicious 16105 500 983 19.02 0.02 15806 3 education 12030 27534 33 1.46 0.04 511 3 enron 1702 1001 53 5.31 0.06 753 entertainment3 12730 32001 21 1.41 0.07 337 health3 9205 30605 32 1.64 0.05 335 medical3 978 1449 45 1.25 0.03 94 language log4 13929 1002 23 1.66 0.07 1147 3 rcv1v2 (subset1) 6000 47236 101 2.88 0.03 1028 recreation3 12828 30324 22 1.43 0.06 530 reference3 8027 39679 33 1.17 0.04 275 science3 6428 37187 40 1.45 0.04 457 social3 12111 52350 39 1.28 0.03 361 society3 14512 31802 27 1.67 0.06 1054 slashdot4 3782 1079 22 1.18 0.05 156 3 tmc2007-500 28596 500 22 2.21 0.10 1341 After applying each FS method in a dataset, the BR + Linear SVM (BRLL) method, efficient to classify large sparse datasets [7], was used. BRLL classifiers were built from data described by the best t features found by a FS method, in which t = 10%, 20%, . . . , 90% of the number of features M . The learning algorithm was executed with SVM C = 3, tolerance of stopping criterion e = 0.001 and remaining parameters with default values5 . All classification models built were evaluated according to Micro F- Measure [26]. This evaluation measure, defined by Equation 5, has values in the interval [0..1] and the higher its value, the better the multi-label classifier performance is. Let TPyi , FPyi , TNyi and FNyi be, respectively, the number of true/false positives/negatives for a label yj from the set of labels L. Pq 2 j=1 TPyj M icro F -M easure(H,D) = Pq Pq Pq . (5) 2 j=1 TPyj + j=1 FPyj + j=1 FNyj 4.2 Results and discussion The micro F-measure of the 8 feature selection methods at the 9 percentages of selected features for each one of the 20 datasets are available in an online appendix6 . Following the recommendations in [5], we will here compare different feature selection approaches at specific percentages of selected features based on their average rankings across all datasets. We first discuss the relative performance of the 4 aggregation strategies (Max, Mean, RoR, RaR) for each feature evaluation measure (CS, BNS) separately. 5 Solvers in LIBLINEAR are insensitive to C. 6 http://tiny.cc/e0ke3w 8 Table 4 shows the mean and standard deviation of the ranking of the 4 aggre- gation strategies (columns) at the 9 different percentages of selected features (rows) across all datasets for BNS (left) and CS (right). The best (lowest) av- erage ranking for each evaluation measure and percentage of selected features is highlighted in bold. Figure 1 presents the same values as graphs of the average ranking (y-axis) with respect to the percentage of selected features (x-axis) for BNS (left) and CS (right). Table 4. Mean and standard deviation of the ranking of the 4 aggregation strategies (columns) at the 9 different percentages of selected features (rows) across all datasets for BNS (left) and CS (right). t MaxBNS MeanBNS RoRBNS RaRBNS MaxCS MeanCS RoRCS RaRCS 10% 2.00 (0.7) 1.95 (1.4) 3.15 (0.8) 2.90 (1.0) 2.50 (1.0) 1.40 (0.8) 3.10 (0.9) 3.00 (1.0) 20% 2.10 (0.8) 2.05 (1.4) 3.05 (1.0) 2.80 (1.0) 2.50 (0.9) 1.65 (1.1) 2.80 (1.1) 3.05 (1.0) 30% 2.10 (1.0) 2.10 (1.4) 3.15 (0.9) 2.65 (0.9) 2.30 (1.0) 1.90 (1.2) 2.80 (1.0) 3.00 (1.0) 40% 2.40 (1.1) 2.15 (1.4) 3.10 (1.0) 2.35 (0.8) 2.25 (1.0) 1.95 (1.2) 2.50 (1.0) 3.30 (1.0) 50% 2.40 (1.1) 2.30 (1.5) 2.95 (1.1) 2.35 (0.7) 2.00 (0.8) 2.20 (1.3) 2.80 (1.1) 3.00 (1.0) 60% 2.50 (1.1) 2.50 (1.3) 2.95 (0.9) 2.05 (1.1) 1.90 (1.1) 2.15 (1.0) 2.60 (0.9) 3.35 (1.0) 70% 2.65 (1.4) 2.68 (1.3) 2.48 (0.9) 2.20 (0.8) 2.00 (1.0) 2.20 (1.0) 2.80 (1.0) 3.00 (1.2) 80% 2.80 (1.1) 2.55 (1.4) 2.50 (1.2) 2.15 (0.8) 2.25 (1.2) 2.35 (1.0) 2.65 (1.2) 2.75 (1.1) 90% 2.90 (1.1) 2.40 (1.2) 2.30 (1.1) 2.40 (1.1) 2.10 (1.2) 2.70 (0.9) 2.65 (1.2) 2.55 (1.2) 3,50 3,50 3,00 3,00 MaxBNS MaxCS 2,50 2,50 MeanBNS MeanCS RoRBNS 2,00 RoRCS RaRBNS RaRCS 2,00 1,50 1,50 1,00 10% 20% 30% 40% 50% 60% 70% 80% 90% 10% 20% 30% 40% 50% 60% 70% 80% 90% Fig. 1. Average rankings of the eight methods. We notice that for both CS and BNS the Mean aggregation performs best when the percentage of features is low (up to 50% for BNS and 40% for CS). For larger percentages of features RaR (and RoR in the case of 90%) performs best for BNS, while Max performs best for CS. Recall that for each feature, Mean averages the scores across all labels, while Max, RoR and RaR are based on a single label. Therefore, at lower number of features, it is probably the case that 9 Max, RoR and RaR are not considering enough features for some of the labels in contrast to Mean. As the percentage of selected features increases, Max, RoR and Rar manage to select enough features for all labels and outweigh Mean, which is selecting features that work well for all labels on average. The finding that Mean and Max lead to good classification models agrees with earlier publications which combine them with different feature evaluation measures, such as ReliefF and IG [23], CS [25,28,20] and mutual information [31]. We now focus on methods that in the previous comparison achieved the best average ranking for more than one percentage of selected features. These are: MeanBNS, RaRBNS, MaxCS and MeanCS. We will discuss the relative performance of these methods along with the baselines of using All Features (AF) and Random Feature Selection (RFS). Table 5 shows the corresponding average rankings and standard deviations. Note that the performance of AF is the same independently of the percentage of selected features, yet its relative ranking with respect to competing methods can and does differ. Table 5. Mean and standard deviation of the ranking of MeanBNS, RaRBNS, MaxCS, MeanCS, RFS and AF (columns) at the 9 different percentages of selected features (rows) across all datasets. t MeanBNS RaRBNS MaxCS MeanCS RFS AF (t = 100%) 10% 3.60 (1.4) 4.15 (1.2) 3.10 (0.9) 2.25 (1.1) 5.35 (1.1) 2.55 (2.2) 20% 3.60 (1.4) 4.05 (1.3) 2.75 (1.2) 1.95 (1.0) 5.40 (1.1) 3.25 (2.0) 30% 3.35 (1.0) 4.05 (1.4) 2.45 (1.5) 2.15 (1.1) 5.40 (1.1) 3.60 (1.9) 40% 3.30 (1.2) 3.80 (1.3) 2.15 (1.2) 2.15 (1.3) 5.45 (1.1) 4.15 (1.6) 50% 3.50 (1.4) 3.55 (1.2) 2.00 (1.0) 2.00 (1.1) 5.45 (1.1) 4.50 (1.4) 60% 3.55 (1.2) 3.20 (1.2) 1.80 (1.1) 2.25 (1.1) 5.45 (1.1) 4.75 (1.3) 70% 3.25 (1.3) 2.85 (1.0) 2.15 (1.2) 2.40 (1.4) 5.45 (1.1) 4.90 (1.1) 80% 3.35 (1.4) 2.75 (1.2) 2.25 (1.3) 2.30 (1.1) 5.50 (1.0) 4.85 (1.2) 90% 3.38 (1.5) 2.90 (1.5) 2.15 (1.5) 2.78 (1.0) 5.55 (0.8) 4.25 (1.5) We notice that the best (lowest) average ranking is achieved by Max and Mean combined with CS. Besides obtaining better ranking than RFS, they also outperform AF, fulfilling the requirements of any reasonable feature selection method. CS, which behaves erratically for very small expected counts common in text classification, was found worse than BNS for single-label classification [8]. However, we here see that this scenario is reversed in the case of multi-label classification. The strategies Mean and Max seem to mitigate the disadvantage of CS. This could be because they consider more than one label, with different expected counts, when evaluating features. We complete the comparison of the 8 feature selection methods by analyzing the similarity of the feature subsets that are selected by each method. In par- ticular, we calculate a similarity index between each pair of methods [12]. This could be useful, for example, to identify diverse feature selection methods for constructing ensembles [1]. In this analysis, only one run of RaR is considered. 10 For the sake of saving space, Table 6 shows the similarity values averaged across all datasets at a specific percentage of features (t = 50%), highlighting similarity values larger than 0.7 with bold typeface. Nevertheless, the patterns found also occur for other percentage of selected features. Table 6. Similarity values of the feature subsets yielded by eight FS algorithms. MaxBNS MeanBNS RoRBNS RaRBNS MaxCS MeanCS RoRCS RaRCS Max 0.22 0.75 0.72 0.55 0.56 0.47 0.40 Mean 0.22 0.15 0.17 0.33 0.42 0.23 0.14 BNS RoR 0.75 0.15 0.82 0.48 0.47 0.44 0.48 RaR 0.72 0.17 0.82 0.47 0.49 0.44 0.40 Max 0.55 0.33 0.48 0.47 0.73 0.71 0.53 Mean 0.56 0.42 0.47 0.49 0.73 0.55 0.44 CS RoR 0.47 0.23 0.44 0.44 0.71 0.55 0.76 RaR 0.40 0.14 0.48 0.40 0.53 0.44 0.76 We first notice that CS methods are quite different from BNS methods, as one would expect. Within BNS methods, we see that MeanBNS selects quite different feature subsets from the ones found by the other BNS methods, which in turn select relatively similar feature subsets. Within CS methods, we see 3 pairs of methods selecting similar features: RaR/RoR, Mean/Max and Mean/RoR. 5 Conclusion This work evaluated 8 FS methods to support multi-label text classification in 20 benchmark datasets. They are based on 2 feature evaluation measures and 4 strategies to consider label information while evaluating features. The best methods from this group also highlighted in an experimental comparison with the classifiers built using all features and using features randomly selected. The popular algorithms MeanCS and MaxCS, which respectively rank fea- tures according to the average or the maximum Chi-squared score across all labels, led to most of the best classifiers while using less features. The former was the best choice when the number of features was smaller. As the number of features increased, the latter yielded the best classifiers. Future work will apply some of the best FS methods and their optimized implementation to rank features in large textual datasets. Furthermore, we plan to evaluate efficient FS methods which are able to consider label information in a higher level than the one considered by MeanCS and MaxCS [32]. Acknowledgment This research was supported by the São Paulo Research Foundation (FAPESP), grant 2012/23906-2. The authors would like to thank Maria C. Monard, Huei D. Lee and Everton A. Cherman for their support in multi-label feature selection. 11 References 1. Cannas, L.M., Dessı̀, N., Pes, B.: Assessing similarity of feature selection tech- niques in high-dimensional domains. Pattern Recognition Letters 34(12), 1446– 1453 (2013) 2. Chang, Y.C., Chen, S.M., Liau, C.J.: Multilabel text categorization based on a new linear classifier learning method and a category-sensitive refinement method. Expert Systems with Applications 34(3), 1948–1953 (2008) 3. Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q.: Document transformation for multi-label feature selection in text categorization. In: IEEE International Confer- ence on Data Mining. pp. 451–456 (2007) 4. Dembczynski, K., Waegeman, W., Cheng, W., H?llermeier, E.: On label depen- dence and loss minimization in multi-label classification. Machine Learning 88, 5–45 (2012) 5. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 6. Dendamrongvit, S., Vateekul, P., Kubat, M.: Irrelevant attributes and imbalanced classes in multi-label text-categorization domains. Intelligent Data Analysis 15(6), 843–859 (2011) 7. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A li- brary for large linear classification. Journal of Machine Learning Research 9, 1871– 1874 (2008) 8. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003) 9. Forman, G.: A pitfall and solution in multi-class feature selection for text classifi- cation. HPL-2004-86 (2004), Hewlett Packard 10. Fürnkranz, J., Hüllermeier, E., Mencı́a, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Machine Learning 73(2), 133–153 (2008) 11. Kononenko, I., Robnik-Šikonja, M.: Weighting and local methods non-myopic fea- ture quality evaluation with (R)ReliefF. In: Liu, H., Motoda, H. (eds.) Computa- tional Methods of Feature Selection, pp. 169–191. Chapman & Hall/CRC (2008) 12. Kuncheva, L.I.: A stability index for feature selection. In: IASTED International Multi-Conference: artificial intelligence and applications. pp. 390–395 (2007) 13. Lastra, G., Luaces, O., Quevedo, J.R., Bahamonde, A.: Graphical feature selection for multilabel classification tasks. In: International Conference on Advances in Intelligent Data Analysis. pp. 246–257 (2011) 14. Lee, J., Kim, D.W.: Feature selection for multi-label classification using multivari- ate mutual information. Pattern Recognition Letters 34(3), 349–357 (2013) 15. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004) 16. Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman & Hall/CRC (2008) 17. Mayne, A., Perry, R.: Hierarchically classifying documents with multiple labels. In: IEEE Symposium on Computational Intelligence and Data Mining. pp. 133–139 (2009) 18. Nardiello, P., Sebastiani, F., Sperduti, A.: Discretizing continuous attributes in adaboost for text categorization. In: European Conference on Information Retrieval Research. pp. 320–334 (2003) 12 19. Novovičová, J., Somol, P., Haindl, M., Pudil, P.: Conditional mutual informa- tion based feature selection for classification task. In: Iberoamerican conference on Progress in pattern recognition, image analysis and applications. pp. 417–426 (2007) 20. Olsson, J.O.S., Oard, D.W.: Combining feature selectors for text classification. In: ACM International Conference on Information and Knowledge Management. pp. 798–799 (2006) 21. Saleh, S.N., El-Sonbaty, Y.: A feature selection algorithm with redundancy reduc- tion for text classification. In: International Symposium on Computer and Infor- mation Sciences. pp. 1–6 (2007) 22. Spolaôr, N., Monard, M.C., Lee, H.D.: A systematic review to identify feature selection publications in multi-labeled data. ICMC Technical Report No 374. 31 pg. (2012), University of S?o Paulo 23. Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science 292, 135–151 (2013) 24. Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hier- archical mixture model for small training sets. In: International Conference on Information and Knowledge Management. pp. 105–113 (2001) 25. Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multi-label classification of music into emotions. In: International Conference on Music Information Retrieval. pp. 1–6 (2008) 26. Tsoumakas, G., Katakis, I., Vlahavas, I.P.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667– 685. Springer (2010) 27. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, 2411– 2414 (2011) 28. Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for mul- tilabel classification. In: European Conference on Machine Learning. pp. 406–417 (2007) 29. Wang, B., Jia, Y., Han, Y., Han, W.: Effective feature selection on data with uncer- tain labels. In: IEEE International Conference on Data Engineering. pp. 1657–1662 (2009) 30. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech- niques. Morgan Kaufmann (2011) 31. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text catego- rization. In: International Conference on Machine Learning. pp. 412–420 (1997) 32. Zhang, M., Zhou, Z.: A review on multi-label learning algorithms (in press). IEEE Transactions on Knowledge and Data Engineering PrePrints(PrePrints), 1–1 (2013) 33. Zhang, M.L., Peña, J.M., Robles, V.: Feature selection for multi-label Naive Bayes classification. Information Sciences 179, 3218–3229 (2009) 34. Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H.: Advanc- ing feature selection research - ASU feature selection repository. Technical Report (2011), Arizona State University 35. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbal- anced data. SIGKDD Explorations Newsletter 6(1), 80–89 (2004)