<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Feature Selection Methods for Multi-Label Text Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Newton Spola</string-name>
          <email>newtonspolaor@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigorios Tsoumakas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Computational Intelligence Institute of Mathematics and Computer Science University of Sa~o Paulo Sa~o Carlos</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Machine Learning and Knowledge Discovery Department of Informatics Aristotle University of Thessaloniki Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Multi-label text classi cation deals with problems in which each document is associated with a subset of categories. These documents often consist of a large number of words, which can hinder the performance of learning algorithms. Feature selection is a popular task to nd representative words and remove unimportant ones, which could speed up learning and even improve learning performance. This work evaluates eight feature selection algorithms in text benchmark datasets. The best algorithms are subsequently compared with random feature selection and classi ers built using all features. Results agree with literature by nding that well-known approaches, such as maximum chi-squared scoring across all labels, are good choices to reduce text dimensionality while reaching competitive multi-label classi cation performance.</p>
      </abstract>
      <kwd-group>
        <kwd>problem transformation</kwd>
        <kwd>binary relevance</kwd>
        <kwd>round-robin</kwd>
        <kwd>rand-robin</kwd>
        <kwd>chi-squared</kwd>
        <kwd>bi-normal separation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Classical single-label learning deals with problems in which each dataset instance
(or example) is described by a set of features and associated with only one label
from a disjoint set of labels L. Single-label text classi cation (or text
categorization), for example, learns from data in which each document has a unique
category (topic) as label. If L = 2, this task is called binary text classi cation,
and it is called multi-class text classi cation if L &gt; 2.</p>
      <p>
        Although a large amount of research has been carried out on single-label
learning, the correspondent learning algorithms do not t well into applications
composed of instances annotated with subsets of labels from L. Even in some
text categorization problems, each document is labeled with several topics
simultaneously, such that the learning algorithm should tackle more than one
label simultaneously to learn accordingly. Motivated by this scenario, multi-label
learning algorithms have been developed [
        <xref ref-type="bibr" rid="ref26 ref32">32,26</xref>
        ].
      </p>
      <p>
        Irrelevant and/or redundant features can hinder the performance of
singlelabel and multi-label learning algorithms due to the \curse of
dimensionality" [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Thus, the Feature Selection (FS) task is often applied before learning
to nd features which describe the dataset as well as, or even better than, the
original set of features does, and remove the remaining ones. FS also speeds up
learning algorithms and sometimes improves their performance [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
      </p>
      <p>
        Research on multi-label feature selection is still scarce. For example, many
publications evaluate a number of FS algorithms in only a few multi-label
datasets. This work contributes to reduce this gap by comparing 8 FS methods
in 20 multi-label text classi cation datasets (9 from di erent sources and 11 from
a web page). The methods combine 2 feature evaluation measures, Chi-squared
(CS) and Bi-Normal Separation (BNS) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], with 4 aggregation strategies to tackle
multiple labels [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], some of them still unexplored for multi-label datasets. Results
show that well-known approaches, such as considering the maximum CS score
of each feature across all labels, led to some of the best classi cation models.
      </p>
      <p>The rest of this work is organized as follows: Section 2 brie y presents
multilabel learning, FS and related work. Section 3 describes the methods evaluated
in Section 4, which is followed by the conclusion and future work in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>This section describes basic notations and concepts related to multi-label
learning and feature selection. Related work in multi-label feature selection for textual
datasets is also considered.
2.1</p>
      <sec id="sec-2-1">
        <title>Multi-label learning</title>
        <p>Let D be a dataset composed of N examples Ei = (xi;Yi), i = 1::N . Each
example (instance) Ei is associated with a feature vector xi = (xi1; xi2; : : : ; xiM )
described by M features (attributes) Xj , j = 1::M , and a subset of labels Yi L,
where L = fy1; y2; : : : yqg is the set of q labels. Table 1 shows this representation.
In this scenario, the multi-label learning task consists in generating a model H
which, given an unseen instance E = (x; ?), is capable of accurately predicting
its subset of labels Y , i.e., H(E) ! Y .</p>
        <p>
          Multi-label learning methods can be organized into two main categories:
algorithm adaptation and problem transformation [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. The former one includes
learning algorithms extended to deal with multi-label data directly, such as the
Multi-label Naive Bayes algorithm [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. On the other hand, the latter category
consists of algorithm independent methods, as any state of the art single-label
learning method can learn from each single-label problem generated by these
methods. The Binary Relevance (BR) approach exempli es this category by
transforming a multi-label dataset into q single-label datasets, learning from
each single-label problem separately and combining the results.
        </p>
        <p>
          Furthermore, exploiting label dependence during learning can improve
performance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. An alternative categorization organizes multi-label learning
methods based on the order of label correlations taken into account [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. First-order
strategies ignore co-existence of other labels during learning, as BR does.
Secondorder strategies, exempli ed by Calibrated Label [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], consider pairwise relations
between labels. High-order strategies, such as Random k-labelsets [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], consider
relations among more labels.
        </p>
        <p>Although high-order strategies potentially model wider label correlations,
they are usually computationally more demanding. In this work, the problem
transformation/ rst-order strategy BR is used for classi cation.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Feature selection</title>
        <p>
          FS for multi-label text datasets often applies single-label feature evaluation
measures, i.e., measures to score the quality of features, after using problem
transformation approaches, such as BR [
          <xref ref-type="bibr" rid="ref31 ref6">6,31</xref>
          ]. Moreover, these measures usually follow
the lter approach [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Unlike the wrapper and embedded approaches, lters
remove irrelevant and/or redundant features regardless of the learning algorithm,
which can save computational resources when working with large datasets. Both
FS measures used in this work agree with these popular choices.
        </p>
        <p>
          CS and BNS share the same notation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Let tp, f p, f n and tn be the number
of (feature) true positives, false positives, false negatives and true negatives in
a binary dataset. In this scenario, tp counts when a feature and a label under
evaluation co-occur, i.e., both are positive, while f p counts the cases in which
only the feature is positive. De ning the remaining notations is straightforward.
        </p>
        <p>Chi-squared estimates the independence between the occurrence of a feature
Xj and the occurrence of a label yi, such that the higher the measure value,
the more related Xj and yi are. CS is de ned by Equation 1, where Ppos =
(tp + f n)=(tp + f p + f n + tn), Pneg = (f p + tn)=(tp + f p + f n + tn) and
t(count;expect) = (count expect)2=expect.</p>
        <p>CS(tp;f p;f n;tn) = t (tp; (tp + f p) Ppos) + t (f n; (f n + tn) Ppos) +
t (f p; (tp + f p) Pneg) + t (tn; (f n + tn) Pneg) :
(1)</p>
        <p>Bi-Normal Separation measures the separation between two thresholds
(positive and negative classes) in a Gaussian function. This measure models the
occurrence of a feature Xj in the documents as a random Normal variable
exceeding a hypothetical threshold, such that the frequency of Xj corresponds
to the area under the curve past the threshold. As de ned by Equation 2, the
higher the di erence between the thresholds, the better the feature Xj is. Let
F 1 be the standard Normal distribution inverse cumulative probability function
(z-score), tpr = tp=(tp + f n) and f pr = f p=(f p + tn).</p>
        <p>BN S(tp;f p;f n;tn) = jF 1(tpr)</p>
        <p>F 1(f pr)j:
(2)
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Related work</title>
        <p>
          FS has been an active research topic in supervised learning, with several related
publications and comprehensive surveys [
          <xref ref-type="bibr" rid="ref16 ref26 ref34">34,26,16</xref>
          ]. Most of this research has
been mainly proposed to support single-label classi cation, but there are also
many publications on feature selection for multi-label text classi cation.
        </p>
        <p>
          The systematic review process, a method to perform a wide, replicable and
rigorous literature review, was carried out in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and recently updated to search
for multi-label FS publications. Most of the methods found are lters, which
are useful to save time/space when working with large textual datasets. Table 2
summarizes some publications on lter FS. As mentioned, many publications use
only few datasets to evaluate their methods. Moreover, Information Gain (IG)
and CS, two potentially related measures [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], are the most usual ones.
After transforming a multi-label dataset into q binary datasets by BR and
counting the number of (feature) true/false positives/negatives, any feature
evaluation measure for binary data can be applied according to the macro-averaged
approach [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], exempli ed in [
          <xref ref-type="bibr" rid="ref20 ref23 ref25 ref28">23,25,28,20</xref>
          ].
        </p>
        <p>In what follows, four aggregation strategies are described. Let tpyi , f pyi , tnyi
and f nyi be the number of (feature) true/false positives/negatives for a label yi,
i = 1::q and a feature Xj , j = 1::M .</p>
        <p>
          The well-known Mean strategy (Mean) [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] averages the scores obtained after
applying the measure f e in the binary dataset related to each label yi
(Equation 3). Max (Max), which returns the maximum score obtained across all labels
(Equation 4), is also popular.
        </p>
        <p>q
M ean(Xj ) = 1 X f e (tpyi ;f pyi ;tnyi ;f nyi ) :
q i=1</p>
        <p>q
M ax(Xj ) = max f e (tpyi ;f pyi ;tnyi ;f nyi ) :
i=1
(3)
(4)</p>
        <p>
          A nding that some feature evaluation measures can be blinded by a surplus
of strongly predictive features for frequent labels, while largely ignoring features
needed to discriminate hard (low frequency) labels, motivated the proposal of
Round-Robin (RoR) and Rand-Robin (RaR) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. After calculating q feature
rankings, the former variation takes the best feature in the ranking related to each
label in turn. On the other hand, the latter one takes the best feature for a label
randomly chosen with probability inversely proportional to its frequency. Each
feature taken in turn is removed from the q feature rankings.
        </p>
        <p>Algorithm 3.1 suggests a generic implementation for all the strategies, which
can be optimized according to each one for code optimization. In what follows,
the main procedures and variables of this algorithm are described.</p>
        <p>Textual data is often represented as sparse data, as not all features (words)
will occur in every instance (document). This property is considered by the
procedure invertedIndexes (Line 2) for countings. As result, there are M + q
rows, such that each row consists in the inverted indexes linking a feature or a
label to the instances they occur. Thus, a (feature) true positive is veri ed every
time a feature and a label co-occur in the same instance (Line 7). Based on the
number of inverted indexes, i.e., the frequency of each feature or label, f p, f n
and tn are easily set. Then the score calculated by a feature evaluation measure
f e (Line 15) is set to the matrix of feature rankings F RM .</p>
        <p>Algorithm 3.1 ends with the application of one of the strategies in Line 19.
It should be emphasized that the Equations 3 and 4 would use the matrix F RM
instead of reapplying the feature evaluation measure f e.</p>
        <p>
          Algorithm 3.1 implementation can support parallelization and serialization,
enabling user to save time/space in large datasets. First, the algorithm is split
into several independent tasks, which is helpful to successful parallelization [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Second, serialization is considered to enable the algorithm to save a stream of
bytes in the disk and load it back when necessary, releasing space in memory.
Algorithm 3.1 Generic implementation for the aggregation strategies
Input: M ulti-label dataset D
Output: F eature ranking F R
1: Initialize tp and F RM
2: finvertedF eatureIndexes;invertedLabelIndexesg
3: for each label of invertedLabelIndexes do
4: for each f eature of invertedF eatureIndexes do
5: for each invertedIndexF of f eature do
6: for each invertedIndexL of label do
7: if invertedIndexF = invertedIndexL then
8: tp tp + 1
9: end if
10: end for
11: end for
12: f p numberIndexes(f eature) tp
13: f n numberIndexes(label) tp
14: tn N (tp + f p + f n)
15: F RM [label][f eature] f e(tp;f p;f n;tn)
16: Reinitialize tp
17: end for
18: end for
19: F R aggregationStrategy(F RM )
20: return F R
invertedIndexes(D)
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental evaluation</title>
      <p>
        In this work, 8 text FS methods (2 f eature evaluation measures 4 strategies)
are applied in 20 benchmark datasets. The best methods are after compared with
the classi ers built using All Features (AF) and using the features selected by
Random Feature Selection (RFS) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The RaR strategy and RFS were executed
three times due to their stochasticity, and the correspondent Micro F-Measure
values from RaR and RFS were averaged before calculating the average ranking.
      </p>
      <p>
        Some implemented procedures use Weka [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and LIBLINEAR [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] resources.
All the reported classi cation results were obtained by Mulan [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], a framework
for multi-label classi cation, using 10-fold cross-validation with paired folds.
4.1
      </p>
      <sec id="sec-3-1">
        <title>Datasets and experimental setup</title>
        <p>3http://mulan.sourceforge.net/datasets.html
4http://meka.sourceforge.net/#datasets</p>
        <p>
          After applying each FS method in a dataset, the BR + Linear SVM (BRLL)
method, e cient to classify large sparse datasets [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], was used. BRLL classi ers
were built from data described by the best t features found by a FS method,
in which t = 10%; 20%; : : : ; 90% of the number of features M . The learning
algorithm was executed with SVM C = 3, tolerance of stopping criterion e =
0:001 and remaining parameters with default values5.
        </p>
        <p>
          All classi cation models built were evaluated according to Micro
FMeasure [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. This evaluation measure, de ned by Equation 5, has values in
the interval [0::1] and the higher its value, the better the multi-label classi er
performance is. Let TPyi , FPyi , TNyi and FNyi be, respectively, the number of
true/false positives/negatives for a label yj from the set of labels L.
2 Pq
M icro F -M easure(H;D) = 2 Pjq=1 TPyj + Pjqj==11 FTPPyyjj + Pjq=1 FNyj
:
(5)
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Results and discussion</title>
        <p>
          The micro F-measure of the 8 feature selection methods at the 9 percentages
of selected features for each one of the 20 datasets are available in an online
appendix6. Following the recommendations in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we will here compare di erent
feature selection approaches at speci c percentages of selected features based on
their average rankings across all datasets.
        </p>
        <p>We rst discuss the relative performance of the 4 aggregation strategies (Max,
Mean, RoR, RaR) for each feature evaluation measure (CS, BNS) separately.
5Solvers in LIBLINEAR are insensitive to C.</p>
        <p>6http://tiny.cc/e0ke3w</p>
        <p>We notice that for both CS and BNS the Mean aggregation performs best
when the percentage of features is low (up to 50% for BNS and 40% for CS). For
larger percentages of features RaR (and RoR in the case of 90%) performs best
for BNS, while Max performs best for CS. Recall that for each feature, Mean
averages the scores across all labels, while Max, RoR and RaR are based on a
single label. Therefore, at lower number of features, it is probably the case that
Max, RoR and RaR are not considering enough features for some of the labels
in contrast to Mean. As the percentage of selected features increases, Max, RoR
and Rar manage to select enough features for all labels and outweigh Mean,
which is selecting features that work well for all labels on average.</p>
        <p>
          The nding that Mean and Max lead to good classi cation models agrees
with earlier publications which combine them with di erent feature evaluation
measures, such as ReliefF and IG [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], CS [
          <xref ref-type="bibr" rid="ref20 ref25 ref28">25,28,20</xref>
          ] and mutual information [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ].
        </p>
        <p>We now focus on methods that in the previous comparison achieved the
best average ranking for more than one percentage of selected features. These
are: MeanBNS, RaRBNS, MaxCS and MeanCS. We will discuss the relative
performance of these methods along with the baselines of using All Features
(AF) and Random Feature Selection (RFS). Table 5 shows the corresponding
average rankings and standard deviations. Note that the performance of AF is
the same independently of the percentage of selected features, yet its relative
ranking with respect to competing methods can and does di er.</p>
        <p>
          We notice that the best (lowest) average ranking is achieved by Max and
Mean combined with CS. Besides obtaining better ranking than RFS, they also
outperform AF, ful lling the requirements of any reasonable feature selection
method. CS, which behaves erratically for very small expected counts common
in text classi cation, was found worse than BNS for single-label classi cation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
However, we here see that this scenario is reversed in the case of multi-label
classi cation. The strategies Mean and Max seem to mitigate the disadvantage
of CS. This could be because they consider more than one label, with di erent
expected counts, when evaluating features.
        </p>
        <p>
          We complete the comparison of the 8 feature selection methods by analyzing
the similarity of the feature subsets that are selected by each method. In
particular, we calculate a similarity index between each pair of methods [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. This
could be useful, for example, to identify diverse feature selection methods for
constructing ensembles [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In this analysis, only one run of RaR is considered.
For the sake of saving space, Table 6 shows the similarity values averaged across
all datasets at a speci c percentage of features (t = 50%), highlighting similarity
values larger than 0:7 with bold typeface. Nevertheless, the patterns found also
occur for other percentage of selected features.
        </p>
        <p>We rst notice that CS methods are quite di erent from BNS methods, as one
would expect. Within BNS methods, we see that MeanBNS selects quite di erent
feature subsets from the ones found by the other BNS methods, which in turn
select relatively similar feature subsets. Within CS methods, we see 3 pairs of
methods selecting similar features: RaR/RoR, Mean/Max and Mean/RoR.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This work evaluated 8 FS methods to support multi-label text classi cation in
20 benchmark datasets. They are based on 2 feature evaluation measures and
4 strategies to consider label information while evaluating features. The best
methods from this group also highlighted in an experimental comparison with
the classi ers built using all features and using features randomly selected.</p>
      <p>The popular algorithms MeanCS and MaxCS, which respectively rank
features according to the average or the maximum Chi-squared score across all
labels, led to most of the best classi ers while using less features. The former
was the best choice when the number of features was smaller. As the number of
features increased, the latter yielded the best classi ers.</p>
      <p>
        Future work will apply some of the best FS methods and their optimized
implementation to rank features in large textual datasets. Furthermore, we plan
to evaluate e cient FS methods which are able to consider label information in
a higher level than the one considered by MeanCS and MaxCS [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This research was supported by the S~ao Paulo Research Foundation (FAPESP),
grant 2012/23906-2. The authors would like to thank Maria C. Monard, Huei D.
Lee and Everton A. Cherman for their support in multi-label feature selection.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cannas</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pes</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Assessing similarity of feature selection techniques in high-dimensional domains</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>34</volume>
          (
          <issue>12</issue>
          ),
          <volume>1446</volume>
          {
          <fpage>1453</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <issue>2</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liau</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Multilabel text categorization based on a new linear classi er learning method and a category-sensitive re nement method</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>34</volume>
          (
          <issue>3</issue>
          ),
          <year>1948</year>
          {
          <year>1953</year>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          :
          <article-title>Document transformation for multi-label feature selection in text categorization</article-title>
          .
          <source>In: IEEE International Conference on Data Mining</source>
          . pp.
          <volume>451</volume>
          {
          <issue>456</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dembczynski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waegeman</surname>
          </string-name>
          , W., Cheng, W., H?llermeier, E.:
          <article-title>On label dependence and loss minimization in multi-label classi cation</article-title>
          .
          <source>Machine Learning</source>
          <volume>88</volume>
          , 5{
          <fpage>45</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Demsar</surname>
          </string-name>
          , J.:
          <article-title>Statistical comparisons of classi ers over multiple data sets</article-title>
          .
          <source>Journal of Machine Learning Research 7</source>
          ,
          <issue>1</issue>
          {
          <fpage>30</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dendamrongvit</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vateekul</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Irrelevant attributes and imbalanced classes in multi-label text-categorization domains</article-title>
          .
          <source>Intelligent Data Analysis</source>
          <volume>15</volume>
          (
          <issue>6</issue>
          ),
          <volume>843</volume>
          {
          <fpage>859</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBLINEAR: A library for large linear classi cation</article-title>
          .
          <source>Journal of Machine Learning Research 9</source>
          ,
          <year>1871</year>
          {
          <year>1874</year>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Forman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>An extensive empirical study of feature selection metrics for text classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          ,
          <issue>1289</issue>
          {
          <fpage>1305</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Forman</surname>
          </string-name>
          , G.:
          <article-title>A pitfall and solution in multi-class feature selection for text classi - cation</article-title>
          . HPL-2004-
          <volume>86</volume>
          (
          <year>2004</year>
          ), Hewlett Packard
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Furnkranz, J., Hullermeier, E.,
          <string-name>
            <surname>Menc</surname>
            <given-names>a</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.L.</given-names>
            ,
            <surname>Brinker</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Multilabel classi cation via calibrated label ranking</article-title>
          .
          <source>Machine Learning</source>
          <volume>73</volume>
          (
          <issue>2</issue>
          ),
          <volume>133</volume>
          {
          <fpage>153</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Kononenko, I.,
          <string-name>
            <surname>Robnik-Sikonja</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Weighting and local methods non-myopic feature quality evaluation with (R)ReliefF</article-title>
          . In: Liu,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Motoda</surname>
          </string-name>
          , H. (eds.)
          <source>Computational Methods of Feature Selection</source>
          , pp.
          <volume>169</volume>
          {
          <fpage>191</fpage>
          .
          <string-name>
            <surname>Chapman</surname>
          </string-name>
          &amp; Hall/CRC (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L.I.:</given-names>
          </string-name>
          <article-title>A stability index for feature selection</article-title>
          .
          <source>In: IASTED International Multi-Conference: arti cial intelligence and applications</source>
          . pp.
          <volume>390</volume>
          {
          <issue>395</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lastra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luaces</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quevedo</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahamonde</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Graphical feature selection for multilabel classi cation tasks</article-title>
          .
          <source>In: International Conference on Advances in Intelligent Data Analysis</source>
          . pp.
          <volume>246</volume>
          {
          <issue>257</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <article-title>Feature selection for multi-label classi cation using multivariate mutual information</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>34</volume>
          (
          <issue>3</issue>
          ),
          <volume>349</volume>
          {
          <fpage>357</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>T.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Rcv1: A new benchmark collection for text categorization research</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>5</volume>
          ,
          <issue>361</issue>
          {
          <fpage>397</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motoda</surname>
          </string-name>
          , H.:
          <article-title>Computational Methods of Feature Selection</article-title>
          . Chapman &amp; Hall/CRC (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mayne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perry</surname>
          </string-name>
          , R.:
          <article-title>Hierarchically classifying documents with multiple labels</article-title>
          .
          <source>In: IEEE Symposium on Computational Intelligence and Data Mining</source>
          . pp.
          <volume>133</volume>
          {
          <issue>139</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Nardiello</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sperduti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Discretizing continuous attributes in adaboost for text categorization</article-title>
          .
          <source>In: European Conference on Information Retrieval Research</source>
          . pp.
          <volume>320</volume>
          {
          <issue>334</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Novovicova</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Somol</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haindl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pudil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Conditional mutual information based feature selection for classi cation task</article-title>
          . In:
          <article-title>Iberoamerican conference on Progress in pattern recognition, image analysis and applications</article-title>
          . pp.
          <volume>417</volume>
          {
          <issue>426</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Olsson</surname>
            ,
            <given-names>J.O.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D.W.:</given-names>
          </string-name>
          <article-title>Combining feature selectors for text classi cation</article-title>
          .
          <source>In: ACM International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>798</volume>
          {
          <issue>799</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Saleh</surname>
            ,
            <given-names>S.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Sonbaty</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A feature selection algorithm with redundancy reduction for text classi cation</article-title>
          .
          <source>In: International Symposium on Computer and Information Sciences</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. Spolao^r, N.,
          <string-name>
            <surname>Monard</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.D.:</given-names>
          </string-name>
          <article-title>A systematic review to identify feature selection publications in multi-labeled data</article-title>
          .
          <source>ICMC Technical Report No 374. 31 pg</source>
          . (
          <year>2012</year>
          ), University of S?o Paulo
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. Spolao^r, N.,
          <string-name>
            <surname>Cherman</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monard</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.D.:</given-names>
          </string-name>
          <article-title>A comparison of multi-label feature selection methods using the problem transformation approach</article-title>
          .
          <source>Electronic Notes in Theoretical Computer Science</source>
          <volume>292</volume>
          ,
          <issue>135</issue>
          {
          <fpage>151</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popat</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text classi cation in a hierarchical mixture model for small training sets</article-title>
          .
          <source>In: International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>105</volume>
          {
          <issue>113</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Trohidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalliris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Vlahavas, I.: Multi-label classi cation of music into emotions</article-title>
          .
          <source>In: International Conference on Music Information Retrieval</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          :
          <article-title>Mining multi-label data</article-title>
          . In: Maimon,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Data Mining and Knowledge Discovery Handbook</article-title>
          , pp.
          <volume>667</volume>
          {
          <fpage>685</fpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Vilcek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mulan: A java library for multi-label learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2411</volume>
          {
          <fpage>2414</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Random k-labelsets: An ensemble method for multilabel classi cation</article-title>
          .
          <source>In: European Conference on Machine Learning</source>
          . pp.
          <volume>406</volume>
          {
          <issue>417</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y</given-names>
            ., Han, Y
          </string-name>
          .,
          <string-name>
            <surname>Han</surname>
          </string-name>
          , W.:
          <article-title>E ective feature selection on data with uncertain labels</article-title>
          .
          <source>In: IEEE International Conference on Data Engineering</source>
          . pp.
          <volume>1657</volume>
          {
          <issue>1662</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning Tools and Techniques</article-title>
          . Morgan Kaufmann (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.O.:</given-names>
          </string-name>
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . pp.
          <volume>412</volume>
          {
          <issue>420</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>A review on multi-label learning algorithms (in press)</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering PrePrints(PrePrints)</source>
          ,
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          , Pen~a,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Robles</surname>
          </string-name>
          , V.:
          <article-title>Feature selection for multi-label Naive Bayes classi cation</article-title>
          .
          <source>Information Sciences</source>
          <volume>179</volume>
          ,
          <volume>3218</volume>
          {
          <fpage>3229</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morstatter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alelyani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , H.:
          <article-title>Advancing feature selection research - ASU feature selection repository</article-title>
          .
          <source>Technical Report</source>
          (
          <year>2011</year>
          ), Arizona State University
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srihari</surname>
          </string-name>
          , R.:
          <article-title>Feature selection for text categorization on imbalanced data</article-title>
          .
          <source>SIGKDD Explorations Newsletter</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <volume>80</volume>
          {
          <fpage>89</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>