Performance Measures Fusion for Experimental Comparison of Methods for
                          Multi-label Classification
                              Tome Eftimov                                           Dragi Kocev
                      Computer Systems Department                       Department of Knowledge Technologies
                          Jožef Stefan Institute                               Jožef Stefan Institute
                            Jamova cesta 39                                        Jamova cesta 39
                        1000 Ljubljana, Slovenia                              1000 Ljubljana, Slovenia


                            Abstract                                 multaneously. This task is known as multi-label classifica-
                                                                     tion (MLC). The issue of learning from multi-label data
  Over the past few years, multi-label classification has been       has recently attracted significant attention from many re-
  widely explored in the machine learning community. This re-
                                                                     searchers, motivated by an increasing number of new ap-
  sulted in a number of multi-label classification methods re-
  quiring benchmarking to determine their strengths and weak-        plications. The latter include semantic annotation of images
  ness. For these reasons, typically, the authors compare the        and video (news clips, movies clips), functional genomics
  methods using a set of benchmark problems (datasets) with          (gene and protein function), music categorization into emo-
  regard to different performance measures. At the end, the re-      tions, text classification (news articles, web pages, patents,
  sults are discussed for each performance measure separately.       emails, bookmarks, ...), directed marketing and others. An
  In order to give a general conclusion in which the contribution    exhaustive list of multi-label applications is presented in
  of each performance measure will be included, we propose a         (Gibaja and Ventura 2015).
  performance measures fusion approach based on multi crite-
  ria decision analysis. The approach provides rankings of the
                                                                        In recent years, many different approaches have been de-
  compared methods for each benchmark problem separately.            veloped to solving MLC problems. Tsoumakas and Katakis
  These rankings can then be aggregated to discover sets of cor-     (Tsoumakas and Katakis 2007) summarize them into two
  related measures as well as sets of evaluation measures that       main categories: a) algorithm adaptation methods, and
  are least correlated. The performance and the robustness of        b) problem transformation methods. Algorithm adaptation
  the proposed methodology is investigated and illustrated on        methods extend specific learning algorithms to handle multi-
  the results from a comprehensive experimental study includ-        label data directly. Examples include lazy learning (Zhang
  ing 12 multi-label classification according to 16 performance      and Zhou 2007), neural networks (Crammer and Singer
  measures on a set of 11 benchmark problems.                        2003), boosting (De Comité, Gilleron, and Tommasi 2003),
                                                                     classification rules (Thabtah, Cowling, and Peng 2004), de-
                        Introduction                                 cision trees (Clare and King 2001) (Blockeel, Raedt, and
                                                                     Ramon 1998) and ensembles thereof (Kocev et al. 2013),
Supervised learning is one of the most widely researched             ensembles with label subgroups (RAKEL) (Tsoumakas and
and investigated areas of machine learning. The goal in su-          Vlahavas 2007), ensembles of classifier chains (Read et al.
pervised learning is to learn, from a set of examples with           2011) etc. Problem transformation methods, on the other
known class, a function that outputs a prediction for the class      hand, transform the MLC problem into one or more single-
of a previously unseen example. If the examples belong to            label classification problems. The single-label classification
two classes (e.g., the example has some property or not) the         problems are solved with a commonly used single-label
task is called binary classification. The task where the ex-         classification approach and the output is transformed back
amples can belong to a single class from a given set of m            into a multi-label representation. The simplest strategies in-
classes (m ≥ 3) is known as multi-class classification. The          clude the one-against-all and one-against-one strategies, also
case where the output is a real value is called regression.          referred to as the binary relevance method (Tsoumakas and
   However, in many real life problems of predictive mod-            Katakis 2007) and pair-wise method (Fürnkranz 2002) re-
elling the output (i.e., the target) can be structured, meaning      spectively.
that there can be more complex output structures such as                Performance evaluation for MLC is a more complex task
vectors of variables with some dependencies among them.              than that of classical single-label classification. Due to the
One type of structured output is vector of binary vari-              nature of the task: one example can be labelled with mul-
ables, i.e., the examples can belong to multiple classes si-         tiple labels. Namely, it is difficult to assess which error is
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     worse: two instances with two incorrect labels each or four
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   instances with single incorrect label each. To this end, in
the AAAI 2019 Spring Symposium on Combining Machine Learn-           any typical multi-label experiment, it is essential to include
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            multiple and contrasting measures because of the additional
University, Palo Alto, California, USA, March 25-27, 2019.           degrees of freedom that the multi-label setting introduces
(Madjarov et al. 2012).                                                   Fusion of performance measures
   The relations among the different evaluation measures in      Let us assume that a comparison needs to be made among m
the literature have been theoretically studied and the main      methods (i.e., alternatives) regarding n performance mea-
findings can be summarized as follows. To begin with, Ham-       sures (i.e., criteria) on a single multi-label classification
ming loss and subset accuracy have a different structure and     problem (i.e., dataset). Let M = {M1 , M2 , . . . , Mm } be
minimization of one may cause a high regret for the other        the set of methods we want to compare regarding the set of
(Dembczyński et al. 2010). Next, a study on surrogate losses    performance measures Q = {q1 , q2 , . . . , qn }. The decision
for MLC showed that none of the convex surrogate loss is         matrix is a m × n matrix (see Table 1) that contains values
consistent with ranking loss (Gao and Zhou 2013). Further-       of the performance measures obtained for the methods.
more, the F-measure optimality of the inference algorithm
is studied with decision theoretic approaches (Waegeman et                         Table 1: Decision matrix
al. 2014). Finally, an investigation on the shared properties
among different measures yielded a unified understanding                        q1          q2          ...   qn
for MLC evaluation (Wu and Zhou 2017). All in all, when                 M1      q1 (M1 )    q2 (M1 )    ...   qn (M1 )
benchmarking novel MLC methods, it is necessary to com-                 M2      q1 (M2 )    q2 (M2 )    ...   qn (M2 )
pare their performance with existing state-of-the-art meth-             ..      ..          ..          ..
                                                                         .       .           .           .
ods. However, due to the multitude of evaluation measures,
                                                                        Mm      q1 (Mm )    q2 (Mm )    ...   qn (Mm )
drawing a clear summaries and conclusions is not easy: the
methods have different performance compared to the com-
peting methods on the different evaluation measures. This           For drawing conclusions and making recommendations
makes proving a summary recommendation a complex task.           on methods’ usage by considering a set of performance
   Considering this, we propose an approach for experimen-       measures, we propose a performance measures fusion ap-
tal comparison of methods for multi-label classification. It     proach that follows the idea of PROMETHEE methodol-
is developed for making a general conclusion using a set         ogy (Brans and Mareschal 2005). More specifically, we ex-
of user-specified performance measures. For this reason,         ploit the method PROMETHEE II. It is based on making
the approach follows the idea of PROMETHEE methods,              pairwise comparisons within all methods for each perfor-
which are applicable in different domains such as, business,     mance measure. The differences between the values for each
chemistry, manufacturing, social sciences, agriculture and       pair of methods according to a specified performance met-
medicine (Ishizaka and Nemery 2011; Nikouei, Oroujzadeh,         ric are taken into consideration. For larger differences the
and Mehdipour-Ataei 2017). Recently, they were also used         decision maker might consider larger preferences. The pref-
in a data-driven approach for evaluating multi-objective op-     erence function of a performance measure for two methods
timization algorithms regarding different performance mea-       is defined as the degree of preference of method M1 over
sures (Eftimov, Korošec, and Koroušić Seljak 2018). To the    method M2 as seen in the following equation:
best of our knowledge, they were not used in the domain                           
                                                                                    pj (dj (M1 , M2 )),    maximization qj
of MLC. The PROMETHEE methodology works as a rank-               Pj (M1 , M2 ) =                                             ,
                                                                                    pj (−dj (M1 , M2 )), minimization qj
ing scheme for transforming the data for each benchmark
dataset instead of using some traditional statistical ranking                                                              (1)
scheme (e.g., fractional ranking scheme). Further the ob-        where dj (M1 , M2 ) = qj (M1 )−qj (M2 ) is the difference be-
tained rankings that are fused from more performance mea-        tween the values of the methods for the performance mea-
sures are involved in a statistical test to provide a general    sure qj and pj (·) is a generalized preference function as-
conclusion from the benchmark experiment.                        signed to that performance measure. There exist six types of
                                                                 generalized preference functions (Brans and Vincke 1985).
   The main contributions of the paper are:                      Some of them require certain preferential parameters to be
• A methodology for fusing the various evaluation mea-           defined, such as the preference and indifference thresholds.
  sures for the task of MLC.                                     The preference threshold is the smallest amount that is as-
                                                                 sumed as preference, while the indifference threshold is the
• The proposed methodology is robust considering the in-         greatest amount of difference that is insignificant.
  clusion or exclusion of correlated measures.                      After selecting the preference function for each perfor-
                                                                 mance measure, the next step is to define the average pref-
• We elucidate sets of evaluation measures that should be        erence index and outranking (preference and net) flows. The
  used together when assessing the predictive performance.       average preference index for each pair of methods gives in-
                                                                 formation of global comparison between them using all per-
• We identify the correlated measures for each measure sep-      formance measures. The average preference index can be
  arately.                                                       calculated as:
                                                                                                 n
   In the reminder of the paper, we first present the proposed                               1X
method for fusion of the performance measures for MLC.                       π(M1 , M2 ) =         wj Pj (M1 , M2 ),       (2)
                                                                                             n j=1
Then, the experimental design is explained followed by the
results and discussion. Finally, the conclusions of the paper    where wj represents the relative significance (weight) of the
are presented.                                                   j th performance measure. The higher the weight value of
a given performance measure the higher its relative signifi-        The weight of each performance measure used in Equation
cance. The selection of the weights is a crucial step in the        2 is calculated using the following equation:
PROMETHEE II method because it defines the priorities                                           1
used by the decision-maker. In our case, we used the Shan-                                    (n−E) (1 − ej )
                                                                                  wj = P          h                  i,       (12)
non entropy weighted method. For the average preference                                     n           1
                                                                                            j=1       (n−E) (1 − ej )
index, we need to point out that it is not a symmetric func-                                                   Pn
tion, so π(M1 , M2 ) 6= π(M2 , M1 ).                                where E is the sum of entropies E =           j=1 ej .
   To rank the methods, the net flow for each method needs
to be calculated. It is the difference between the positive,        Correlation analysis
φ(Mi+ ), and the negative preference flow of the method,            The existing literature on evaluation methodology for ma-
φ(Mi− ). The positive preference flow gives information how         chine learning and especially the ones referring to the task
a given method is globally better than the other methdos,           of MLC correctly identify that some of the typically used
while the negative preference flow gives the information            measures are correlated among themselves. Furthermore, it
about how a given method is outranked by all the other meth-        points out that one needs to consider different uncorrelated
ods. The positive preference flow is defined as:                    measure to get a better insight into the performance of the
                                                                    evaluated methods. To this end, we perform a correlation
                            1     X
             φ(Mi+ ) =                 π(Mi , x),        (3)        analysis of the proposed methodology to assess its robust-
                         (n − 1)                                    ness to correlated measures, and as an additional result we
                                          x∈M
                                                                    empirically elucidate the correlations among the measures
while the negative preference flow is defined as:                   widely used for MLC.
                           1     X                                     We used a correlation analysis that considers the abso-
            φ(Mi− ) =                 π(x, Mi ).              (4)
                        (n − 1)                                     lute values of pairwise correlation.Namely, we performed a
                                          x∈M
                                                                    correlation analysis on each dataset starting by calculating a
The net flow of an algorithm is defined as:                         correlation matrix for each decision matrix presented in Ta-
                                                                    ble 1. In our case, the correlation matrix is a n × n matrix
                   φ(Mi ) = φ(Mi+ ) − φ(Mi− ).                (5)
                                                                    showing Pearson correlation coefficients between the perfor-
  The PROMETHEE II method ranks the methods by or-                  mance measures (Benesty et al. 2009). The Pearson correla-
dering them according to decreasing values of net flows.            tion coefficient is a measure of the linear correlation between
                                                                    two performance measures. Its value is between -1 and 1. We
Shannon entropy weighted method                                     then averaged the correlation matrices across datasets. Fur-
To calculate the weights of each performance measure,               thermore, we removed the performance measures that have
we use the Shannon entropy weighted method (Boroushaki              the average absolute correlation greater than some threshold
2017). For this reason, the decision matrix presented in Ta-        thus obtaining sets of evaluation measures that are least cor-
ble 1 needs to be normalized. Depending of the value that is        related. Finally, by applying a threshold on the correlation
preferred (smaller or larger), the matrix is normalized using       coefficients we obtain the measures that are most correlated
the following equations:                                            among themselves.
                   0     maxi (qj (Mi )) − qj (Mi )
         qj (Mi ) =                                      ,    (6)                    Experimental design
                       maxi (qj (Mi )) − mini (qj (Mi ))
                                                                    The data used to evaluate the performance of the fusion
or                                                                  method is taken from (Madjarov et al. 2012). In that study,
                   0     qj (Mi ) − mini (qj (Mi ))                 12 MLC methods are compared according to a set of 16
         qj (Mi ) =                                      ,    (7)
                       maxi (qj (Mi )) − mini (qj (Mi ))            performance measures separately. The methods are divided
               0
where qj (Mi ) is the normalized value for qj (Mi ). The sums       into three groups using the base machine learning algo-
of the performance measures in all methods are defined as           rithm:(1)SVMs (BR (Tsoumakas and Katakis 2007), CC
                       m
                                                                    (Read et al. 2011), CLR (Park and Fürnkranz 2007), QWML
                       X              0                             (Mencı́a, Park, and Fürnkranz 2010), HOMER (Tsoumakas,
               Dj =          qj (Mi ) , j = 1, . . . , n.     (8)   Katakis, and Vlahavas 2008), RAkEL (Tsoumakas and Vla-
                       i=1
                                                                    havas 2007), ECC (Read et al. 2011), (2) Decision trees
     The entropy for each performance measure is defined as:        (ML-C4.5 (Clare and King 2001), PCT (Blockeel, Raedt,
                           m              0
                                            !                       and Ramon 1998), RFML-C4.5 (Breiman 2001), RF-PCT
                         X        qj (Mi )                          (Kocev et al. 2013)), and (3) Nearest neighbors (ML-kNN
                 ej = K       W               ,          (9)
                          i=1
                                     Dj                             (Zhang and Zhou 2007)).
                                                                       The evaluation measures of predictive performance are di-
where K is the normalized coefficient defined as:                   vided into two groups (Madjarov et al. 2012; Tsoumakas and
                              1                                     Katakis 2007): bipartitions-based and rankings-based. The
                   K = 0.5            ,                      (10)
                        (e − 1)m                                    bipartitions-based evaluation measures are calculated based
                                                                    on the comparison of the predicted relevant labels with the
and W is a function defined as:
                                                                    ground truth relevant labels. This group of evaluation mea-
              W (x) = xe(1−x) + (1 − x)ex − 1.               (11)   sures is further divided into example-based and label-based.
The example-based evaluation measures ((Hamming loss,             dataset separately and it was set as the maximum difference
accuracy, precision, recall, F1 score and subset accuracy))       that exists from all pairwise comparisons of the values be-
are based on the average differences of the actual and the        tween the methods regarding the performance measure on
predicted sets of labels over all examples of the evaluation      that dataset.
dataset. The label-based evaluation measures (micro preci-            The performance measures fusion rankings of the meth-
sion, micro recall, micro F1 , macro precision, macro recall      ods obtained using the usual generalized preference function
and macro F1 ), on the other hand, assess the predictive per-     are presented in Table 2, while the rankings obtained using
formance for each label separately and then average the per-      the V -shape generalized preference function are presented
formance over all labels. The ranking-based evaluation mea-       in Table 3. Comparing the rankings from the tables, both
sures (one-error, coverage, ranking loss and average preci-       generalized preference functions yield equal ranking only
sion) compare the predicted ranking of the labels with the        on the bookmarks dataset. The main reason for this is the
ground truth ranking.                                             size of the bookmarks dataset. Namely, most of the methods
   Using the set of performance measures, the methods are         were not able to return a result given the experimental set-
compared using 11 MLC benchmark datasets: emotions,               ting as provided in the study by (Madjarov et al. 2012). This
scene, yeast, medical, enron, corel5k, tmc2007, mediamill,        in turn means that the preference functions are calculated on
bibtex, delicious, and bookmarks. A detailed explanation of       small number of different values for the performance mea-
the implementation of the methods, definitions of the perfor-     sures (the experiments that did not finish on time were given
mance measures, and the basic statistics of the datasets are      the equally worst performance as stipulated by (Madjarov et
given in (Madjarov et al. 2012).                                  al. 2012)). For all other datasets, the rankings of the meth-
   We selected and tested two generalized preference func-        ods differ. For example, let us focus on the delicious dataset,
tions defined in Equation 1. First, a usual preference function   for which the rankings only for two methods differ. In the
is used for each performance measure, so we do not need to        case of usual generalized preference function the RFML-
select the preference and indifference thresholds. The usual      C4.5 is ranked as the second and the RF-PCT is ranked as
preference function is presented in Equation 13. Using this       the first, while in the case of the V -shape generalized pref-
preference function, we can only say if there is a difference     erence function they swap their rankings, the RFML-C4.5 is
or not, but we do not take into account the difference value.     ranked as the first and the RF-PCT as the second. So, to un-
                                                                 derstand why this happens, we will analyze the performance
                              0, x ≤ 0                            measures fusion approach on the delicious dataset.
                    p(x) =                 ,               (13)
                              1, x > 0                                When different generalized preference functions are used,
                                                                  it follows that the methods have different net flows. The net
   Second, a V -shape generalized preference function is
                                                                  flows are dependent from the positive and negative flows,
used for each performance measure, in which the threshold
                                                                  which are related to the average preference index. Fur-
of strict preference, q, is set to the maximum difference that
                                                                  thermore, the average preference index depends from the
exists for each preference measure on a given benchmark
                                                                  weights of the performance measures and the selected gen-
problem. The V -shape preference function is presented in
                                                                  eralized preference function. In our case, using the Shan-
Equation 14. Using this preference function, all difference
                                                                  non entropy weighted method, the result is that all perfor-
values are take into account using a linear function.
                                                                  mance measures are uniformly distributed on each dataset,
                                                                  so they all have the same influence on the end result, wj =
                           
                           0, x ≤ 0
                           
                                                                  w, j = 1, . . . , n, for both versions of the performance fu-
                  p(x) = xq , 0 ≤ x ≤ q ,                (14)     sion approach. The weight for each performance measure is
                           
                           1, x > q
                                                                  estimated according to the entropy it conveys. Having this
                                                                  result, it follows that the difference between the rankings
   According to the value of each performance measure that        in both versions comes from the selection of different gen-
is preferable (smaller or larger), the 16 performance mea-        eralized preference functions. For this reason, in Figures 1
sures can be split into two groups: (1)Minimization (Ham-         and 1, the average preference indices, π(RF -P CT, Mi )
ming loss, One error, Coverage, Ranking loss) and (2) Max-        and π(RF M L-C4.5, Mi ) used for calculating the positive
imization (Precision, Accuracy, Recall, F1 score, Subset ac-      flows, obtained on the delicious dataset, are presented. Us-
curacy, Macro precision, Macro recall, Macro F1 , Micro pre-      ing this figure, we can see that the average preference indices
cision, Micro recall, Micro F1 , Average precision).              obtained using the usual generalized preference function
                                                                  between the RFML-C4.5 and each of the methods: CLR,
                Results and discussion                            QWML, PCT, RAkEL, and ECC, are the same with the av-
We compared the 12 MLC methods using the set of 16                erage preference indices obtained between the RF-PCT and
performance measures on each dataset separately by using          each of the methods: CLR, QWML, PCT, RAkEL, and ECC.
the performance measures fusion ranking. We performed             However this is not a case when the V -shape generalized
the analysis for the two preference functions (usual gener-       preference function is used. In this case, the same above-
alized and V -shaped preference generalized function. The         mentioned average preference indices obtained for RFML-
latter was used with different threshold of strict preference     C4.5 are greater than the same average preference indices
for each performance measure. The threshold of strict pref-       obtained for RF-PCT.
erence for each performance measure was estimated on each             We inspect the results more closely by inspecting a pair-
                        Table 2: Performance measures fusion rankings using the usual preference generalized function.


                                                                                                                                                                                                                                       RFML-C4.5
                                                                                                                                                                  ML-kNN
                                                                                                                                   ML-C4.5
                                                                                                                         HOMER


                                                                                                                                                                                                                                                                  RF-PCT
                                                                                                            QWML


                                                                                                                                                                                         RAkEL
                                                                              CLR


                                                                                                                                                                                                              ECC
                                                                                                                                                PCT
                                       BR


                                                           CC
            Dataset
            emotions              7.00               10.00                 8.00                 11.00                   5.00       3.00        4.00          12.00                 6.00                       9.00              2.00                             1.00
            scene                 2.00                3.00                 5.00                 7.00                    6.00      11.00       12.00           9.00                 1.00                       4.00             10.00                             8.00
            yeast                 2.00                4.00                 1.00                 6.00                    3.00      11.00       12.00           8.00                 9.00                       5.00             10.00                             7.00
            medical               8.00                5.00                 3.00                 1.00                    2.00      4.00        12.00          10.00                 7.00                       9.00             11.00                             6.00
            enron                 2.00                7.00                 1.00                 9.00                    4.00      10.00       12.00          11.00                 8.00                       5.00              6.00                             3.00
            corel5k               4.00                5.00                 1.00                 2.00                    3.00      9.00        11.00           7.00                12.00                      10.00             8.00                              6.00
            tmc2007               3.00                1.00                 4.00                 5.00                    6.00      11.00       12.00          10.00                 7.00                       9.00              8.00                             2.00
            mediamill             4.00                5.00                 11.00                10.00                   6.00      12.00        7.00           3.00                 9.00                       8.00              2.00                             1.00
            bibtex                2.00                1.00                 3.00                 4.00                    5.00      10.00       11.00           8.00                12.00                       7.00              9.00                             6.00
            delicious             4.00                3.00                 10.50                10.50                   5.00       7.00        8.00           6.00                10.50                      10.50              2.00                             1.00
            bookmarks             9.00                9.00                 9.00                 9.00                    9.00      2.00         5.00           3.00                 9.00                       9.00              4.00                             1.00

                   Table 3: Performance measures fusion rankings using the V -shape preference generalized function.


                                                                                                                                                                                                                                         RFML-C4.5
                                                                                                                                                                   ML-kNN
                                                                                                                                    ML-C4.5
                                                                                                                          HOMER


                                                                                                                                                                                                                                                                   RF-PCT
                                                                                                             QWML


                                                                                                                                                                                          RAkEL
                                                                                    CLR


                                                                                                                                                                                                                ECC
                                                                                                                                                  PCT
                                       BR


                                                           CC


        Dataset
        emotions                 8.00                     10.00             9.00                      11.00             5.00       3.00        4.00          12.00                       6.00                 7.00               2.00                            1.00
        scene                    2.00                      1.00             4.00                       7.00             6.00      11.00       12.00          8.00                        3.00                 5.00              10.00                            9.00
        yeast                    2.00                      4.00             1.00                       9.00             3.00      12.00       11.00          6.00                        8.00                 5.00              10.00                            7.00
        medical                 10.00                      9.00             2.00                       1.00             4.00       3.00       12.00          6.00                        7.00                 8.00              11.00                            5.00
        enron                    1.00                     10.00             2.00                       7.00             4.00       8.00       12.00          11.00                       9.00                 6.00               5.00                            3.00
        corel5k                  4.00                      5.00             1.00                       2.00             3.00      10.00       9.00           7.00                       11.00                12.00               6.00                            8.00
        tmc2007                  3.00                      2.00             4.00                       6.00             5.00      11.00       12.00          10.00                       7.00                 8.00               9.00                            1.00
        mediamill                4.00                      5.00            12.00                      11.00             7.00      10.00        6.00           3.00                       8.00                 9.00               2.00                            1.00
        bibtex                   2.00                      1.00             3.00                       4.00             5.00      10.00       11.00          7.00                       12.00                 8.00               9.00                            6.00
        delicious                4.00                      3.00            10.50                      10.50             5.00       7.00        8.00           6.00                      10.50                10.50               1.00                            2.00
        bookmarks                9.00                      9.00             9.00                       9.00             9.00       2.00        5.00           3.00                       9.00                 9.00               4.00                            1.00


     0.06
                                                                                                            RF-MLC4.5                          0.05                                                                                                              RF-MLC4.5
                                                                                                            RF-PCT                                                                                                                                               RF-PCT
     0.05
                                                                                                                                               0.04


     0.04

                                                                                                                                               0.03

     0.03

                                                                                                                                               0.02
     0.02


                                                                                                                                               0.01
     0.01


     0.00                                                                                                                                      0.00
              BR


                   CC


                          CLR


                                QWML


                                        HOMER


                                                ML-C4.5


                                                            PCT


                                                                  ML-kNN


                                                                            RAkEL


                                                                                          ECC


                                                                                                RFML-C4.5


                                                                                                               RF-PCT


                                                                                                                                                        BR


                                                                                                                                                             CC


                                                                                                                                                                           CLR


                                                                                                                                                                                 QWML


                                                                                                                                                                                           HOMER


                                                                                                                                                                                                   ML-C4.5


                                                                                                                                                                                                              PCT


                                                                                                                                                                                                                      ML-kNN


                                                                                                                                                                                                                               RAkEL


                                                                                                                                                                                                                                               ECC


                                                                                                                                                                                                                                                     RFML-C4.5


                                                                                                                                                                                                                                                                     RF-PCT


Figure 1: Average preference indices for RFML−C4.5                                                                                       Figure 2: Average preference indices for RFML−C4.5
and RF-PCT obtained on the delicious dataset using                                                                                       and RF-PCT obtained on the delicious dataset using
the usual preference function (π(RF M L-C4.5, Mi ) and                                                                                   the V -shape preference function(π(RF M L-C4.5, Mi ) and
π(RF -P CT, Mi )).                                                                                                                       π(RF -P CT, Mi )).
wise comparison with the ECC method. Using the usual gen-          poses. . The performance measures that are not removed for
eralized preference function, we can see that π(RF M L-            each predefined threshold are (i.e., the least correlated):
C4.5, ECC) == π(RF -P CT, ECC), while if the V -                   • 0.7: coverage, macro precision, micro precision, micro re-
shape generalized preference function is used π(RF M L-               call, subset accuracy.
C4.5, ECC) > π(RF -P CT, ECC). Having the weights
uniformly distributed, all of them have the same value, w,         • 0.8: hamming loss, macro precision, micro precision, mi-
the Equation 2 is transformed into:                                   cro recall, precision, ranking loss, subset accuracy.
                            n
                                                                   • 0.9: average precision, hamming loss, macro precision,
                         1 X                                          micro precision, one error, precision, recall, ranking loss,
            π(M1 , M2 ) = w    Pj (M1 , M2 ).              (15)
                         n j=1                                        subset accuracy.
                                                                      The rankings obtained for each predefined threshold are
   Using the usual generalized preference function, we can         further tested with the Friedman test. In all cases the p-
see that both methods, RFML−C4.5 and RF−PCT, win                   values is smaller than 0.05, so the null hypothesis is rejected
against ECC according to all performance measures, but             and the Nemenyi test was used to get the source of the dif-
using it we only count wins and losses without taking              ference. In all cases there are no big differences in the re-
into account how large are the wins of RFML−C4.5 and               sults from the post-hoc test. When the correlation threshold
RF−PCT against ECC. By using the usual generalized pref-           is set at 0.9, the difference comes from the pairs of methods:
erence function the performance measures fusion approach           (RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT, RAkEL);
behaves as majority vote in the case when the influence            in the case of 0.8 from the pairs of methods (RF-PCT, PCT)
of each performance measure is uniformly, which happens            and (RF-PCT, ECC); and in the case of 0.7 from the pairs
in our case. However, using the V -shape generalized pref-         of methods (RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT,
erence function, the information of how large is the win           RAkEL). If we compare these results with the result ob-
is also taken into account. Both methods also win against          tained when all performance measures are used, there are not
ECC in all performance measures, but here the magnitude            big changes, the quesition that arises is only if there is a sta-
of the wins are also considered, which results in different        tistical significance between the pairs (RF-PCT, ECC), and
average
 Pn       preference indices. So it follows that RFML-C4.5         (RF-PCT, RAkEL), which can be further explored within an
( j=1 Pj (RF M L − C4.5, M2 ) = 13.63) has greater av-             one vs all analysis.
erage preference
           Pn index than the average preference index of              However, in a lot of papers authors are also interested
RF-PCT ( j=1 Pj (RF M L − C4.5, M2 ) = 12.43).                     in the practical significance of the results. The rankings for
   After describing the inner working of the proposed              each method across the datasets for each predefined thresh-
method for a single dataset in detail, the obtained rank-          old are thus averaged (Table 4). Next, we check for statis-
ings for each dataset could be further used with some sta-         tical difference between them using the Friedman test. The
tistical test to provide a general overall conclusion of the       p-value is 0.935, so it follows that there is no difference be-
benchmarking of the MLC methods. The Friedman test was             tween the average rankings that are obtained for each prede-
selected as an appropriate test for use. The p-value for           fined correlation threshold. Also, for each predefined thresh-
the rankings obtained with the usual generalized preference        old, we ranked them starting from the best till the worst
function is 0.0005, while the p-value for the rankings ob-         method according to its average ranking (Table 5). Form
tained using the V -shape generalized preference function is       here, it follows that there is no big differences regarding the
0.0061. In both cases, the null hypothesis is rejected, so there   correlation threshold that is used. Notwithstanding, the dif-
is a difference between the methods according to the set of        ference for the HOMER method is noticeable. This is due
16 performance measures compared on a set of 11 bench-             to the fact that HOMER performs better on the correlated
mark datasets. To further check where the difference comes         measures (thus its high score). Conversely, CC seems that it
from, the Nemenyi post-hoc test (all vs. all) was used with        performs worse on the correlated measures.
a significance level of 0.05. In the case of usual generalized        Furthermore, to quantify the robustness, the absolute dif-
preference function, the difference come from the pairs of         ference between the rankings obtained on each dataset for
methods (RF-PCT, PCT) and (BR, PCT), while in the case             each predefined threshold and the rankings obtained us-
of the V -shape generalized preference function, there is only     ing all performance measures are calculated. Next, for each
a difference in the pair (RF-PCT, PCT). This implies that the      method, the average absolute difference is calculated across
differences in the rankings of the methods are very small.         datasets to investigate how much the methods change their
   We next focus on assessing the robustness of the proposed       ranking (Table 6). Using these results, it follows that the
methodology w.r.t. the presence of correlated measures. Re-        rankings are robust to the correlated measures, they can vary,
call that some of the evaluation measures for MLC are cor-         but with a very small differences.
related among themselves. For this reason, we performed a             Finally, we use the aggregated correlation matrix across
correlation analysis to investigate whether the method rank-       datasets to elucidate the correlated measures. The results are
ings will be disturbed by removing the correlated measures.        given in Figure 3. The results show a large group of inter-
We performed this analysis using the results from the V -          connected measures. We can note that accuracy, F1 score
shape generalized preference function. We investigate three        and micro F1 are connected with most measures (each has
predefined correlation thresholds: 0.7, 0.8, and 0.9. The ex-      8 connections). The least connected are the ranking based
act values of the thresholds were selected for illustrative pur-   measures.
Figure 3: Correlation between performance measures. Red edges are for correlation greater than 0.9, blue and red edges are for correlation
greater than 0.8, and green, blue, and red edges correspond to correlation more than 0.7. The evaluation measures Hamming loss, one-error
and micro precision are not correlated with the other measures.


Table 4: Average rankings for each method across datasets.              Table 5: Practical rankings for each method across datasets.

                          0.7     0.8     0.9     All                                            0.7      0.8       0.9      All
         BR               4.82    4.82    4.45    4.45                         BR                2.00     2.00      2.00     2.00
         CC               5.09    5.18    5.45    5.36                         CC                3.00     3.00      5.00     5.00
         CLR              5.14    5.32    5.23    5.23                         CLR               4.00     4.00      3.00     4.00
         QWML             6.77    6.77    6.41    7.05                         QWML              7.00     7.00      6.00     7.00
         HOMER            6.27    6.55    6.73    5.09                         HOMER             6.00     6.00      7.00     3.00
         ML-C4.5          8.27    8.09    7.91    7.91                         ML-C4.5           9.00    10.00      9.00     9.00
         PCT              9.09    9.00    9.27    9.27                         PCT              12.00    12.00     12.00    12.00
         ML-kNN           6.91    6.91    6.91    7.18                         ML-kNN            8.00     8.00      8.00     8.00
         RAkEL            8.41    7.95    8.50    8.23                         RAkEL            10.00     9.00     10.00    11.00
         ECC              8.59    8.59    8.59    7.95                         ECC              11.00    11.00     11.00    10.00
         RFML-C4.5        5.45    5.45    5.36    6.27                         RFML-C4.5         5.00     5.00      4.00     6.00
         RF-PCT           3.18    3.36    3.18    4.00                         RF-PCT            1.00     1.00      1.00     1.00


   This is the first attempt at treating the versatile results of       enormous. From a user perspective, the proposed method
MLC experiments in an unified way. More specifically, most              takes as input the tables with the results does the necessary
of the works in the area report performance along many in-              calculations and outputs the overall rankings of the meth-
dividual measures and making general conclusions in such                ods across the different evaluation measures. This is very
a setting is heavily impaired. This is evident also in the ex-          convenient considering the number of evaluation measures
tensive experimental comparison performed by (Madjarov et               typically used for MLC. This way benchmarking of new
al. 2012), where the results are extensively discussed along            methods for MLC can be performed with a great ease. More-
multiple evaluation measures. We consider the results from              over, it provides the user a nice overview of the methods per-
this study to evaluate and illustrate our method because it             formance: The proposed methodology shows its robustness
is the most extensive and most complete study for MLC. We               on correlated measures and also defines sets of performance
could easily use also other experimental results, but there are         measure that are not correlated and can be further included
not many that follow the same experimental design and have              in individual analyses.
the results readily publicly available.                                    We need to mention that the proposed methodology can
   The potential for practical use of the proposed method is            easily consider also other performance measures such as
Table 6: Average absolute difference between the rankings        Brans, J.-P., and Mareschal, B. 2005. Promethee methods. In
obtained for each predefined threshold and the rankings ob-      Multiple criteria decision analysis: state of the art surveys.
tained using all performance measures across datasets.           Springer. 163–186.
                     (0.7, All)   (0.8, All)   (0.9, All)        Brans, J.-P., and Vincke, P. 1985. Note-a preference ranking
     BR                 0.91         0.91         0.91           organisation method: (the promethee method for multiple
     CC                 0.64         0.91         1.18           criteria decision-making). Management science 31(6):647–
     CLR                1.00         1.00         0.55           656.
     QWML               0.64         0.64         0.82           Breiman, L. 2001. Random forests. Machine learning
     HOMER              1.18         1.64         1.64           45(1):5–32.
     ML-C4.5            0.73         0.55         0.36           Clare, A., and King, R. D. 2001. Knowledge discovery
     PCT                0.18         0.27         0.00           in multi-label phenotype data. In European Conference on
     ML-kNN             0.64         0.64         0.64           Principles of Data Mining and Knowledge Discovery, 42–
     RAkEL              0.36         0.82         0.45           53. Springer.
     ECC                1.00         1.18         1.00
     RFML-C4.5          0.82         0.82         0.91           Crammer, K., and Singer, Y. 2003. A family of additive
     RF-PCT             0.82         0.64         0.82           online algorithms for category ranking. Journal of Machine
                                                                 Learning Research 3:1025–1058.
                                                                 De Comité, F.; Gilleron, R.; and Tommasi, M. 2003. Learn-
running times and memory consumption.                            ing multi-label alternating decision trees from texts and data.
                                                                 In Proc. of the 3rd international conference on Machine
                      Conclusions                                learning and data mining in pattern recognition, 35–49.
In this paper, we propose an approach for fusing multiple        Dembczyński, K.; Waegeman, W.; Cheng, W.; and
evaluation measures for MLC into an overall assessment of        Hüllermeier, E. 2010. Regret analysis for performance
performance. The benefit of using this approach is manifold.     metrics in multi-label classification: The case of hamming
First, it is designed for making a general conclusion using a    and subset zero-one loss. In Machine Learning and Knowl-
set of performance measures. Second, it avoids the compar-       edge Discovery in Databases, 280–295. Berlin, Heidelberg:
ison according to multiple performance measures separately       Springer Berlin Heidelberg.
and then reporting on the results in a biased manner. Third,     Eftimov, T.; Korošec, P.; and Koroušić Seljak, B. 2018.
it is robust to inclusion of correlated evaluation measures.     Data-driven preference-based deep statistical ranking for
Finally, it gives lists of evaluation measures that are corre-   comparing multi-objective optimization algorithms. In In-
lated among themselves thus avoiding comparisons only on         ternational Conference on Bioinspired Methods and Their
correlated measures.                                             Applications, 138–150. Springer.
   For future work, we plan to extend this approach by in-
vestigating different preference functions and selecting the     Fürnkranz, J. 2002. Round robin classification. Journal of
best suitable one for each performance measure regarding its     Machine Learning Research 2:721–747.
properties. Next, we will investigate the building of hybrid     Gao, W., and Zhou, Z.-H. 2013. On the consistency of multi-
methods (mix of more generalized preference functions) that      label learning. Artificial intelligence 199-200:22–44.
can be used for experimental comparison of methods for           Gibaja, E., and Ventura, S. 2015. A Tutorial on Multilabel
MLC. Finally, we will extend the experimental study by in-       Learning. ACM Computing Surveys 47(3):1–38.
cluding more datasets and methods.
                                                                 Ishizaka, A., and Nemery, P. 2011. Selecting the best sta-
                   Acknowledgments                               tistical distribution with promethee and gaia. Computers &
                                                                 Industrial Engineering 61(4):958–969.
This work was supported by the project from the Slovenian
Research Agency (research core funding No. P2-0098 and           Kocev, D.; Vens, C.; Struyf, J.; and Džeroski, S. 2013. Tree
No. P2-0103)                                                     ensembles for predicting structured outputs. Pattern Recog-
                                                                 nition 46(3):817–833.
                       References                                Madjarov, G.; Kocev, D.; Gjorgjevikj, D.; and Džeroski, S.
Benesty, J.; Chen, J.; Huang, Y.; and Cohen, I. 2009. Pear-      2012. An extensive experimental comparison of methods for
son correlation coefficient. In Noise reduction in speech pro-   multi-label learning. Pattern recognition 45(9):3084–3104.
cessing. Springer. 1–4.                                          Mencı́a, E. L.; Park, S.-H.; and Fürnkranz, J. 2010. Effi-
Blockeel, H.; Raedt, L. D.; and Ramon, J. 1998. Top-down         cient voting prediction for pairwise multilabel classification.
induction of clustering trees. In Proceedings of the 15th In-    Neurocomputing 73(7-9):1164–1176.
ternational Conference on Machine Learning, 55–63. Mor-          Nikouei, M. A.; Oroujzadeh, M.; and Mehdipour-Ataei, S.
gan Kaufmann.                                                    2017. The promethee multiple criteria decision making
Boroushaki, S. 2017. Entropy-based weights for multicrite-       analysis for selecting the best membrane prepared from sul-
ria spatial decision-making. Yearbook of the Association of      fonated poly (ether ketone) s and poly (ether sulfone) s for
Pacific Coast Geographers 79:168–187.                            proton exchange membrane fuel cell. Energy 119:77–85.
Park, S.-H., and Fürnkranz, J. 2007. Efficient pairwise clas-
sification. In European Conference on Machine Learning,
658–665. Springer.
Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011.
Classifier chains for multi-label classification. Machine
learning 85(3):333.
Thabtah, F. A.; Cowling, P.; and Peng, Y. 2004. MMAC: A
New Multi-class, Multi-label Associative Classification Ap-
proach. In Proc. of the 4th IEEE International Conference
on Data Mining, 217–224.
Tsoumakas, G., and Katakis, I. 2007. Multi-label classifica-
tion: An overview. International Journal of Data Warehous-
ing and Mining (IJDWM) 3(3):1–13.
Tsoumakas, G., and Vlahavas, I. 2007. Random k-labelsets:
An ensemble method for multilabel classification. In Euro-
pean conference on machine learning, 406–417. Springer.
Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2008. Effective
and efficient multilabel classification in domains with large
number of labels. In Proc. ECML/PKDD 2008 Workshop on
Mining Multidimensional Data (MMD08), volume 21, 53–
59. sn.
Waegeman, W.; Dembczyński, K.; Jachnik, A.; Cheng, W.;
and Hüllermeier, E. 2014. On the bayes-optimality of
f-measure maximizers. Journal of Machine Learning Re-
search 15:3513–3568.
Wu, K.-Z., and Zhou, Z.-H. 2017. A unified view of multi-
label performance measures. In Proceedings of the 28th In-
ternational Conference on Machine Learning, ICML’17.
Zhang, M.-L., and Zhou, Z.-H. 2007. Ml-knn: A lazy learn-
ing approach to multi-label learning. Pattern recognition
40(7):2038–2048.