<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Measures Fusion for Experimental Comparison of Methods for Multi-label Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tome Eftimov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dragi Kocev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Systems Department, Jozˇef Stefan Institute</institution>
          ,
          <addr-line>Jamova cesta 39, 1000 Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Knowledge Technologies, Jozˇef Stefan Institute</institution>
          ,
          <addr-line>Jamova cesta 39, 1000 Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Gerber</institution>
          ,
          <addr-line>D. Lenat, F. van Harmelen, P. Clark (Eds.)</addr-line>
          ,
          <institution>Proceedings of, the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019). Stanford, University</institution>
          ,
          <addr-line>Palo Alto, California, USA, March 25-27, 2019.</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Over the past few years, multi-label classification has been widely explored in the machine learning community. This resulted in a number of multi-label classification methods requiring benchmarking to determine their strengths and weakness. For these reasons, typically, the authors compare the methods using a set of benchmark problems (datasets) with regard to different performance measures. At the end, the results are discussed for each performance measure separately. In order to give a general conclusion in which the contribution of each performance measure will be included, we propose a performance measures fusion approach based on multi criteria decision analysis. The approach provides rankings of the compared methods for each benchmark problem separately. These rankings can then be aggregated to discover sets of correlated measures as well as sets of evaluation measures that are least correlated. The performance and the robustness of the proposed methodology is investigated and illustrated on the results from a comprehensive experimental study including 12 multi-label classification according to 16 performance measures on a set of 11 benchmark problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Supervised learning is one of the most widely researched
and investigated areas of machine learning. The goal in
supervised learning is to learn, from a set of examples with
known class, a function that outputs a prediction for the class
of a previously unseen example. If the examples belong to
two classes (e.g., the example has some property or not) the
task is called binary classification. The task where the
examples can belong to a single class from a given set of m
classes (m 3) is known as multi-class classification. The
case where the output is a real value is called regression.</p>
      <p>However, in many real life problems of predictive
modelling the output (i.e., the target) can be structured, meaning
that there can be more complex output structures such as
vectors of variables with some dependencies among them.
One type of structured output is vector of binary
variables, i.e., the examples can belong to multiple classes
si</p>
      <p>Performance evaluation for MLC is a more complex task
than that of classical single-label classification. Due to the
nature of the task: one example can be labelled with
multiple labels. Namely, it is difficult to assess which error is
worse: two instances with two incorrect labels each or four
instances with single incorrect label each. To this end, in
any typical multi-label experiment, it is essential to include
multiple and contrasting measures because of the additional
degrees of freedom that the multi-label setting introduces
(Madjarov et al. 2012).</p>
      <p>
        The relations among the different evaluation measures in
the literature have been theoretically studied and the main
findings can be summarized as follows. To begin with,
Hamming loss and subset accuracy have a different structure and
minimization of one may cause a high regret for the other
        <xref ref-type="bibr" rid="ref12">(Dembczyn´ski et al. 2010)</xref>
        . Next, a study on surrogate losses
for MLC showed that none of the convex surrogate loss is
consistent with ranking loss
        <xref ref-type="bibr" rid="ref16">(Gao and Zhou 2013)</xref>
        .
Furthermore, the F-measure optimality of the inference algorithm
is studied with decision theoretic approaches
        <xref ref-type="bibr" rid="ref31">(Waegeman et
al. 2014)</xref>
        . Finally, an investigation on the shared properties
among different measures yielded a unified understanding
for MLC evaluation
        <xref ref-type="bibr" rid="ref32">(Wu and Zhou 2017)</xref>
        . All in all, when
benchmarking novel MLC methods, it is necessary to
compare their performance with existing state-of-the-art
methods. However, due to the multitude of evaluation measures,
drawing a clear summaries and conclusions is not easy: the
methods have different performance compared to the
competing methods on the different evaluation measures. This
makes proving a summary recommendation a complex task.
      </p>
      <p>
        Considering this, we propose an approach for
experimental comparison of methods for multi-label classification. It
is developed for making a general conclusion using a set
of user-specified performance measures. For this reason,
the approach follows the idea of PROMETHEE methods,
which are applicable in different domains such as, business,
chemistry, manufacturing, social sciences, agriculture and
medicine
        <xref ref-type="bibr" rid="ref18 ref32">(Ishizaka and Nemery 2011; Nikouei, Oroujzadeh,
and Mehdipour-Ataei 2017)</xref>
        . Recently, they were also used
in a data-driven approach for evaluating multi-objective
optimization algorithms regarding different performance
measures
        <xref ref-type="bibr" rid="ref13">(Eftimov, Korosˇec, and Korousˇic´ Seljak 2018)</xref>
        . To the
best of our knowledge, they were not used in the domain
of MLC. The PROMETHEE methodology works as a
ranking scheme for transforming the data for each benchmark
dataset instead of using some traditional statistical ranking
scheme (e.g., fractional ranking scheme). Further the
obtained rankings that are fused from more performance
measures are involved in a statistical test to provide a general
conclusion from the benchmark experiment.
      </p>
      <p>The main contributions of the paper are:
A methodology for fusing the various evaluation
measures for the task of MLC.</p>
      <p>The proposed methodology is robust considering the
inclusion or exclusion of correlated measures.</p>
      <p>We elucidate sets of evaluation measures that should be
used together when assessing the predictive performance.
We identify the correlated measures for each measure
separately.</p>
      <p>In the reminder of the paper, we first present the proposed
method for fusion of the performance measures for MLC.
Then, the experimental design is explained followed by the
results and discussion. Finally, the conclusions of the paper
are presented.</p>
    </sec>
    <sec id="sec-2">
      <title>Fusion of performance measures</title>
      <p>Let us assume that a comparison needs to be made among m
methods (i.e., alternatives) regarding n performance
measures (i.e., criteria) on a single multi-label classification
problem (i.e., dataset). Let M = fM1; M2; : : : ; Mmg be
the set of methods we want to compare regarding the set of
performance measures Q = fq1; q2; : : : ; qng. The decision
matrix is a m n matrix (see Table 1) that contains values
of the performance measures obtained for the methods.</p>
      <p>
        For drawing conclusions and making recommendations
on methods’ usage by considering a set of performance
measures, we propose a performance measures fusion
approach that follows the idea of PROMETHEE
methodology
        <xref ref-type="bibr" rid="ref4">(Brans and Mareschal 2005)</xref>
        . More specifically, we
exploit the method PROMETHEE II. It is based on making
pairwise comparisons within all methods for each
performance measure. The differences between the values for each
pair of methods according to a specified performance
metric are taken into consideration. For larger differences the
decision maker might consider larger preferences. The
preference function of a performance measure for two methods
is defined as the degree of preference of method M1 over
method M2 as seen in the following equation:
      </p>
      <p>
        pj (dj (M1; M2)); maximization qj
Pj (M1; M2) = ;
pj ( dj (M1; M2)); minimization qj
(1)
where dj (M1; M2) = qj (M1) qj (M2) is the difference
between the values of the methods for the performance
measure qj and pj ( ) is a generalized preference function
assigned to that performance measure. There exist six types of
generalized preference functions
        <xref ref-type="bibr" rid="ref6">(Brans and Vincke 1985)</xref>
        .
Some of them require certain preferential parameters to be
defined, such as the preference and indifference thresholds.
The preference threshold is the smallest amount that is
assumed as preference, while the indifference threshold is the
greatest amount of difference that is insignificant.
      </p>
      <p>After selecting the preference function for each
performance measure, the next step is to define the average
preference index and outranking (preference and net) flows. The
average preference index for each pair of methods gives
information of global comparison between them using all
performance measures. The average preference index can be
calculated as:</p>
      <p>n
(M1; M2) = 1 X wj Pj (M1; M2);
n
j=1
(2)
where wj represents the relative significance (weight) of the
jth performance measure. The higher the weight value of
a given performance measure the higher its relative
significance. The selection of the weights is a crucial step in the
PROMETHEE II method because it defines the priorities
used by the decision-maker. In our case, we used the
Shannon entropy weighted method. For the average preference
index, we need to point out that it is not a symmetric
function, so (M1; M2) 6= (M2; M1).</p>
      <p>To rank the methods, the net flow for each method needs
to be calculated. It is the difference between the positive,
(Mi+), and the negative preference flow of the method,
(Mi ). The positive preference flow gives information how
a given method is globally better than the other methdos,
while the negative preference flow gives the information
about how a given method is outranked by all the other
methods. The positive preference flow is defined as:
while the negative preference flow is defined as:
(Mi+) =
(Mi ) =
1
1
(n</p>
      <p>1)
(n
1)</p>
      <p>X
x2M
X
x2M
(Mi; x);
(x; Mi):
The net flow of an algorithm is defined as:
(Mi) = (Mi+)
(Mi ):</p>
      <p>The PROMETHEE II method ranks the methods by
ordering them according to decreasing values of net flows.</p>
      <sec id="sec-2-1">
        <title>Shannon entropy weighted method</title>
        <p>
          To calculate the weights of each performance measure,
we use the Shannon entropy weighted method
          <xref ref-type="bibr" rid="ref3">(Boroushaki
2017)</xref>
          . For this reason, the decision matrix presented in
Table 1 needs to be normalized. Depending of the value that is
preferred (smaller or larger), the matrix is normalized using
the following equations:
or
qj (Mi)0 =
qj (Mi)0 =
maxi(qj (Mi))
        </p>
        <p>qj (Mi)
maxi(qj (Mi))</p>
        <p>mini(qj (Mi))
qj (Mi)</p>
        <p>mini(qj (Mi))
maxi(qj (Mi))
mini(qj (Mi))
;
;
where qj (Mi)0 is the normalized value for qj (Mi). The sums
of the performance measures in all methods are defined as
The entropy for each performance measure is defined as:
Dj =
m
X qj (Mi)0 ; j = 1; : : : ; n:
i=1</p>
        <p>m
ej = K X W
i=1
qj (Mi)0 !</p>
        <p>Dj
;
where K is the normalized coefficient defined as:
1
K =
(e0:5
1)m
;
and W is a function defined as:</p>
        <p>W (x) = xe(1 x) + (1
x)ex</p>
        <p>The weight of each performance measure used in Equation
2 is calculated using the following equation:</p>
      </sec>
      <sec id="sec-2-2">
        <title>Correlation analysis</title>
        <p>The existing literature on evaluation methodology for
machine learning and especially the ones referring to the task
of MLC correctly identify that some of the typically used
measures are correlated among themselves. Furthermore, it
points out that one needs to consider different uncorrelated
measure to get a better insight into the performance of the
evaluated methods. To this end, we perform a correlation
analysis of the proposed methodology to assess its
robustness to correlated measures, and as an additional result we
empirically elucidate the correlations among the measures
widely used for MLC.</p>
        <p>
          We used a correlation analysis that considers the
absolute values of pairwise correlation.Namely, we performed a
correlation analysis on each dataset starting by calculating a
correlation matrix for each decision matrix presented in
Table 1. In our case, the correlation matrix is a n n matrix
showing Pearson correlation coefficients between the
performance measures
          <xref ref-type="bibr" rid="ref1">(Benesty et al. 2009)</xref>
          . The Pearson
correlation coefficient is a measure of the linear correlation between
two performance measures. Its value is between -1 and 1. We
then averaged the correlation matrices across datasets.
Furthermore, we removed the performance measures that have
the average absolute correlation greater than some threshold
thus obtaining sets of evaluation measures that are least
correlated. Finally, by applying a threshold on the correlation
coefficients we obtain the measures that are most correlated
among themselves.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental design</title>
      <p>
        The data used to evaluate the performance of the fusion
method is taken from (Madjarov et al. 2012). In that study,
12 MLC methods are compared according to a set of 16
performance measures separately. The methods are divided
into three groups using the base machine learning
algorithm:(1)SVMs (BR
        <xref ref-type="bibr" rid="ref24 ref28 ref29 ref33">(Tsoumakas and Katakis 2007)</xref>
        , CC
        <xref ref-type="bibr" rid="ref25">(Read et al. 2011)</xref>
        , CLR
        <xref ref-type="bibr" rid="ref24 ref28 ref29 ref33">(Park and Fu¨rnkranz 2007)</xref>
        , QWML
        <xref ref-type="bibr" rid="ref21">(Menc´ıa, Park, and Fu¨rnkranz 2010)</xref>
        , HOMER
        <xref ref-type="bibr" rid="ref30">(Tsoumakas,
Katakis, and Vlahavas 2008)</xref>
        , RAkEL
        <xref ref-type="bibr" rid="ref24 ref28 ref29 ref33">(Tsoumakas and
Vlahavas 2007)</xref>
        , ECC
        <xref ref-type="bibr" rid="ref25">(Read et al. 2011)</xref>
        , (2) Decision trees
(ML-C4.5
        <xref ref-type="bibr" rid="ref8">(Clare and King 2001)</xref>
        , PCT
        <xref ref-type="bibr" rid="ref2">(Blockeel, Raedt,
and Ramon 1998)</xref>
        , RFML-C4.5
        <xref ref-type="bibr" rid="ref7">(Breiman 2001)</xref>
        , RF-PCT
        <xref ref-type="bibr" rid="ref19">(Kocev et al. 2013)</xref>
        ), and (3) Nearest neighbors (ML-kNN
        <xref ref-type="bibr" rid="ref24 ref28 ref29 ref33">(Zhang and Zhou 2007)</xref>
        ).
      </p>
      <p>
        The evaluation measures of predictive performance are
divided into two groups
        <xref ref-type="bibr" rid="ref24 ref28 ref29 ref33">(Madjarov et al. 2012; Tsoumakas and
Katakis 2007)</xref>
        : bipartitions-based and rankings-based. The
bipartitions-based evaluation measures are calculated based
on the comparison of the predicted relevant labels with the
ground truth relevant labels. This group of evaluation
measures is further divided into example-based and label-based.
The example-based evaluation measures ((Hamming loss,
accuracy, precision, recall, F1 score and subset accuracy))
are based on the average differences of the actual and the
predicted sets of labels over all examples of the evaluation
dataset. The label-based evaluation measures (micro
precision, micro recall, micro F1, macro precision, macro recall
and macro F1), on the other hand, assess the predictive
performance for each label separately and then average the
performance over all labels. The ranking-based evaluation
measures (one-error, coverage, ranking loss and average
precision) compare the predicted ranking of the labels with the
ground truth ranking.
      </p>
      <p>Using the set of performance measures, the methods are
compared using 11 MLC benchmark datasets: emotions,
scene, yeast, medical, enron, corel5k, tmc2007, mediamill,
bibtex, delicious, and bookmarks. A detailed explanation of
the implementation of the methods, definitions of the
performance measures, and the basic statistics of the datasets are
given in (Madjarov et al. 2012).</p>
      <p>We selected and tested two generalized preference
functions defined in Equation 1. First, a usual preference function
is used for each performance measure, so we do not need to
select the preference and indifference thresholds. The usual
preference function is presented in Equation 13. Using this
preference function, we can only say if there is a difference
or not, but we do not take into account the difference value.
(13)
(14)
dataset separately and it was set as the maximum difference
that exists from all pairwise comparisons of the values
between the methods regarding the performance measure on
that dataset.</p>
      <p>The performance measures fusion rankings of the
methods obtained using the usual generalized preference function
are presented in Table 2, while the rankings obtained using
the V -shape generalized preference function are presented
in Table 3. Comparing the rankings from the tables, both
generalized preference functions yield equal ranking only
on the bookmarks dataset. The main reason for this is the
size of the bookmarks dataset. Namely, most of the methods
were not able to return a result given the experimental
setting as provided in the study by (Madjarov et al. 2012). This
in turn means that the preference functions are calculated on
small number of different values for the performance
measures (the experiments that did not finish on time were given
the equally worst performance as stipulated by (Madjarov et
al. 2012)). For all other datasets, the rankings of the
methods differ. For example, let us focus on the delicious dataset,
for which the rankings only for two methods differ. In the
case of usual generalized preference function the
RFMLC4.5 is ranked as the second and the RF-PCT is ranked as
the first, while in the case of the V -shape generalized
preference function they swap their rankings, the RFML-C4.5 is
ranked as the first and the RF-PCT as the second. So, to
understand why this happens, we will analyze the performance
measures fusion approach on the delicious dataset.</p>
      <p>When different generalized preference functions are used,
it follows that the methods have different net flows. The net
flows are dependent from the positive and negative flows,
which are related to the average preference index.
Furthermore, the average preference index depends from the
weights of the performance measures and the selected
generalized preference function. In our case, using the
Shannon entropy weighted method, the result is that all
performance measures are uniformly distributed on each dataset,
so they all have the same influence on the end result, wj =
w; j = 1; : : : ; n, for both versions of the performance
fusion approach. The weight for each performance measure is
estimated according to the entropy it conveys. Having this
result, it follows that the difference between the rankings
in both versions comes from the selection of different
generalized preference functions. For this reason, in Figures 1
and 1, the average preference indices, (RF -P CT; Mi)
and (RF M L-C4:5; Mi) used for calculating the positive
flows, obtained on the delicious dataset, are presented.
Using this figure, we can see that the average preference indices
obtained using the usual generalized preference function
between the RFML-C4.5 and each of the methods: CLR,
QWML, PCT, RAkEL, and ECC, are the same with the
average preference indices obtained between the RF-PCT and
each of the methods: CLR, QWML, PCT, RAkEL, and ECC.
However this is not a case when the V -shape generalized
preference function is used. In this case, the same
abovementioned average preference indices obtained for
RFMLC4.5 are greater than the same average preference indices
obtained for RF-PCT.</p>
      <p>We inspect the results more closely by inspecting a
pairp(x) =
0; x 0
1; x &gt; 0
;</p>
      <p>Second, a V -shape generalized preference function is
used for each performance measure, in which the threshold
of strict preference, q, is set to the maximum difference that
exists for each preference measure on a given benchmark
problem. The V -shape preference function is presented in
Equation 14. Using this preference function, all difference
values are take into account using a linear function.</p>
      <p>8&gt;0; x
p(x) = &lt; xq ; 0
&gt;:1;
x &gt; q
0
x
q ;</p>
      <p>According to the value of each performance measure that
is preferable (smaller or larger), the 16 performance
measures can be split into two groups: (1)Minimization
(Hamming loss, One error, Coverage, Ranking loss) and (2)
Maximization (Precision, Accuracy, Recall, F1 score, Subset
accuracy, Macro precision, Macro recall, Macro F1, Micro
precision, Micro recall, Micro F1, Average precision).</p>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>We compared the 12 MLC methods using the set of 16
performance measures on each dataset separately by using
the performance measures fusion ranking. We performed
the analysis for the two preference functions (usual
generalized and V -shaped preference generalized function. The
latter was used with different threshold of strict preference
for each performance measure. The threshold of strict
preference for each performance measure was estimated on each
wise comparison with the ECC method. Using the usual
generalized preference function, we can see that (RF M
LC4:5; ECC) == (RF -P CT; ECC), while if the V
shape generalized preference function is used (RF M
LC4:5; ECC) &gt; (RF -P CT; ECC). Having the weights
uniformly distributed, all of them have the same value, w,
the Equation 2 is transformed into:</p>
      <p>Using the usual generalized preference function, we can
see that both methods, RFML C4.5 and RF PCT, win
against ECC according to all performance measures, but
using it we only count wins and losses without taking
into account how large are the wins of RFML C4.5 and
RF PCT against ECC. By using the usual generalized
preference function the performance measures fusion approach
behaves as majority vote in the case when the influence
of each performance measure is uniformly, which happens
in our case. However, using the V -shape generalized
preference function, the information of how large is the win
is also taken into account. Both methods also win against
ECC in all performance measures, but here the magnitude
of the wins are also considered, which results in different
average preference indices. So it follows that RFML-C4.5
(Pjn=1 Pj (RF M L C4:5; M2) = 13:63) has greater
average preference index than the average preference index of
RF-PCT (Pjn=1 Pj (RF M L C4:5; M2) = 12:43).</p>
      <p>After describing the inner working of the proposed
method for a single dataset in detail, the obtained
rankings for each dataset could be further used with some
statistical test to provide a general overall conclusion of the
benchmarking of the MLC methods. The Friedman test was
selected as an appropriate test for use. The p-value for
the rankings obtained with the usual generalized preference
function is 0.0005, while the p-value for the rankings
obtained using the V -shape generalized preference function is
0.0061. In both cases, the null hypothesis is rejected, so there
is a difference between the methods according to the set of
16 performance measures compared on a set of 11
benchmark datasets. To further check where the difference comes
from, the Nemenyi post-hoc test (all vs. all) was used with
a significance level of 0.05. In the case of usual generalized
preference function, the difference come from the pairs of
methods (RF-PCT, PCT) and (BR, PCT), while in the case
of the V -shape generalized preference function, there is only
a difference in the pair (RF-PCT, PCT). This implies that the
differences in the rankings of the methods are very small.</p>
      <p>We next focus on assessing the robustness of the proposed
methodology w.r.t. the presence of correlated measures.
Recall that some of the evaluation measures for MLC are
correlated among themselves. For this reason, we performed a
correlation analysis to investigate whether the method
rankings will be disturbed by removing the correlated measures.
We performed this analysis using the results from the V
shape generalized preference function. We investigate three
predefined correlation thresholds: 0.7, 0.8, and 0.9. The
exact values of the thresholds were selected for illustrative
purposes. . The performance measures that are not removed for
each predefined threshold are (i.e., the least correlated):
0.7: coverage, macro precision, micro precision, micro
recall, subset accuracy.
0.8: hamming loss, macro precision, micro precision,
micro recall, precision, ranking loss, subset accuracy.
0.9: average precision, hamming loss, macro precision,
micro precision, one error, precision, recall, ranking loss,
subset accuracy.</p>
      <p>The rankings obtained for each predefined threshold are
further tested with the Friedman test. In all cases the
pvalues is smaller than 0.05, so the null hypothesis is rejected
and the Nemenyi test was used to get the source of the
difference. In all cases there are no big differences in the
results from the post-hoc test. When the correlation threshold
is set at 0.9, the difference comes from the pairs of methods:
(RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT, RAkEL);
in the case of 0.8 from the pairs of methods (RF-PCT, PCT)
and (RF-PCT, ECC); and in the case of 0.7 from the pairs
of methods (RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT,
RAkEL). If we compare these results with the result
obtained when all performance measures are used, there are not
big changes, the quesition that arises is only if there is a
statistical significance between the pairs (RF-PCT, ECC), and
(RF-PCT, RAkEL), which can be further explored within an
one vs all analysis.</p>
      <p>However, in a lot of papers authors are also interested
in the practical significance of the results. The rankings for
each method across the datasets for each predefined
threshold are thus averaged (Table 4). Next, we check for
statistical difference between them using the Friedman test. The
p-value is 0.935, so it follows that there is no difference
between the average rankings that are obtained for each
predefined correlation threshold. Also, for each predefined
threshold, we ranked them starting from the best till the worst
method according to its average ranking (Table 5). Form
here, it follows that there is no big differences regarding the
correlation threshold that is used. Notwithstanding, the
difference for the HOMER method is noticeable. This is due
to the fact that HOMER performs better on the correlated
measures (thus its high score). Conversely, CC seems that it
performs worse on the correlated measures.</p>
      <p>Furthermore, to quantify the robustness, the absolute
difference between the rankings obtained on each dataset for
each predefined threshold and the rankings obtained
using all performance measures are calculated. Next, for each
method, the average absolute difference is calculated across
datasets to investigate how much the methods change their
ranking (Table 6). Using these results, it follows that the
rankings are robust to the correlated measures, they can vary,
but with a very small differences.</p>
      <p>Finally, we use the aggregated correlation matrix across
datasets to elucidate the correlated measures. The results are
given in Figure 3. The results show a large group of
interconnected measures. We can note that accuracy, F1 score
and micro F1 are connected with most measures (each has
8 connections). The least connected are the ranking based
measures.</p>
      <p>This is the first attempt at treating the versatile results of
MLC experiments in an unified way. More specifically, most
of the works in the area report performance along many
individual measures and making general conclusions in such
a setting is heavily impaired. This is evident also in the
extensive experimental comparison performed by (Madjarov et
al. 2012), where the results are extensively discussed along
multiple evaluation measures. We consider the results from
this study to evaluate and illustrate our method because it
is the most extensive and most complete study for MLC. We
could easily use also other experimental results, but there are
not many that follow the same experimental design and have
the results readily publicly available.</p>
      <p>The potential for practical use of the proposed method is
enormous. From a user perspective, the proposed method
takes as input the tables with the results does the necessary
calculations and outputs the overall rankings of the
methods across the different evaluation measures. This is very
convenient considering the number of evaluation measures
typically used for MLC. This way benchmarking of new
methods for MLC can be performed with a great ease.
Moreover, it provides the user a nice overview of the methods
performance: The proposed methodology shows its robustness
on correlated measures and also defines sets of performance
measure that are not correlated and can be further included
in individual analyses.</p>
      <p>We need to mention that the proposed methodology can
easily consider also other performance measures such as</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we propose an approach for fusing multiple
evaluation measures for MLC into an overall assessment of
performance. The benefit of using this approach is manifold.
First, it is designed for making a general conclusion using a
set of performance measures. Second, it avoids the
comparison according to multiple performance measures separately
and then reporting on the results in a biased manner. Third,
it is robust to inclusion of correlated evaluation measures.
Finally, it gives lists of evaluation measures that are
correlated among themselves thus avoiding comparisons only on
correlated measures.</p>
      <p>For future work, we plan to extend this approach by
investigating different preference functions and selecting the
best suitable one for each performance measure regarding its
properties. Next, we will investigate the building of hybrid
methods (mix of more generalized preference functions) that
can be used for experimental comparison of methods for
MLC. Finally, we will extend the experimental study by
including more datasets and methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the project from the Slovenian
Research Agency (research core funding No. P2-0098 and
No. P2-0103)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Benesty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Huang,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Cohen,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>Pearson correlation coefficient</article-title>
          .
          <source>In Noise reduction in speech processing. Springer. 1-4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Blockeel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Raedt</surname>
          </string-name>
          , L. D.; and
          <string-name>
            <surname>Ramon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Top-down induction of clustering trees</article-title>
          .
          <source>In Proceedings of the 15th International Conference on Machine Learning</source>
          ,
          <fpage>55</fpage>
          -
          <lpage>63</lpage>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Boroushaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Entropy-based weights for multicriteria spatial decision-making</article-title>
          .
          <source>Yearbook of the Association of Pacific Coast Geographers</source>
          <volume>79</volume>
          :
          <fpage>168</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Brans</surname>
            ,
            <given-names>J.-P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mareschal</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Promethee methods. In Multiple criteria decision analysis: state of the art surveys</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          Springer.
          <fpage>163</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Brans</surname>
            ,
            <given-names>J.-P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vincke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1985</year>
          .
          <article-title>Note-a preference ranking organisation method: (the promethee method for multiple criteria decision-making)</article-title>
          .
          <source>Management science 31</source>
          (
          <issue>6</issue>
          ):
          <fpage>647</fpage>
          -
          <lpage>656</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Random forests</article-title>
          .
          <source>Machine learning 45(1)</source>
          :
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Clare</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Knowledge discovery in multi-label phenotype data</article-title>
          .
          <source>In European Conference on Principles of Data Mining and Knowledge Discovery</source>
          ,
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Crammer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>A family of additive online algorithms for category ranking</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          :
          <fpage>1025</fpage>
          -
          <lpage>1058</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>De Comite´</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gilleron</surname>
            , R.; and Tommasi,
            <given-names>M.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Learning multi-label alternating decision trees from texts and data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>In Proc. of the 3rd international conference on Machine learning and data mining in pattern recognition</source>
          ,
          <fpage>35</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Dembczyn</surname>
          </string-name>
          ´ski, K.;
          <string-name>
            <surname>Waegeman</surname>
            , W.; Cheng, W.; and Hu¨llermeier,
            <given-names>E.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Regret analysis for performance metrics in multi-label classification: The case of hamming and subset zero-one loss</article-title>
          .
          <source>In Machine Learning and Knowledge Discovery in Databases</source>
          ,
          <volume>280</volume>
          -
          <fpage>295</fpage>
          . Berlin, Heidelberg: Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Eftimov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; Korosˇec, P.; and Korousˇic´ Seljak,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>Data-driven preference-based deep statistical ranking for comparing multi-objective optimization algorithms</article-title>
          .
          <source>In International Conference on Bioinspired Methods and Their Applications</source>
          ,
          <fpage>138</fpage>
          -
          <lpage>150</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Fu¨rnkranz</article-title>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2002</year>
          .
          <article-title>Round robin classification</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>2</volume>
          :
          <fpage>721</fpage>
          -
          <lpage>747</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.-H.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>On the consistency of multilabel learning</article-title>
          .
          <source>Artificial intelligence 199</source>
          <volume>-200</volume>
          :
          <fpage>22</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Gibaja</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>A Tutorial on Multilabel Learning</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>47</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Ishizaka</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Nemery</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Selecting the best statistical distribution with promethee and gaia</article-title>
          .
          <source>Computers &amp; Industrial Engineering</source>
          <volume>61</volume>
          (
          <issue>4</issue>
          ):
          <fpage>958</fpage>
          -
          <lpage>969</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Kocev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vens</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Struyf</surname>
            , J.; and Dzˇeroski,
            <given-names>S.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Tree ensembles for predicting structured outputs</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>46</volume>
          (
          <issue>3</issue>
          ):
          <fpage>817</fpage>
          -
          <lpage>833</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2012.
          <article-title>An extensive experimental comparison of methods for multi-label learning</article-title>
          .
          <source>Pattern recognition 45</source>
          <volume>(9)</volume>
          :
          <fpage>3084</fpage>
          -
          <lpage>3104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Menc</surname>
          </string-name>
          ´ıa, E. L.;
          <string-name>
            <surname>Park</surname>
            , S.-H.; and Fu¨rnkranz,
            <given-names>J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Efficient voting prediction for pairwise multilabel classification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>Neurocomputing</source>
          <volume>73</volume>
          (
          <issue>7-9</issue>
          ):
          <fpage>1164</fpage>
          -
          <lpage>1176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <year>2017</year>
          .
          <article-title>The promethee multiple criteria decision making analysis for selecting the best membrane prepared from sulfonated poly (ether ketone) s and poly (ether sulfone) s for proton exchange membrane fuel cell</article-title>
          .
          <source>Energy</source>
          <volume>119</volume>
          :
          <fpage>77</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Park</surname>
            , S.-H., and Fu¨rnkranz,
            <given-names>J.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Efficient pairwise classification</article-title>
          .
          <source>In European Conference on Machine Learning</source>
          ,
          <fpage>658</fpage>
          -
          <lpage>665</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Holmes</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>Classifier chains for multi-label classification</article-title>
          .
          <source>Machine learning 85(3)</source>
          :
          <fpage>333</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Thabtah</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cowling</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Peng,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>MMAC: A New Multi-class, Multi-label Associative Classification Approach</article-title>
          .
          <source>In Proc. of the 4th IEEE International Conference on Data Mining</source>
          ,
          <fpage>217</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Multi-label classification: An overview</article-title>
          .
          <source>International Journal of Data Warehousing and Mining (IJDWM) 3</source>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Random k-labelsets: An ensemble method for multilabel classification</article-title>
          .
          <source>In European conference on machine learning</source>
          ,
          <fpage>406</fpage>
          -
          <lpage>417</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Effective and efficient multilabel classification in domains with large number of labels</article-title>
          .
          <source>In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD08)</source>
          , volume
          <volume>21</volume>
          ,
          <fpage>53</fpage>
          -
          <lpage>59</lpage>
          . sn.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Waegeman</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Dembczyn´ski, K.;
          <string-name>
            <surname>Jachnik</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          .; Cheng, W.; and Hu¨llermeier,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>On the bayes-optimality of f-measure maximizers</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          :
          <fpage>3513</fpage>
          -
          <lpage>3568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-Z.</surname>
          </string-name>
          , and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.-H.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A unified view of multilabel performance measures</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Machine Learning</source>
          , ICML'
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , M.-L., and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.-H.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Ml-knn: A lazy learning approach to multi-label learning</article-title>
          .
          <source>Pattern recognition 40</source>
          (
          <issue>7</issue>
          ):
          <fpage>2038</fpage>
          -
          <lpage>2048</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>