Performance Measures Fusion for Experimental Comparison of Methods for Multi-label Classification Tome Eftimov Dragi Kocev Computer Systems Department Department of Knowledge Technologies Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 1000 Ljubljana, Slovenia 1000 Ljubljana, Slovenia Abstract multaneously. This task is known as multi-label classifica- tion (MLC). The issue of learning from multi-label data Over the past few years, multi-label classification has been has recently attracted significant attention from many re- widely explored in the machine learning community. This re- searchers, motivated by an increasing number of new ap- sulted in a number of multi-label classification methods re- quiring benchmarking to determine their strengths and weak- plications. The latter include semantic annotation of images ness. For these reasons, typically, the authors compare the and video (news clips, movies clips), functional genomics methods using a set of benchmark problems (datasets) with (gene and protein function), music categorization into emo- regard to different performance measures. At the end, the re- tions, text classification (news articles, web pages, patents, sults are discussed for each performance measure separately. emails, bookmarks, ...), directed marketing and others. An In order to give a general conclusion in which the contribution exhaustive list of multi-label applications is presented in of each performance measure will be included, we propose a (Gibaja and Ventura 2015). performance measures fusion approach based on multi crite- ria decision analysis. The approach provides rankings of the In recent years, many different approaches have been de- compared methods for each benchmark problem separately. veloped to solving MLC problems. Tsoumakas and Katakis These rankings can then be aggregated to discover sets of cor- (Tsoumakas and Katakis 2007) summarize them into two related measures as well as sets of evaluation measures that main categories: a) algorithm adaptation methods, and are least correlated. The performance and the robustness of b) problem transformation methods. Algorithm adaptation the proposed methodology is investigated and illustrated on methods extend specific learning algorithms to handle multi- the results from a comprehensive experimental study includ- label data directly. Examples include lazy learning (Zhang ing 12 multi-label classification according to 16 performance and Zhou 2007), neural networks (Crammer and Singer measures on a set of 11 benchmark problems. 2003), boosting (De Comité, Gilleron, and Tommasi 2003), classification rules (Thabtah, Cowling, and Peng 2004), de- Introduction cision trees (Clare and King 2001) (Blockeel, Raedt, and Ramon 1998) and ensembles thereof (Kocev et al. 2013), Supervised learning is one of the most widely researched ensembles with label subgroups (RAKEL) (Tsoumakas and and investigated areas of machine learning. The goal in su- Vlahavas 2007), ensembles of classifier chains (Read et al. pervised learning is to learn, from a set of examples with 2011) etc. Problem transformation methods, on the other known class, a function that outputs a prediction for the class hand, transform the MLC problem into one or more single- of a previously unseen example. If the examples belong to label classification problems. The single-label classification two classes (e.g., the example has some property or not) the problems are solved with a commonly used single-label task is called binary classification. The task where the ex- classification approach and the output is transformed back amples can belong to a single class from a given set of m into a multi-label representation. The simplest strategies in- classes (m ≥ 3) is known as multi-class classification. The clude the one-against-all and one-against-one strategies, also case where the output is a real value is called regression. referred to as the binary relevance method (Tsoumakas and However, in many real life problems of predictive mod- Katakis 2007) and pair-wise method (Fürnkranz 2002) re- elling the output (i.e., the target) can be structured, meaning spectively. that there can be more complex output structures such as Performance evaluation for MLC is a more complex task vectors of variables with some dependencies among them. than that of classical single-label classification. Due to the One type of structured output is vector of binary vari- nature of the task: one example can be labelled with mul- ables, i.e., the examples can belong to multiple classes si- tiple labels. Namely, it is difficult to assess which error is Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. worse: two instances with two incorrect labels each or four Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of instances with single incorrect label each. To this end, in the AAAI 2019 Spring Symposium on Combining Machine Learn- any typical multi-label experiment, it is essential to include ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford multiple and contrasting measures because of the additional University, Palo Alto, California, USA, March 25-27, 2019. degrees of freedom that the multi-label setting introduces (Madjarov et al. 2012). Fusion of performance measures The relations among the different evaluation measures in Let us assume that a comparison needs to be made among m the literature have been theoretically studied and the main methods (i.e., alternatives) regarding n performance mea- findings can be summarized as follows. To begin with, Ham- sures (i.e., criteria) on a single multi-label classification ming loss and subset accuracy have a different structure and problem (i.e., dataset). Let M = {M1 , M2 , . . . , Mm } be minimization of one may cause a high regret for the other the set of methods we want to compare regarding the set of (Dembczyński et al. 2010). Next, a study on surrogate losses performance measures Q = {q1 , q2 , . . . , qn }. The decision for MLC showed that none of the convex surrogate loss is matrix is a m × n matrix (see Table 1) that contains values consistent with ranking loss (Gao and Zhou 2013). Further- of the performance measures obtained for the methods. more, the F-measure optimality of the inference algorithm is studied with decision theoretic approaches (Waegeman et Table 1: Decision matrix al. 2014). Finally, an investigation on the shared properties among different measures yielded a unified understanding q1 q2 ... qn for MLC evaluation (Wu and Zhou 2017). All in all, when M1 q1 (M1 ) q2 (M1 ) ... qn (M1 ) benchmarking novel MLC methods, it is necessary to com- M2 q1 (M2 ) q2 (M2 ) ... qn (M2 ) pare their performance with existing state-of-the-art meth- .. .. .. .. . . . . ods. However, due to the multitude of evaluation measures, Mm q1 (Mm ) q2 (Mm ) ... qn (Mm ) drawing a clear summaries and conclusions is not easy: the methods have different performance compared to the com- peting methods on the different evaluation measures. This For drawing conclusions and making recommendations makes proving a summary recommendation a complex task. on methods’ usage by considering a set of performance Considering this, we propose an approach for experimen- measures, we propose a performance measures fusion ap- tal comparison of methods for multi-label classification. It proach that follows the idea of PROMETHEE methodol- is developed for making a general conclusion using a set ogy (Brans and Mareschal 2005). More specifically, we ex- of user-specified performance measures. For this reason, ploit the method PROMETHEE II. It is based on making the approach follows the idea of PROMETHEE methods, pairwise comparisons within all methods for each perfor- which are applicable in different domains such as, business, mance measure. The differences between the values for each chemistry, manufacturing, social sciences, agriculture and pair of methods according to a specified performance met- medicine (Ishizaka and Nemery 2011; Nikouei, Oroujzadeh, ric are taken into consideration. For larger differences the and Mehdipour-Ataei 2017). Recently, they were also used decision maker might consider larger preferences. The pref- in a data-driven approach for evaluating multi-objective op- erence function of a performance measure for two methods timization algorithms regarding different performance mea- is defined as the degree of preference of method M1 over sures (Eftimov, Korošec, and Koroušić Seljak 2018). To the method M2 as seen in the following equation: best of our knowledge, they were not used in the domain  pj (dj (M1 , M2 )), maximization qj of MLC. The PROMETHEE methodology works as a rank- Pj (M1 , M2 ) = , pj (−dj (M1 , M2 )), minimization qj ing scheme for transforming the data for each benchmark dataset instead of using some traditional statistical ranking (1) scheme (e.g., fractional ranking scheme). Further the ob- where dj (M1 , M2 ) = qj (M1 )−qj (M2 ) is the difference be- tained rankings that are fused from more performance mea- tween the values of the methods for the performance mea- sures are involved in a statistical test to provide a general sure qj and pj (·) is a generalized preference function as- conclusion from the benchmark experiment. signed to that performance measure. There exist six types of generalized preference functions (Brans and Vincke 1985). The main contributions of the paper are: Some of them require certain preferential parameters to be • A methodology for fusing the various evaluation mea- defined, such as the preference and indifference thresholds. sures for the task of MLC. The preference threshold is the smallest amount that is as- sumed as preference, while the indifference threshold is the • The proposed methodology is robust considering the in- greatest amount of difference that is insignificant. clusion or exclusion of correlated measures. After selecting the preference function for each perfor- mance measure, the next step is to define the average pref- • We elucidate sets of evaluation measures that should be erence index and outranking (preference and net) flows. The used together when assessing the predictive performance. average preference index for each pair of methods gives in- formation of global comparison between them using all per- • We identify the correlated measures for each measure sep- formance measures. The average preference index can be arately. calculated as: n In the reminder of the paper, we first present the proposed 1X method for fusion of the performance measures for MLC. π(M1 , M2 ) = wj Pj (M1 , M2 ), (2) n j=1 Then, the experimental design is explained followed by the results and discussion. Finally, the conclusions of the paper where wj represents the relative significance (weight) of the are presented. j th performance measure. The higher the weight value of a given performance measure the higher its relative signifi- The weight of each performance measure used in Equation cance. The selection of the weights is a crucial step in the 2 is calculated using the following equation: PROMETHEE II method because it defines the priorities 1 used by the decision-maker. In our case, we used the Shan- (n−E) (1 − ej ) wj = P h i, (12) non entropy weighted method. For the average preference n 1 j=1 (n−E) (1 − ej ) index, we need to point out that it is not a symmetric func- Pn tion, so π(M1 , M2 ) 6= π(M2 , M1 ). where E is the sum of entropies E = j=1 ej . To rank the methods, the net flow for each method needs to be calculated. It is the difference between the positive, Correlation analysis φ(Mi+ ), and the negative preference flow of the method, The existing literature on evaluation methodology for ma- φ(Mi− ). The positive preference flow gives information how chine learning and especially the ones referring to the task a given method is globally better than the other methdos, of MLC correctly identify that some of the typically used while the negative preference flow gives the information measures are correlated among themselves. Furthermore, it about how a given method is outranked by all the other meth- points out that one needs to consider different uncorrelated ods. The positive preference flow is defined as: measure to get a better insight into the performance of the evaluated methods. To this end, we perform a correlation 1 X φ(Mi+ ) = π(Mi , x), (3) analysis of the proposed methodology to assess its robust- (n − 1) ness to correlated measures, and as an additional result we x∈M empirically elucidate the correlations among the measures while the negative preference flow is defined as: widely used for MLC. 1 X We used a correlation analysis that considers the abso- φ(Mi− ) = π(x, Mi ). (4) (n − 1) lute values of pairwise correlation.Namely, we performed a x∈M correlation analysis on each dataset starting by calculating a The net flow of an algorithm is defined as: correlation matrix for each decision matrix presented in Ta- ble 1. In our case, the correlation matrix is a n × n matrix φ(Mi ) = φ(Mi+ ) − φ(Mi− ). (5) showing Pearson correlation coefficients between the perfor- The PROMETHEE II method ranks the methods by or- mance measures (Benesty et al. 2009). The Pearson correla- dering them according to decreasing values of net flows. tion coefficient is a measure of the linear correlation between two performance measures. Its value is between -1 and 1. We Shannon entropy weighted method then averaged the correlation matrices across datasets. Fur- To calculate the weights of each performance measure, thermore, we removed the performance measures that have we use the Shannon entropy weighted method (Boroushaki the average absolute correlation greater than some threshold 2017). For this reason, the decision matrix presented in Ta- thus obtaining sets of evaluation measures that are least cor- ble 1 needs to be normalized. Depending of the value that is related. Finally, by applying a threshold on the correlation preferred (smaller or larger), the matrix is normalized using coefficients we obtain the measures that are most correlated the following equations: among themselves. 0 maxi (qj (Mi )) − qj (Mi ) qj (Mi ) = , (6) Experimental design maxi (qj (Mi )) − mini (qj (Mi )) The data used to evaluate the performance of the fusion or method is taken from (Madjarov et al. 2012). In that study, 0 qj (Mi ) − mini (qj (Mi )) 12 MLC methods are compared according to a set of 16 qj (Mi ) = , (7) maxi (qj (Mi )) − mini (qj (Mi )) performance measures separately. The methods are divided 0 where qj (Mi ) is the normalized value for qj (Mi ). The sums into three groups using the base machine learning algo- of the performance measures in all methods are defined as rithm:(1)SVMs (BR (Tsoumakas and Katakis 2007), CC m (Read et al. 2011), CLR (Park and Fürnkranz 2007), QWML X 0 (Mencı́a, Park, and Fürnkranz 2010), HOMER (Tsoumakas, Dj = qj (Mi ) , j = 1, . . . , n. (8) Katakis, and Vlahavas 2008), RAkEL (Tsoumakas and Vla- i=1 havas 2007), ECC (Read et al. 2011), (2) Decision trees The entropy for each performance measure is defined as: (ML-C4.5 (Clare and King 2001), PCT (Blockeel, Raedt, m 0 ! and Ramon 1998), RFML-C4.5 (Breiman 2001), RF-PCT X qj (Mi ) (Kocev et al. 2013)), and (3) Nearest neighbors (ML-kNN ej = K W , (9) i=1 Dj (Zhang and Zhou 2007)). The evaluation measures of predictive performance are di- where K is the normalized coefficient defined as: vided into two groups (Madjarov et al. 2012; Tsoumakas and 1 Katakis 2007): bipartitions-based and rankings-based. The K = 0.5 , (10) (e − 1)m bipartitions-based evaluation measures are calculated based on the comparison of the predicted relevant labels with the and W is a function defined as: ground truth relevant labels. This group of evaluation mea- W (x) = xe(1−x) + (1 − x)ex − 1. (11) sures is further divided into example-based and label-based. The example-based evaluation measures ((Hamming loss, dataset separately and it was set as the maximum difference accuracy, precision, recall, F1 score and subset accuracy)) that exists from all pairwise comparisons of the values be- are based on the average differences of the actual and the tween the methods regarding the performance measure on predicted sets of labels over all examples of the evaluation that dataset. dataset. The label-based evaluation measures (micro preci- The performance measures fusion rankings of the meth- sion, micro recall, micro F1 , macro precision, macro recall ods obtained using the usual generalized preference function and macro F1 ), on the other hand, assess the predictive per- are presented in Table 2, while the rankings obtained using formance for each label separately and then average the per- the V -shape generalized preference function are presented formance over all labels. The ranking-based evaluation mea- in Table 3. Comparing the rankings from the tables, both sures (one-error, coverage, ranking loss and average preci- generalized preference functions yield equal ranking only sion) compare the predicted ranking of the labels with the on the bookmarks dataset. The main reason for this is the ground truth ranking. size of the bookmarks dataset. Namely, most of the methods Using the set of performance measures, the methods are were not able to return a result given the experimental set- compared using 11 MLC benchmark datasets: emotions, ting as provided in the study by (Madjarov et al. 2012). This scene, yeast, medical, enron, corel5k, tmc2007, mediamill, in turn means that the preference functions are calculated on bibtex, delicious, and bookmarks. A detailed explanation of small number of different values for the performance mea- the implementation of the methods, definitions of the perfor- sures (the experiments that did not finish on time were given mance measures, and the basic statistics of the datasets are the equally worst performance as stipulated by (Madjarov et given in (Madjarov et al. 2012). al. 2012)). For all other datasets, the rankings of the meth- We selected and tested two generalized preference func- ods differ. For example, let us focus on the delicious dataset, tions defined in Equation 1. First, a usual preference function for which the rankings only for two methods differ. In the is used for each performance measure, so we do not need to case of usual generalized preference function the RFML- select the preference and indifference thresholds. The usual C4.5 is ranked as the second and the RF-PCT is ranked as preference function is presented in Equation 13. Using this the first, while in the case of the V -shape generalized pref- preference function, we can only say if there is a difference erence function they swap their rankings, the RFML-C4.5 is or not, but we do not take into account the difference value. ranked as the first and the RF-PCT as the second. So, to un-  derstand why this happens, we will analyze the performance 0, x ≤ 0 measures fusion approach on the delicious dataset. p(x) = , (13) 1, x > 0 When different generalized preference functions are used, it follows that the methods have different net flows. The net Second, a V -shape generalized preference function is flows are dependent from the positive and negative flows, used for each performance measure, in which the threshold which are related to the average preference index. Fur- of strict preference, q, is set to the maximum difference that thermore, the average preference index depends from the exists for each preference measure on a given benchmark weights of the performance measures and the selected gen- problem. The V -shape preference function is presented in eralized preference function. In our case, using the Shan- Equation 14. Using this preference function, all difference non entropy weighted method, the result is that all perfor- values are take into account using a linear function. mance measures are uniformly distributed on each dataset, so they all have the same influence on the end result, wj =  0, x ≤ 0  w, j = 1, . . . , n, for both versions of the performance fu- p(x) = xq , 0 ≤ x ≤ q , (14) sion approach. The weight for each performance measure is  1, x > q estimated according to the entropy it conveys. Having this result, it follows that the difference between the rankings According to the value of each performance measure that in both versions comes from the selection of different gen- is preferable (smaller or larger), the 16 performance mea- eralized preference functions. For this reason, in Figures 1 sures can be split into two groups: (1)Minimization (Ham- and 1, the average preference indices, π(RF -P CT, Mi ) ming loss, One error, Coverage, Ranking loss) and (2) Max- and π(RF M L-C4.5, Mi ) used for calculating the positive imization (Precision, Accuracy, Recall, F1 score, Subset ac- flows, obtained on the delicious dataset, are presented. Us- curacy, Macro precision, Macro recall, Macro F1 , Micro pre- ing this figure, we can see that the average preference indices cision, Micro recall, Micro F1 , Average precision). obtained using the usual generalized preference function between the RFML-C4.5 and each of the methods: CLR, Results and discussion QWML, PCT, RAkEL, and ECC, are the same with the av- We compared the 12 MLC methods using the set of 16 erage preference indices obtained between the RF-PCT and performance measures on each dataset separately by using each of the methods: CLR, QWML, PCT, RAkEL, and ECC. the performance measures fusion ranking. We performed However this is not a case when the V -shape generalized the analysis for the two preference functions (usual gener- preference function is used. In this case, the same above- alized and V -shaped preference generalized function. The mentioned average preference indices obtained for RFML- latter was used with different threshold of strict preference C4.5 are greater than the same average preference indices for each performance measure. The threshold of strict pref- obtained for RF-PCT. erence for each performance measure was estimated on each We inspect the results more closely by inspecting a pair- Table 2: Performance measures fusion rankings using the usual preference generalized function. RFML-C4.5 ML-kNN ML-C4.5 HOMER RF-PCT QWML RAkEL CLR ECC PCT BR CC Dataset emotions 7.00 10.00 8.00 11.00 5.00 3.00 4.00 12.00 6.00 9.00 2.00 1.00 scene 2.00 3.00 5.00 7.00 6.00 11.00 12.00 9.00 1.00 4.00 10.00 8.00 yeast 2.00 4.00 1.00 6.00 3.00 11.00 12.00 8.00 9.00 5.00 10.00 7.00 medical 8.00 5.00 3.00 1.00 2.00 4.00 12.00 10.00 7.00 9.00 11.00 6.00 enron 2.00 7.00 1.00 9.00 4.00 10.00 12.00 11.00 8.00 5.00 6.00 3.00 corel5k 4.00 5.00 1.00 2.00 3.00 9.00 11.00 7.00 12.00 10.00 8.00 6.00 tmc2007 3.00 1.00 4.00 5.00 6.00 11.00 12.00 10.00 7.00 9.00 8.00 2.00 mediamill 4.00 5.00 11.00 10.00 6.00 12.00 7.00 3.00 9.00 8.00 2.00 1.00 bibtex 2.00 1.00 3.00 4.00 5.00 10.00 11.00 8.00 12.00 7.00 9.00 6.00 delicious 4.00 3.00 10.50 10.50 5.00 7.00 8.00 6.00 10.50 10.50 2.00 1.00 bookmarks 9.00 9.00 9.00 9.00 9.00 2.00 5.00 3.00 9.00 9.00 4.00 1.00 Table 3: Performance measures fusion rankings using the V -shape preference generalized function. RFML-C4.5 ML-kNN ML-C4.5 HOMER RF-PCT QWML RAkEL CLR ECC PCT BR CC Dataset emotions 8.00 10.00 9.00 11.00 5.00 3.00 4.00 12.00 6.00 7.00 2.00 1.00 scene 2.00 1.00 4.00 7.00 6.00 11.00 12.00 8.00 3.00 5.00 10.00 9.00 yeast 2.00 4.00 1.00 9.00 3.00 12.00 11.00 6.00 8.00 5.00 10.00 7.00 medical 10.00 9.00 2.00 1.00 4.00 3.00 12.00 6.00 7.00 8.00 11.00 5.00 enron 1.00 10.00 2.00 7.00 4.00 8.00 12.00 11.00 9.00 6.00 5.00 3.00 corel5k 4.00 5.00 1.00 2.00 3.00 10.00 9.00 7.00 11.00 12.00 6.00 8.00 tmc2007 3.00 2.00 4.00 6.00 5.00 11.00 12.00 10.00 7.00 8.00 9.00 1.00 mediamill 4.00 5.00 12.00 11.00 7.00 10.00 6.00 3.00 8.00 9.00 2.00 1.00 bibtex 2.00 1.00 3.00 4.00 5.00 10.00 11.00 7.00 12.00 8.00 9.00 6.00 delicious 4.00 3.00 10.50 10.50 5.00 7.00 8.00 6.00 10.50 10.50 1.00 2.00 bookmarks 9.00 9.00 9.00 9.00 9.00 2.00 5.00 3.00 9.00 9.00 4.00 1.00 0.06 RF-MLC4.5 0.05 RF-MLC4.5 RF-PCT RF-PCT 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.00 0.00 BR CC CLR QWML HOMER ML-C4.5 PCT ML-kNN RAkEL ECC RFML-C4.5 RF-PCT BR CC CLR QWML HOMER ML-C4.5 PCT ML-kNN RAkEL ECC RFML-C4.5 RF-PCT Figure 1: Average preference indices for RFML−C4.5 Figure 2: Average preference indices for RFML−C4.5 and RF-PCT obtained on the delicious dataset using and RF-PCT obtained on the delicious dataset using the usual preference function (π(RF M L-C4.5, Mi ) and the V -shape preference function(π(RF M L-C4.5, Mi ) and π(RF -P CT, Mi )). π(RF -P CT, Mi )). wise comparison with the ECC method. Using the usual gen- poses. . The performance measures that are not removed for eralized preference function, we can see that π(RF M L- each predefined threshold are (i.e., the least correlated): C4.5, ECC) == π(RF -P CT, ECC), while if the V - • 0.7: coverage, macro precision, micro precision, micro re- shape generalized preference function is used π(RF M L- call, subset accuracy. C4.5, ECC) > π(RF -P CT, ECC). Having the weights uniformly distributed, all of them have the same value, w, • 0.8: hamming loss, macro precision, micro precision, mi- the Equation 2 is transformed into: cro recall, precision, ranking loss, subset accuracy. n • 0.9: average precision, hamming loss, macro precision, 1 X micro precision, one error, precision, recall, ranking loss, π(M1 , M2 ) = w Pj (M1 , M2 ). (15) n j=1 subset accuracy. The rankings obtained for each predefined threshold are Using the usual generalized preference function, we can further tested with the Friedman test. In all cases the p- see that both methods, RFML−C4.5 and RF−PCT, win values is smaller than 0.05, so the null hypothesis is rejected against ECC according to all performance measures, but and the Nemenyi test was used to get the source of the dif- using it we only count wins and losses without taking ference. In all cases there are no big differences in the re- into account how large are the wins of RFML−C4.5 and sults from the post-hoc test. When the correlation threshold RF−PCT against ECC. By using the usual generalized pref- is set at 0.9, the difference comes from the pairs of methods: erence function the performance measures fusion approach (RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT, RAkEL); behaves as majority vote in the case when the influence in the case of 0.8 from the pairs of methods (RF-PCT, PCT) of each performance measure is uniformly, which happens and (RF-PCT, ECC); and in the case of 0.7 from the pairs in our case. However, using the V -shape generalized pref- of methods (RF-PCT, PCT), (RF-PCT, ECC), and (RF-PCT, erence function, the information of how large is the win RAkEL). If we compare these results with the result ob- is also taken into account. Both methods also win against tained when all performance measures are used, there are not ECC in all performance measures, but here the magnitude big changes, the quesition that arises is only if there is a sta- of the wins are also considered, which results in different tistical significance between the pairs (RF-PCT, ECC), and average Pn preference indices. So it follows that RFML-C4.5 (RF-PCT, RAkEL), which can be further explored within an ( j=1 Pj (RF M L − C4.5, M2 ) = 13.63) has greater av- one vs all analysis. erage preference Pn index than the average preference index of However, in a lot of papers authors are also interested RF-PCT ( j=1 Pj (RF M L − C4.5, M2 ) = 12.43). in the practical significance of the results. The rankings for After describing the inner working of the proposed each method across the datasets for each predefined thresh- method for a single dataset in detail, the obtained rank- old are thus averaged (Table 4). Next, we check for statis- ings for each dataset could be further used with some sta- tical difference between them using the Friedman test. The tistical test to provide a general overall conclusion of the p-value is 0.935, so it follows that there is no difference be- benchmarking of the MLC methods. The Friedman test was tween the average rankings that are obtained for each prede- selected as an appropriate test for use. The p-value for fined correlation threshold. Also, for each predefined thresh- the rankings obtained with the usual generalized preference old, we ranked them starting from the best till the worst function is 0.0005, while the p-value for the rankings ob- method according to its average ranking (Table 5). Form tained using the V -shape generalized preference function is here, it follows that there is no big differences regarding the 0.0061. In both cases, the null hypothesis is rejected, so there correlation threshold that is used. Notwithstanding, the dif- is a difference between the methods according to the set of ference for the HOMER method is noticeable. This is due 16 performance measures compared on a set of 11 bench- to the fact that HOMER performs better on the correlated mark datasets. To further check where the difference comes measures (thus its high score). Conversely, CC seems that it from, the Nemenyi post-hoc test (all vs. all) was used with performs worse on the correlated measures. a significance level of 0.05. In the case of usual generalized Furthermore, to quantify the robustness, the absolute dif- preference function, the difference come from the pairs of ference between the rankings obtained on each dataset for methods (RF-PCT, PCT) and (BR, PCT), while in the case each predefined threshold and the rankings obtained us- of the V -shape generalized preference function, there is only ing all performance measures are calculated. Next, for each a difference in the pair (RF-PCT, PCT). This implies that the method, the average absolute difference is calculated across differences in the rankings of the methods are very small. datasets to investigate how much the methods change their We next focus on assessing the robustness of the proposed ranking (Table 6). Using these results, it follows that the methodology w.r.t. the presence of correlated measures. Re- rankings are robust to the correlated measures, they can vary, call that some of the evaluation measures for MLC are cor- but with a very small differences. related among themselves. For this reason, we performed a Finally, we use the aggregated correlation matrix across correlation analysis to investigate whether the method rank- datasets to elucidate the correlated measures. The results are ings will be disturbed by removing the correlated measures. given in Figure 3. The results show a large group of inter- We performed this analysis using the results from the V - connected measures. We can note that accuracy, F1 score shape generalized preference function. We investigate three and micro F1 are connected with most measures (each has predefined correlation thresholds: 0.7, 0.8, and 0.9. The ex- 8 connections). The least connected are the ranking based act values of the thresholds were selected for illustrative pur- measures. Figure 3: Correlation between performance measures. Red edges are for correlation greater than 0.9, blue and red edges are for correlation greater than 0.8, and green, blue, and red edges correspond to correlation more than 0.7. The evaluation measures Hamming loss, one-error and micro precision are not correlated with the other measures. Table 4: Average rankings for each method across datasets. Table 5: Practical rankings for each method across datasets. 0.7 0.8 0.9 All 0.7 0.8 0.9 All BR 4.82 4.82 4.45 4.45 BR 2.00 2.00 2.00 2.00 CC 5.09 5.18 5.45 5.36 CC 3.00 3.00 5.00 5.00 CLR 5.14 5.32 5.23 5.23 CLR 4.00 4.00 3.00 4.00 QWML 6.77 6.77 6.41 7.05 QWML 7.00 7.00 6.00 7.00 HOMER 6.27 6.55 6.73 5.09 HOMER 6.00 6.00 7.00 3.00 ML-C4.5 8.27 8.09 7.91 7.91 ML-C4.5 9.00 10.00 9.00 9.00 PCT 9.09 9.00 9.27 9.27 PCT 12.00 12.00 12.00 12.00 ML-kNN 6.91 6.91 6.91 7.18 ML-kNN 8.00 8.00 8.00 8.00 RAkEL 8.41 7.95 8.50 8.23 RAkEL 10.00 9.00 10.00 11.00 ECC 8.59 8.59 8.59 7.95 ECC 11.00 11.00 11.00 10.00 RFML-C4.5 5.45 5.45 5.36 6.27 RFML-C4.5 5.00 5.00 4.00 6.00 RF-PCT 3.18 3.36 3.18 4.00 RF-PCT 1.00 1.00 1.00 1.00 This is the first attempt at treating the versatile results of enormous. From a user perspective, the proposed method MLC experiments in an unified way. More specifically, most takes as input the tables with the results does the necessary of the works in the area report performance along many in- calculations and outputs the overall rankings of the meth- dividual measures and making general conclusions in such ods across the different evaluation measures. This is very a setting is heavily impaired. This is evident also in the ex- convenient considering the number of evaluation measures tensive experimental comparison performed by (Madjarov et typically used for MLC. This way benchmarking of new al. 2012), where the results are extensively discussed along methods for MLC can be performed with a great ease. More- multiple evaluation measures. We consider the results from over, it provides the user a nice overview of the methods per- this study to evaluate and illustrate our method because it formance: The proposed methodology shows its robustness is the most extensive and most complete study for MLC. We on correlated measures and also defines sets of performance could easily use also other experimental results, but there are measure that are not correlated and can be further included not many that follow the same experimental design and have in individual analyses. the results readily publicly available. We need to mention that the proposed methodology can The potential for practical use of the proposed method is easily consider also other performance measures such as Table 6: Average absolute difference between the rankings Brans, J.-P., and Mareschal, B. 2005. Promethee methods. In obtained for each predefined threshold and the rankings ob- Multiple criteria decision analysis: state of the art surveys. tained using all performance measures across datasets. Springer. 163–186. (0.7, All) (0.8, All) (0.9, All) Brans, J.-P., and Vincke, P. 1985. Note-a preference ranking BR 0.91 0.91 0.91 organisation method: (the promethee method for multiple CC 0.64 0.91 1.18 criteria decision-making). Management science 31(6):647– CLR 1.00 1.00 0.55 656. QWML 0.64 0.64 0.82 Breiman, L. 2001. Random forests. Machine learning HOMER 1.18 1.64 1.64 45(1):5–32. ML-C4.5 0.73 0.55 0.36 Clare, A., and King, R. D. 2001. Knowledge discovery PCT 0.18 0.27 0.00 in multi-label phenotype data. In European Conference on ML-kNN 0.64 0.64 0.64 Principles of Data Mining and Knowledge Discovery, 42– RAkEL 0.36 0.82 0.45 53. Springer. ECC 1.00 1.18 1.00 RFML-C4.5 0.82 0.82 0.91 Crammer, K., and Singer, Y. 2003. A family of additive RF-PCT 0.82 0.64 0.82 online algorithms for category ranking. Journal of Machine Learning Research 3:1025–1058. De Comité, F.; Gilleron, R.; and Tommasi, M. 2003. Learn- running times and memory consumption. ing multi-label alternating decision trees from texts and data. In Proc. of the 3rd international conference on Machine Conclusions learning and data mining in pattern recognition, 35–49. In this paper, we propose an approach for fusing multiple Dembczyński, K.; Waegeman, W.; Cheng, W.; and evaluation measures for MLC into an overall assessment of Hüllermeier, E. 2010. Regret analysis for performance performance. The benefit of using this approach is manifold. metrics in multi-label classification: The case of hamming First, it is designed for making a general conclusion using a and subset zero-one loss. In Machine Learning and Knowl- set of performance measures. Second, it avoids the compar- edge Discovery in Databases, 280–295. Berlin, Heidelberg: ison according to multiple performance measures separately Springer Berlin Heidelberg. and then reporting on the results in a biased manner. Third, Eftimov, T.; Korošec, P.; and Koroušić Seljak, B. 2018. it is robust to inclusion of correlated evaluation measures. Data-driven preference-based deep statistical ranking for Finally, it gives lists of evaluation measures that are corre- comparing multi-objective optimization algorithms. In In- lated among themselves thus avoiding comparisons only on ternational Conference on Bioinspired Methods and Their correlated measures. Applications, 138–150. Springer. For future work, we plan to extend this approach by in- vestigating different preference functions and selecting the Fürnkranz, J. 2002. Round robin classification. Journal of best suitable one for each performance measure regarding its Machine Learning Research 2:721–747. properties. Next, we will investigate the building of hybrid Gao, W., and Zhou, Z.-H. 2013. On the consistency of multi- methods (mix of more generalized preference functions) that label learning. Artificial intelligence 199-200:22–44. can be used for experimental comparison of methods for Gibaja, E., and Ventura, S. 2015. A Tutorial on Multilabel MLC. Finally, we will extend the experimental study by in- Learning. ACM Computing Surveys 47(3):1–38. cluding more datasets and methods. Ishizaka, A., and Nemery, P. 2011. Selecting the best sta- Acknowledgments tistical distribution with promethee and gaia. Computers & Industrial Engineering 61(4):958–969. This work was supported by the project from the Slovenian Research Agency (research core funding No. P2-0098 and Kocev, D.; Vens, C.; Struyf, J.; and Džeroski, S. 2013. Tree No. P2-0103) ensembles for predicting structured outputs. Pattern Recog- nition 46(3):817–833. References Madjarov, G.; Kocev, D.; Gjorgjevikj, D.; and Džeroski, S. Benesty, J.; Chen, J.; Huang, Y.; and Cohen, I. 2009. Pear- 2012. An extensive experimental comparison of methods for son correlation coefficient. In Noise reduction in speech pro- multi-label learning. Pattern recognition 45(9):3084–3104. cessing. Springer. 1–4. Mencı́a, E. L.; Park, S.-H.; and Fürnkranz, J. 2010. Effi- Blockeel, H.; Raedt, L. D.; and Ramon, J. 1998. Top-down cient voting prediction for pairwise multilabel classification. induction of clustering trees. In Proceedings of the 15th In- Neurocomputing 73(7-9):1164–1176. ternational Conference on Machine Learning, 55–63. Mor- Nikouei, M. A.; Oroujzadeh, M.; and Mehdipour-Ataei, S. gan Kaufmann. 2017. The promethee multiple criteria decision making Boroushaki, S. 2017. Entropy-based weights for multicrite- analysis for selecting the best membrane prepared from sul- ria spatial decision-making. Yearbook of the Association of fonated poly (ether ketone) s and poly (ether sulfone) s for Pacific Coast Geographers 79:168–187. proton exchange membrane fuel cell. Energy 119:77–85. Park, S.-H., and Fürnkranz, J. 2007. Efficient pairwise clas- sification. In European Conference on Machine Learning, 658–665. Springer. Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011. Classifier chains for multi-label classification. Machine learning 85(3):333. Thabtah, F. A.; Cowling, P.; and Peng, Y. 2004. MMAC: A New Multi-class, Multi-label Associative Classification Ap- proach. In Proc. of the 4th IEEE International Conference on Data Mining, 217–224. Tsoumakas, G., and Katakis, I. 2007. Multi-label classifica- tion: An overview. International Journal of Data Warehous- ing and Mining (IJDWM) 3(3):1–13. Tsoumakas, G., and Vlahavas, I. 2007. Random k-labelsets: An ensemble method for multilabel classification. In Euro- pean conference on machine learning, 406–417. Springer. Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2008. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD08), volume 21, 53– 59. sn. Waegeman, W.; Dembczyński, K.; Jachnik, A.; Cheng, W.; and Hüllermeier, E. 2014. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Re- search 15:3513–3568. Wu, K.-Z., and Zhou, Z.-H. 2017. A unified view of multi- label performance measures. In Proceedings of the 28th In- ternational Conference on Machine Learning, ICML’17. Zhang, M.-L., and Zhou, Z.-H. 2007. Ml-knn: A lazy learn- ing approach to multi-label learning. Pattern recognition 40(7):2038–2048.