Benchmarking Multi-label Classification Algorithms

Benchmarking Multi-label Classification Algorithms ArjunPakrashi arjun.pakrashi@insight-centre.org Insight Centre for Data Analytics University College Dublin

Ireland

DerekGreene derek.greene@ucd.ie Insight Centre for Data Analytics University College Dublin

Ireland

BrianMacNamee brian.macnamee@ucd.ie Insight Centre for Data Analytics University College Dublin

Ireland

Benchmarking Multi-label Classification Algorithms 241AE2AB5D4C98C65251F57424564C2B GROBID - A machine learning software for extracting information from scholarly documents

Multi-label classification is an approach to classification problems that allows each data point to be assigned to more than one class at the same time. Real life machine learning problems are often multi-label in nature-for example image labelling, topic identification in texts, and gene expression prediction. Many multi-label classification algorithms have been proposed in the literature and, although there have been some benchmarking experiments, many questions still remain about which approaches perform best for certain kinds of multi-label datasets. This paper presents a comprehensive benchmark experiment of eleven multilabel classification algorithms on eleven different datasets. Unlike many existing studies, we perform detailed parameter tuning for each algorithmdataset pair so as to allow a fair comparative analysis of the algorithms. Also, we report on a preliminary experiment which seeks to understand how the performance of different multi-label classification algorithms changes as the characteristics of multi-label datasets are adjusted.

Introduction

There are many important real-life classification problems in which a data point can be a member of more than one class simultaneously [9]. For example, a gene sequence can be a member of multiple functional classes, or a piece of music can be tagged with multiple genres. These types of problems are known as multi-label classification problems [23]. In multi-label problems there are typically a finite set of potential labels that can be applied to data points. The set of labels that are applicable to a specific data point are known as the relevant labels, while those that are not applicable are known as irrelevant labels.

Early, naïve approaches to the multi-label problem (e.g. [1]) consider each label independently using a one-versus-all binary classification approach to predict the relevance of an individual label to a data point. The outputs of a set of these individual classifiers are then aggregated into a set of relevant labels. Although these approaches can work well [11], their performance tends to degrade significantly as the number of potential labels increases. The prediction of a group of relevant labels effectively involves finding a point in a multi-dimensional label space, and as the number of labels increases this becomes more challenging as this space becomes more and more sparse. An added challenge is that multi-label problems can suffer from a very high degree of label imbalance. To address these challenges, more sophisticated multi-label classification algorithms [9] attempt to exploit the associations between labels, and use ensemble approaches to break the problem into a series of less complex problems (e.g. [1,20,17,14]).

We describe an experiment to benchmark the performance of eleven of the most widely-cited approaches to multi-label classification on a set of eleven multilabel classification datasets. While there are existing benchmarks of this type (eg. [14,15]), they do not sufficiently tune the hyper-parameters for each algorithm, and so do not compare approaches in a fair way. In this experiment extensive hyper-parameter tuning is performed. The paper also presents the results of an initial experiment to investigate how the performance of different multi-label classification algorithms changes as the characteristics of datasets (e.g. the size of the set of potential labels) change.

The remainder of the paper is structured as follows. Section 2 provides a brief survey of existing multi-label classification algorithms and previous benchmark studies. Section 3 describes the benchmark experiment, along with an analysis of the results of this experiment. Section 4 describes the experiment performed to explore the performance of multi-label classification algorithms as the characteristics of the dataset change. Section 5 draws conclusions from the experimental results and outlines a path for future work.

Multi-Label Classification Algorithms

Multi-label classification algorithms can be divided into two categories: problem transformation and algorithm adaptation [23]. The problem transformation approach transforms the multi-label dataset so that existing multi-class algorithms can be used to solve the transformed problem. Algorithm adaptation methods extend multi-class algorithms to directly work with multi-label datasets. In this section the most widely used approaches in each category will be described (including those used in the experiment described in Section 3). The section will end with a review of existing benchmark experiments.

Problem Transformation

The most trivial approach to multi-label classification is the binary relevance method [1]. Binary relevance adopts a one-vs-all ensemble approach, training independent binary classifiers to predict the relevance of each label to a data point. The independent predictions are then aggregated to form a set of relevant labels. Although binary relevance is a simple approach, Luaces et al. [11] show that a properly implemented binary relevance model, with a carefully selected base classifier, can achieve good results.

Classifier chains [14] take a similar approach to binary relevance but explicitly take the associations between labels into account. Again a one-vs-all classifier is built for each label, but these classifiers are chained together in order such that the outputs of classifiers early in the chain (the relevance of specific labels) are used as inputs into subsequent classifiers.

Rather than trying to transform the multi-label classification problem into multiple binary classification problems, the label powerset method [1] transforms the multi-label problem into a single multi-class classification problem. Each unique combination of relevant labels is mapped to a class to create a transformed multi-class dataset which can be used to train a classification model using any multi-class learning algorithm. Although the label powerset method can perform well, as the number of labels increases the number of possible unique label combinations grows exponentially giving rise to a very sparse and imbalanced equivalent multi-class dataset.

The random k-label set (RAkEL) approach [20] attempts to strike a balance between the binary relevance and label powerset approaches. RAkEL divides the full set of potential labels in a multi-label problem into a series of label subsets, and for each subset builds a label powerset model. By creating multiple multilabel problems with small numbers of labels, RAkEL reduces the sparseness and imbalance that affects the label powerset method, but still takes advantage of the associations that exist between labels.

Hierarchy of multi-label classifiers (HOMER) [17] also divides the multi-label dataset into smaller subsets of labels, but in a hierarchical manner. Calibrated label ranking (CLR) [8] takes a paired approach by training an ensemble of classifiers for each possible pair of labels in the dataset using only the data points which have either of the labels in the pair assigned to them.

Algorithm Adaptation

Multi-label k-nearest neighbour (MLkNN ) [25] is one of the most widely cited algorithm adaptation approaches. MLkNN is essentially a binary relevance algorithm, which acts on the labels individually, but instead of applying the standard k-nearest neighbour algorithm directly, it combines it with the maximum a posteriori principle. Dependent MLkNN (DMLkNN ) [22] follows the same principle as MLkNN but incorporates all of the labels while deciding the probability for each label, therefore taking label associations into account. IBLR-ML [4] is another modification of the k-nearest neighbour algorithm. It finds the nearest neighbours of the data point to be labeled, and trains a logistic regression model for each label using the labels of these neighbourhood points as features, thus taking the label associations into account. An algorithmic performance improvement of binary relevance combined with standard k-nearest neighbour, BRkNN, has also been proposed [16].

Multi-label decision tree (ML-DT) [5] extends the C4.5 decision tree algorithm to allow multiple labels in the leaves, and choose node splits based on a re-defined multi-label entropy function. Rank-SVM [7], is a support vector machine based approach that defines one-vs-all SVM classifiers for each label, but uses a cost function across all of these models that captures incorrect predictions of pairs of relevant and irrelevant labels. Backpropagation for multi-label learning (BPMLL) [24], is a neural network modification used to train multi-label datasets using a single hidden layer feed forward architecture using the back propagation algorithm.

Multi-label Classification Benchmark Studies

A number of papers that describe new multi-label classification approaches [3,14,15] benchmark different multi-label classification algorithms against their newly proposed method. One of the limitations of these studies, however, is a lack of hyper-parameter tuning, and a reliance on default hyper-parameter settings. Rather than proposing a new algorithm, Madjarov et al. [13] describes a benchmark study of several multi-label classification algorithms using several datasets. Hyper-parameter tuning is performed in this study. There is, however, a mismatch between the hamming loss measure used to select hyper-parameters and the measures used to evaluate performance in the benchmark. The study identifies HOMER, binary relevance, and classifier chains as promising approaches.

To perform a fair comparison of algorithms, the benchmark experiment described in this paper uses extensive parameter tuning. For consistency, the measure used to guide this parameter tuning-label based macro averaged F-Score (see Section 3.2)-is the same as the measure used to compare algorithms in the benchmark. The set of algorithms used overlaps with, but is different than, those in Madjarov et al. [13].

Multi-label Classification Algorithm Benchmark

This section describes a benchmark experiment performed to evaluate the performance of a collection of multi-label classification algorithms across several datasets. This section introduces the datasets and performance measure used in the experiment as well as the experimental methodology. Finally, the results of the experiment are presented and discussed.

Datasets

Table 1 describes the eleven datasets used in this experiment. The datasets chosen are widely used in the multi-label literature, and have a diverse set of properties, listed in Table 1. Instances, inputs and labels indicate the total number of data points, the number of predictor variables, and the number of potential labels, respectively. Total labelsets indicates the number of unique combinations of relevant labels in the dataset, where each such unique label combination is a labelset. Single labelsets indicates the number of data points having a unique combination of relevant labels. Cardinality indicates the average number of labels assigned per data point. Density is a normalised dimensionless indicator of cardinality computed by dividing the value of cardinality by the number of labels. MeanIR [2] indicates the average degree of label imbalance in the multilabel dataset-a higher value indicates more imbalance. These label parameters together describe the properties of the datasets which may influence the performance of the algorithms. Collectively, these properties will be referred to as label complexity in the remainder of this text. All datasets were acquired from [18]. In the birds dataset, several data points are without any assigned label. To avoid problems computing performance scores, we have added an extra other label to this dataset which is added to a data point when it has no other labels assigned to it.

Experimental Methodology

In this study we use label based macro averaged F-measure [23] for both hyperparameter selection and performance comparison. Higher values indicate better performance. This measure was selected as it allows performance of algorithms on minority labels to be captured and balances precision and recall for each label [10].

The algorithms used in this experiment are: binary relevance (BR) [1], classifier chains (CC) [14], label powerset (LP) [1], RAkEL-d [20], HOMER [17], CLR [8], BRkNN [16], MLkNN [25], DMLkNN [22], IBLR-ML [4] and BPMLL [24]. All algorithm implementations come from the Java library MULAN [19]. For each algorithm-dataset pair, a grid search on different parameter combinations was performed. For an algorithm-dataset pair, for each parameter combination selected from the grid, a 2 × 5-fold cross-validation run was performed, and the F-measure was recorded. When the grid search is complete, the parameter combination with the highest F-measure was selected. These selected scores are shown in Table 2 and used to compare the classifiers.

For each problem transformation method-CC, BR, LP and CLR-a support vector machine with a radial basis kernel (SVM-RBK) was used as the base classifier. The SVM models were tuned over 12 parameter combinations of the regularisation parameter (from the set {1, 10, 100}) and the kernel spread parameter (from the set {0.01, 0.05, 0.001, 0.005}). For RAkEL-d the subset size was varied between 3 and 6, and for HOMER the cluster size was varied between 3 and 6. For both RAkEL-d and HOMER, the base classifiers were label powerset models, using SVM-RBK models tuned as outlined above. The BRkNN, MLkNN, DMLkNN and IBLR-ML were tuned over 4 to 26 nearest neighbours, with a step size of 2. For BPMLL the tuning was two step in order to make it computationally feasible. First, a grid with 120 different parameter combinations for the regularisation weight, learning rate, number of iterations and the number of hidden units were created and the best combination was found using only the yeast dataset. Next, using this best combination of hyper-parameters other algorithm-dataset pairs were tuned over hidden layers containing units equal to 20%, 40%, 60%, 80% and 100% of the number of inputs for each dataset, as recommended by Zhang et al. [24].

Benchmark results

The results of the benchmark experiment performed as explained in Section 3.2 are summarised in Table 2. The columns of the table are ordered in the increasing order of the average rank (a lower average rank is better) of the algorithms over all the datasets. The best performance per dataset is highlighted with bold-face. Direct interpretations of Table 2 indicate that CC achieved the top score on 4 of the datasets, whereas BPMLL was able to achieve the top score 3 times, with RAkEL-d getting top score twice, and LP and HOMER once each. It is also interesting to note that the k-nearest neighbour based algorithms-IBLR-ML, MLkNN, BRkNN and DMLkNN-are ranked in that order and close to each other. DNF appears in Table 2 for the CLR algorithm on the corel5k dataset as the experiment did not finish, due to the huge number of label pairs generated for the 347 labels in this dataset (this is a common outcome for this dataset, eg. [12]).

To further explore these results, as recommended by Demšar [6], first a Friedman test was performed which indicated that a significant difference between the performance of the algorithms over the datasets did exist; then a pairwise Nemenyi test with a significance level of α = 0.05 was performed. The results indicate that the algorithms do not vary very much across the datasets. Figure 1 shows the critical difference plot for the pairwise Nemenyi test. The different algorithms indicated on the line are ordered by average ranks over the datasets. Algorithms that are not significantly different to each other over the datasets, found by the Nemenyi test with the significance level of α = 0.05, are connected with the bold horizontal lines.

Overall, Figure 1 indicates that CC, RAkEL-d, BPMLL and LP performed well, whereas the nearest neighbour based algorithms performed relatively poorly. Among the different nearest neighbour based algorithms, IBLR-ML performs better than others over the datasets, but all the nearest neighbour based algorithms perform significantly worse than CC. Hence, the overall performance of the algorithms indicate that-although over the different datasets none of the algorithms decisively outperforms the others-CC, RAkEL-d, BPMLL and LP perform well, and the nearest neighbour based algorithms perform poorly in general.

Label Analysis

A preliminary experiment was also performed to understand how multi-label classification approaches perform when the number of labels is increased, while the input space is kept the same. Section 4.1 describes the experimental setup and Section 4.2 discusses the results.

Experimental Setup

The corel5k dataset has 50 times as many potential labels as the scene dataset. There are also significant differences in their MeanIR values: 1.254 for scene and 2 indicates that all of the multi-label classification approaches perform much better on scene than corel5k. It is tempting to draw a conclusion that this is because of the complexity of the labelsets, but this is probably a mistake. One multi-label classification problem can be inherently more difficult than another. The prediction performance of an algorithm on a multi-label dataset depends not only the label properties, but also the predictor variables in the input space. Therefore, attempting to establish a relationship between the performances of algorithms on different datasets with varying label properties can be misleading.

To assess the impact of changing label complexity on the performance of multi-label classification algorithms, a group of datasets were generated synthetically that vary label complexity but keep all input variables the same. These datasets were generated using the yeast dataset as the starting point. 13 synthetic datasets were formed from the yeast dataset. The input space of these 13 datasets are kept identical, with the k th dataset having the first k labels of the dataset in the original order, where 2 ≤ k ≤ 14. Similarly, the emotions dataset was also used to generate 5 such synthetic datasets. The yeast and emotions datasets were selected for this preliminary study for two reasons. First, these are widely used datasets that are somewhat typical of multi-label classification problems-they have medium cardinality and the frequencies of the different labels are relatively well balanced. Second, this experiment is computationally quite expensive (multiple days are required for each run) and so the sizes of these datasets makes repeated runs feasible for this preliminary study.

Following the experimental methodology explained in Section 3.2 the performance of the BR, CC, LP, RAkEL, IBLR-ML, BRkNN, CLR and BPMLL were assessed on the 13 datasets created based on the yeast data, and the 5 synthetic datasets based on emotions dataset. The results of this experiment are discussed in the following section.

Label Analysis Results

In Figures 2a and 2b the number of labels used in the dataset (wither yeast or emotions) is shown on the x-axis and the label based macro averaged Fmeasure is shown on the y-axis (note that the graphs do not use a zero baseline for F-measure so as to emphasise the differences between approaches). These plots indicate that all the algorithms have responded similarly with respect to F-measure as the number of labels vary. Figures 2c and 2d, however, show how the relative ranking of the performance of the different algorithms changes as label complexity increases, and here interesting patterns are observed.

Figure 2c, related to the yeast dataset, indicates that the performance of BR starts in a high rank, but reduces as the number of labels increases. CLR does better in rank than BR, but keeps on decreasing as the number of labels increases. For LP and CC, the performance increases as the number of labels increases, ending at the first and the second position respectively. BPMLL starts with the lowest rank, but quickly increases maintaining the best rank most of the time. RAkEL-d stays in the middle. BRkNN and IBLR-ML stays at the Fig. 2: Number of labels selected from yeast and emotions dataset, when compared against classifier performance.

Scores of different nos of labels in Yeast dataset

Different number of labels Macro Averaged F−Measure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q BPMLL CLR IBLR−ML RAKEL−d LP CC BR BRKNN (b) Macro average F-Measure performance changes, emotions.

Scores of different nos of labels in Emotions dataset

Different number of labels Macro Averaged F−Measure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q BPMLL CLR IBLR−ML RAKEL−d LP CC BR BRKNN (c) Relative rank changes, yeast.

• • • • • • • • • • • • •

Relative rankings on increasing number of Yeast labels This preliminary study indicates that LP, CC and BPMLL were able to perform comparatively better than others, while BR showed consistent decrease in rank. To establish a definite relation, a more detailed study should be performed.

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 2 3 45• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •2

Figure 3 shows how the label complexity parameters for the yeast and emotions datasets change as the number of labels are varied in the synthetically Fig. 3: Change of a few label complexity parameters as the number of labels change q q q q q q q q q q q q q 2 4 6 8 10 12 14 1 2 3 4 for yeast Labels Cardinality q q q q q 2 3 4 5 6 1 2 3 4 for emotions Labels Cardinality q q q q q q q q q q q q q 2 4 6 8 10 12 14 0.26 0.30 0.34 0.38 for yeast Labels Density q q q q q 2 3 4 5 6 0.26 0.30 0.34 0.38 for emotions Labels Density q q q q q q q q q q q q q 2 4 6 8 10 12 14 2b, but such a conclusion from this experiment may be misleading, and hence requires further study.

Discussion and Future Work

This paper focuses on two aspects. Firstly, the benchmarking of several multilabel classification algorithms over a diverse collection of datasets. Secondly, a preliminary study to understand the performance of the algorithms when the input space is kept identical, while varying the label complexity. For the benchmark experiment, the hyper-parameters for each algorithm-dataset pair were tuned based on label based macro averaged F-measure to provide the fairest comparison between approaches. The algorithms DMLkNN, BRkNN and MLkNN perform poorly overall. On the other hand CC, RAkEL-d and BPMLL were the top three algorithms, in that order. The pairwise Nemenyi test, however, indicates that overall there is not a statistical difference between the performance of most of the pairs of different algorithms. This is perhaps unsurprising, and provides a reinforcement of the no free lunch theorem [21] in the context of multi-label classification.

The preliminary label analysis provides some interesting results. The performance of BPMLL, LP and CC improve as the number of labels increases, whereas the performance of BR decreases in comparison. IBLR-ML appears to have consistently better ranks than BRkNN.

The level of research in the multi-label classification field is continuing to increase, with new methods being proposed and existing methods being improved. Further investigations can be done to understand the performance of additional algorithms over even more datasets to understand their overall effectiveness. Our label analysis experiment was limited to two datasets. Given the preliminary observations from this study, it would be interesting to further investigate if any consistent relationship exists between algorithm performance and the label properties of the dataset under consideration, which may provide a guideline for the suitable application of multi-label algorithms.

Fig. 1 :1Fig. 1: Comparison of algorithms based on pairwise Nemenyi test. Connected groups with bold line are not significantly different with the significance level α = 0.05

(a) Macro average F-Measure performance changes, yeast.

Relative rank changes, emotions.

though IBLR-ML was able to get a better rank than BRkNN most of the times. In Figure2drelated to emotions dataset, BPMLL and CC continued to rise up, CLR and BR floated down, IBLR-ML and BRkNN were relatively flat, while IBLR-ML achieved a better ranking most of the time.

Although it looks like there is some relationship between the change of Density in Figure3with the change of performance in Figures2a and

Table 1 :1DatasetsTotalSingleDataset Instances Inputs Labels Labelsets Labelsets Cardinality Density MeanIRyeast241710314198774.237 0.3037.197scene240729461531.074 0.1791.254emotions5937262741.869 0.3111.478medical978 14494594331.245 0.028 89.501enron1702 1001537535733.378 0.064 73.953birds3222602089551.503 0.075 13.004genbase662 11862732101.252 0.046 37.315cal5005026817450250226.044 0.150 20.578llog1460 1004753041891.180 0.016 39.267slashdot3782 107922156561.181 0.054 17.693corel5k5000499374317525233.522 0.009 189.568

Table 2 :2Best mean Label Based Macro Averaged F-MeasureDatasetCC RAkEL-d BPMLLLP HOMER BR CLR IBLR-ML MLkNN BRkNN DMLkNNyeast0.4510.4370.436 0.4510.448 0.387 0.3990.3940.3770.3920.380scene0.8040.8020.778 0.8020.800 0.799 0.7930.7490.7420.6950.750emotions0.6240.628 0.690 0.5960.621 0.604 0.6160.6580.6290.6330.634medical0.6920.6970.558 0.6590.611 0.676 0.5200.4340.5400.4740.505enron0.2890.2880.281 0.2780.281 0.284 0.2860.1530.1770.1690.163birds0.1580.181 0.343 0.1810.155 0.157 0.1560.2550.2260.2730.216genbase0.9440.9430.815 0.9410.939 0.941 0.9310.9100.8500.8370.821cal5000.1850.179 0.237 0.1780.199 0.181 0.1690.1780.1010.1240.107llog0.2920.3000.295 0.2970.256 0.296 0.2810.1100.2630.2550.248slashdot0.4690.4720.209 0.4740.477 0.466 0.1510.2140.1940.1640.200corel5k0.2220.2170.219 0.2100.197 0.213 DNF0.0840.1900.1860.181Average Rank 3.3643.4554.818 4.9095.455 5.546 7.3007.9098.0918.3648.546

rankings on increasing number of Emotions labelsRelative•••••

Acknowledgement. This research was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

Learning multi-label scene classification MRBoutell JLuo XShen CMBrown Pattern Recognition 37 9 2004 Addressing imbalance in multilabel classification: Measures and random resampling algorithms FCharte AJRivera MJDel Jesus FHerrera Neurocomputing 163 2015 Mltsvm: A novel twin support vector machine to multi-label learning WJChen YHShao CNLi NYDeng Pattern Recognition 52 2016 Combining instance-based learning and logistic regression for multilabel classification WCheng EHüllermeier Machine Learning 76 2 2009 Knowledge discovery in multi-label phenotype data AClare AClare RDKing Lecture Notes in Computer Science 2001 Springer Statistical comparisons of classifiers over multiple data sets JDemšar J. Mach. Learn. Res 7 Dec 2006 A kernel method for multi-labelled classification AElisseeff JWeston Advances in Neural Information Processing Systems 14 MIT Press 2001 Multilabel classification via calibrated label ranking JFürnkranz EHüllermeier ELoza Mencía KBrinker Machine Learning 73 2 2008 Multi-label learning: a review of the state of the art and ongoing research EGibaja SVentura Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 6 2014 Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies JDKelleher BMac Namee AD'arcy 2015 The MIT Press Binary relevance efficacy for multilabel classification OLuaces JDíez JBarranquero JJDel Coz ABahamonde Progress in Artificial Intelligence 1 4 2012 Two stage architecture for multi-label learning GMadjarov DGjorgjevikj SDžeroski Pattern Recognition 45 3 2012 An extensive experimental comparison of methods for multi-label learning GMadjarov DKocev DGjorgjevikj SDžeroski best Papers of Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA' 2012. 2011 45 Classifier chains for multi-label classification JRead BPfahringer GHolmes EFrank Machine Learning 85 3 2011 Multi-label classification based on multi-objective optimization CShi XKong DFu PSYu BWu ACM Trans. Intell. Syst. Technol 5 2 22 Apr 2014 An empirical study of lazy multilabel classification algorithms ESpyromitros GTsoumakas IVlahavas Proceedings of the 5th Hellenic Conference on Artificial Intelligence: Theories, Models and Applications the 5th Hellenic Conference on Artificial Intelligence: Theories, Models and Applications

Berlin, Heidelberg

Springer-Verlag 2008 SETN '08 Effective and Efficient Multilabel Classification in Domains with Large Number of Labels GTsoumakas IKatakis IVlahavas Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08) ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08) 2008 MULAN multi-label dataset repository GTsoumakas ESXioufis JVilcek IVlahavas Mulan: A java library for multi-label learning GTsoumakas ESpyromitros-Xioufis JVilcek IVlahavas Journal of Machine Learning Research 12 2011 Random k-labelsets: An ensemble method for multilabel classification GTsoumakas IVlahavas Machine Learning: ECML 2007: 18th European Conference on Machine Learning

Warsaw, Poland; Berlin Heidelberg

Springer September 17-21, 2007. 2007 Proceedings. No free lunch theorems for optimization DHWolpert WGMacready Trans. Evol. Comp 1 1 Apr 1997 Multi-label classification algorithm derived from k-nearest neighbor rule with label dependencies ZYounes FAbdallah TDenoeux Signal Processing Conference 2008. Aug 2008 16th European A review on multi-label learning algorithms MLZhang ZHZhou IEEE Transactions on Knowledge and Data Engineering 26 8 2014 Multilabel neural networks with applications to functional genomics and text categorization MLZhang ZHZhou IEEE Transactions on Knowledge and Data Engineering 18 10 Oct 2006 Ml-knn: A lazy learning approach to multi-label learning MLZhang ZHZhou Pattern Recognition 40 7 2007