=Paper=
{{Paper
|id=Vol-2126/paper10
|storemode=property
|title=Approaches for the Improvement of the Multilabel Multiclass Classification with a
Huge Number of Classes
|pdfUrl=https://ceur-ws.org/Vol-2126/paper10.pdf
|volume=Vol-2126
|authors=Martha Tatusch
|dblpUrl=https://dblp.org/rec/conf/gvd/Tatusch18
}}
==Approaches for the Improvement of the Multilabel Multiclass Classification with a
Huge Number of Classes==
Approaches for the Improvement of the Multilabel Multiclass Classification with a huge Number of Classes Martha Tatusch Institute of Computer Science Heinrich Heine University Düsseldorf D-40225 Düsseldorf, Germany tatusch@cs.uni-duesseldorf.de ABSTRACT 1. INTRODUCTION In the field of data analysis, the multilabel multiclass clas- Today, Deep Learning and Artificial Neural Network are sification is still a major problem in case of a large number widespreaded terms especially in the fields of information of classes. technology and data science. A few years ago, these methods With the help of deep learning methods, impressive infor- were launched and immediately met with great enthusiasm. mation can be extracted from a wide variety of data. For They stand for a specific concept of machine learning (ML), example, people can be recognized on images and in videos in which a machine can learn by itself and opens up new or fonts can be imitated. Nevertheless, these algorithms also knowledge only on the basis of training data that does not encounter limitations. One of these limits when classifying necessarily have to be preprocessed. This discovery was a objects is the treatment of multiple classes. For example, major breakthrough in the field of data science because the- if an image is supposed to be described with the help of a re finally was a way to avoid the difficulties of the feature dictionary in a few keywords, there are countless words that selection that would otherwise be required in ML. can be selected, but only very few that apply to the object. Although there already was a wide variety of classifiers[10] Another aggravating fact is that the number of words per that could learn from training data, the developer always image is not fixed. had to manually explore which features of the objects were This paper presents two basic approaches to improve the meaningful and extract them beforehand, so that the human classification accuracy with neural networks compared to a being still had a great influence on the quality of the results. common approach. One strategy describes a parallel model In Deep Learning, the relevant features are automatically that requires clustered label sets. For this purpose, different determined and processed. The used construct is an artifi- distributions are considered. In the second approach, the cial neural network with multiple layers between the input effects of different loss functions are investigated. and the output. With these networks it is possible to find It is shown that the presented approaches obtain a very correlations between data that cannot be readily grasped by significant improvement of the results compared to the ba- the human mind. In addition, problems that seem simple sic model. Both approaches show an improvement of at least for humans but are difficult on a programmatic level, such 400%. The parallel architecture even achieves 31 times bet- as the artificial generation of realistic images or fonts, and ter results than the basic model. We also show under which the generation of meaningful answers to freely formulated conditions the individual approaches can achieve the most questions, can be solved. effective enhancement of quality. But these models also have their weaknesses. When clas- sifying objects, large amounts of training data are required so that each class – also called label – can be learned with Categories and Subject Descriptors a moderate number of representatives. If we now want to I.2.8 [Artifical Intelligence]: Problem Solving, Control Me- label a collection of different images – for example, patient thods, and Search; H.2.8 [Database Management]: Da- images of a hospital – thousands of different words are possi- tabase Applications—Data Mining; I.4.m [IMAGE PRO- ble. The number of possible words can of course be limited, CESSING AND COMPUTER VISION]: Miscellaneous— for example, by choosing a subject area, but the number of Image Classification possibilities will still be large. This means that the number of images per class on average is very low. The use of classi- fiers that require a previous feature selection is not possible, Keywords since there are no recognizable consistent properties of rele- Neural Networks, Image Processing, Artificial Intelligence, vance that can be extracted. This only leaves the possibility Classification, Information Retrieval of using deep neural networks. Due to the large number of classes, however, this task also represents a great challenge in Deep Learning, which is dealt with in this paper. 2. APPROACHES Multilabel multiclass classification describes the task of 30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany. classifying data into classes, whereby there are a lot of clas- Copyright is held by the author/owner(s). ses and each data point can be assigned to any number of Figure 1: Example of the parallel execution of multiple CNNs on clustered label sets. The final result is calculated by combining the individual results using the indicator method. The rectangles marked in gray represent classes into which the sample object has been classified. classes. This work deals with the situation in which each in- are assigned to a similar number of data objects would then put object is assigned to only a fraction of the possible clas- be placed in the same cluster. It is unlikely that only labels ses. Let |C| be the number of possible classes and |Co | the with similar occurrences will be assigned to the same object. number of classes assigned to an object o. A particular dif- This means that often multiple clusters contain labels that ficulty in dealing with this problem using machine learning belong to one data object. This increases the probability algorithms is that because of the ratio |C o| |C| , which is a very of correct assignment. Another possibility is to divide the small value, the system has difficulties to learn sensibly. labels randomly into several groups. For example, it is possible that the network may adapt In contrast to classification, which falls under the term itself to assign objects to no class at all. This can be explai- supervised learning, clustering belongs to the unsupervised ned by the fact that the hit rate is mainly influenced by the learning. This means that objects are classified without kno- classes which are not assigned to the considered object. The wing the classes in advance. Therefore there is no training Accuracy Metric is calculated by the formula data, since no information about class affiliation is known. number of correctly classif ied classes In this work, the clustering by occurrences is done with the . KMeans algorithm. number of all classes The parameter K represents the number of clusters to be During the learning process, many systems strive to maximi- calculated. First, K random data points of the training set ze this value. Suppose there are 1 000 classes and one object are selected as the centers of the individual clusters. The- is assigned to exactly 5 of these classes. If a classifier does se are called centroids. All objects are then assigned to the not classify the object to any class, it will have an accura- cluster whose centroid is closest to the object. The distance cy of 1000−5 1000 995 = 1000 = 99.5%. There is a high probability is usually calculated with the euclidean distance. Now the that this value will deteriorate if the system tries to find the centers are recalculated by computing the mean value of all correct classes, which can cause incorrect assignments. Since data points of the respective cluster. All data points are then the accuracy of non-assignment is still very close to 100%, reassigned and the resulting centroids are calculated. This is this method proves to be the best option for the network. repeated until the assignment of the data objects does not But for humans, however, this approach does not make sen- show any changes anymore. se. The aim is, of course, to make the assignment as accurate as possible, but to assure that an assignment will be made. With the help of the determined clusters, the classifica- This means, in this situation, it is much more important to tion problem can now be broken down. We consider a con- identify the associated classes than to prevent other classes struction, in which a CNN is trained separately for each from being incorrectly assigned to an object. calculated label set. When the model is executed, all CNNs are evaluated in parallel. Here, parallelism does not mean 2.1 Parallel Network Architecture the temporal context, but a symbolism for the fact that all Since the high number of classes is the greatest difficulty, CNNs are used for testing at the same level. All resulting as- it is very likely that splitting the problem into several smal- signments are finally merged and contribute with the same ler sub-problems can improve the results. The division into importance to the final result. several easier problems can be achieved by clustering[10] the Figure 1 illustrates an example of a model that can clas- set of labels. If then, for each cluster, one seperate net is trai- sify into 16 classes. The individual clusters contain different ned, the number of classes gets considerably lower and the numbers of labels and some overlaps. In the leftmost clus- |Cavg | ∑ N ter, for example, a network is used that can categorize into ratio |C| increases. The denominator |Cavg | = N1 |Co | the classes 1, 3, 4, 7, 9, 11, 12 and 14. In this example, it has o=0 stands for the average number of labels per training object chosen the labels 3 and 7. In the final result, the outcomes and the counter |C| for the number of all possible classes. of all CNNs are considered equally. Now, the question is how to divide the classes in order to There are several ways to combine the results. For examp- achieve the best possible results. le, all classes selected by at least one CNN can be considered Since often nothing is known about the labels other than assigned in the final result. It is also possible to use a ma- their names, properties must be determined with which the jority voting system or an average value. In the first case, clustering can be performed. One possibility is to look at the this means that for each label all results of the clusters con- independent occurrences of the different classes. Labels that taining it are considered. Only if the majority has assigned the object to this class, it is also selected in the final result. dividually and calculates weighted costs is desirable. In [5] a For the calculation of the average values either the binary or type of loss functions is presented, which is based on propen- unrounded predicted values can be used. The consideration sity values that can be calculated with subjective relevance of the decimal values is more suitable, since a prediction of ratings. The developer can assign relevance values to the in- 0.51 for a class in the binary case would already result in dividual classes, which then are incorporated into the cost a 1, which would flow into the average much more strongly function. Since this paper assumes that the relevance ratings than a 0.51. Finally the average value itself is rounded up, of the different labels are not known, another variant is used whereby in the binary case a ”double rounding” would result, that was presented in the same paper and is independent of which can falsify the result. subjective evaluations. Based on previous observations, the authors have decided 2.2 Propensity Loss Function that the propensity of a class can be represented by a sig- A further approach to improving the results of a mul- moid function. For a label l with unknown relevance value tilabel multiclass CNN, which has nothing to do with the the propensity pl is calculated by construction of the model and the clustering of classes, is to 1 pl = √ , (2) adjust the loss function. If it is set up in such a way that, 1 + (log(N − 1)) · 1.4 · e−0.5 log(Nl +0.4) for example, false negatives are strongly and false positives are hardly penalized, then this would already have a great where Nl represents the number of data objects that con- influence on the learning process of the classifier and would tain the label l and N stands for the number of all training prevent objects from not being classified at all. objects. The values for the optimization parameters are the The learning process of a convolutional neural network same as those chosen in the paper. requires a loss and an optimization function. Depending on In [5] the integration of propensity scores into different the resulting error value of a run, all weights of the CNN known loss functions was presented. In this work, the deci- are adjusted during the backpropagation. A frequently used sion was made to use an adapted version of the Hamming loss function in multilabel classification is∑the binary cross Loss function: 1 ∑∑ 1 N L entropy. It is calculated by H(p, q) = − x p(x) log(q(x)), HL(M ) = ( (2ŷij − 1)) · (yi,j − ŷi,j )2 . (3) where p(x) stands for the actual probability and q(x) for the N i=1 j=1 pj calculated probability that the considered object belongs to class x. The resulting probabilities are rounded, so that p(x) The subterm (2ŷij − 1) has the function of an indicator that and q(x) can only have values of 0 or 1. The largest costs are checks whether the object i has been assigned to class j or incurred if the network does not classify into the class which not. In the binary case, it is 1 if it has been classfied, and the object in question belongs to. If it assigns the data point −1 if it has not been classified into the observed class. As a to a class which it does not belong to, no costs are caused result, predictions in which i incorrectly has been assigned by the object. to class j are punished and those who have wrongly not been assigned an object to the class are rewarded. By po- In [5] a new type of loss functions is introduced. It is pri- sitioning the propensity in the denominator of the fraction, marily designed for multiclass classification with an enor- misclassifications to labels with high propensity are weigh- mous number of classes. According to [5], the functions prio- ted less than those to the rarely occurring ones. Except for ritize the assignment to the correct classes and promotes this factor, nothing has changed in the original Hamming classification to rarely occurring labels. Their special cha- Loss function. racteristic is the relation to the propensity of the individual As one of the problems discussed here is that the classifier labels. possibly learns not to classify at all, it is not advisable to The Hamming Loss is cited as a bad example for a loss take the formula from [5] unchanged. The indicator function function for the multiclass problem. For a model M , it is only punishes false positives and even rewards false negati- defined by ves. As a result, the likelihood that the network does not 1 ∑∑ N L make a classification at all rather than misclassifying an ob- HL(M ) = (yi,j − ŷi,j )2 , (1) N · L i=1 j=1 ject is increased. In this work, it makes more sense to use a loss function that punishes false positives and false negati- with N the number of data points, L the number of labels, ves equally or possibly prefers false negatives. In any case, yi,j the actual assignment of an object i to class j, and ŷi,j however, incorrect allocations must increase the error value. the predicted assignment of an object i to class j. Becau- For this reason, the absolute value of the indicator function se of the squared difference, the model is punished for both is used in the following process: false negatives and false positives. In addition, the costs for 1 ∑∑ 1 N L all individual class assignments are calculated in the same HL(M ) = ( (|2ŷij − 1|)) · (yi,j − ŷi,j )2 . (4) way, as it is usual in most cases. In an unbalanced dataset, N i=1 j=1 pj however, there may be labels that contain very few repre- sentatives but are nevertheless as important as frequently This ensures that all incorrect classifications are treated in represented labels. These are easily overlooked during the the same way. training because the probability of an incorrect assignment is significantly lower than for classes that belong to many 3. REALISATION data points. Even a correct classification to such minority Before creating the model, the input data has to be prepro- classes has not much influence on the training result, as this cessed. All images are mapped to the RGB color space. Since happens so rarely that the relevant weights get hardly chan- the net expects a fixed image size, a squared size of 800×800 ged. pixels has been chosen. If neccessary, the increase of the For this reason, a cost function which treats each label in- image size is achieved by adding black borders. This can be (a) Random Figure 3: Uniformly used CNN Architectur. The grayed-out part was computed only once. 3.2 Architecture Due to the problem of determining a suitable Convolu- tional Neural Network architecture and the usually time- (b) By Occurrences consuming training sessions of a network, it is advisable to use a pre-trained network, which has already achieved con- Figure 2: Distributions of the labels with different cluste- vincing results on similar data. In [9] several strongly resem- rings. A dataset of 2 000 labels has been used. ble architectures for Deep Convolutional Neural Networks are presented. They were developed as part of the ImageNet Challenge 2014[8]. The VGG16 net achieved the best results done by adding zero values on the sides. If the image is too with a depth of 16 trainable layers. big, it is scaled down by means of interpolation until the lar- Since both the input data and the required output differ gest side length is 800 pixels long. The smaller side length from the original architecture, the model needs to be slight- is then evenly filled with zero values from both sides. ly modified. The entire chosen section of the architecture includes 14, 714, 688 pretrained weights. These can be set 3.1 Clustering untrainable so that only the weights which have been added In order to accomplish the approach of the parallel mo- by the own layers are trained. In Figure 3 the final archi- del, the first step is to cluster the label set. In this work, tecture used in all cases is displayed. Since the weights of in any case, 50 clusters has been requested. The determina- the VGG16 block were no longer trained, the output of this tion of random label groups is self-explanatory. The result part of the net could be calculated once and reused to in- is a distribution of the classes that is similar to an uniform crease efficiency. Since the result of this area still contained distribution. This balanced arrangement is illustrated in Fi- 512 channels, the idea arose to pool the result. When using gure 2a. The smallest cluster contains 28 and the largest 52 the VGG16 block without pooling, it was noticeable that labels. All label groups are therefore relatively small. the output sometimes contained a lot of zeros. For this rea- Although the used MiniBatchKMeans implementation of son, pooling above the maximum makes sense to reduce the Scikit Image1 receives a desired number of clusters as pa- number of zeros. Since it is not the size of the feature maps rameter, it only creates as many clusters as actually make but the number of channels that should be reduced, we wro- sense. The effect of this is that during clustering by occur- te a custom layer. It is named ChannelsMaxPooling Layer rences 39 label groups with 8 to 391 classes are created. and pools one-dimensionally each pixel over a given number Besides a few exceptions, these are again relatively small of feature maps. It can be found on Github2 . In this work clusters. Even the largest number of labels is more than four a filter size of 32 and a step size of 8 pixels were used. This times smaller than the total amount of classes and therefore means that the sliding window goes over 32 channels and is represents a significant decimation of the label set. Neverthe- moved in 8-pixel steps across all channels. The number of less, the distribution is very heterogeneous. The differences feature maps is reduced from 512 to 512−32 + 1 = 61. 8 between the label distributions of the random clusters and The white components in Figure 3 must be trained for the clustering by occurrences become clear in Figure 2, as each approach. The second dense layer generates a N -dimen- the Y-axis is same-scaled in both cases. sional vector, where N stands for the number of CNN classes Both, the random distribution and the clustering by oc- considered. Unlike all other layers, it does not have ReLu as currences, generate disjoint label groups. Since it is interes- an activation function, but Sigmoid. By using this activation ting to see which effect it would have, if the clusters showed function, all values of the resulting vector are normalized to overlaps, an additional distribution of the labels has been the interval [0, 1], which correspond to the probabilities of an made. The labels were randomly distributed into 50 clusters, assignment to the respective class. Between the two dense with each label being assigned to a maximum of 5 clusters. layers a dropout layer is applied, which randomly rejects 20% of the tensor values to prevent overfitting. 1 http://scikit-learn.org/stable/modules/generated/ 2 sklearn.cluster.MiniBatchKMeans.html http://github.com/tatusch/ChannelsMaxPoolingLayer Dataset Clustering Loss Function Precision Recall F1-Score 2000-Labels Dataset None Binary Cross Entropy 0.000206 0.000900 0.000336 Propensity Loss 0.001118 0.461154 0.002230 Random Binary Cross Entropy 0.005398 0.282772 0.010594 Propensity Loss 0.002225 0.455054 0.004429 Random (redundant) Binary Cross Entropy 0.002082 0.225877 0.004126 Propensity Loss 0.001842 0.431557 0.003669 By Occurrences Binary Cross Entropy 0.000709 0.066093 0.001403 Propensity Loss 0.000475 0.188481 0.000948 1000-Labels Dataset None Binary Cross Entropy 0.002608 0.001677 0.002042 Propensity Loss 0.005148 0.201733 0.010040 Random Binary Cross Entropy 0.013830 0.151201 0.025343 Propensity Loss 0.007573 0.209422 0.014618 Table 1: Comparison of the achieved precision, recall and F1 score values with different clusterings and loss functions on the two datasets. The results of the redundant clusters with the Binary Crossentropy were calculated using the indicator and with the Propensity Loss using the average method. 4. EXPERIMENTAL RESULTS score – which describes the most meaningful measure refer- The dataset used for the evaluation was provided during red to the task – the best results have been obtained with ImageCLEF2017[6]. All images come from the medical field, the parallel architecture, the random disjoint clusters and but can vary greatly in size, color coding and content. For the binary cross entropy. The resulting F1 score is nearly example, there are images of wounds, patients, CT scans and eight times larger than that of the clusters by occurrences. maps of areas where a diseases has been spread. The mea- Random redundant clustering achieves the second-best va- nings of the labels can be looked up in the Unified Medical lues and is still much better than the worst clustering. Ne- Language System (UMLS)3 . They, however, were not taken vertheless, the results are considerably worse than with the into account in this work. other alternative. This fact once again confirms the assump- The dataset contains 164 614 training and 10 000 test images tion that a network achieves poorer results the more classes in different formats and a total of 20 812 labels. The test re- it has to consider. sults of the competition listed in [3] clearly show that a pre- In order to determine which method is the most suitable, cise classification of the objects is a difficult problem.The the individual merge strategies have been evaluated on the best achieved average F1 score is only 15.83%. The next 10 2000-Labels Dataset. The results are shown in Table 2. As places are occupied with values of 12 to 14%. With additio- you can see, for the binary cross entropy the best strategy is nal external resources, a maximum F1 score of 17.18% was given by the indicator method. This outcome was expected achieved. All these values are far away from a score, which as the assignment rate with this loss function is very low can be taken for a precise classification. and gets supported by the indicator function. Due to limited technical resources, it was necessary to Since the propensity loss already promotes the allocati- reduce the total amount of labels to 2 000 labels, which were on rate itself, the best results are achieved with the average chosen randomly. Although it is an enormous decimation method. This result can also easily be explained. Since the of the number of classes, the dataset is still large enough loss function increases the number of assignments, the fal- to represent the described problem. The decimation of the se positive rate is increased, as well. To reduce this effect, label set causes, that the reduced dataset contains in total the most appropriate results must be selected. Since both 89 113 training images and 5 451 test images. the average and majority method provide this functionali- Since in this selected dataset a lot of labels with very few ty, both are suitable for the propensity loss. Table 2 shows representatives were included, another subset of the dataset that both results are very close to each other. In Table 1 the was chosen for comparison. This time the 1 000 most com- results of the most suitable strategies are displayed. mon labels were selected. All these labels are represented by The usage of the propensity loss function improves the re- at least 139 training images. The resulting data set includes sults of the basic network by around 400% on both datasets, 150 339 training and 9 201 test images. In Table 1 the results of the models with different cluste- rings and loss functions on the two data sets are displayed. Method Binary Cross Entropy Propensity Loss All metrics are calculated by the formulas used in [1]. It be- R F1 R F1 comes clear that all results are anything but good. The basic Indicator 0.225877 0.004126 0.642536 0.002345 network with all labels reaches only an F1 score of 0.000336 Average 0.053195 0.003831 0.431557 0.003669 on the 2000-Labels Dataset and 0.002042 on the 1000-Labels Majority 0.056994 0.003796 0.433857 0.003649 Dataset. In relative terms, however, a strong improvement was achieved by the different approaches. Regarding the F1 Table 2: Comparison of the achieved results with redundant 3 https://www.nlm.nih.gov/research/umls/ clusters using different merge strategies on the 2000-Labels Dataset. as can be seen in Table 1. Particularly noticeable is the im- level. It can also be promising to perform backpropagation provement of the recall. On the 2000-Labels Dataset a recall across all levels so that the first levels can learn from the of 0.4612 is achieved. This can be explained by the promo- final results. tion of classification which causes the decrease of the false Principally, the multilabel multiclass classification with a negatives rate. In combination with the parallel architecture, large number of classes is continually a difficult problem that however, a significant improvement of the results regarding can be investigated extensively in the future. the basic model is achieved, as well, but it gets clear that in all cases the parallel approach scores better without the 8. REFERENCES propensity loss function. [1] F. Chollet. Metrics File in Keras’ GitHub Repository. Furthermore, it can be said, that on both datasets, the https://github.com/keras-team/keras/blob/ results of the basic model using the propensity loss are out- ac1a09c787b3968b277e577a3709cd3b6c931aa5/ performed by the parallel approach with random clusters keras/metrics.py. Accessed: 2018-04-06. (disjoint as well as redundant). [2] O. Dekel and O. Shamir. Multiclass-Multilabel 5. RELATED WORK Classification with More Classes than Examples. In To the present day, there are a few approaches for mul- Proceedings of the Thirteenth International Conference tilabel and multiclass classification[7][2][12][13][4], but only on Artificial Intelligence and Statistics, 2010. few publications work on the combination of the two tasks. [3] C. Eickhoff, I. Schwall, A. G. S. de Herrera, and In addition, the difficult facts that the number of labels per H. Müller. Overview of ImageCLEFcaption 2017 - object is not fixed and the number of classes is very high are Image Caption Prediction and Concept Detection for usually not taken into account. In [11] a classifier is presen- Biomedical Images. In Working Notes of CLEF 2017 - ted that deals with the multilabel multiclass classification Conference and Labs of the Evaluation Forum, 2017. and uses association rules[10] to classify the input objects. [4] D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang. The authors achieve very good accuracy results, however, Multi-Label Prediction via Compressed Sensing. In the solution is not suitable for a classification with a large Advances in Neural Information Processing Systems number of classes. 22. 2009. [5] H. Jain, Y. Prabhu, and M. Varma. Extreme 6. CONCLUSION Multi-label Loss Functions for Recommendation, It has been shown that the division of the original label set Tagging, Ranking & Other Missing Label into smaller label groups and the parallel execution of mul- Applications. In Proceedings of the 22Nd ACM tiple CNNs on seperate clusters significantly improves the SIGKDD International Conference on Knowledge results regarding a basic network which deals with all labels Discovery and Data Mining, 2016. at a time. Depending on the choice of the clustering, the [6] H. Mller, P. Clough, T. Deselaers, and B. Caputo. quality can be even further improved. On the used datasets, ImageCLEF: Experimental Evaluation in Visual the use of clusters after occurrences has not been very ad- Information Retrieval. 2010. vantageous. Significantly better results were achieved with [7] M.-E. Nilsback and A. Zisserman. Automated Flower random disjoint clusters. The artificially generated redun- Classification over a Large Number of Classes. In dancy of the labelsets reduces the quality of the results, as Indian Conference on Computer Vision, Graphics and this again leads to an increase in the average cluster size. Image Processing, Dec 2008. The customized propensity loss function reveals a strong [8] O. Russakovsky, J. Deng, H. Su, J. Krause, improvement of the results, as well. The usage of this func- S. Satheesh, S. Ma, Z. Huang, A. Karpathy, tion makes particularly sense if the assignment rate is very A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. low. If the structure of the model has already achieved a ImageNet Large Scale Visual Recognition Challenge. relatively high assignment rate, it has been shown that the International Journal of Computer Vision, 2015. propensity loss can also reduce the quality of the results. For [9] K. Simonyan and A. Zisserman. Very Deep this reason, it should not be assumed that the propensity loss Convolutional Networks for Large-Scale Image always leads to an improvement in quality. Recognition. Computing Reasearch Repository. [10] P. Tan, M. Steinbach, and V. Kumar. Introduction to 7. FUTURE WORK Data Mining: Pearson New International Edition. Unfortunately the usage of association rules for the cluste- 2013. ring of the labels was not possible on the datasets considered [11] F. A. Thabtah, P. I. Cowling, and Y. Peng. MMAC: A here. Due to the diversity of the data, it was not possible New Multi-Class, Multi-Label Associative to find suitable parameters that covered most of the labels Classification Approach. In ICDM, 2004. and did not produce any rules that only appeared once or [12] C. M. Wang, L. and J. Feng. Parallel and Sequential twice. Rules that occur so rarely are meaningless and there- Support Vector Machines for Multi-Label fore not useful. For future research, it would be interesting Classification. In International Journal of Information to look at a different set of data and examine the effects of Technology, 2005. such a clustering, especially since in [11] it was shown that [13] T. Zhang. Class-size Independent Generalization association rules can achieve convincing results. Analsysis of Some Discriminative Multi-Category Another interesting aspect would be the development of a Classification. In Advances in Neural Information hierarchical model, in whose uppermost levels it is decided Processing Systems 17, 2004. which cluster the input object can be assigned to. A clas- sification to the concrete labels is only made at the lowest