=Paper=
{{Paper
|id=Vol-2126/paper10
|storemode=property
|title=Approaches for the Improvement of the Multilabel Multiclass Classification with a
Huge Number of Classes
|pdfUrl=https://ceur-ws.org/Vol-2126/paper10.pdf
|volume=Vol-2126
|authors=Martha Tatusch
|dblpUrl=https://dblp.org/rec/conf/gvd/Tatusch18
}}
==Approaches for the Improvement of the Multilabel Multiclass Classification with a
Huge Number of Classes==
<pdf width="1500px">https://ceur-ws.org/Vol-2126/paper10.pdf</pdf>
<pre>
      Approaches for the Improvement of the Multilabel
   Multiclass Classification with a huge Number of Classes

                                                          Martha Tatusch
                                                  Institute of Computer Science
                                               Heinrich Heine University Düsseldorf
                                                 D-40225 Düsseldorf, Germany
                                               tatusch@cs.uni-duesseldorf.de

ABSTRACT                                                              1. INTRODUCTION
In the field of data analysis, the multilabel multiclass clas-           Today, Deep Learning and Artificial Neural Network are
sification is still a major problem in case of a large number         widespreaded terms especially in the fields of information
of classes.                                                           technology and data science. A few years ago, these methods
   With the help of deep learning methods, impressive infor-          were launched and immediately met with great enthusiasm.
mation can be extracted from a wide variety of data. For              They stand for a specific concept of machine learning (ML),
example, people can be recognized on images and in videos             in which a machine can learn by itself and opens up new
or fonts can be imitated. Nevertheless, these algorithms also         knowledge only on the basis of training data that does not
encounter limitations. One of these limits when classifying           necessarily have to be preprocessed. This discovery was a
objects is the treatment of multiple classes. For example,            major breakthrough in the field of data science because the-
if an image is supposed to be described with the help of a            re finally was a way to avoid the diﬃculties of the feature
dictionary in a few keywords, there are countless words that          selection that would otherwise be required in ML.
can be selected, but only very few that apply to the object.             Although there already was a wide variety of classifiers[10]
Another aggravating fact is that the number of words per              that could learn from training data, the developer always
image is not fixed.                                                   had to manually explore which features of the objects were
   This paper presents two basic approaches to improve the            meaningful and extract them beforehand, so that the human
classification accuracy with neural networks compared to a            being still had a great influence on the quality of the results.
common approach. One strategy describes a parallel model                 In Deep Learning, the relevant features are automatically
that requires clustered label sets. For this purpose, diﬀerent        determined and processed. The used construct is an artifi-
distributions are considered. In the second approach, the             cial neural network with multiple layers between the input
eﬀects of diﬀerent loss functions are investigated.                   and the output. With these networks it is possible to find
   It is shown that the presented approaches obtain a very            correlations between data that cannot be readily grasped by
significant improvement of the results compared to the ba-            the human mind. In addition, problems that seem simple
sic model. Both approaches show an improvement of at least            for humans but are diﬃcult on a programmatic level, such
400%. The parallel architecture even achieves 31 times bet-           as the artificial generation of realistic images or fonts, and
ter results than the basic model. We also show under which            the generation of meaningful answers to freely formulated
conditions the individual approaches can achieve the most             questions, can be solved.
eﬀective enhancement of quality.                                         But these models also have their weaknesses. When clas-
                                                                      sifying objects, large amounts of training data are required
                                                                      so that each class – also called label – can be learned with
Categories and Subject Descriptors                                    a moderate number of representatives. If we now want to
I.2.8 [Artifical Intelligence]: Problem Solving, Control Me-          label a collection of diﬀerent images – for example, patient
thods, and Search; H.2.8 [Database Management]: Da-                   images of a hospital – thousands of diﬀerent words are possi-
tabase Applications—Data Mining; I.4.m [IMAGE PRO-                    ble. The number of possible words can of course be limited,
CESSING AND COMPUTER VISION]: Miscellaneous—                          for example, by choosing a subject area, but the number of
Image Classification                                                  possibilities will still be large. This means that the number
                                                                      of images per class on average is very low. The use of classi-
                                                                      fiers that require a previous feature selection is not possible,
Keywords                                                              since there are no recognizable consistent properties of rele-
Neural Networks, Image Processing, Artificial Intelligence,           vance that can be extracted. This only leaves the possibility
Classification, Information Retrieval                                 of using deep neural networks. Due to the large number of
                                                                      classes, however, this task also represents a great challenge
                                                                      in Deep Learning, which is dealt with in this paper.

                                                                      2. APPROACHES
                                                                         Multilabel multiclass classification describes the task of
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.                 classifying data into classes, whereby there are a lot of clas-
Copyright is held by the author/owner(s).                             ses and each data point can be assigned to any number of
Figure 1: Example of the parallel execution of multiple CNNs on clustered label sets. The final result is calculated by combining
the individual results using the indicator method. The rectangles marked in gray represent classes into which the sample object
has been classified.


classes. This work deals with the situation in which each in-         are assigned to a similar number of data objects would then
put object is assigned to only a fraction of the possible clas-       be placed in the same cluster. It is unlikely that only labels
ses. Let |C| be the number of possible classes and |Co | the          with similar occurrences will be assigned to the same object.
number of classes assigned to an object o. A particular dif-          This means that often multiple clusters contain labels that
ficulty in dealing with this problem using machine learning           belong to one data object. This increases the probability
algorithms is that because of the ratio |C    o|
                                            |C|
                                                 , which is a very    of correct assignment. Another possibility is to divide the
small value, the system has diﬃculties to learn sensibly.             labels randomly into several groups.
   For example, it is possible that the network may adapt                In contrast to classification, which falls under the term
itself to assign objects to no class at all. This can be explai-      supervised learning, clustering belongs to the unsupervised
ned by the fact that the hit rate is mainly influenced by the         learning. This means that objects are classified without kno-
classes which are not assigned to the considered object. The          wing the classes in advance. Therefore there is no training
Accuracy Metric is calculated by the formula                          data, since no information about class aﬃliation is known.
           number of correctly classif ied classes                    In this work, the clustering by occurrences is done with the
                                                        .             KMeans algorithm.
                    number of all classes
                                                                         The parameter K represents the number of clusters to be
During the learning process, many systems strive to maximi-           calculated. First, K random data points of the training set
ze this value. Suppose there are 1 000 classes and one object         are selected as the centers of the individual clusters. The-
is assigned to exactly 5 of these classes. If a classifier does       se are called centroids. All objects are then assigned to the
not classify the object to any class, it will have an accura-         cluster whose centroid is closest to the object. The distance
cy of 1000−5
         1000
                   995
               = 1000  = 99.5%. There is a high probability           is usually calculated with the euclidean distance. Now the
that this value will deteriorate if the system tries to find the      centers are recalculated by computing the mean value of all
correct classes, which can cause incorrect assignments. Since         data points of the respective cluster. All data points are then
the accuracy of non-assignment is still very close to 100%,           reassigned and the resulting centroids are calculated. This is
this method proves to be the best option for the network.             repeated until the assignment of the data objects does not
But for humans, however, this approach does not make sen-             show any changes anymore.
se. The aim is, of course, to make the assignment as accurate
as possible, but to assure that an assignment will be made.              With the help of the determined clusters, the classifica-
This means, in this situation, it is much more important to           tion problem can now be broken down. We consider a con-
identify the associated classes than to prevent other classes         struction, in which a CNN is trained separately for each
from being incorrectly assigned to an object.                         calculated label set. When the model is executed, all CNNs
                                                                      are evaluated in parallel. Here, parallelism does not mean
2.1    Parallel Network Architecture                                  the temporal context, but a symbolism for the fact that all
   Since the high number of classes is the greatest diﬃculty,         CNNs are used for testing at the same level. All resulting as-
it is very likely that splitting the problem into several smal-       signments are finally merged and contribute with the same
ler sub-problems can improve the results. The division into           importance to the final result.
several easier problems can be achieved by clustering[10] the            Figure 1 illustrates an example of a model that can clas-
set of labels. If then, for each cluster, one seperate net is trai-   sify into 16 classes. The individual clusters contain diﬀerent
ned, the number of classes gets considerably lower and the            numbers of labels and some overlaps. In the leftmost clus-
       |Cavg |                                            ∑
                                                          N           ter, for example, a network is used that can categorize into
ratio |C|      increases. The denominator |Cavg | = N1        |Co |   the classes 1, 3, 4, 7, 9, 11, 12 and 14. In this example, it has
                                                          o=0
stands for the average number of labels per training object           chosen the labels 3 and 7. In the final result, the outcomes
and the counter |C| for the number of all possible classes.           of all CNNs are considered equally.
Now, the question is how to divide the classes in order to               There are several ways to combine the results. For examp-
achieve the best possible results.                                    le, all classes selected by at least one CNN can be considered
  Since often nothing is known about the labels other than            assigned in the final result. It is also possible to use a ma-
their names, properties must be determined with which the             jority voting system or an average value. In the first case,
clustering can be performed. One possibility is to look at the        this means that for each label all results of the clusters con-
independent occurrences of the diﬀerent classes. Labels that          taining it are considered. Only if the majority has assigned
the object to this class, it is also selected in the final result.   dividually and calculates weighted costs is desirable. In [5] a
For the calculation of the average values either the binary or       type of loss functions is presented, which is based on propen-
unrounded predicted values can be used. The consideration            sity values that can be calculated with subjective relevance
of the decimal values is more suitable, since a prediction of        ratings. The developer can assign relevance values to the in-
0.51 for a class in the binary case would already result in          dividual classes, which then are incorporated into the cost
a 1, which would flow into the average much more strongly            function. Since this paper assumes that the relevance ratings
than a 0.51. Finally the average value itself is rounded up,         of the diﬀerent labels are not known, another variant is used
whereby in the binary case a ”double rounding” would result,         that was presented in the same paper and is independent of
which can falsify the result.                                        subjective evaluations.
                                                                        Based on previous observations, the authors have decided
2.2    Propensity Loss Function                                      that the propensity of a class can be represented by a sig-
   A further approach to improving the results of a mul-             moid function. For a label l with unknown relevance value
tilabel multiclass CNN, which has nothing to do with the             the propensity pl is calculated by
construction of the model and the clustering of classes, is to                                        1
                                                                            pl =                     √                        , (2)
adjust the loss function. If it is set up in such a way that,                     1 + (log(N − 1)) · 1.4 · e−0.5 log(Nl +0.4)
for example, false negatives are strongly and false positives
are hardly penalized, then this would already have a great           where Nl represents the number of data objects that con-
influence on the learning process of the classifier and would        tain the label l and N stands for the number of all training
prevent objects from not being classified at all.                    objects. The values for the optimization parameters are the
   The learning process of a convolutional neural network            same as those chosen in the paper.
requires a loss and an optimization function. Depending on              In [5] the integration of propensity scores into diﬀerent
the resulting error value of a run, all weights of the CNN           known loss functions was presented. In this work, the deci-
are adjusted during the backpropagation. A frequently used           sion was made to use an adapted version of the Hamming
loss function in multilabel classification is∑the binary cross       Loss function:
                                                                                     1 ∑∑ 1
                                                                                        N   L
entropy. It is calculated by H(p, q) = − x p(x) log(q(x)),               HL(M ) =             ( (2ŷij − 1)) · (yi,j − ŷi,j )2 . (3)
where p(x) stands for the actual probability and q(x) for the                        N i=1 j=1 pj
calculated probability that the considered object belongs to
class x. The resulting probabilities are rounded, so that p(x)       The subterm (2ŷij − 1) has the function of an indicator that
and q(x) can only have values of 0 or 1. The largest costs are       checks whether the object i has been assigned to class j or
incurred if the network does not classify into the class which       not. In the binary case, it is 1 if it has been classfied, and
the object in question belongs to. If it assigns the data point      −1 if it has not been classified into the observed class. As a
to a class which it does not belong to, no costs are caused          result, predictions in which i incorrectly has been assigned
by the object.                                                       to class j are punished and those who have wrongly not
                                                                     been assigned an object to the class are rewarded. By po-
   In [5] a new type of loss functions is introduced. It is pri-     sitioning the propensity in the denominator of the fraction,
marily designed for multiclass classification with an enor-          misclassifications to labels with high propensity are weigh-
mous number of classes. According to [5], the functions prio-        ted less than those to the rarely occurring ones. Except for
ritize the assignment to the correct classes and promotes            this factor, nothing has changed in the original Hamming
classification to rarely occurring labels. Their special cha-        Loss function.
racteristic is the relation to the propensity of the individual         As one of the problems discussed here is that the classifier
labels.                                                              possibly learns not to classify at all, it is not advisable to
   The Hamming Loss is cited as a bad example for a loss             take the formula from [5] unchanged. The indicator function
function for the multiclass problem. For a model M , it is           only punishes false positives and even rewards false negati-
defined by                                                           ves. As a result, the likelihood that the network does not
                           1 ∑∑
                                N   L
                                                                     make a classification at all rather than misclassifying an ob-
             HL(M ) =                  (yi,j − ŷi,j )2 ,    (1)
                         N · L i=1 j=1                               ject is increased. In this work, it makes more sense to use a
                                                                     loss function that punishes false positives and false negati-
with N the number of data points, L the number of labels,            ves equally or possibly prefers false negatives. In any case,
yi,j the actual assignment of an object i to class j, and ŷi,j      however, incorrect allocations must increase the error value.
the predicted assignment of an object i to class j. Becau-           For this reason, the absolute value of the indicator function
se of the squared diﬀerence, the model is punished for both          is used in the following process:
false negatives and false positives. In addition, the costs for
                                                                                     1 ∑∑ 1
                                                                                        N   L
all individual class assignments are calculated in the same             HL(M ) =               ( (|2ŷij − 1|)) · (yi,j − ŷi,j )2 . (4)
way, as it is usual in most cases. In an unbalanced dataset,                        N i=1 j=1 pj
however, there may be labels that contain very few repre-
sentatives but are nevertheless as important as frequently           This ensures that all incorrect classifications are treated in
represented labels. These are easily overlooked during the           the same way.
training because the probability of an incorrect assignment
is significantly lower than for classes that belong to many          3. REALISATION
data points. Even a correct classification to such minority            Before creating the model, the input data has to be prepro-
classes has not much influence on the training result, as this       cessed. All images are mapped to the RGB color space. Since
happens so rarely that the relevant weights get hardly chan-         the net expects a fixed image size, a squared size of 800×800
ged.                                                                 pixels has been chosen. If neccessary, the increase of the
   For this reason, a cost function which treats each label in-      image size is achieved by adding black borders. This can be
                         (a) Random


                                                                  Figure 3: Uniformly used CNN Architectur. The grayed-out
                                                                  part was computed only once.

                                                                  3.2 Architecture
                                                                     Due to the problem of determining a suitable Convolu-
                                                                  tional Neural Network architecture and the usually time-
                     (b) By Occurrences                           consuming training sessions of a network, it is advisable to
                                                                  use a pre-trained network, which has already achieved con-
Figure 2: Distributions of the labels with diﬀerent cluste-       vincing results on similar data. In [9] several strongly resem-
rings. A dataset of 2 000 labels has been used.                   ble architectures for Deep Convolutional Neural Networks
                                                                  are presented. They were developed as part of the ImageNet
                                                                  Challenge 2014[8]. The VGG16 net achieved the best results
done by adding zero values on the sides. If the image is too
                                                                  with a depth of 16 trainable layers.
big, it is scaled down by means of interpolation until the lar-
                                                                     Since both the input data and the required output diﬀer
gest side length is 800 pixels long. The smaller side length
                                                                  from the original architecture, the model needs to be slight-
is then evenly filled with zero values from both sides.
                                                                  ly modified. The entire chosen section of the architecture
                                                                  includes 14, 714, 688 pretrained weights. These can be set
3.1    Clustering                                                 untrainable so that only the weights which have been added
   In order to accomplish the approach of the parallel mo-        by the own layers are trained. In Figure 3 the final archi-
del, the first step is to cluster the label set. In this work,    tecture used in all cases is displayed. Since the weights of
in any case, 50 clusters has been requested. The determina-       the VGG16 block were no longer trained, the output of this
tion of random label groups is self-explanatory. The result       part of the net could be calculated once and reused to in-
is a distribution of the classes that is similar to an uniform    crease eﬃciency. Since the result of this area still contained
distribution. This balanced arrangement is illustrated in Fi-     512 channels, the idea arose to pool the result. When using
gure 2a. The smallest cluster contains 28 and the largest 52      the VGG16 block without pooling, it was noticeable that
labels. All label groups are therefore relatively small.          the output sometimes contained a lot of zeros. For this rea-
   Although the used MiniBatchKMeans implementation of            son, pooling above the maximum makes sense to reduce the
Scikit Image1 receives a desired number of clusters as pa-        number of zeros. Since it is not the size of the feature maps
rameter, it only creates as many clusters as actually make        but the number of channels that should be reduced, we wro-
sense. The eﬀect of this is that during clustering by occur-      te a custom layer. It is named ChannelsMaxPooling Layer
rences 39 label groups with 8 to 391 classes are created.         and pools one-dimensionally each pixel over a given number
Besides a few exceptions, these are again relatively small        of feature maps. It can be found on Github2 . In this work
clusters. Even the largest number of labels is more than four     a filter size of 32 and a step size of 8 pixels were used. This
times smaller than the total amount of classes and therefore      means that the sliding window goes over 32 channels and is
represents a significant decimation of the label set. Neverthe-   moved in 8-pixel steps across all channels. The number of
less, the distribution is very heterogeneous. The diﬀerences      feature maps is reduced from 512 to 512−32      + 1 = 61.
                                                                                                             8
between the label distributions of the random clusters and           The white components in Figure 3 must be trained for
the clustering by occurrences become clear in Figure 2, as        each approach. The second dense layer generates a N -dimen-
the Y-axis is same-scaled in both cases.                          sional vector, where N stands for the number of CNN classes
   Both, the random distribution and the clustering by oc-        considered. Unlike all other layers, it does not have ReLu as
currences, generate disjoint label groups. Since it is interes-   an activation function, but Sigmoid. By using this activation
ting to see which eﬀect it would have, if the clusters showed     function, all values of the resulting vector are normalized to
overlaps, an additional distribution of the labels has been       the interval [0, 1], which correspond to the probabilities of an
made. The labels were randomly distributed into 50 clusters,      assignment to the respective class. Between the two dense
with each label being assigned to a maximum of 5 clusters.        layers a dropout layer is applied, which randomly rejects
                                                                  20% of the tensor values to prevent overfitting.
1
  http://scikit-learn.org/stable/modules/generated/
                                                                  2
sklearn.cluster.MiniBatchKMeans.html                                  http://github.com/tatusch/ChannelsMaxPoolingLayer
                      Dataset          Clustering             Loss Function        Precision        Recall      F1-Score
                 2000-Labels Dataset   None                 Binary Cross Entropy    0.000206       0.000900     0.000336
                                                               Propensity Loss      0.001118       0.461154     0.002230
                                       Random               Binary Cross Entropy    0.005398       0.282772     0.010594
                                                               Propensity Loss      0.002225       0.455054     0.004429
                                       Random (redundant)   Binary Cross Entropy    0.002082       0.225877     0.004126
                                                               Propensity Loss      0.001842       0.431557     0.003669
                                       By Occurrences       Binary Cross Entropy    0.000709       0.066093     0.001403
                                                               Propensity Loss      0.000475       0.188481     0.000948
                 1000-Labels Dataset   None                 Binary Cross Entropy    0.002608       0.001677     0.002042
                                                               Propensity Loss      0.005148       0.201733     0.010040
                                       Random               Binary Cross Entropy    0.013830       0.151201     0.025343
                                                               Propensity Loss      0.007573       0.209422     0.014618


Table 1: Comparison of the achieved precision, recall and F1 score values with diﬀerent clusterings and loss functions on the
two datasets. The results of the redundant clusters with the Binary Crossentropy were calculated using the indicator and with
the Propensity Loss using the average method.


4.     EXPERIMENTAL RESULTS                                          score – which describes the most meaningful measure refer-
   The dataset used for the evaluation was provided during           red to the task – the best results have been obtained with
ImageCLEF2017[6]. All images come from the medical field,            the parallel architecture, the random disjoint clusters and
but can vary greatly in size, color coding and content. For          the binary cross entropy. The resulting F1 score is nearly
example, there are images of wounds, patients, CT scans and          eight times larger than that of the clusters by occurrences.
maps of areas where a diseases has been spread. The mea-             Random redundant clustering achieves the second-best va-
nings of the labels can be looked up in the Unified Medical          lues and is still much better than the worst clustering. Ne-
Language System (UMLS)3 . They, however, were not taken              vertheless, the results are considerably worse than with the
into account in this work.                                           other alternative. This fact once again confirms the assump-
   The dataset contains 164 614 training and 10 000 test images      tion that a network achieves poorer results the more classes
in diﬀerent formats and a total of 20 812 labels. The test re-       it has to consider.
sults of the competition listed in [3] clearly show that a pre-         In order to determine which method is the most suitable,
cise classification of the objects is a diﬃcult problem.The          the individual merge strategies have been evaluated on the
best achieved average F1 score is only 15.83%. The next 10           2000-Labels Dataset. The results are shown in Table 2. As
places are occupied with values of 12 to 14%. With additio-          you can see, for the binary cross entropy the best strategy is
nal external resources, a maximum F1 score of 17.18% was             given by the indicator method. This outcome was expected
achieved. All these values are far away from a score, which          as the assignment rate with this loss function is very low
can be taken for a precise classification.                           and gets supported by the indicator function.
   Due to limited technical resources, it was necessary to              Since the propensity loss already promotes the allocati-
reduce the total amount of labels to 2 000 labels, which were        on rate itself, the best results are achieved with the average
chosen randomly. Although it is an enormous decimation               method. This result can also easily be explained. Since the
of the number of classes, the dataset is still large enough          loss function increases the number of assignments, the fal-
to represent the described problem. The decimation of the            se positive rate is increased, as well. To reduce this eﬀect,
label set causes, that the reduced dataset contains in total         the most appropriate results must be selected. Since both
89 113 training images and 5 451 test images.                        the average and majority method provide this functionali-
   Since in this selected dataset a lot of labels with very few      ty, both are suitable for the propensity loss. Table 2 shows
representatives were included, another subset of the dataset         that both results are very close to each other. In Table 1 the
was chosen for comparison. This time the 1 000 most com-             results of the most suitable strategies are displayed.
mon labels were selected. All these labels are represented by           The usage of the propensity loss function improves the re-
at least 139 training images. The resulting data set includes        sults of the basic network by around 400% on both datasets,
150 339 training and 9 201 test images.
   In Table 1 the results of the models with diﬀerent cluste-
rings and loss functions on the two data sets are displayed.
                                                                            Method      Binary Cross Entropy        Propensity Loss
All metrics are calculated by the formulas used in [1]. It be-                             R          F1            R           F1
comes clear that all results are anything but good. The basic
                                                                            Indicator   0.225877     0.004126    0.642536    0.002345
network with all labels reaches only an F1 score of 0.000336                Average     0.053195     0.003831    0.431557   0.003669
on the 2000-Labels Dataset and 0.002042 on the 1000-Labels                  Majority    0.056994     0.003796    0.433857    0.003649
Dataset. In relative terms, however, a strong improvement
was achieved by the diﬀerent approaches. Regarding the F1
                                                                     Table 2: Comparison of the achieved results with redundant
3
    https://www.nlm.nih.gov/research/umls/                           clusters using diﬀerent merge strategies on the 2000-Labels
                                                                     Dataset.
as can be seen in Table 1. Particularly noticeable is the im-      level. It can also be promising to perform backpropagation
provement of the recall. On the 2000-Labels Dataset a recall       across all levels so that the first levels can learn from the
of 0.4612 is achieved. This can be explained by the promo-         final results.
tion of classification which causes the decrease of the false         Principally, the multilabel multiclass classification with a
negatives rate. In combination with the parallel architecture,     large number of classes is continually a diﬃcult problem that
however, a significant improvement of the results regarding        can be investigated extensively in the future.
the basic model is achieved, as well, but it gets clear that
in all cases the parallel approach scores better without the       8. REFERENCES
propensity loss function.
                                                                    [1] F. Chollet. Metrics File in Keras’ GitHub Repository.
   Furthermore, it can be said, that on both datasets, the
                                                                        https://github.com/keras-team/keras/blob/
results of the basic model using the propensity loss are out-
                                                                        ac1a09c787b3968b277e577a3709cd3b6c931aa5/
performed by the parallel approach with random clusters
                                                                        keras/metrics.py. Accessed: 2018-04-06.
(disjoint as well as redundant).
                                                                    [2] O. Dekel and O. Shamir. Multiclass-Multilabel
5.   RELATED WORK                                                       Classification with More Classes than Examples. In
   To the present day, there are a few approaches for mul-              Proceedings of the Thirteenth International Conference
tilabel and multiclass classification[7][2][12][13][4], but only        on Artificial Intelligence and Statistics, 2010.
few publications work on the combination of the two tasks.          [3] C. Eickhoﬀ, I. Schwall, A. G. S. de Herrera, and
In addition, the diﬃcult facts that the number of labels per            H. Müller. Overview of ImageCLEFcaption 2017 -
object is not fixed and the number of classes is very high are          Image Caption Prediction and Concept Detection for
usually not taken into account. In [11] a classifier is presen-         Biomedical Images. In Working Notes of CLEF 2017 -
ted that deals with the multilabel multiclass classification            Conference and Labs of the Evaluation Forum, 2017.
and uses association rules[10] to classify the input objects.       [4] D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang.
The authors achieve very good accuracy results, however,                Multi-Label Prediction via Compressed Sensing. In
the solution is not suitable for a classification with a large          Advances in Neural Information Processing Systems
number of classes.                                                      22. 2009.
                                                                    [5] H. Jain, Y. Prabhu, and M. Varma. Extreme
6.   CONCLUSION                                                         Multi-label Loss Functions for Recommendation,
   It has been shown that the division of the original label set        Tagging, Ranking & Other Missing Label
into smaller label groups and the parallel execution of mul-            Applications. In Proceedings of the 22Nd ACM
tiple CNNs on seperate clusters significantly improves the              SIGKDD International Conference on Knowledge
results regarding a basic network which deals with all labels           Discovery and Data Mining, 2016.
at a time. Depending on the choice of the clustering, the           [6] H. Mller, P. Clough, T. Deselaers, and B. Caputo.
quality can be even further improved. On the used datasets,             ImageCLEF: Experimental Evaluation in Visual
the use of clusters after occurrences has not been very ad-             Information Retrieval. 2010.
vantageous. Significantly better results were achieved with         [7] M.-E. Nilsback and A. Zisserman. Automated Flower
random disjoint clusters. The artificially generated redun-             Classification over a Large Number of Classes. In
dancy of the labelsets reduces the quality of the results, as           Indian Conference on Computer Vision, Graphics and
this again leads to an increase in the average cluster size.            Image Processing, Dec 2008.
   The customized propensity loss function reveals a strong         [8] O. Russakovsky, J. Deng, H. Su, J. Krause,
improvement of the results, as well. The usage of this func-            S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
tion makes particularly sense if the assignment rate is very            A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.
low. If the structure of the model has already achieved a               ImageNet Large Scale Visual Recognition Challenge.
relatively high assignment rate, it has been shown that the             International Journal of Computer Vision, 2015.
propensity loss can also reduce the quality of the results. For     [9] K. Simonyan and A. Zisserman. Very Deep
this reason, it should not be assumed that the propensity loss          Convolutional Networks for Large-Scale Image
always leads to an improvement in quality.                              Recognition. Computing Reasearch Repository.
                                                                   [10] P. Tan, M. Steinbach, and V. Kumar. Introduction to
7.   FUTURE WORK                                                        Data Mining: Pearson New International Edition.
   Unfortunately the usage of association rules for the cluste-         2013.
ring of the labels was not possible on the datasets considered     [11] F. A. Thabtah, P. I. Cowling, and Y. Peng. MMAC: A
here. Due to the diversity of the data, it was not possible             New Multi-Class, Multi-Label Associative
to find suitable parameters that covered most of the labels             Classification Approach. In ICDM, 2004.
and did not produce any rules that only appeared once or           [12] C. M. Wang, L. and J. Feng. Parallel and Sequential
twice. Rules that occur so rarely are meaningless and there-            Support Vector Machines for Multi-Label
fore not useful. For future research, it would be interesting           Classification. In International Journal of Information
to look at a diﬀerent set of data and examine the eﬀects of             Technology, 2005.
such a clustering, especially since in [11] it was shown that      [13] T. Zhang. Class-size Independent Generalization
association rules can achieve convincing results.                       Analsysis of Some Discriminative Multi-Category
   Another interesting aspect would be the development of a             Classification. In Advances in Neural Information
hierarchical model, in whose uppermost levels it is decided             Processing Systems 17, 2004.
which cluster the input object can be assigned to. A clas-
sification to the concrete labels is only made at the lowest

</pre>