<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches for the Improvement of the Multilabel Multiclass Classification with a huge Number of Classes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martha Tatusch</string-name>
          <email>tatusch@cs.uni-duesseldorf.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science Heinrich Heine University Düsseldorf D-40225 Düsseldorf</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>In the eld of data analysis, the multilabel multiclass classi cation is still a major problem in case of a large number of classes. With the help of deep learning methods, impressive information can be extracted from a wide variety of data. For example, people can be recognized on images and in videos or fonts can be imitated. Nevertheless, these algorithms also encounter limitations. One of these limits when classifying objects is the treatment of multiple classes. For example, if an image is supposed to be described with the help of a dictionary in a few keywords, there are countless words that can be selected, but only very few that apply to the object. Another aggravating fact is that the number of words per image is not xed. This paper presents two basic approaches to improve the classi cation accuracy with neural networks compared to a common approach. One strategy describes a parallel model that requires clustered label sets. For this purpose, different distributions are considered. In the second approach, the effects of different loss functions are investigated. It is shown that the presented approaches obtain a very signi cant improvement of the results compared to the basic model. Both approaches show an improvement of at least 400%. The parallel architecture even achieves 31 times better results than the basic model. We also show under which conditions the individual approaches can achieve the most effective enhancement of quality.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>I.2.8 [Arti cal Intelligence]: Problem Solving, Control
Methods, and Search; H.2.8 [Database Management]:
Database Applications|Data Mining; I.4.m [IMAGE
PROCESSING AND COMPUTER VISION]: Miscellaneous|
Image Classi cation</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Today, Deep Learning and Arti cial Neural Network are
widespreaded terms especially in the elds of information
technology and data science. A few years ago, these methods
were launched and immediately met with great enthusiasm.
They stand for a speci c concept of machine learning (ML),
in which a machine can learn by itself and opens up new
knowledge only on the basis of training data that does not
necessarily have to be preprocessed. This discovery was a
major breakthrough in the eld of data science because
there nally was a way to avoid the difficulties of the feature
selection that would otherwise be required in ML.</p>
      <p>
        Although there already was a wide variety of classi ers[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
that could learn from training data, the developer always
had to manually explore which features of the objects were
meaningful and extract them beforehand, so that the human
being still had a great in uence on the quality of the results.
      </p>
      <p>In Deep Learning, the relevant features are automatically
determined and processed. The used construct is an arti
cial neural network with multiple layers between the input
and the output. With these networks it is possible to nd
correlations between data that cannot be readily grasped by
the human mind. In addition, problems that seem simple
for humans but are difficult on a programmatic level, such
as the arti cial generation of realistic images or fonts, and
the generation of meaningful answers to freely formulated
questions, can be solved.</p>
      <p>But these models also have their weaknesses. When
classifying objects, large amounts of training data are required
so that each class { also called label { can be learned with
a moderate number of representatives. If we now want to
label a collection of different images { for example, patient
images of a hospital { thousands of different words are
possible. The number of possible words can of course be limited,
for example, by choosing a subject area, but the number of
possibilities will still be large. This means that the number
of images per class on average is very low. The use of
classiers that require a previous feature selection is not possible,
since there are no recognizable consistent properties of
relevance that can be extracted. This only leaves the possibility
of using deep neural networks. Due to the large number of
classes, however, this task also represents a great challenge
in Deep Learning, which is dealt with in this paper.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>APPROACHES</title>
      <p>Multilabel multiclass classi cation describes the task of
classifying data into classes, whereby there are a lot of
classes and each data point can be assigned to any number of
classes. This work deals with the situation in which each
input object is assigned to only a fraction of the possible
classes. Let jCj be the number of possible classes and jCoj the
number of classes assigned to an object o. A particular
difculty in dealing with this problem using machine learning
algorithms is that because of the ratio jjCCojj , which is a very
small value, the system has difficulties to learn sensibly.</p>
      <p>For example, it is possible that the network may adapt
itself to assign objects to no class at all. This can be
explained by the fact that the hit rate is mainly in uenced by the
classes which are not assigned to the considered object. The
Accuracy Metric is calculated by the formula
number of correctly classif ied classes
:
number of all classes
During the learning process, many systems strive to
maximize this value. Suppose there are 1 000 classes and one object
is assigned to exactly 5 of these classes. If a classi er does
not classify the object to any class, it will have an
accuracy of 1000 5 = 1000 = 99:5%. There is a high probability
995
1000
that this value will deteriorate if the system tries to nd the
correct classes, which can cause incorrect assignments. Since
the accuracy of non-assignment is still very close to 100%,
this method proves to be the best option for the network.
But for humans, however, this approach does not make
sense. The aim is, of course, to make the assignment as accurate
as possible, but to assure that an assignment will be made.
This means, in this situation, it is much more important to
identify the associated classes than to prevent other classes
from being incorrectly assigned to an object.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Parallel Network Architecture</title>
      <p>
        Since the high number of classes is the greatest difficulty,
it is very likely that splitting the problem into several
smaller sub-problems can improve the results. The division into
several easier problems can be achieved by clustering[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the
set of labels. If then, for each cluster, one seperate net is
trained, the number of classes gets considerably lower and the
N
ratio jCavgj increases. The denominator jCavgj = N1 o=0
jCj ∑jCoj
stands for the average number of labels per training object
and the counter jCj for the number of all possible classes.
Now, the question is how to divide the classes in order to
achieve the best possible results.
      </p>
      <p>Since often nothing is known about the labels other than
their names, properties must be determined with which the
clustering can be performed. One possibility is to look at the
independent occurrences of the different classes. Labels that
are assigned to a similar number of data objects would then
be placed in the same cluster. It is unlikely that only labels
with similar occurrences will be assigned to the same object.
This means that often multiple clusters contain labels that
belong to one data object. This increases the probability
of correct assignment. Another possibility is to divide the
labels randomly into several groups.</p>
      <p>In contrast to classi cation, which falls under the term
supervised learning, clustering belongs to the unsupervised
learning. This means that objects are classi ed without
knowing the classes in advance. Therefore there is no training
data, since no information about class affiliation is known.
In this work, the clustering by occurrences is done with the
KMeans algorithm.</p>
      <p>The parameter K represents the number of clusters to be
calculated. First, K random data points of the training set
are selected as the centers of the individual clusters.
These are called centroids. All objects are then assigned to the
cluster whose centroid is closest to the object. The distance
is usually calculated with the euclidean distance. Now the
centers are recalculated by computing the mean value of all
data points of the respective cluster. All data points are then
reassigned and the resulting centroids are calculated. This is
repeated until the assignment of the data objects does not
show any changes anymore.</p>
      <p>With the help of the determined clusters, the classi
cation problem can now be broken down. We consider a
construction, in which a CNN is trained separately for each
calculated label set. When the model is executed, all CNNs
are evaluated in parallel. Here, parallelism does not mean
the temporal context, but a symbolism for the fact that all
CNNs are used for testing at the same level. All resulting
assignments are nally merged and contribute with the same
importance to the nal result.</p>
      <p>Figure 1 illustrates an example of a model that can
classify into 16 classes. The individual clusters contain different
numbers of labels and some overlaps. In the leftmost
cluster, for example, a network is used that can categorize into
the classes 1; 3; 4; 7; 9; 11; 12 and 14. In this example, it has
chosen the labels 3 and 7. In the nal result, the outcomes
of all CNNs are considered equally.</p>
      <p>There are several ways to combine the results. For
example, all classes selected by at least one CNN can be considered
assigned in the nal result. It is also possible to use a
majority voting system or an average value. In the rst case,
this means that for each label all results of the clusters
containing it are considered. Only if the majority has assigned
the object to this class, it is also selected in the nal result.
For the calculation of the average values either the binary or
unrounded predicted values can be used. The consideration
of the decimal values is more suitable, since a prediction of
0:51 for a class in the binary case would already result in
a 1, which would ow into the average much more strongly
than a 0:51. Finally the average value itself is rounded up,
whereby in the binary case a "double rounding" would result,
which can falsify the result.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Propensity Loss Function</title>
      <p>A further approach to improving the results of a
multilabel multiclass CNN, which has nothing to do with the
construction of the model and the clustering of classes, is to
adjust the loss function. If it is set up in such a way that,
for example, false negatives are strongly and false positives
are hardly penalized, then this would already have a great
in uence on the learning process of the classi er and would
prevent objects from not being classi ed at all.</p>
      <p>The learning process of a convolutional neural network
requires a loss and an optimization function. Depending on
the resulting error value of a run, all weights of the CNN
are adjusted during the backpropagation. A frequently used
loss function in multilabel classi cation is the binary cross
entropy. It is calculated by H(p; q) = ∑x p(x) log(q(x)),
where p(x) stands for the actual probability and q(x) for the
calculated probability that the considered object belongs to
class x. The resulting probabilities are rounded, so that p(x)
and q(x) can only have values of 0 or 1. The largest costs are
incurred if the network does not classify into the class which
the object in question belongs to. If it assigns the data point
to a class which it does not belong to, no costs are caused
by the object.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a new type of loss functions is introduced. It is
primarily designed for multiclass classi cation with an
enormous number of classes. According to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the functions
prioritize the assignment to the correct classes and promotes
classi cation to rarely occurring labels. Their special
characteristic is the relation to the propensity of the individual
labels.
      </p>
      <p>The Hamming Loss is cited as a bad example for a loss
function for the multiclass problem. For a model M , it is
de ned by</p>
      <p>HL(M ) =
i=1 j=1
with N the number of data points, L the number of labels,
yi;j the actual assignment of an object i to class j, and y^i;j
the predicted assignment of an object i to class j.
Because of the squared difference, the model is punished for both
false negatives and false positives. In addition, the costs for
all individual class assignments are calculated in the same
way, as it is usual in most cases. In an unbalanced dataset,
however, there may be labels that contain very few
representatives but are nevertheless as important as frequently
represented labels. These are easily overlooked during the
training because the probability of an incorrect assignment
is signi cantly lower than for classes that belong to many
data points. Even a correct classi cation to such minority
classes has not much in uence on the training result, as this
happens so rarely that the relevant weights get hardly
changed.</p>
      <p>
        For this reason, a cost function which treats each label
inpl =
dividually and calculates weighted costs is desirable. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a
type of loss functions is presented, which is based on
propensity values that can be calculated with subjective relevance
ratings. The developer can assign relevance values to the
individual classes, which then are incorporated into the cost
function. Since this paper assumes that the relevance ratings
of the different labels are not known, another variant is used
that was presented in the same paper and is independent of
subjective evaluations.
      </p>
      <p>Based on previous observations, the authors have decided
that the propensity of a class can be represented by a
sigmoid function. For a label l with unknown relevance value
the propensity pl is calculated by
1
1 + (log(N
1))
p1:4 e 0:5 log(Nl+0:4)
;
(2)
where Nl represents the number of data objects that
contain the label l and N stands for the number of all training
objects. The values for the optimization parameters are the
same as those chosen in the paper.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the integration of propensity scores into different
known loss functions was presented. In this work, the
decision was made to use an adapted version of the Hamming
Loss function:
      </p>
      <p>HL(M ) = 1 ∑N ∑L ( 1</p>
      <p>N i=1 j=1 pj
(2y^ij
1)) (yi;j
y^i;j)2 :
(3)
The subterm (2y^ij 1) has the function of an indicator that
checks whether the object i has been assigned to class j or
not. In the binary case, it is 1 if it has been class ed, and
1 if it has not been classi ed into the observed class. As a
result, predictions in which i incorrectly has been assigned
to class j are punished and those who have wrongly not
been assigned an object to the class are rewarded. By
positioning the propensity in the denominator of the fraction,
misclassi cations to labels with high propensity are
weighted less than those to the rarely occurring ones. Except for
this factor, nothing has changed in the original Hamming
Loss function.</p>
      <p>
        As one of the problems discussed here is that the classi er
possibly learns not to classify at all, it is not advisable to
take the formula from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] unchanged. The indicator function
only punishes false positives and even rewards false
negatives. As a result, the likelihood that the network does not
make a classi cation at all rather than misclassifying an
object is increased. In this work, it makes more sense to use a
loss function that punishes false positives and false
negatives equally or possibly prefers false negatives. In any case,
however, incorrect allocations must increase the error value.
For this reason, the absolute value of the indicator function
is used in the following process:
      </p>
      <p>HL(M ) = 1 ∑N ∑L ( 1 y^i;j)2 : (4)
1j)) (yi;j
(j2y^ij</p>
      <p>N i=1 j=1 pj
This ensures that all incorrect classi cations are treated in
the same way.</p>
    </sec>
    <sec id="sec-6">
      <title>REALISATION</title>
      <p>Before creating the model, the input data has to be
preprocessed. All images are mapped to the RGB color space. Since
the net expects a xed image size, a squared size of 800 800
pixels has been chosen. If neccessary, the increase of the
image size is achieved by adding black borders. This can be
(a) Random
(b) By Occurrences
done by adding zero values on the sides. If the image is too
big, it is scaled down by means of interpolation until the
largest side length is 800 pixels long. The smaller side length
is then evenly lled with zero values from both sides.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Clustering</title>
      <p>In order to accomplish the approach of the parallel
model, the rst step is to cluster the label set. In this work,
in any case, 50 clusters has been requested. The
determination of random label groups is self-explanatory. The result
is a distribution of the classes that is similar to an uniform
distribution. This balanced arrangement is illustrated in
Figure 2a. The smallest cluster contains 28 and the largest 52
labels. All label groups are therefore relatively small.</p>
      <p>Although the used MiniBatchKMeans implementation of
Scikit Image1 receives a desired number of clusters as
parameter, it only creates as many clusters as actually make
sense. The effect of this is that during clustering by
occurrences 39 label groups with 8 to 391 classes are created.
Besides a few exceptions, these are again relatively small
clusters. Even the largest number of labels is more than four
times smaller than the total amount of classes and therefore
represents a signi cant decimation of the label set.
Nevertheless, the distribution is very heterogeneous. The differences
between the label distributions of the random clusters and
the clustering by occurrences become clear in Figure 2, as
the Y-axis is same-scaled in both cases.</p>
      <p>Both, the random distribution and the clustering by
occurrences, generate disjoint label groups. Since it is
interesting to see which effect it would have, if the clusters showed
overlaps, an additional distribution of the labels has been
made. The labels were randomly distributed into 50 clusters,
with each label being assigned to a maximum of 5 clusters.
1http://scikit-learn.org/stable/modules/generated/
sklearn.cluster.MiniBatchKMeans.html</p>
      <p>
        Due to the problem of determining a suitable
Convolutional Neural Network architecture and the usually
timeconsuming training sessions of a network, it is advisable to
use a pre-trained network, which has already achieved
convincing results on similar data. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] several strongly
resemble architectures for Deep Convolutional Neural Networks
are presented. They were developed as part of the ImageNet
Challenge 2014[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The VGG16 net achieved the best results
with a depth of 16 trainable layers.
      </p>
      <p>Since both the input data and the required output differ
from the original architecture, the model needs to be
slightly modi ed. The entire chosen section of the architecture
includes 14; 714; 688 pretrained weights. These can be set
untrainable so that only the weights which have been added
by the own layers are trained. In Figure 3 the nal
architecture used in all cases is displayed. Since the weights of
the VGG16 block were no longer trained, the output of this
part of the net could be calculated once and reused to
increase efficiency. Since the result of this area still contained
512 channels, the idea arose to pool the result. When using
the VGG16 block without pooling, it was noticeable that
the output sometimes contained a lot of zeros. For this
reason, pooling above the maximum makes sense to reduce the
number of zeros. Since it is not the size of the feature maps
but the number of channels that should be reduced, we
wrote a custom layer. It is named ChannelsMaxPooling Layer
and pools one-dimensionally each pixel over a given number
of feature maps. It can be found on Github2. In this work
a lter size of 32 and a step size of 8 pixels were used. This
means that the sliding window goes over 32 channels and is
moved in 8-pixel steps across all channels. The number of
feature maps is reduced from 512 to 5128 32 + 1 = 61.</p>
      <p>The white components in Figure 3 must be trained for
each approach. The second dense layer generates a N
-dimensional vector, where N stands for the number of CNN classes
considered. Unlike all other layers, it does not have ReLu as
an activation function, but Sigmoid. By using this activation
function, all values of the resulting vector are normalized to
the interval [0; 1], which correspond to the probabilities of an
assignment to the respective class. Between the two dense
layers a dropout layer is applied, which randomly rejects
20% of the tensor values to prevent over tting.
2http://github.com/tatusch/ChannelsMaxPoolingLayer
Dataset</p>
      <p>Precision
2000-Labels Dataset None</p>
      <p>Random
Random
Random (redundant) Binary Cross Entropy</p>
      <p>Propensity Loss
By Occurrences</p>
      <p>Binary Cross Entropy</p>
      <p>Propensity Loss
Binary Cross Entropy</p>
      <p>Propensity Loss
Binary Cross Entropy</p>
      <p>Propensity Loss
Binary Cross Entropy</p>
      <p>Propensity Loss
Binary Cross Entropy</p>
      <p>Propensity Loss</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL RESULTS</title>
      <p>
        The dataset used for the evaluation was provided during
ImageCLEF2017[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All images come from the medical eld,
but can vary greatly in size, color coding and content. For
example, there are images of wounds, patients, CT scans and
maps of areas where a diseases has been spread. The
meanings of the labels can be looked up in the Uni ed Medical
Language System (UMLS)3. They, however, were not taken
into account in this work.
      </p>
      <p>
        The dataset contains 164 614 training and 10 000 test images
in different formats and a total of 20 812 labels. The test
results of the competition listed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] clearly show that a
precise classi cation of the objects is a difficult problem.The
best achieved average F1 score is only 15:83%. The next 10
places are occupied with values of 12 to 14%. With
additional external resources, a maximum F1 score of 17:18% was
achieved. All these values are far away from a score, which
can be taken for a precise classi cation.
      </p>
      <p>Due to limited technical resources, it was necessary to
reduce the total amount of labels to 2 000 labels, which were
chosen randomly. Although it is an enormous decimation
of the number of classes, the dataset is still large enough
to represent the described problem. The decimation of the
label set causes, that the reduced dataset contains in total
89 113 training images and 5 451 test images.</p>
      <p>Since in this selected dataset a lot of labels with very few
representatives were included, another subset of the dataset
was chosen for comparison. This time the 1 000 most
common labels were selected. All these labels are represented by
at least 139 training images. The resulting data set includes
150 339 training and 9 201 test images.</p>
      <p>
        In Table 1 the results of the models with different
clusterings and loss functions on the two data sets are displayed.
All metrics are calculated by the formulas used in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It
becomes clear that all results are anything but good. The basic
network with all labels reaches only an F1 score of 0:000336
on the 2000-Labels Dataset and 0:002042 on the 1000-Labels
Dataset. In relative terms, however, a strong improvement
was achieved by the different approaches. Regarding the F1
3https://www.nlm.nih.gov/research/umls/
score { which describes the most meaningful measure
referred to the task { the best results have been obtained with
the parallel architecture, the random disjoint clusters and
the binary cross entropy. The resulting F1 score is nearly
eight times larger than that of the clusters by occurrences.
Random redundant clustering achieves the second-best
values and is still much better than the worst clustering.
Nevertheless, the results are considerably worse than with the
other alternative. This fact once again con rms the
assumption that a network achieves poorer results the more classes
it has to consider.
      </p>
      <p>In order to determine which method is the most suitable,
the individual merge strategies have been evaluated on the
2000-Labels Dataset. The results are shown in Table 2. As
you can see, for the binary cross entropy the best strategy is
given by the indicator method. This outcome was expected
as the assignment rate with this loss function is very low
and gets supported by the indicator function.</p>
      <p>Since the propensity loss already promotes the
allocation rate itself, the best results are achieved with the average
method. This result can also easily be explained. Since the
loss function increases the number of assignments, the
false positive rate is increased, as well. To reduce this effect,
the most appropriate results must be selected. Since both
the average and majority method provide this
functionality, both are suitable for the propensity loss. Table 2 shows
that both results are very close to each other. In Table 1 the
results of the most suitable strategies are displayed.</p>
      <p>The usage of the propensity loss function improves the
results of the basic network by around 400% on both datasets,
Method
Indicator
Average
Majority</p>
      <p>Binary Cross Entropy</p>
      <p>R F1
0.225877
0.053195
0.056994
0.004126
0.003831
0.003796</p>
      <p>Propensity Loss</p>
      <p>R F1
0.642536
0.431557
0.433857
0.002345
0.003669
0.003649
as can be seen in Table 1. Particularly noticeable is the
improvement of the recall. On the 2000-Labels Dataset a recall
of 0:4612 is achieved. This can be explained by the
promotion of classi cation which causes the decrease of the false
negatives rate. In combination with the parallel architecture,
however, a signi cant improvement of the results regarding
the basic model is achieved, as well, but it gets clear that
in all cases the parallel approach scores better without the
propensity loss function.</p>
      <p>Furthermore, it can be said, that on both datasets, the
results of the basic model using the propensity loss are
outperformed by the parallel approach with random clusters
(disjoint as well as redundant).</p>
    </sec>
    <sec id="sec-9">
      <title>RELATED WORK</title>
      <p>
        To the present day, there are a few approaches for
multilabel and multiclass classi cation[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but only
few publications work on the combination of the two tasks.
In addition, the difficult facts that the number of labels per
object is not xed and the number of classes is very high are
usually not taken into account. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] a classi er is
presented that deals with the multilabel multiclass classi cation
and uses association rules[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to classify the input objects.
The authors achieve very good accuracy results, however,
the solution is not suitable for a classi cation with a large
number of classes.
      </p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION</title>
      <p>It has been shown that the division of the original label set
into smaller label groups and the parallel execution of
multiple CNNs on seperate clusters signi cantly improves the
results regarding a basic network which deals with all labels
at a time. Depending on the choice of the clustering, the
quality can be even further improved. On the used datasets,
the use of clusters after occurrences has not been very
advantageous. Signi cantly better results were achieved with
random disjoint clusters. The arti cially generated
redundancy of the labelsets reduces the quality of the results, as
this again leads to an increase in the average cluster size.</p>
      <p>The customized propensity loss function reveals a strong
improvement of the results, as well. The usage of this
function makes particularly sense if the assignment rate is very
low. If the structure of the model has already achieved a
relatively high assignment rate, it has been shown that the
propensity loss can also reduce the quality of the results. For
this reason, it should not be assumed that the propensity loss
always leads to an improvement in quality.</p>
    </sec>
    <sec id="sec-11">
      <title>FUTURE WORK</title>
      <p>
        Unfortunately the usage of association rules for the
clustering of the labels was not possible on the datasets considered
here. Due to the diversity of the data, it was not possible
to nd suitable parameters that covered most of the labels
and did not produce any rules that only appeared once or
twice. Rules that occur so rarely are meaningless and
therefore not useful. For future research, it would be interesting
to look at a different set of data and examine the effects of
such a clustering, especially since in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] it was shown that
association rules can achieve convincing results.
      </p>
      <p>Another interesting aspect would be the development of a
hierarchical model, in whose uppermost levels it is decided
which cluster the input object can be assigned to. A
classi cation to the concrete labels is only made at the lowest
level. It can also be promising to perform backpropagation
across all levels so that the rst levels can learn from the
nal results.</p>
      <p>Principally, the multilabel multiclass classi cation with a
large number of classes is continually a difficult problem that
can be investigated extensively in the future.
8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          . Metrics File in Keras' GitHub Repository. https://github.com/keras-team/keras/blob/ ac1a09c787b3968b277e577a3709cd3b6c931aa5/ keras/metrics.py. Accessed:
          <fpage>2018</fpage>
          -04-06.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Dekel</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Shamir</surname>
          </string-name>
          .
          <article-title>Multiclass-Multilabel Classi cation with More Classes than Examples</article-title>
          .
          <source>In Proceedings of the Thirteenth International Conference on Arti cial Intelligence and Statistics</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Eickhoff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Schwall</given-names>
            ,
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Overview of ImageCLEFcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images</article-title>
          .
          <source>In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Kakade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          <article-title>. Multi-Label Prediction via Compressed Sensing</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>22</volume>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <article-title>Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking &amp; Other Missing Label Applications</article-title>
          .
          <source>In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Deselaers</surname>
          </string-name>
          , and
          <string-name>
            <surname>B. Caputo.</surname>
          </string-name>
          <article-title>ImageCLEF: Experimental Evaluation in Visual Information Retrieval</article-title>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.-E.</given-names>
            <surname>Nilsback</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <source>Automated Flower Classi cation over a Large Number of Classes. In Indian Conference on Computer Vision</source>
          , Graphics and
          <string-name>
            <given-names>Image</given-names>
            <surname>Processing</surname>
          </string-name>
          ,
          <year>Dec 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Fei-Fei. ImageNet Large Scale Visual Recognition Challenge</surname>
          </string-name>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition. Computing Reasearch Repository</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steinbach</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          . Introduction to Data
          <source>Mining: Pearson New International Edition</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Thabtah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. I.</given-names>
            <surname>Cowling</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng. MMAC</surname>
          </string-name>
          :
          <article-title>A New Multi-Class, Multi-Label Associative Classi cation Approach</article-title>
          . In ICDM,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>C. M. Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Parallel and Sequential Support Vector Machines for Multi-Label Classi cation</article-title>
          . In
          <source>International Journal of Information Technology</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classi cation</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>17</volume>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>