=Paper= {{Paper |id=Vol-1391/67-CR |storemode=property |title=IIS at ImageCLEF 2015: Multi-label Classification Task |pdfUrl=https://ceur-ws.org/Vol-1391/67-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/Rodriguez-Sanchez15 }} ==IIS at ImageCLEF 2015: Multi-label Classification Task== https://ceur-ws.org/Vol-1391/67-CR.pdf
                  IIS at ImageCLEF 2015:
                 Multi-label classification task

             Antonio J Rodrı́guez-Sánchez1 , Sabrina Fontanella1,2 ,
                    Justus Piater1 , and Sandor Szedmak1
      1
         Intelligent and Interactive Systems, Department of Computer Science,
                            University of Innsbruck, Austria
    {antonio.rodriguez-sanchez,justus.piater,sandor.szedmak}@uibk.ac.at
                              https://iis.uibk.ac.at/
           2
             Department of Computer Science, University of Salerno, Italy
                            fontanellasabrina@gmail.com




       Abstract. We propose an image decomposition technique that captures
       the structure of a scene. An image is decomposed into a matrix that
       represents the adjacency between the elements of the image and their
       distance. Images decomposed this way are then classified using a max-
       imum margin regression (MMR) approach where the normal vector of
       the separating hyperplane maps the input feature vectors into the out-
       puts vectors. Multiclass and multilabel classification are native to MMR,
       unlike other more classical maximum margin approaches, like SVM. We
       have tested our approach with the ImageCLEF 2015 multi-label classifi-
       cation task, obtaining high rankings at that task.

       Keywords: ImageCLEF, Kronecker decomposition, Maximum Margin,
       MMR, SVM, multi-label classification, medical images



1    Introduction

Automatic image classification is a fundamental part of computer vision. An
image is classified according to the visual content it contains. Tasks in image
classification include if an image contains a certain object, person, animal or
plant; if the image is from a street or it is indoors; or in the case that applies to
this paper, if it is a medical figure and the type of medical images and/or graphs
it contains. Image classification spans several decades from the first character
or digit recognition challenges (still used today), such as the MNIST dataset
[1] to more recent, challenging image classification tasks, such as the Pascal [2],
Imagenet [3] or ImageCLEF [4, 5] challenges.
    Image classification algorithms usually consist of a first step where keypoints
or regions are found, which are then assigned a representation in terms of a fea-
ture vector. During training, the feature vectors extracted from a set of training
images are usually grouped into histograms that approximate the distribution
of the features for the different types of images. These histograms compose the
input to a classification algorithm, such as an SVM (discriminative) or Naive
Bayes (generative).
    One of the central problems in exploring the general structure of an image is
to recognize the relations between the objects appearing on the image. The task
is not really the recognition of the objects but rather building a model on the
structure: what belongs to what and how they can be related. It might be similar
to the way how an animal could observe the world without labels attached to
the object but only relying on relations among them in a certain environment.
Those relations could provide the knowledge needed to identify scenes.
    One of the most popular streams of machine learning research is to find ef-
ficient methods for learning structured outputs. Several researchers introduced
similar approaches to these kind of problems [6–10]. Those methods directly in-
corporate the structural learning into a specially chosen optimization framework.
    It is generally assumed that to learn a discriminating function when the out-
put space is a labeled hierarchy is a much more complex problem than binary
classification. In this paper we show that the complexity of this kind of problem
can be detached from the optimization model and can be expressed by an embed-
ding into Hilbert space. This allows us to apply a universal optimization model,
processing inputs and outputs represented in a properly chosen Hilbert space
which can solve the corresponding optimization task without tackling with the
underlying structural complexity. The optimization model is an implementation
of a certain type of maximum margin regression, an algebraic generalization
of the well-known Support Vector Machine. The computational complexity of
the optimization scales only with the number of input-output pairs and it is
independent from the dimensions of both spaces. Furthermore its overall com-
plexity is equal to a binary classification. Our approach can be easily extended
towards other structural learning problems without giving up efficiency on the
basic optimization framework.
    Three fundamental steps are needed for structural learning:
Embedding The structures of the input and output objects are represented as
   abstract vectors in properly chosen Hilbert spaces reflecting the similarity
   and the dissimilarity of the objects.
Optimization The optimization phase is implemented via a universal solver
   which tries to find the best similarity based matching between the input
   and the output representations. Since these representation are expressed
   as general vectors, the optimizer needs not directly tackle the underlying
   structural complexity.
Inversion The optimizer provides a decision function which emits a vector. The
   inversion phase has to find the best fitting output structure by projecting
   the image vector back. This is often referred to as the pre-Image problem,
   see some alternatives presented in [10]. If the embedding is realized as a
   bijective mapping the inversion task is well defined.
    We will make use of the following mathematical notation conventions in the
rest of the paper: X stands for the space of the input objects, Y for the space
of the outputs. Hφ is a Hilbert space comprising the feature vectors, the images
of the input vectors with respect to the embedding φ(). Hψ is a Hilbert space
comprising the image of label vectors with respect to the embedding ψ(). W is
a matrix representing the linear operator projecting the feature space Hφ into
Hψ . h., .iHz denotes the inner product in Hilbert space Hz , k.kHz is the norm
defined in Hilbert space Hz . tr(W) is the trace of matrix W. dim(H) is the
dimension of the space H. x1 ⊗ x2 denotes the tensor product of the vectors
x1 ∈ H1 and x2 ∈ H2 , and it represents a linear operator A : H2 → H1 which
                                       def
acts on a vector z ∈ H2 as (x1 ⊗ x2 )z = (x1 x02 )z = x1 hx2 , ziH2 . hA, BiF is the
Frobenius inner product of a matrix represented by the linear operators A and B
and it is defined by tr(A0 B). kAkF stands for the Frobenius
                                                         p        norm of a matrix
represented by the linear operator A and defined by hA, AiF . A · B is the
element-wise(Schur) product of the matrices A and B. A0 , a0 is the transpose
of any matrix A or any vector a.


2   Image feature generation via decomposition
Let us consider a real 2D image decomposition, where we can expect that the
points close to each other within continuous 2D blocks relate more strongly to
each other than only considering their connection in 1D rows and columns. To
represent the image decomposition, the Kronecker product is applied, which can
be expressed as
                                                        
                        A1,1 B A1,2 B · · · A1,nA B
                      A2,1 B A2,2 B · · · A2,nA B 
         X = A ⊗ B
                                                        
                           ..      ..    ..       ..     
                                                                           (1)
                           .       .        .     .     
                       AmA ,1 B AmA ,2 B · · · AmA ,nA B
         A ∈ RmA ×nA , B ∈ RmB ×nB , mX = mA × mB , nX = nA × nB

   In the Kronecker decomposition the second component (B) can be interpreted
as a 2D filter of the image represented by the matrix X. We can try to find a
sequence of filters by the following procedure:

 1. k = 1
 2. X(k) = X
 3. DO
 4. d(A(k) , B (k) ) = minAk ,Bk kX(k) − A(k) ⊗ B(k) k2
 5. IF d(A(k) , B (k) ) ≤  STOP
 6. X(k+1) = X(k) − A(k) ⊗ B(k)
 7. k = k + 1
 8. Goto 3

   The question is, if X is given, how do we compute A and B? It turns out
that the Kronecker decomposition can be carried out by Singular Value Decom-
position (SVD) working on a reordered representation of the matrix X.
   For an arbitrary matrix X with size m × n the SVD is given by X = USVT
where U ∈ Rm×m is an orthogonal matrix, UUT = Im , of left singular vectors,
V ∈ Rn×n , is an orthogonal matrix, VVT = In , of right singular vectors, and
S ∈ Rm×n , is a diagonal matrix containing the singular values with nonnegative
components in its diagonal.

2.1   Reordering of the matrix
Since the algorithm solving the SVD problem does not depend directly on the
order of the elements of the matrix [11], any permutation of the indexes, i.e.
reordering the columns and(or) rows, preserves the same solution.

2.2   Kronecker decomposition as SVD
The solution to the Kronecker decomposition via the SVD can be found in [11].
This approach considers the aforementioned observation regarding the invariance
of the SVD on the reordering of the matrix elements.
    In order to show how the reordering of matrix X can help to solve the Kro-
necker decomposition problem we present the following example. The matrices
in the Kronecker product

                      X                        =A⊗B
             x11 x12 x13 x14 x15 x16
            x21 x22 x23 x24 x25 x26                 
                                    
            x31 x32 x33 x34 x35 x36      a11 a12 a13             
                                     =  a21 a22 a23  ⊗ b11 b12 ,
            x41 x42 x43 x44 x45 x46                       b21 b22
                                    
            x51 x52 x53 x54 x55 x56      a31 a32 a33
             x61 x62 x63 x64 x65 x66
can be reordered into
                                                                   
                                x11 x13 x15 x31 x33 x35 x51 x53 x55
                               x12 x14 x16 x32 x34 x36 x52 x54 x56 
               X̃ = Ã ⊗ B̃ =                                      
                               x21 x23 x25 x41 x43 x45 x61 x63 x65 
                              x22 x24 x26 x42 x44 x46 x62 x64 x66
                    b11
                   b12                                      
               =  b21  ⊗ a11 a12 a13 a21 a22 a23 a31 a32 a33 ,
                        

                    b22
where the blocks of X and the matrices A and B are vectorized in row wise
order. In this vectorization we follow that order which is applied in most of the
well known programming languages, C, Java, Python, MATLAB, instead of the
column wise order, e.g. used in the Fortran language.
    We can recognize that X̃ = Ã ⊗ B̃ can be interpreted
                                                        √ as the first step
                                                                        √ in the
SVD algorithm where we might apply the substitution su = Ã and sv = B̃.
The proof that this reordering generally provides the correct solution to the
Kronecker decomposition can be found in [11].
    We can summarize the main steps of the Kronecker decomposition in the
following steps:
 1. Reorder(reshape) the matrix,
 2. Compute the SVD decomposition,
 3. Compute the approximation of X̃ by à ⊗ B̃
 4. Invert the reordering.
   This kind of Kronecker decomposition is often referred as Nearest Orthogonal
Kronecker Product as well [11].


3      Learning task
The learning task that we are going to solve is the following: There is a set - called
sample - of pairs of output and input objects {(yi , xi ) : yi ∈ Y, xi ∈ X , i =
1, . . . , m, } independently and identically chosen out of an unknown multivariate
distribution P(Y, X). Here we would like to emphasize that the input and the
output objects can be arbitrary, e.g. they may be graphs, matrices, functions,
probability distributions etc. To these objects, let’s consider two functions φ :
X → Hφ and ψ : Y → Hψ mapping the input and output objects respectively
into linear vector spaces, called from now on, the feature space in case of the
inputs and the label space when the outputs are considered.
     The objective is to find a linear function acting on the feature space

                                f (φ(x)) = Wφ(x) + b,                             (2)

that produces a prediction of every input object in the label space and in this
way could implicitly give back a corresponding output object. Formally we have

                            y = ψ −1 (ψ(y)) = ψ −1 (f (φ(x))).                    (3)

      The learning procedure can be summarized as follows:
                                               X               Hφ
                                      z   }|    {   z     }|    {
                                  φ : input space → feature space,
                Embedding                                      Hψ
                                               Y
                                        z    }|    {   z }| {
                                      : output space → label space,

                Similarity     f = (W, b) ⇒ ψ(y) ∼ Wφ(x),
                               W                   f
                transformation
                                                Hψ               Y
                Inversion             −1
                                             z }| {        z    }|    {
                                  ψ        : label space → output space .


4      Optimization model
4.1     The “Classical” scheme of Support Vector Machine (SVM)
In the framework of the Support Vector Machine the outputs represent two
classes and the labels are chosen out of the set yi ∈ {−1, +1}. The aim is to find
a separating hyperplane, via its normal vector, such that the distance between
the elements of the two classes, called margin, is the largest one measured in the
direction of this normal vector. This base scheme can be extended allowing some
sample items to fall closer to the separating hyperplane than to the margin.
    This learning scenario can be formulated as an optimization problem:

                      min 12 kwk22 + C10 ξ
                      w.r.t. w : Hφ → R, normal vector
                             b ∈ R, bias, ξ ∈ Rm , error vector
                      s.t. yi (w0 φ(xi ) + b) ≥ 1 − ξi
                             ξ ≥ 0, i = 1, . . . , m.

4.2      Reinterpretation of the normal vector w
The normal vector w formally behaves as a linear transformation acting on the
feature vectors whose capabilities can be even further extended. This extension
can be characterized briefly in the following way

               SVM                             ExtendedView
  – w is the normal vector of the – W is a linear operator projecting the
    separating hyperplane.            feature space into the label space.
  – yi ∈ {−1, +1} binary outputs.   – yi ∈ Y arbitrary outputs
  – The labels are equal to the bi- – ψ(yi ) ∈ Hψ are the labels, the embed-
    nary objects.                     ded outputs in a linear vector space
   If we apply a one-dimensional normalized label space invoking binary labels
{−1, +1} in the general framework, one can restore the original scenario of the
SVM, and the normal vector is a projection into the one dimensional label space.
   To summarize the learning task, we end up in the following optimization
problem when compared to the original primal form of the SVM:
              Primal problems for maximum margin learning
          Binary class learning       Vector label learning
          Support Vector Machine(SVM) Maximum Margin Regression(MMR)
  min      1    2
           2 kwk2   + C10 ξ                   1    2
                                              2 kWkF     + C10 ξ
  w.r.t. w : Hφ → R, normal vector            W : Hφ → Hψ , linear operator,
           b ∈ R, bias,                    b ∈ Hψ , translation(bias),
                                          m
                                   ξ ∈ R , error vector                             ,
                0
  s.t.     yi (w φ(xi ) + b) ≥ 1 − ξi ,        ψ(yi ), Wφ(xi ) + b H   ≥ 1 − ξi ,
                                                                   ψ

                                    ξ ≥ 0, i = 1, . . . , m.

   In the extended formulation we exploit the fact that the Frobenius norm and
inner product correspond to the linear vector space of matrices with dimension
equal to the number of elements of the matrices, hence it gives an isomorphism
between the space spanned by the normal vector of the hyperplane occurring in
the SVM and the space spanned by the linear transformations.
   One can recognize that if no bias term is included in the MMR problem then
we have a completely symmetric relationship between the label and the feature
space via the representations of the input and the output items, namely
      ψ(yi ), Wφ(xi ) H = W∗ ψ(yi ), φ(xi ) H = φ(xi ), W∗ ψ(yi ) H .
                          ψ                           φ                           φ

   Thus, in predicting the input items as the image of a linear function defined
on the outputs the, adjugate of W, W∗ is involved. This adjugate is equal to
the transpose of the matrix representation of W whenever both, the label space
and the feature space are finite dimensional spaces.

4.3   Dual problem
The dual problem of MMR presented in (4.2) is given by
                                     κφ
                                      ij               κψ
                                                        ij
                 Pm            z       }|      {z         }|       { P
                                                                       m
          min      i,j=1 αi αj hφ(x i ), φ(xj )i hψ(y  i ), ψ(yj ))i − i=1 αi ,
                 αi ∈ R,
          w.r.t. P
                   m
          s.t.     i=1 (ψ(yi ))t αi = 0, t = 1, . . . , dim(Hψ ),
                 0 ≤ αi ≤ C, i = 1, . . . , m,

where κφij kernel items correspond to the feature vectors, and κψ     ij kernel items
correspond to the label vectors.
     The symmetry of the objective function is clearly recognizable showing that
the underlying problem without bias is completely reversible. The explicit oc-
currences of the label vectors can be transformed into implicit
                                                           Pm ones by exploiting
that the feasibility domain covered by the constraints: i=1 (ψ(yi ))t αi = 0, t =
                                                    Pm ψ
1, . . . , dim(Hψ ), coincides with the domain of      i=1 κij αi = 0, j = 1, . . . , m
involving only inner products of the label vectors.
     In the case of the original SVM κψ   ij collapses into the product yi yj of the
binary labels +1 and −1.

4.4   Prediction
After solving the dual problem with the help of the optimum dual variables we
can write up the optimal linear operator
                                Pm
                          W = i=1 αi ψ(yi )φ(xi )0 .
   We can solve this expression by comparing it to the corresponding
                                                           Pm              formula
which gives the optimal solution to the SVM, i.e. w =        i=1 αi y i φ(x i ). The
new part includes the vectors representing the output items which in the SVM
were only scalar values but we could say in the new interpretation that they are
one-dimensional vectors. With the expression of the linear operator W at hand,
the prediction to a new input item x can be written as
                                  Pm
                ψ(y) = Wφ(x) = i=1 αi ψ(yi ) hφ(xi ), φ(x)i .
                                                  |    {z     }
                                                             κφ (xi ,x)
which involves only the input kernel κφ and provides the implicit representation
of the prediction ψ(y) to the corresponding output y.
    Because only the implicit image of the output is given, we need to invert the
function ψ to obtain its corresponding y. This inversion problem is called the
pre-image problem. Unfortunately there is no general procedure to do that. We
mention here a scheme that can be applied when the set of all possible outputs is
finite with a reasonable small cardinality. The meaning of the “reasonable small”
cardinality depends on the given problem, e.g. how expensive is to compute
the inner product between the output items in the label space where they are
represented.
    At the conditions mentioned we can follow this scenario
      y∗ = arg maxy∈Ye ψ(y)0 Wφ(x)
                                   κψ (y,yi )     κφ (xi ,x)
                      Pm        z        }|    {z     }|     {
        = arg maxy∈Ye i=1 αi hψ(y), ψ(yi )i hφ(xi )0 φ(x)i
          where y ∈ Ye = {y1 , . . . , yN } ⇐ is the set of the possible outputs
    The main advantage of this approach is that it requires only the inner prod-
ucts in label space. in addition to this, it is independent from the representation
of the output items and can be applied in any complex structural learning prob-
lem, e.g. on graphs. Probably the best candidate for Ye could be the training
set.

4.5    Hierarchy learning
As mentioned above in this paper we focus on the case where the output space
is a labeled hierarchy (Figure 1a). The hierarchy learning is realized via an
embedding of each path going from a node to the root of the tree. Let V be
the set of nodes in the tree. A path p(v) ⊂ V is defined as a shortest path
from the node v to the root of the tree and its length is equal to |p(v)|. The set
I = 1, . . . , |V | gives an indexing of the nodes. The embedding is realized by a
vector valued function ψ : V → R|V | , and the components of ψ(v) are given by
                            
                              r    if vi ∈
                                         / p(v),
                   ψ(v)i =                                                     (4)
                              sq k if vi ∈ v(p) and k = |p(v)| − |p(vi )|,
where r, q, s are the parameters of the embedding. The parameter q expresses
the diminishing weight of the nodes being closer to the root. If q = 0, assuming
00 = 1, then the intermediate nodes and the root are disregarded, thus we have
a simple multiclass classification problem. The value of r can be 0 but some
experiments show it may help to improve the classification performance. We
might conjecture the best choice of the parameters are those which minimize the
correlation between all pairs of the label vectors.

4.6    Input and output kernels
For the concrete learning task we need to construct the input and output kernels.
To build the input kernel, the second component of the Kronecker decomposition
of each image - the matrix B in (1) - is used. The inner product between those
matrices is computed by applying the Frobenius inner product. The output ker-
nel is created from the inner products of the vectors representing the path in the
hierarchy in (4).


5     Experimental evaluation

5.1   ImageCLEF multi-label classification [4, 5]

The challenge we participated was the characterization of compound figures.
These figures contain subfigures from different types and sources (see figure 1b,c
for two examples). The task consists of labeling the compound figures with each
of the 30 classes that appear in the hierarchy in figure 1a without knowing where
the separation lines are. The training set consists of 1,071 figures, the test set
consists of 927 figures.


5.2   Results on the challenge

In the computation of the prediction results, a 5-fold cross-validation procedure
is applied. The original dataset is split uniformly and randomly into 5 equal
parts. Then, each part is chosen as test data in a loop and the remaining four
parts are taken as training. In the learning procedure, first, a kernel is computed
from the corresponding features. Parameters corresponding to each kernel are
found by cross validation restricted to the training data, namely it is divided
into validation test and validation training parts. Then the learner is trained
only on the validation training items. The values of the parameters are chosen
which maximize the F 1 score on the validation test. We will report here on two
types of kernels: polynomial and Gaussian.
     We submitted ten runs to the challenge providing our predictions on the
labels for the test set (before it was made public). For the ten runs, we used
a third degree polynomial kernel, the only factor changing at each run was the
random selection in the 5-fold cross-validation. Generation of the feature vectors
for the training set took around 60 minutes. The training of MMR would take
around one minute, and obtaining the labels for the test set took less than a
minute for each run. The ImageCLEF organizers provided the Hamming loss,
which is a classical measure for multi-label classification tasks and evaluates the
fraction of wrong labels to the total number of labels. The perfect case would
be obtaining a Hamming loss of 0. Hamming loss values for the challenge in our
case were exceptionally low (very close to 0) and ranged from 0.0671 to 0.0817.
     Once the test set was available we performed extra evaluation. We used three
other different evaluation measures that are popular in multi-label classification,
namely precision, recall and their combination into the F1 score. They are given
by a combination of the true positives Tp , false positives Fp and false negatives
Fn :                          Tp            Tp
                        P = Tp +F p
                                    , R = Tp +Fn
                                                 , F 1 = P2P+R
                                                             R
                                                                                (5)
                                           (a)




                         (b)                                 (c)

Fig. 1. a) The hierarchy of classes in the ImageCLEF 2015 [4, 5] multi-label challenge;
b) and c) are two figure examples.



where P is the precision and R is the recall. Here, the perfect case would have a
recall value of 1 for any precision. The F 1 measure combines both values into one
so that false positives and false negatives are taken into account in this one value.
Precision-Recall curves for six different Kronecker 2D filter sizes (4, 8, 12, 20, 28,
34) are given in figure 2a for polynomial kernerls of different degrees and in figure
3a for Gaussian kernels having different standard deviations. Their respective
F1 scores are in figures 2b and 3b. The parameter for the Precision-Recall curve
(and the F1 plot) in the polynomial kernel was the degree of the polynomial,
from 1 to 10. The parameter that was varied in the Gaussian kernel to generate
its Precision-Recall curve (and the F1 plot) was the standard deviation of the
Gaussian: 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5 and 10.
Precision




                                                F1 score
                        Recall                             Degree of polynomial
                          (a)                                         (b)

        Fig. 2. Results for six filter sizes: 4, 8, 12, 20, 18 and 32, training with a polynomial
        kernel of degrees 1 to 10. a) Precision and Recall, b) F1 score.
Precision




                                                F1 score




                        Recall                             Standard deviation
                          (a)                                         (b)

        Fig. 3. Results for six filter sizes: 4, 8, 12, 20, 18 and 32, training with a Gaussian
        kernel with stdev 0.01 to 10. a) Precision and Recall, b) F1 score.

           These results show that larger filter sizes provide better results, although at
        the largest filter sizes, Precision, Recall and F1 score are very similar. Regarding
        kernels, when using a polynomial kernel, there is a dramatically increase in F1
        scores when using a cubic kernel as compared to a linear or quadratic one.
        Although at kernels of degree 4 and larger, F1 scores are very similar. In the
        case of a Gaussian kernel, the best scores happen at standard deviations smaller
        than 1, although values in the middle (e.g. 0.5, 0.6) provide better results than
        very small values (e.g. 0.01, 0.05). The best F1 score using a polynomial kernel
        was 0.38, in the case of a Gaussian kernel, the highest F1 score was 0.43.


        6    Conclusions

        We have presented an approach based on a structured decomposition of the en-
        vironment. For example, elements appearing on a scene can be incorporated into
        a graph in which the objects play the role of the vertices and the edges related to
the distances between those objects. Then the knowledge about the environment
can be represented by the adjacency matrix of the graph. By decomposing the
image matrix into a similar structure, e.g. into a sequence of Kronecker prod-
ucts, the structure behind the scene could be captured. For classification we have
applied a version of a maximum margin based regression (MMR) technique [12].
MMR relies on the fact that the normal vector of the separating hyperplane
can be interpreted as a linear operator mapping the feature vectors of input
items into the space of the feature vectors of the outputs. The evaluation of
our methodology in the ImageCLEF 2015 [4, 5] multi-label challenge provided
promising results.

7    Acknowledgement
The research leading to these results received funding from the EU 7th Frame-
work Programme FP7/2007-2013 under grant agreement no. 270273, Xperience.

References
 1. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
    document recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324
 2. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisser-
    man, A.: The pascal visual object classes challenge: A retrospective. International
    Journal of Computer Vision 111(1) (2014) 98–136
 3. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
    Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
    Scale Visual Recognition Challenge. International Journal of Computer Vision
    (IJCV) (2015)
 4. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de Her-
    rera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli, S.,
    Marvasti, N.B., Aldana, J.F., del Mar Roldán Garcı́a, M.: General Overview of
    ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science. (2015)
 5. Garcı́a Seco de Herrera, A., Müller, H., Bromuri, S.: Overview of the ImageCLEF
    2015 medical classification task. In: Working Notes of CLEF 2015 (Cross Language
    Evaluation Forum). CEUR Workshop Proceedings (September 2015)
 6. Taskar, B., Guestrin, C., Koller, D.: Max-margin markov networks. In: NIPS 2003.
    (2003)
 7. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector ma-
    chines. In: ICML’03. (2003) 3–10
 8. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
    for structured and interdependent output variables. Journal of Machine Learning
    Research (JMLR) 6(Sep) (2005) 1453–1484
 9. Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical
    multi-category text classification models. In: ICML. (2005)
10. Bakir, G., Hofman, T., B. Schölkopf, A.J.S., Taskar, B., Vishwanathan, S.V.N.,
    eds.: Predicting Structured Data. MIT Press (2007)
11. Loan, C.: The ubiquitous kronecker product. Journal of Computational and Ap-
    plied Mathematics 123 (2000) 85–100 The nearest Kronecker product.
12. Xiong, H., Szedmak, S., Piater, J.: Scalable, Accurate Image Annotation with
    Joint SVMs and Output Kernels. Neurocomputing (2015)