-

IIS at ImageCLEF 2015: Multi-label classi cation task

Antonio J Rodr guez-Sanchez

antonio.rodriguez-sanchez@uibk.ac.at 1

Sabrina Fontanella

fontanellasabrina@gmail.com 0 1

Justus Piater

justus.piater@uibk.ac.at 1

Sandor Szedmak

sandor.szedmak@uibk.ac.at 1 0 Department of Computer Science, University of Salerno , Italy 1 Intelligent and Interactive Systems, Department of Computer Science, University of Innsbruck, Austria https://iis.uibk.ac.at/

We propose an image decomposition technique that captures the structure of a scene. An image is decomposed into a matrix that represents the adjacency between the elements of the image and their distance. Images decomposed this way are then classi ed using a maximum margin regression (MMR) approach where the normal vector of the separating hyperplane maps the input feature vectors into the outputs vectors. Multiclass and multilabel classi cation are native to MMR, unlike other more classical maximum margin approaches, like SVM. We have tested our approach with the ImageCLEF 2015 multi-label classi cation task, obtaining high rankings at that task.

ImageCLEF Kronecker decomposition Maximum Margin MMR SVM multi-label classi cation medical images

Automatic image classi cation is a fundamental part of computer vision. An image is classi ed according to the visual content it contains. Tasks in image classi cation include if an image contains a certain object, person, animal or plant; if the image is from a street or it is indoors; or in the case that applies to this paper, if it is a medical gure and the type of medical images and/or graphs it contains. Image classi cation spans several decades from the rst character or digit recognition challenges (still used today), such as the MNIST dataset [ 1 ] to more recent, challenging image classi cation tasks, such as the Pascal [ 2 ], Imagenet [ 3 ] or ImageCLEF [ 4, 5 ] challenges.

Image classi cation algorithms usually consist of a rst step where keypoints or regions are found, which are then assigned a representation in terms of a feature vector. During training, the feature vectors extracted from a set of training images are usually grouped into histograms that approximate the distribution of the features for the di erent types of images. These histograms compose the input to a classi cation algorithm, such as an SVM (discriminative) or Naive Bayes (generative).

One of the central problems in exploring the general structure of an image is to recognize the relations between the objects appearing on the image. The task is not really the recognition of the objects but rather building a model on the structure: what belongs to what and how they can be related. It might be similar to the way how an animal could observe the world without labels attached to the object but only relying on relations among them in a certain environment. Those relations could provide the knowledge needed to identify scenes.

One of the most popular streams of machine learning research is to nd efcient methods for learning structured outputs. Several researchers introduced similar approaches to these kind of problems [6{10]. Those methods directly incorporate the structural learning into a specially chosen optimization framework.

It is generally assumed that to learn a discriminating function when the output space is a labeled hierarchy is a much more complex problem than binary classi cation. In this paper we show that the complexity of this kind of problem can be detached from the optimization model and can be expressed by an embedding into Hilbert space. This allows us to apply a universal optimization model, processing inputs and outputs represented in a properly chosen Hilbert space which can solve the corresponding optimization task without tackling with the underlying structural complexity. The optimization model is an implementation of a certain type of maximum margin regression, an algebraic generalization of the well-known Support Vector Machine. The computational complexity of the optimization scales only with the number of input-output pairs and it is independent from the dimensions of both spaces. Furthermore its overall complexity is equal to a binary classi cation. Our approach can be easily extended towards other structural learning problems without giving up e ciency on the basic optimization framework.

Three fundamental steps are needed for structural learning: Embedding The structures of the input and output objects are represented as abstract vectors in properly chosen Hilbert spaces re ecting the similarity and the dissimilarity of the objects.

Optimization The optimization phase is implemented via a universal solver which tries to nd the best similarity based matching between the input and the output representations. Since these representation are expressed as general vectors, the optimizer needs not directly tackle the underlying structural complexity.

Inversion The optimizer provides a decision function which emits a vector. The inversion phase has to nd the best tting output structure by projecting the image vector back. This is often referred to as the pre-Image problem, see some alternatives presented in [ 10 ]. If the embedding is realized as a bijective mapping the inversion task is well de ned.

We will make use of the following mathematical notation conventions in the rest of the paper: X stands for the space of the input objects, Y for the space of the outputs. H is a Hilbert space comprising the feature vectors, the images of the input vectors with respect to the embedding (). H is a Hilbert space comprising the image of label vectors with respect to the embedding (). W is a matrix representing the linear operator projecting the feature space H into H . h:; :iHz denotes the inner product in Hilbert space Hz, k:kHz is the norm de ned in Hilbert space Hz. tr(W) is the trace of matrix W. dim(H) is the dimension of the space H. x1 x2 denotes the tensor product of the vectors x1 2 H1 and x2 2 H2, and it represents a linear operator A : H2 ! H1 which acts on a vector z 2 H2 as (x1 x2)z d=ef (x1x02)z = x1hx2; ziH2 . hA; BiF is the Frobenius inner product of a matrix represented by the linear operators A and B and it is de ned by tr(A0B). kAkF stands for the Frobenius norm of a matrix represented by the linear operator A and de ned by phA; AiF . A B is the element-wise(Schur) product of the matrices A and B. A0; a0 is the transpose of any matrix A or any vector a. 2

Image feature generation via decomposition

Let us consider a real 2D image decomposition, where we can expect that the points close to each other within continuous 2D blocks relate more strongly to each other than only considering their connection in 1D rows and columns. To represent the image decomposition, the Kronecker product is applied, which can be expressed as

X = A 7. k = k + 1 8. Goto 3

A(k)

B(k)k2

The question is, if X is given, how do we compute A and B? It turns out that the Kronecker decomposition can be carried out by Singular Value Decomposition (SVD) working on a reordered representation of the matrix X.

For an arbitrary matrix X with size m n the SVD is given by X = USVT where U 2 Rm m is an orthogonal matrix, UUT = Im, of left singular vectors, V 2 Rn n, is an orthogonal matrix, VVT = In, of right singular vectors, and S 2 Rm n, is a diagonal matrix containing the singular values with nonnegative components in its diagonal. 2.1

Reordering of the matrix

Since the algorithm solving the SVD problem does not depend directly on the order of the elements of the matrix [ 11 ], any permutation of the indexes, i.e. reordering the columns and(or) rows, preserves the same solution. 2.2

Kronecker decomposition as SVD

The solution to the Kronecker decomposition via the SVD can be found in [ 11 ]. This approach considers the aforementioned observation regarding the invariance of the SVD on the reordering of the matrix elements.

In order to show how the reordering of matrix X can help to solve the Kronecker decomposition problem we present the following example. The matrices in the Kronecker product

X = A 2 x11 x12 x13 x14 x15 x16 3 6 x21 x22 x23 x24 x25 x26 77 2 a11 a12 a13 3 66 x31 x32 x33 x34 x35 x36 77 = 4 a21 a22 a23 5 66 x41 x42 x43 x44 x45 x46 77 a31 a32 a33 64 x51 x52 x53 x54 x55 x56 5 x61 x62 x63 x64 x65 x66 B b11 b12 ; b21 b22 can be reordered into

X~ = A~

2 b11 3 = 664 bb1221 757 b22

2 x11 x13 x15 x31 x33 x35 x51 x53 x55 3 B~ = 664 xx1221 xx1243 xx1265 xx3421 xx3443 xx3465 xx5621 xx5643 xx5665 757

x22 x24 x26 x42 x44 x46 x62 x64 x66 a11 a12 a13 a21 a22 a23 a31 a32 a33 ; where the blocks of X and the matrices A and B are vectorized in row wise order. In this vectorization we follow that order which is applied in most of the well known programming languages, C, Java, Python, MATLAB, instead of the column wise order, e.g. used in the Fortran language.

We can recognize that X~ = A~ B~ can be interpreted as the rst step in the SVD algorithm where we might apply the substitution psu = A~ and psv = B~. The proof that this reordering generally provides the correct solution to the Kronecker decomposition can be found in [ 11 ].

We can summarize the main steps of the Kronecker decomposition in the following steps: 1. Reorder(reshape) the matrix, 2. Compute the SVD decomposition, 3. Compute the approximation of X~ by A~ 4. Invert the reordering. ~ B

This kind of Kronecker decomposition is often referred as Nearest Orthogonal Kronecker Product as well [ 11 ]. 3

Learning task

The learning task that we are going to solve is the following: There is a set - called sample - of pairs of output and input objects f(yi; xi) : yi 2 Y; xi 2 X ; i = 1; : : : ; m; g independently and identically chosen out of an unknown multivariate distribution P(Y; X). Here we would like to emphasize that the input and the output objects can be arbitrary, e.g. they may be graphs, matrices, functions, probability distributions etc. To these objects, let's consider two functions : X ! H and : Y ! H mapping the input and output objects respectively into linear vector spaces, called from now on, the feature space in case of the inputs and the label space when the outputs are considered.

The objective is to nd a linear function acting on the feature space f ( (x)) = W (x) + b; that produces a prediction of every input object in the label space and in this way could implicitly give back a corresponding output object. Formally we have y = 1( (y)) = 1(f ( (x))): (2) (3) The learning procedure can be summarized as follows:

Embedding Similarity transformation Inversion

X H : iznput}|space{ ! fzeatur}e|space{; H

Y : zoutpu}t|space{ ! lzabel}|spac{e; Wf = (W; b) ) (y)

Wf (x);

H Y 1 : lzabel}|spac{e ! zoutpu}t|spac{e : 4 4.1

Optimization model The \Classical" scheme of Support Vector Machine (SVM)

In the framework of the Support Vector Machine the outputs represent two classes and the labels are chosen out of the set yi 2 f 1; +1g. The aim is to nd a separating hyperplane, via its normal vector, such that the distance between the elements of the two classes, called margin, is the largest one measured in the direction of this normal vector. This base scheme can be extended allowing some sample items to fall closer to the separating hyperplane than to the margin.

This learning scenario can be formulated as an optimization problem: min 12 kwk22 + C10 w.r.t. w : H ! R; normal vector

b 2 R; bias; 2 Rm; error vector s.t. yi(w0 (xi) + b) 1 i

0; i = 1; : : : ; m: 4.2

Reinterpretation of the normal vector w

The normal vector w formally behaves as a linear transformation acting on the feature vectors whose capabilities can be even further extended. This extension can be characterized brie y in the following way

SVM ExtendedView

{ w is the normal vector of the { W is a linear operator projecting the separating hyperplane. feature space into the label space. { yi 2 f 1; +1g binary outputs. { yi 2 Y arbitrary outputs { The labels are equal to the bi- { (yi) 2 H are the labels, the embednary objects. ded outputs in a linear vector space If we apply a one-dimensional normalized label space invoking binary labels f 1; +1g in the general framework, one can restore the original scenario of the SVM, and the normal vector is a projection into the one dimensional label space.

To summarize the learning task, we end up in the following optimization problem when compared to the original primal form of the SVM: min

Primal problems for maximum margin learning

Binary class learning Vector label learning Support Vector Machine(SVM) Maximum Margin Regression(MMR) 21 kwk22 + C10 12 kWk2F + C10 w.r.t. w : H ! R; normal vector

W : H ! H ; linear operator; b 2 R; bias;

b 2 H ; translation(bias); s.t.

yi(w0 (xi) + b) 1 i;

In the extended formulation we exploit the fact that the Frobenius norm and inner product correspond to the linear vector space of matrices with dimension equal to the number of elements of the matrices, hence it gives an isomorphism between the space spanned by the normal vector of the hyperplane occurring in the SVM and the space spanned by the linear transformations.

One can recognize that if no bias term is included in the MMR problem then we have a completely symmetric relationship between the label and the feature space via the representations of the input and the output items, namely After solving the dual problem with the help of the optimum dual variables we can write up the optimal linear operator

W = Pm

i=1 i (yi) (xi)0:

We can solve this expression by comparing it to the corresponding formula which gives the optimal solution to the SVM, i.e. w = Pim=1 iyi (xi). The new part includes the vectors representing the output items which in the SVM were only scalar values but we could say in the new interpretation that they are one-dimensional vectors. With the expression of the linear operator W at hand, the prediction to a new input item x can be written as (y) = W (x) = Pm i=1 i (yi) h (xi); (x)i : | ({xzi;x) } which involves only the input kernel and provides the implicit representation of the prediction (y) to the corresponding output y.

Because only the implicit image of the output is given, we need to invert the function to obtain its corresponding y. This inversion problem is called the pre-image problem. Unfortunately there is no general procedure to do that. We mention here a scheme that can be applied when the set of all possible outputs is nite with a reasonable small cardinality. The meaning of the \reasonable small" cardinality depends on the given problem, e.g. how expensive is to compute the inner product between the output items in the label space where they are represented.

At the conditions mentioned we can follow this scenario y = arg maxy2Ye (y)0W (x)

(y;yi) (xi;x) = arg maxy2Ye Pim=1 i hz (y); (yi){i hz (xi})|0 (x)i

}| { where y 2 Ye = fy1; : : : ; yN g ( is the set of the possible outputs The main advantage of this approach is that it requires only the inner products in label space. in addition to this, it is independent from the representation of the output items and can be applied in any complex structural learning problem, e.g. on graphs. Probably the best candidate for Ye could be the training set. 4.5

Hierarchy learning

As mentioned above in this paper we focus on the case where the output space is a labeled hierarchy (Figure 1a). The hierarchy learning is realized via an embedding of each path going from a node to the root of the tree. Let V be the set of nodes in the tree. A path p(v) V is de ned as a shortest path from the node v to the root of the tree and its length is equal to jp(v)j. The set I = 1; : : : ; jV j gives an indexing of the nodes. The embedding is realized by a vector valued function : V ! RjV j, and the components of (v) are given by (v)i = r if vi 2= p(v); sqk if vi 2 v(p) and k = jp(v)j jp(vi)j; (4) where r; q; s are the parameters of the embedding. The parameter q expresses the diminishing weight of the nodes being closer to the root. If q = 0, assuming 00 = 1, then the intermediate nodes and the root are disregarded, thus we have a simple multiclass classi cation problem. The value of r can be 0 but some experiments show it may help to improve the classi cation performance. We might conjecture the best choice of the parameters are those which minimize the correlation between all pairs of the label vectors.

4.6 Input and output kernels

For the concrete learning task we need to construct the input and output kernels. To build the input kernel, the second component of the Kronecker decomposition of each image - the matrix B in (1) - is used. The inner product between those matrices is computed by applying the Frobenius inner product. The output kernel is created from the inner products of the vectors representing the path in the hierarchy in (4). 5 5.1

Experimental evaluation ImageCLEF multi-label classi cation [4, 5]

The challenge we participated was the characterization of compound gures. These gures contain sub gures from di erent types and sources (see gure 1b,c for two examples). The task consists of labeling the compound gures with each of the 30 classes that appear in the hierarchy in gure 1a without knowing where the separation lines are. The training set consists of 1,071 gures, the test set consists of 927 gures. 5.2

Results on the challenge

In the computation of the prediction results, a 5-fold cross-validation procedure is applied. The original dataset is split uniformly and randomly into 5 equal parts. Then, each part is chosen as test data in a loop and the remaining four parts are taken as training. In the learning procedure, rst, a kernel is computed from the corresponding features. Parameters corresponding to each kernel are found by cross validation restricted to the training data, namely it is divided into validation test and validation training parts. Then the learner is trained only on the validation training items. The values of the parameters are chosen which maximize the F 1 score on the validation test. We will report here on two types of kernels: polynomial and Gaussian.

We submitted ten runs to the challenge providing our predictions on the labels for the test set (before it was made public). For the ten runs, we used a third degree polynomial kernel, the only factor changing at each run was the random selection in the 5-fold cross-validation. Generation of the feature vectors for the training set took around 60 minutes. The training of MMR would take around one minute, and obtaining the labels for the test set took less than a minute for each run. The ImageCLEF organizers provided the Hamming loss, which is a classical measure for multi-label classi cation tasks and evaluates the fraction of wrong labels to the total number of labels. The perfect case would be obtaining a Hamming loss of 0. Hamming loss values for the challenge in our case were exceptionally low (very close to 0) and ranged from 0.0671 to 0.0817.

Once the test set was available we performed extra evaluation. We used three other di erent evaluation measures that are popular in multi-label classi cation, namely precision, recall and their combination into the F1 score. They are given by a combination of the true positives Tp, false positives Fp and false negatives Fn:

P = TpT+pFp ; R = TpT+pFn ; F 1 = P2P+RR (5) where P is the precision and R is the recall. Here, the perfect case would have a recall value of 1 for any precision. The F 1 measure combines both values into one so that false positives and false negatives are taken into account in this one value. Precision-Recall curves for six di erent Kronecker 2D lter sizes (4, 8, 12, 20, 28, 34) are given in gure 2a for polynomial kernerls of di erent degrees and in gure 3a for Gaussian kernels having di erent standard deviations. Their respective F1 scores are in gures 2b and 3b. The parameter for the Precision-Recall curve (and the F1 plot) in the polynomial kernel was the degree of the polynomial, from 1 to 10. The parameter that was varied in the Gaussian kernel to generate its Precision-Recall curve (and the F1 plot) was the standard deviation of the Gaussian: 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5 and 10.

Recall Recall

Degree of polynomial Standard deviation (b)

These results show that larger lter sizes provide better results, although at the largest lter sizes, Precision, Recall and F1 score are very similar. Regarding kernels, when using a polynomial kernel, there is a dramatically increase in F1 scores when using a cubic kernel as compared to a linear or quadratic one. Although at kernels of degree 4 and larger, F1 scores are very similar. In the case of a Gaussian kernel, the best scores happen at standard deviations smaller than 1, although values in the middle (e.g. 0.5, 0.6) provide better results than very small values (e.g. 0.01, 0.05). The best F1 score using a polynomial kernel was 0.38, in the case of a Gaussian kernel, the highest F1 score was 0.43. 6

Conclusions

We have presented an approach based on a structured decomposition of the environment. For example, elements appearing on a scene can be incorporated into a graph in which the objects play the role of the vertices and the edges related to the distances between those objects. Then the knowledge about the environment can be represented by the adjacency matrix of the graph. By decomposing the image matrix into a similar structure, e.g. into a sequence of Kronecker products, the structure behind the scene could be captured. For classi cation we have applied a version of a maximum margin based regression (MMR) technique [ 12 ]. MMR relies on the fact that the normal vector of the separating hyperplane can be interpreted as a linear operator mapping the feature vectors of input items into the space of the feature vectors of the outputs. The evaluation of our methodology in the ImageCLEF 2015 [ 4, 5 ] multi-label challenge provided promising results. 7

Acknowledgement

The research leading to these results received funding from the EU 7th Framework Programme FP7/2007-2013 under grant agreement no. 270273, Xperience.

1. LeCun , Y., Bottou , L. , Bengio , Y. , Ha

ner

, P.: Gradient-based learning applied to document recognition . Proceedings of the IEEE 86 ( 11 ) ( 1998 ) 2278 { 2324

2. Everingham , M. , Eslami , S.A. , Van Gool , L. , Williams , C.K. , Winn , J. , Zisserman , A. : The pascal visual object classes challenge: A retrospective . International Journal of Computer Vision 111 ( 1 ) ( 2014 ) 98 { 136

3. Russakovsky , O. , Deng , J. , Su , H. , Krause , J. , Satheesh , S. , Ma, S. , Huang , Z. , Karpathy , A. , Khosla , A. , Bernstein , M. , Berg , A.C. , Fei-Fei , L. : ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV) ( 2015 )

4. Villegas , M. , Muller, H., Gilbert , A. , Piras , L. , Wang , J. , Mikolajczyk , K. , de Herrera , A.G.S. , Bromuri , S. , Amin , M.A. , Mohammed , M.K. , Acar , B. , Uskudarli , S. , Marvasti , N.B. , Aldana , J.F. , del Mar Roldan Garc a, M.: General Overview of ImageCLEF at the CLEF 2015 Labs . Lecture Notes in Computer Science. ( 2015 )

5. Garc a Seco de Herrera, A. , Muller, H., Bromuri , S. : Overview of the ImageCLEF 2015 medical classi cation task . In: Working Notes of CLEF 2015 ( Cross Language Evaluation Forum) . CEUR Workshop Proceedings (September 2015 )

6. Taskar , B. , Guestrin , C. , Koller , D. : Max-margin markov networks . In: NIPS 2003 . ( 2003 )

7. Altun , Y. , Tsochantaridis , I. , Hofmann , T. : Hidden markov support vector machines . In: ICML' 03 . ( 2003 ) 3 { 10

8. Tsochantaridis , I. , Joachims , T. , Hofmann , T. , Altun , Y. : Large margin methods for structured and interdependent output variables . Journal of Machine Learning Research (JMLR) 6(Sep) ( 2005 ) 1453 { 1484

9. Rousu , J. , Saunders , C. , Szedmak , S. , Shawe-Taylor , J.: Learning hierarchical multi-category text classi cation models . In: ICML. ( 2005 )

10. Bakir , G. , Hofman , T. , B. Scholkopf, A.J.S. , Taskar , B. , Vishwanathan , S.V.N., eds.: Predicting Structured Data . MIT Press ( 2007 )

11. Loan , C. : The ubiquitous kronecker product . Journal of Computational and Applied Mathematics 123 ( 2000 ) 85{100 The nearest Kronecker product .

12. Xiong , H. , Szedmak , S. , Piater , J.: Scalable, Accurate Image Annotation with Joint SVMs and Output Kernels . Neurocomputing ( 2015 )