Graph Based Method Approach to the ImageCLEF2015 Task1 - Image Annotation Ludovic Dos Santos, Benjamin Piwowarski, and Patrick Gallinari Sorbonne Universits, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606, 4 place Jussieu 75005 Paris. firstname.lastname@lip6.fr Abstract. We address the tasks of image classification and tagging through a transductive approach that automatically learns to project the different images onto a common latent space. This learned representation is then used to classify the images. We construst a graph between images and concepts using the deep representations given by ImageCLEF and WordNet database, and we exploit the idea that two connected nodes will tend to have a similar latent representation. This assumption allows us to learn correlations between the labels of connected images. Key words: Representation Learning, Graph based method, Deep Re- lations 1 Introduction The ImageCLEF 2015[3] image annotation [2] is composed of two subtasks : the Image Concept detection and localisation, and the Generation of Textual Descriptions of Images. We have participated in the former, whose objective is to predict which concepts are present in an image given in input. Use of labeled data is allowed, and we therefore used the available ImageNET Convolutional Neural Network (CNN). The system we propose is based on a model [?] that projects nodes of a graph within a vector space (typically, of dimension 100-200). This model requires that some nodes be labeled, and uses this information within the embedding process. To leverage ImageNet data, where no relationship links images together, we had to build a graph. We experimented with different ways to construct a simple graph based on the output of the ImageNET CNN representations and the WordNet database[1], and presents results on the CLEF task corresponding to the settings of different hyperparameters of the model. The article is organized as follows: Section 2 presents an overview of our model and how we construct the graph based on the images and the WordNet database, Section 3 the representation and the classifier learning, Section 4 our algorithm, and Section 5 outlines our results and conclusions. 2 Overview and graph construction We propose a solution that relies on the exploitation of the correlations that exists between images that have a similar 1000 dimensional activations of the fc8 layer of Oxford VGGs 16-layer Convolutional Neural Network model (OV- CNN) given in the ImageCLEF challenge. It aims at taking directly into account the correlations between the CNN-labels of connected images. The underlying idea of this model is the following: – We construct a graph mixing images and concepts based on the CNN- representation and the WordNet database. – Each node is mapped onto a latent representation in a vectorial space RZ . The latent space is common to all node types (images and concepts). – This latent representation defines a metric in RZ such that two connected nodes tend to have a close representation (smoothness assumption). – All the nodes and relations are not equal w.r.t. their position in the graph, their number of neighbors, etc. We use two functions, ψi and φi,j , to model the importance of nodes and the relationships in the embedding process. – A classification function is learned for the images. It takes as input a latent image representation and computes associated class labels. All the notation used in this working note are summarized in Table 1. Notation Meaning xi Node of the graph Ak 1000 dimensional activations of the fc8 layer of an image k zi Latent representation of node xi zicopy Numerical copy of zi T Set of possible nodes types T Number of nodes types ti Type of node xi r(i, j) Relation type between node xi and xj r i→j Relation of type r from i to j R Number of relation’s type N Number of nodes x1 · · · x` Labeled nodes used for the classification task Njr Number of neighbors of xj considering the relation r E Total number of edges Er Total number of edges of type r Yt Set of categories associated to the type of node t Ct Cardinality of Y t yi Vector of categories for the node xi yic Value of category c for the node xi Table 1. Notations 2.1 Gaph construction To apply our method, we need to construct a graph. The resulting graph is composed of nodes of two types, images and (WordNet) concepts. To construct the graph, we connect images to a subset of the concepts and then enrich this graph using Wordnet relationships between concepts. We first directly use the representation Ak of an image, i.e. the 1000 dimen- sional activations of the OV-CNN. Each dimension of this vector corresponds to a label, and we suppose that an image is labeled by the ith label if its corre- sponding component is above a threshold µ ∈ R. In that case, each label being a precise WordNet concept, we connect the image to the corresponding concept in the graph. We then use WordNet to connect concepts together. We selected two rela- tionships that may be important when trying to label images: the hypernym- hyponym relationship (“is-a”) and the meronymy relationship (“part of”). In order to keep the graph size manageable, we started from the concepts of the CNN representation and added all the concepts that would be necessary to link the different concepts together. For example, in the figure below, we have four images. Two are classified ”bomber” by the VO-CNN two ”jet plane”. Both concepts are hyponyms of airplane. We thus have the corresponding graph : bomber jet plane airplane The final graph has 25,730,835 image-concept edges and 4,309 concept-concept edges. 2.2 Training set The initial training set for subtask 1 contains 1,979 labeled images (there are 500,000 images in the entire dataset) for 250 concepts. That give us 7.9 images per concept on average, which is a relatively small number. To increase the number of training examples, we again used the Ak representa- tion and the corresponding OV-CNN concepts. We used another threshold ν ∈ R to determine when an image should be labeled with a given OV-CNN concept. In practice, given an image k, if the i-th component of Ak is bigger than ν we check if the corresponding OV-CNN concept is an hypernym of an ImageCLEF concept Cclef . If then, we add the image k to the training set labeled with Cclef . When ν = 20, the labeled images in our training set contains 43.970 images (175.9 images per concept), and when ν = 10, 446.130 images (1784.5 images per concept). 3 Learning latent node representations 3.1 Classifier The mapping onto the latent space is learned so that the labels for each type of node can be predicted from the latent representations. For that, we consider a linear classification function for each type of node k denoted by fθk . This function takes as input a node representation and outputs the predicted label(s). The f -functions can be learned by minimizing a loss on labeled data as follows: ` X ψi ∆(fθti (zi ), yi ) (1) i=1 where ∆(fθti (zix ), yi ) is the loss of predicting labels fθti (zi ) instead of observed labels yi . Here ψi represents the relative importance of node i in the graph. For example, ψi = 1/N would define a model where all nodes have the same impor- tance. In our case, in order to set the value of ψi , we think our model should consider that all nodes have not the same importance, so we should be able to empha- sis on more central nodes with regard to different relations. To do so, we use what makes a node more important is it has many neighbors of different types PR N r : ψi = R1 r=1 Eir In our implementation, we use a hinge-loss function for ∆: t C X ∆(fθt (z), y) = max(0; 1 − y k fθt,k (z)) (2) k=1 where y k is the desired score of category k for node x (−1 or +1) and fθt,k (z) is the predicted score of category k by the model. 3.2 Transductive classification model Let us denote by zi ∈ RZ the hidden representation of node xi which is a vector of size Z. When updating zi , in order to capture the graph metric in the latent space, we use the following loss which embodies the metric smoothness assumption and forces linked nodes to have similar representations: X φi,j ||zi − zj0 ||2 (3) j We are using an L2 norm in the latent space, but other metrics could be used as well. Furthermore, φi,j aims at considering the importance of an edge w.r.t. the hole graph and the shape of it around this specific edge. We assume that all the edges of the graph should have the same impact 1 regardless of their predominance (φi,j ∝ E r(i,j) ). Then, as in the PageRank algorithm, we assume that the information coming from a neighbor j of i should 1 be divided equally through his neighbors of the same type as i (φi,j ∝ r(i,j) ). Nj Thus we set 1 φi,j = r(i,j) r(i,j) RNj E 3.3 Loss Function The final expected objective loss of our model combines the classification and regularization losses 1 and 3: ` X X L(z, θ) = ψi ∆(fθti (zi ), yi ) + λ φi,j ||zi − zj0 ||2 (4) i=1 i,j The minimization of this function aims at finding a trade-off between the smooth- ness over the latent representations of correlated nodes Z and the predicted observed labels in Yk . Optimizing this loss allows us to learn: – The projection zi of each node xi in the latent space. – The classification functions fθk for each nodes type k which transform the latent space to categories scores. 4 Algorithm Learning consists in minimizing the loss function defined in Equation 4. Different optimization methods can be considered. We have used a Stochastic Gradient Descent Method to learn the latent representations. Furthermore, we can remark that the φi,j and ψi can also be understood as a specific way of sampling during the SGD. Actually, if we firstly choose uniformly the type of edge then choose uniformly an edge given that type and pick the nodes from the chosen edge we update a given representation zi in PR N r PR N r expectation R1 r=1 Eir times for the classification term and R1 r=1 Eir times for the graph regularization one. Adding the PageRank division in the graph regularization, we find out our previous assumptions. The algorithm is detailed below. For a fixed number of iterations Choose r in R at random . r Pick randomly an edge i → j . If (i ≤ `) θ ← θ + ∇θ ∆(fθti (zi ), yi ) zi ← zi + ∇zi ∆(fθti (zi ), yi ) If (j ≤ `) t θ ← θ + ∇∆(fθj (zj ), yj ) t zj ← zj + ∇zj ∆(fθj (zj ), yj ) zi ← zi + λ∇zi ||zi − zj ||2 zj ← zj + λ∇zj ||zi − zj ||2 The algorithm chooses iteratively, with the sampling method explained above, a pair of connected nodes and then make a gradient update over the parameters of the model. If one of the chosen nodes is part of the first labeled training set, the algorithm first performs an update according to the first term of Equation 1. This update – lines 5-6 and 9-10 – consists in successively modifying the pa- rameters of the classification function θ and of the latent representations zi and zj so as to minimize the classification loss term. Here,  is the gradient step, and λ is the trade-off between the classification and smoothness terms. 5 Results As our model is not able to localize a concept in a given image we only present the results of the mean average precision with zero overlap (MAP-0). Model MAP 0 overlap N = 250, ν = 10, λ = 50 0.057422 N = 100, ν = 10, λ = 50 0.056822 N = 250, ν = 10, λ = 150 0.05625 N = 100, ν = 10, λ = 150 0.056141 N = 250, ν = 20, λ = 150 0.052665 N = 250, ν = 20, λ = 50 0.052585 As we can see, the bigger the training set is (ν is small) the better our model performs. As the graph has many edges our model needs a bigger dimension (N ) in order to adapt to the graph topology. Furthermore we can see that the more we constrain (λ is big) the representations of two nodes to be close, the worst the performances get. The results are quite low compared to other opproaches, but we did not exploit (at least directly) any image related feature. Compared to a random ap- proach, we are still able to extract some information from a graph constructed using some basic information about the classification of an image and the rela- tionship between the labels/concepts. Our methodology would benefit from a real graph linking the images (e.g. a social network), and results would be much improved if using image-based features. In future work, we will combine this approach with image features to check whether they can bring some further information. Acknowledgments. This work was supported in part by a grant from the Research Agency ANR (Agence Nationale de la Recherche) under the MLVIS project. References 1. Christiane Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books. 1998. 2. Andrew Gilbert, Luca Piras, Josiah Wang, Fei Yan, Emmanuel Dellandrea, Robert Gaizauskas, Mauricio Villegas, and Krystian Mikolajczyk. Overview of the Image- CLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task. In CLEF2015 Working Notes, CEUR Workshop Proceedings, Toulouse, France, September 8-11 2015. CEUR-WS.org. 3. Mauricio Villegas, Henning Müller, Andrew Gilbert, Luca Piras, Josiah Wang, Krys- tian Mikolajczyk, Alba Garcı́a Seco de Herrera, Stefano Bromuri, M. Ashraful Amin, Mahmood Kazi Mohammed, Burak Acar, Suzan Uskudarli, Neda B. Mar- vasti, José F. Aldana, and Marı́a del Mar Roldán Garcı́a. General Overview of ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science. Springer International Publishing, 2015.