-

CNRS - TELECOM ParisTech at ImageCLEF 2013 Scalable Concept Image Annotation Task: Winning Annotations with Context Dependent SVMs

Hichem SAHBI

hichem.sahbi@telecom-paristech.fr 0 0 CNRS TELECOM ParisTech , 46 rue Barrault, 75013 Paris , France

2013

In this paper, we describe the participation of CNRS TELECOM ParisTech in the ImageCLEF 2013 Scalable Concept Image Annotation challenge. This edition promotes the use of many contextual cues attached to visual contents. Image collections are supplied with visual features as well as tags taken from di erent sources (web pages, etc.). Our framework is based on training support vector machines (SVMs) using a class of kernels referred to as context dependent. These kernels are designed by minimizing objective functions mixing visual features and their contextual cues resulting from surrounding tags. The results clearly corroborate the complementarity of tags and visual features and the effectiveness of these context dependent SVMs for image annotation.

Context-Dependent Kernels Support Vector Machines Image Annotation

Conventionally, visual information search requires a preliminary step known as image annotation. The latter is a major challenge (see for instance [ 14, 33, 31, 23, 24, 17, 5, 8 ]) and consists in assigning list of keywords (a.k.a concepts) to given visual content. These concepts may either correspond to physical entities (pedestrians, cars, etc.) or to high level aspects resulting from the interaction of many entities into scenes (races, ghts, etc.). In both cases, image annotation is challenging due to the perplexity when assigning concepts to scenes especially when the number of possible concepts is taken from a large vocabulary and when analyzing highly semantic contents.

Existing annotation methods (see for instance [ 5, 17 ]) are usually content-based; they rst model image observations using low level features (color, texture, shape, etc.), treat each concept as an independent class, and then train the corresponding concept-speci c classi er to identify images belonging to that concept using a variety of machine learning and inference techniques such as latent Dirichlet allocation [ 2 ], Markov models [ 17, 23 ], probabilistic latent semantic analysis [ 21 ] and support vector machines (SVMs) [ 10, 30 ]. These learning machines are used to model correspondences between concepts and low level features and make it possible to assign concepts to new images.

The above annotation methods heavily rely on their visual content for image annotation [ 26 ]. Due to the semantic gap, they are unable to fully explore the semantic information inside images; this comes from the statistical inconsistency of low level features with respect to the learned concepts and also complexity of scenes. Another class of annotation methods, referred to as context-based, has emerged that takes advantage of extra information (such as contextual cues in social networks [ 7 ]) in order to better capture the correlations between images and their semantic concepts. Early methods started to emerge for text documents in social networks [ 41 ] and now recent work is handling visual content annotation, in di erent contexts; such as the approach of [ 18, 11 ] that uses visual links as context in social networks, in order to propagate image tags and the method of [ 34 ] that uses friendship connections and conditional random elds in order to improve the performance of photo annotation. Other works consider distances between tags using Flickr [ 38 ], or context informations taken from personal calendars [ 9 ], GPS locations [ 15 ], visual appearances [ 4, 19 ] and multiple cues [ 40 ] in order to improve annotation.

In this paper, we describe the participation of \CNRS-TELECOM ParisTech" at ImageCLEF 2013 Scalable Concept Image Annotation Task [ 36 ]. Our proposed solution is based on the design of similarity functions that compare images, using context-dependent kernels. The latter are designed using multiple visual features as well as multiple contextual (text) informations provided in this task. When plugged into SVMs, for image classi cation and annotation, these kernels turned out to be very e ective.

The rest of this paper is organized as follows; in Section 2, we describe motivation and proposed method at a glance. In Section 3, we describe our participation and di erent runs submitted to this task as well as our results and comparison against other participants' runs. Finally, we conclude the paper in Section 4. 2

Motivation and Proposed Method at a Glance

Among image annotation methods mentioned earlier, those based on machine learning and particularly kernel methods (such as SVMs) are particularly successful but their success remains highly dependent on the choice of kernels. The latter, de ned as symmetric and positive semi-de nite functions [ 35, 32 ], should reserve large values to very similar content and vice-versa. Usual kernels, either holistic [ 16, 22 ] or alignment-based [ 12, 1, 13, 3, 20, 37, 6, 25 ], consider similarities as decreasing functions of distances between patterns or proportional to the quality of aligning primitives inside patterns. In both cases, kernels rely only on the intrinsic properties of patterns without taking into account their contextual cues. We are interested, in this work, in the integration of context in kernels in order to further enhance their discrimination power, for image annotation, while ensuring their positive de niteness and also their e ciency. The guiding principle relies on a basic assertion: kernels should not depend only on intrinsic aspects of images (as images with the same semantic may have di erent visual and textual features), but also on di erent sources of knowledge including context. The designed family of kernels, takes high values not only when images share the same content but also the same context. The context of an image is de ned as the set of images sharing links (eg. tags) and exhibiting better semantic descriptions, compared to both pure visual and tag based descriptions. The issue of combining context and visual content for image annotation and search has been investigated in previous related work (see for instance [ 9, 4, 40, 39, 29, 28, 30, 27 ] and work discussed earlier); the novel part of this work aims to integrate context (from the ImageCLEF 2013 collection), in kernel design for classi cation and annotation, and plug these kernels in support vector machines in order to take bene t from their well established generalization power [ 35 ].

In this work, we use a novel class of kernels (referred to as explicit and context-dependent) for image annotation [ 27 ] (see also [ 29, 28 ]). An image database is modeled as a graph where nodes are pictures and edges correspond to the shared tagged links. The proposed kernel design method is based on the optimization of an objective function mixing a delity term, a context criterion and a regularization term. The delity term, takes into account the visual content of images, so highly visually similar contents encourage high kernel values. The context criterion, considers the local graph structure and allows us to further enhance the relevance of our designed kernel, by di using and restoring the similarity i pairs of images are also surrounded by highly similar images that should also recursively share the same context. The regularization term controls the smoothness of the learned kernel and makes it possible to obtain a closed form solution. Solving this minimization problem results into a recursive similarity function (with an explicit kernel map) that converges to a positive semi-de nite xed-point.

Again, our proposed method goes beyond the naive use of low level features and usual context free kernels (established as the standard baseline in image annotation) in order to design a family of kernels applicable to annotation and suitable to integrate the \contextual" information taken from tagged links in interconnected datasets. In the proposed context-dependent kernel, two images (even with di erent visual content and even sharing di erent tags) will be declared as similar if they share the same visual context (see also Fig. 1). This is usually useful as tags in interconnected data may be noisy and misspelled. Furthermore, the intrinsic visual content of images might not always be relevant especially for concepts exhibiting large variation of the underlying visual aspects.

ImageCLEF 2013 Evaluation

The targeted task is image annotation also known as \concept detection"; given a picture of a database, the goal is to predict which concepts (classes) are present into that picture. 3.1

ImageCLEF 2013 Collection

The annotation task, of this year, concentrated on developing annotation algorithms that rely only on data obtained automatically from the web. A very large amount of images was gathered from the web by the organizers, and using associated web pages, tags were also obtained. As tags are noisy (i.e., the degree of relationship between images and the surrounding tags varies greatly), we use some preprocessing in order to assign tags to images.

Dev set. this set is labeled and consists in 1,000 images belonging to 95 categories including \aerial", \bridges", \clouds", etc. Sample of images belonging to the dev set is shown in Fig. 2, top.

Test set. as the objective, of this year task, is to develop algorithms that can easily change or scale the list of concepts used for image annotation, an unlabeled test set was provided and includes 2,000 images belonging to 116 categories; 21 of them are not available in the dev set and are considered as out of list concepts. These concepts include \bottle", \butter y", \chair", etc. Sample of images belonging to the out of list concepts is shown in Fig. 2, bottom.

Training set label generation. a larger set including 250,000 images was provided with meta-data but without labels. The meta-data, associated to a given image, includes a list of keywords used in order to retrieve that image, in the web, with di erent search engines.

For a given concept (among the 116 concepts), we extract a training set, by collecting among the 250k images those which include that concept, in their meta-data. As keywords associated to a given concept may appear in di erent forms, we applied some morphological expansions in order to increase the recall when searching for training images belonging to a given concept. Context matrix generation. we design a left stochastic adjacency matrix (denoted P) between images with each entry proportional to the number of shared keywords in the meta-data of the underlying images. We use this adjacency matrix in order to build our context dependent kernels as discussed in section 3.3. 3.2

ImageCLEF 2013 Visual Features

We used only the visual features provided in this imageCLEF task including GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For all the SIFT-based descriptors, a bag-of-words representation is provided. Even though provided, images were not used in order to extract any extra features. 3.3

CNRS-TELECOM ParisTech Runs and Comparison All our submitted runs (discussed below) are based on SVM training. Again the goal is to achieve image annotation also known as concept detection. For this purpose, we trained \one-versus-all" SVM classi ers for each concept; we use many random folds (taken from training data) for multiple SVM training and we use these SVMs in order to predict the concepts on the dev and test sets. We repeat this training process, for each concept, through di erent random folds from the training set and we take the average scores of the underlying SVM classi ers. This makes classi cation results less sensitive to the sampling of the training set. For all the submitted runs (see runs 1 - 6 below), the only di erence resides in the used kernels. We plug the latter into SVMs in order to achieve concept detection. Performances are evaluated using the mean F-measures (at concept and sample levels) as well as the mean average precisions. Details about these measures are given in the ImageCLEF 2013 web page1.

Run 1. for this run, we build 7 gram matrices2 associated to the visual features mentioned earlier. Then, we linearly combine those matrices into a single one. Notice that this combination does not result from multiple kernel learning but just a convex combination of kernels with uniform weights. We plug the resulting kernel into SVMs for training and testing. A given test image is assigned to a given concept, i the underlying SVM score is (with = 0:5 in practice). Run 2. the setting of this run is exactly the same as run 1 except that the cut-o threshold is set to 1.

Run 3. the linear combination of kernel matrices (denoted K(0)) obtained in runs 1 and 2 is used as an initialization to the context dependent kernel (CDK) de ned as K(t+1) = K(0) + PK(t)P0, with 0 (see [ 27 ]). The latter is computed iteratively (in two iterations) using the adjacency matrix P introduced earlier. Once designed, we plug CDK into SVMs for training and testing. A given test image is again assigned to a given concept, i the underlying SVM score is (with = 0:5 in practice) Run 4. the setting of this run is exactly the same as run 3 except that the cut-o threshold is set to 1. 1 http://imageclef.org/2013/photo/annotation 2 Based on histogram intersection kernel. Run 5. before computing the convex combination of kernels (as done in runs 3, 4), we rst evaluate for each kernel matrix (associated to a given visual feature) its underlying CDK (K(t+1) = K(0) + PK(t)P0 with K(0) being the linear kernel matrix). Then, we apply histogram intersection kernel to these CDKs and we linearly combine the resulting kernels with uniform weights. Again, the number of iterations in CDK is set to 2. Once designed, we plug the nal kernel matrix into SVMs, for training and testing. A given test image is again assigned to a given concept, i the underlying SVM score is (with = 0:5 in practice) Run 6. the setting of this run is exactly the same as run 5 except that the cut-o threshold is set to 1.

Diagrams in Figs. 3, 4 and 5, show the mean F-measures and mean average precisions of our runs and their comparisons with respect to di erent participants' runs. From all these results it is clear that our best runs (runs 6 and 4) outperform the others for almost all the evaluation measures. 4

Conclusion

We discussed in this paper, the participation of \CNRS-TELECOM ParisTech" in ImageCLEF 2013 Scalable Concept Image Annotation Task. Our submissions include pure visual runs based on linear combination of elementary histogram intersection kernels, as well as combined visual/textual runs, that consider the context of images through context dependent kernels. The latter turned out to be the most e ective and achieved the best performance among 57 participants' runs.

Acknowledgments

This work was supported in part by a grant from the Research Agency ANR (Agence Nationale de la Recherche) under the MLVIS project and a grant from DIGITEO under the RELIR project.

ME E I N N N U U U

ME E I N N N U U U C C C C C

U 6−TP 4−TP II1−S II4−S II2−S 2−E 5−E 1−E 2−T 6−E 4−C I5−S 5−C I3−S 3−ERO 3−VDU 5−VDU 5−TTP 3−CRU 3−TTP 2−CRU 4−ERO 4−VUD 1−VDU 1−CRU 2−VEDUN IL4−TSEA IL5−TSEA IL3−TSEA IL2−TSEA IL1−TSEA I1−EVKRD y3−ENDCU I5−CCMI4−CCM2−EDUN I3−CCM1−EDUN I2−CCMI1−CCMI3−VEKR 1−TTP I6−VEKRD I4−VEKRD I2−VEKRD I5−VEKRD I1−TZAKS I3−AENO I2−ZTKAS 3−PSAMM2−SPAMM1−FTCUH I1−AENO 1−SPAMMI2−AENO 4−SPAMM5−SPAMMI4−AENO