=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-ImageCLEF-LeBorgneEt2013
|storemode=property
|title=CEA LIST@ImageCLEF 2013: Scalable Concept Image Annotation
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-LeBorgneEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/BorgnePZ13
}}
==CEA LIST@ImageCLEF 2013: Scalable Concept Image Annotation==
CEA LIST@imageCLEF 2013: Scalable Concept Image Annotation Hervé Le Borgne, Adrian Popescu, and Amel Znaidia CEA, LIST, Vision & Content Engineering Gif-sur-Yvettes, France firstname.lastname@cea.fr Abstract. We report the participation of the CEA LIST to the Scalable Concept Image Annotation Subtask of ImageCLEF 2013. The full system is based on both textual and visual similarity to each concept, that are merged by late fusion. Each image is visually represented with a bag of visterm, computed from a dense grid of SIFT every 3 pixels, that a locally soft coded and max pooled on a codebook of size 1024 and spatially extended with a pyramid 1 × 1 + 3 × 1 + 2 × 2, resulting into a vector of size 8192. The visual neighbors of a query are given by the L1 distance to the images of the learning database. The similarity of a query to one of the 95/116 concepts to identify is the sum of the similarity of each neighbor to the concept. The decision is set at 1 for all concepts above one standard deviation from the average similarity to all concepts for the considered query. The similarity between a training image and a concept is computed from an intermediate vectorial representation of its tags. We tested several spaces of representation for the tags, including wikipedia concepts sorted according to their popularity or characterized according FlickR data. As well, the size of space was pruned at several values, from 5000 to 200, 000. The tag representation are max-pooled to make the training image vec- tor, such that the resulting vector contain the maximal similarity to each concept of the intermediate space. The 96/116 concepts to identify are represented in the same space and their similarity to the training image is the cosine between the intermediate space representation. As well, we ranked all training images to each concept to identify in order to learn visual classifiers (linear SVM). We tested several strategies to select positive and negative examples, including the visual coherency, but the simplest strategies was finally the most efficient. It consisted in setting the 100 most similar images as positive and the 500 least similar as negative. Finally, a simple weighted late fusion of the visual and textual similarity scores appeared to be more efficient than more sophisticated strategies, resulting to 0.4 MAP on the development query and 0.34 on the testing ones. 1 Introduction The Scalable Concept Image Annotation Subtask of ImageCLEF 2013[1] is de- scribed in detail in [7]. The system we propose rely on both visual and textual cues. We conducted many preliminary experiments in order to iteratively im- prove the provided baseline system (see section 1.1). These experiments dealt with visual features to find image neighbors of queries (section 2), several models of tag (section 3 to 5), the way we learnt visual models (section 6) and finally the decision process (section 7). 1.1 baseline system A baseline system based on the co-occurence of concepts and tags of the visual neighbors of each query is provided[7]. Each image I of the development set has to be annotated according to concepts Cp , p = 1...95. Such an image has Kv neighbors in the train test, according to a visual descriptor (csift BoV pro- vided). Each of these neighbors (k = 1 . . . Kv ) has Tk tags with a given score (tk,1 , sk,1 . . . tk,i , sk,i . . . tk,Tk , sk,Tk ). Each of these tags is described with Nk,i weighted concepts (Ck,i,1 , Wk,i,1 . . . Ck,i,j , Wk,i,j . . . Ck,i,Nk,i , Wk,i,Nk,i ). Thus, the score of concept Cp for image I is: Kv PTk 1 X i=1 sk,i Wk,i,Cp ScoreI (Cp ) = PTk (1) Kv i=1 sk,i k=1 In practice, each tag is described by the same number Kconcepts of concepts (default: 6). 2 Searching visual neighbors In the original system, visual neighbors are provided and said to be found with a C-SIFT based descriptor. We tested several alternative methods. Descriptors are bag of visterms. SIFT local descriptors are densely extracted every 3 pixels. The bag are coded using local soft coding [3] and max pooling. Then two different spatial pyramid matching scheme [2] are used: 1×1+3+2×2 for BoV1 and 1×1+2×24×4 for BoV2 . Further details on bag-of-visterm design can be found in [6]. Several distances were tested on these vectors to find the neighbours. The histogram intersection (HI) distance implemented as: D 1 X min(xi , yi ) DistHI (x − y) = 1 − (2) D i=1 max(xi , yi ) and the classical L1 distance: D 1 X DistL1 (x − y) = |xi − yi | (3) D i=1 Results are shown in table 1, showing a non-significant improvement with the BoV1 signature and the L1 distance. System mAP K 32 Provided baseline 24.235 Random neighbors 17.878 BoV1 HI 23.830 BoV2 HI 23.468 BoV1 L1 24.305 BoV2 L1 23.229 Table 1. Result of the baseline system with several methods to search the K = 32 visual neighbors 3 Using a FlickR-based tag model We used a FlickR-based tag model built from the selection of the 95 concepts (F lickr95 ) and another one built from 30, 000 wikipedia concept (F lickr30k ). See [5, 8] for details about the way similarities are computed for these models.Note that both models were built from the FlickR tags. The F lickr95 tag model was injected into the system provided, in conjunction with two different visual models. The mAP is reported in table 2. Note this performance measure should be independant from the parameter Kconcepts that was fixed to 6. The F-measure Kvisual 8 16 32 64 128 Tag Visual co-occurrence csift 24.71(*) 24.77 24.24 23.63 23.10 co-occurrence BoV1 + L1 25.01(*) 25.08 24.31 23.60 22.80 F lickR95 csift 25.08 25.92 26.61 27.33 27.51 F lickR95 BoV1 + L1 25.96 27.30 28.16 28.18 27.67 F lickr30k csift 30.25 29.46 29.23 28.80 28.44 F lickr30k BoV1 + L1 31.05 30.25 29.50 29.07 28.48 Table 2. Result (mAP) of the system with different tag models and method to search the visual neighbors. (*) the mAP grows with Kconcepts here; result reported with Kconcepts = 6 only uses the annotation decisions and is computed in two ways, one by analyzing each of the testing samples and the other by analyzing each of the concepts. Results are reported on table 3 and 4. 4 Window-restricted FlickR-based tag models The tag model of each training document is built with a restriction of the word- image distance. In the original web page a training image has been found, the method consists in taking into account words that are less than a given distance Kvisual 8 16 32 64 128 Tag Visual Kconcepts 2 11.92 12.49 11.54 10.81 10.32 4 15.53 16.48 16.60 16.37 15.77 csift co-occurrence 6 17.90 18.96 18.60 17.88 17.54 8 19.27 19.71 19.61 19.03 18.65 10 20.09 20.14 19.86 19.59 19.23 2 12.46 13.00 12.20 11.64 10.47 4 16.21 16.78 16.33 15.48 14.96 BoV1 + L1 co-occurrence 6 17.76 18.43 18.20 17.81 17.38 8 19.07 19.60 19.29 19.06 18.85 10 19.83 20.19 19.96 19.63 18.97 2 13.96 14.82 15.31 16.39 15.76 4 16.56 17.47 18.38 18.98 18.93 csift F lickR95 6 17.67 18.49 19.67 19.98 19.98 8 18.19 19.17 19.82 20.48 20.72 10 18.65 19.36 19.96 20.61 21.29 2 14.83 15.72 16.66 16.25 15.80 4 17.48 18.68 19.36 19.44 19.07 BoV1 + L1 F lickR95 6 18.64 20.04 20.73 21.01 20.90 8 19.28 20.33 21.19 21.37 21.18 10 19.38 20.51 21.20 21.68 21.68 Table 3. Result (mFsamp) of the system with different tag models and method to search the visual neighbors. from the considered image. Moreover, we considered a lemmatized and non- lemmatized version of the models. Results are comparable to those obtained with F lickr30k (around 0.31) but no improvement is actually observed. 5 Wikipedia-based tag models Similarly to the FlickR-based tag models, tags are projected onto 1187980 wikipedia concepts. The concepts being ranked to the numbers of their incoming links, the representation can be pruned to a lower dimension. 6 Learning visual models For a given tag-model, training images are sorted according to their score for each concept. Then we select positive and negative examples according to different strategies to learn linear SVM models from the BoV1 signatures. Text model used w was F lickr30k 0 i.e the FlickR tags projected on the 30k wikipedia concepts, with restriction a window of size 0. Kvisual 8 16 32 64 128 Tag Visual Kconcepts 2 7.25 5.49 4.33 3.60 2.91 4 11.33 9.13 8.47 7.81 6.11 csift co-occurrence 6 13.98 12.48 10.67 9.41 8.00 8 15.34 13.17 11.87 10.46 9.39 10 16.18 14.15 12.67 11.72 10.43 2 8.28 7.95 6.01 5.24 3.00 4 12.26 11.49 9.15 7.87 7.12 BoV1 + L1 co-occurrence 6 14.40 13.14 11.62 10.38 8.95 8 15.42 14.45 13.72 11.93 10.34 10 16.21 15.70 14.40 13.03 11.18 2 15.11 15.74 15.35 15.39 13.79 4 16.10 17.20 17.95 18.13 17.26 csift F lickR95 6 16.30 17.54 18.91 19.34 18.78 8 16.26 17.75 18.41 19.43 19.89 10 16.35 17.61 17.96 19.16 20.53 2 17.00 17.17 17.15 15.61 14.42 4 18.64 19.08 19.16 17.98 17.05 BoV1 + L1 F lickR95 6 18.62 19.28 20.47 19.56 18.85 8 18.25 19.33 20.56 20.00 18.85 10 17.88 18.99 20.34 19.97 20.06 Table 4. Result (mFcnpt) of the system with different tag models and method to search the visual neighbors. The simplest strategy consisted in setting the 100 most similar images as positive and the 500 least similar as negative. It leaded to a mAP of 0.219 on the devel queries. A second strategy consisted to select as positive all images above a given score (0.8) and as negative all those below a smaller threshold (0.1). Given that classes were then strongly ubalanced, we limited negative image to nine times the positive on for each SVM model, leading to a mAP of 0.207. When the number of negative samples are forced to be equal to the positive ones, the mAP is 0.212. A last strategy was tested, for wich the choice of the images was based on the visual coherency [4]. The 1000 most similar images to each concept are resorted according to their VCscore and the 100 best are thus selective as positive examples. Negative examples are chosen as the 1000 least similar images to the concept. Although promising, this approach leaded to a mAP of 0.209 only. We thus finally decided to keep the first and simplest strategy. 7 Decision The decision value (0/1) is taken independently on each query, according to its similarity to the 95 or 116 concepts. We compute the average (µ) and the standard deviation (σ) of the scores and set the decision to 1 for all concepts having a score above µ + σ. 8 Participation to the campaign 8.1 Submitted runs We submitted five runs to the campaign, based on the obsvation made during preliminary experiments: Run 1: we computed the FlickR-based tag model and merged it with the visual similarities. The weights were respectively 0.8 and 0.2. Run 2: we added to Run 1 a Wikipedia-based tag model with a representa- tion pruned to 5, 000 dimensions. Run 3: we added to Run 2 a FlickR-based tag model with a representation pruned to 50, 000 dimensions. Run 4: similar to Run 3 with a visual model selected on scores Run 5: similar to Run 4 with a FlickR-based tag model with a representation pruned to 200, 000 dimensions. 8.2 Results devel set test set mAP MF-sample MF-concepts mAP MF-sample MF-concepts Run 1 34.6 28.7 23.6 29.4 23.0 19.0 Run 2 39.6 30.2 24.6 33.6 24.2 20.1 Run 3 40.4 31.8 25.3 34.1 25.2 20.2 Run 4 40.3 32.2 26.1 34.2 26.0 21.2 Run 5 39.2 31.6 25.4 33.6 25.7 21.0 Table 5. Result of our five runs to the campaign. Results are quite close from each other and around 10 points (in term of mAP) below the best run of the campaign. References 1. B. Caputo, H. Muller, B. Thomee, M. Villegas, R. Paredes, D. Zellhofer, H. Goeau, A. Joly, P. Bonnet, J. Martinez Gomez, I. Garcia Varea, and M. Cazorla. Imageclef 2013: the vision, the data and the open challenges. In Proc CLEF 2013, LNCS, 2013. 2. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2169–2178, 2006. 3. Lingqiao Liu, Lei Wang, and Xinwang Liu. In Defense of Soft-assignment Coding. In IEEE International Conference on Computer Vision, 2011. 4. Débora Myoupo, Adrian Popescu, Hervé Le Borgne, and Pierre-Alain Moëllic. Mul- timodal image retrieval over a large database. In Proceedings of the 10th inter- national conference on Cross-language evaluation forum: multimedia experiments, CLEF’09, pages 177–184, Berlin, Heidelberg, 2010. Springer-Verlag. 5. Adrian Popescu and Gregory Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval, pages 33:1–33:8, 2011. 6. Aymen Shabou and Hervé Le Borgne. Locality-constrained and spatially regular- ized coding for scene categorization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3618–3625, 2012. 7. Mauricio Villegas, Roberto Paredes, and Bart Thomee. Overview of the imageclef 2013 scalable concept image annotation subtask. In CLEF 2013 working notes, 2013. 8. Amel Znaidia, Aymen Shabou, Adrian Popescu, Hervé Le Borgne, and Céline Hude- lot. Multimodal feature generation framework for semantic image classification. In ICMR, International Conference on Multimedia Retrieval, ICMR ’12, Hong Kong, China, June 5-8, 2012, page 38, 2012.