-

MLIA at ImageCLFE 2014 Scalable Concept Image Annotation Challenge

Xing Xu

Atsushi Shimada

atsushig@limu.ait.kyushu-u.ac.jp 0

Rin-ichiro Taniguchi

0 0 Department of Advanced Information Technology, Kyushu University , Japan

411 420

In this paper, we propose a large-scale image annotation system for the ImageCLEF 2014 Scalable Concept Image Annotation task. The annotation task, of this year, concentrated on developing annotation algorithms that rely only on data obtained automatically from the web. Since the sophisticated SVM based annotation techniques had been widely applied in the task last year (ImageCLEF 2013), for the task this year, we also adopt the SVM based annotation techniques and put our e ort mainly on obtaining more accurate concepts assignment for training images. More speci cally, we proposed a two-fold scheme to assign concepts to unlabeled training images: (1) A traditional process which stems the extracted web data of each training image from textual aspect, and make concepts assignment based on the appearance of each concept. (2) An additional process which leverages the deep convolutional network toolbox Overfeat to predict labels (in ImageNet nouns) for each training image from visual aspect, then the predicted tags are mapped to concepts in ImageCLEF based on WordNet synonyms and hyponyms with semantic relations. Finally, the allocated concepts for each training image are generated based on a fusion step of the two-fold concepts assignment processes. Experimental results show that the proposed concepts assignment scheme is e cient to improve the assignment results of traditional textual processing and to allocate reasonable concepts for training images. Consequently, with an e cient SVMs solver based on Stochastic Gradient Descent, our annotation systems achieves competitive performance in the annotation task.

imageclef image annotation social web data

In this year ImageCLEF 2014 [ 1 ], we participated the Scalable Concept Image Annotation challenge1 [ 10 ] which aimed at developing more scalable image annotation system. The goal of this challenge is to develop annotation systems that for training only rely on unsupervised web data and other automatically obtainable resources. In contrast to traditional image annotation evaluations with labeled training data, this challenge requires work in more front, such as handling

1http://www.imageclef.org/2014/annotation

the noisy data, textual processing and multilabel annotations and scalability to unobserved labels.

Since this year is the third edition of the annotation challenge, regarding the methodology of annotation system, we can make several observations from the overview reports [ 9 ] [ 11 ] of previous editions: { The best performing system, TPT [ 6 ], only used provided visual features, which indicated that the visual features provided by the organizers is sufcient enough and the other features extracted by several teams might be complementary. { The top 3 teams (TPT, MIL [ 4 ], and UNIMORE [ 2 ]) all utilized SVMs based algorithms to learn separate classi ers for each concept, which was veri ed to be superior to the K nearest neighbor (KNN) based annotation techniques used by other groups, such as RUC [ 5 ], MICC [ 8 ]. { The textual processing and concepts assignment for training images were signi cant, since they directly a ected the learning accuracy of concept classi ers.

The major di erence of the challenge this year compared with previous editions is the proportions of \scaled" concepts. In the challenge last year, there are total 116 concepts (95 concepts for development set and 21 more for test set), the proportions of \scaled" concepts are 12116 0:181. On the contrast, in this year, there are total 207 concepts (107 concepts for development set and 100 more for test set), the proportions of \scaled" concepts are 210007 0:483. Thus it implies the signi cance of annotation system to be scalable and to generalize well to the new concepts.

To develop a robust and scalable annotation system, we believe that one of the intrinsic issues is to assign more appropriate concepts to training images. Once we have collected more accurate (positive/negative) samples for each concept, it is possible to improve the performance of concepts' classi ers. Thus for the contest, we mainly focus on the issue of accurate concepts assignment for training images. Besides the traditional textual information processing such as stopwords removal and stemming, which have been widely applied in previous editions. We also leverage the recent popular convolutional neural networks (CNN) [ 7 ] to allocate tags (1K WordNet nouns) for each training images from visual aspect. As the CNN based method utilizes the deep neural network to improve classi cation task, we can rely on the tags predicted by Overfeat and map the tags to concepts of ImageCLFE vocabulary. Then a late fusion approach is used to decide the nal concepts assignment for each training image. Finally, we train a linear SVM classi er for each concept (similar in development and test set) with the visual features provided by the organizers. To tackle the high dimensional large volumes of training data, we adopt the online learning strategy of stochastic gradient descent (SGD). We nally obtain competitive annotation performances in terms of mAP-samples, MF-concepts and MF-samples measures and are ranked the 4th place among all 11 groups on overall measure.

The rest of the paper is organized as follows. Section 2 demonstrates the architecture of proposed annotation system and we mainly discuss our concepts assignment scheme for training images. In Section 3, we describe our experimental setups and report the evaluation results obtained on both the development and the test sets. And Section 4 includes conclusion and some future direction of our works. 2

Proposed annotation system

The proposed annotation system is depicted in Figure 1. To assign more appropriate concepts for training images, we conduct a 2-fold scheme which explicitly leverages the provided textural information semantically (Section 2.1) and the training images visually (Section 2.2). Based on the reliable labeled training images, we further learn SVMs based concept classi ers using standard visual features provided by the organizers. To tackle the high dimensional features and large volumes of data, we use online learning method combined with SGD algorithm. Then we use the learnt stable concept classi ers for concept prediction of images in development and test sets. In the following subsections, we would like to depict the detailed procedure of each module of the diagram in Figure 1. The organizers of ImageCLFE 2014 provided several kinds of textural features of training images. Following the traditional text processing approach utilized last year, to e ciently process the textual features, we applied multiple ltering on the textural features. Regarding the modules of \Stopword removal and stemming" in Figure 1, the detailed processing procedures are: { \Stopword removal and stemming" is performed on the \scofeats" les, where stopwords, misspelled words, words from di erent languages other than English, the titles of the original web pages are extracted and parsed. { We then matched the semantic relations of the remaining words with the list of concepts in development set based on WordNet 3.02. We extend the list of concepts with their synonyms, and examine whether current word matches with concept or its synonyms. { The Lucene [ 3 ] stemmer is adopted if the word does not exactly match with the list of concepts.

The output of the \Result ltering and re nement" produces a candidate set of concepts for each of the training image. Indeed, the processing approach in this subsection could be considered as a baseline as it gives many false negative and false positive concepts to training images. Therefore, besides the textual features of training images, it is reasonable to further consider the visuality of training images. For example, for a training image describing \airplane", and its textural features (web page, title, etc) contain words of \airplane pilot hats", simply applying the text processing approach would result in concepts \airplane", \person", and \hat" to be assigned to the training image. However, if it is possible to estimate the content of image visually in advance, then the unrelated concepts \person", \hat" could be rejected to the training image. Thus, in the next subsection, we would like to introduce a context mapping method to predict tags for training images in advance. 2.2

Context mapping using CNN

To estimate the content of training images visually, we take advantages of a recently proposed toolbox Overfeat3, which is an image recognizer and feature extractor built around a deep convolutional neural network (CNN). We consider this powerful toolbox for two reasons: (1) It achieved competitive classi cation results on ImageNet 2013 contest4. (2) OverFeat convolutional net was trained on WordNet 1K nouns, which is consistent to the concept list of ImageCLFE. Thus it is rational to predict tags for training images based on the Overfeat and mapping the tags to ImageCLFE using a built context mapping rule. Regarding the modules \Tag prediction with CNN" and \Context mapping" in Figure 1, the detailed processing procedures are: { For a given training image, we directly use the Overfeat toolbox to predict tags for it. { For each of the tag predicted from Overfeat, we calculate its semantic similarity to the concept list of development set, and mapping it to the most similar concept. 2http://wordnet.princeton.edu/ 3http://cilvr.nyu.edu/doku.php?id=software:overfeat:start 4http://www.image-net.org/challenges/LSVRC/2013/results.php instrument space toy cloudless cityscape person cartoon sign protest airplane sky vehicle cloudless cityscape person cartoon boat

In Figure 2, we give an example of the context mapping using CNN. The tags in blue rectangle are obtained from the previous \text processing" stage. The tags (with con dence scores) in green rectangle are tags predicted from Overfeat. For the context mapping procedure (in practice, we use the path similarity measure in NLTK toolbox as the semantic measure), we can get a candidate concept set fsky, airplane, vehicle, boat g based on the tags in green rectangle .

For the \Result ltering and re nement" module in Figure 1, it fuses the candidate concept set from both textual processing approach and context mapping with CNN. Since there are much more number of concepts produced by textual processing approach than context mapping with CNN, however, the concept set from textual processing approach is more coarse. Thus for the fusion strategy, we relied more on the concept set from context mapping with CNN and preserved the concepts with high similarity scores in concept set from textual processing approach. In Figure 2, the concepts in red rectangle are the nal assigned concepts to the training image, which are considered to be semantically related to the training image.

Experimental results Visual features

Similar as the best result of TPT [ 6 ] in ImageCLFE 2013 annotation task, we use the visual features provided by the organizer including GIST, Color Histogram, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For all SIFT-based descriptors, a bag-of-words (BoW) representation is provided. An early fusion is made by concatenating all the features provided (global color histogram, getlf, CSIFT, GIST, opponent SIFT, RGB-SIFT, SIFT) resulting in a 21,312 dimension space. Global features GIST and Color Histogram are normalized using L2 norm, and SIFT-based features are normalized using L1 norm. 3.2

Evaluation measures

For the performance measures used to evaluate the runs, there are three standard measures: mean F-measure for the samples (MF-samples), mean F-measure for the concepts (MF-concepts) and the mean average precision for the samples (MAP-samples). The MF is computed analyzing both the samples (MF-samples) and the concepts (MF-concepts), whereas the MAP is computed analyzing the samples. 3.3

Training SVM classi ers for concepts

Following the SVM based annotation techniques which had achieved best annotation performance last year [ 2 ] [ 6 ], again we trained \one-versus-all" SVM classi er for each concept. The popular SVM solvers, such as SVMlight, LibSVM, they are not feasible for training large volumes of data with high dimension, since these batch methods need to pre-load entire training data into memory, to compute gradient in each iteration. Thus it is di cult to directly utilize these SVM solvers. According to the con guration of our machine (an Intel Core i7 2600 CPU (3.4 GHz) and 16 GB RAM), we take into account a better solution by the stochastic gradient descent (SGD) algorithm which is more e cient for training SVM classi ers with large-scale data. Di erent from the batch method, in the SGD algorithm, training sample is fed one by one to calculate the gradients and update rules of model parameters. Although the SGD algorithm might need more iteration loops to reach convergence, it requires much less memory cost which is more appropriate for large-scale training samples and online learning manner.

According to the advices in [ 2 ], we randomize the training data and load the data in chunks which t in memory, then train the di erent classi ers on further randomizations of chunks, so that di erent epochs will get the chunks data with di erent ordering which leads the learnt classi ers to be stable. We repeat this training process on training set for 5 times to train SVM classi er for each concept of development set and cross validate the F-measure on development set. Then we select the parameters of best performance on development set to further learn classi ers for concepts of test set. To predict concepts for images in development and test sets, we use the trained concepts's classi ers and obtain decision scores for each concept by thresholding the con dence score at zero. 3.4

Inside analysis of annotation results

We rst discuss the proposed 2-fold concept assignments to training images, and evaluate its in uence of learning accuracy of concept classi ers. We rst conduct experiments on the development set and then extend the 2-fold scheme to test set. Here we consider three settings: (1) \Single-Fold A": the single fold scheme of traditional textual information process (\Stopword removal and stemming" module in Figure 1). (2) \Single-Fold B": the single fold scheme of CNN based tag prediction process (\Tag prediction with CNN" and \Context mapping" modules), (3) \Two-Folds": the fusion process of both \Single-Fold A" and \Single-Fold B". We limited the maximum number of concepts assigned to each training image to be 4. Then we use the learned SVMs classi ers from the labeled training data to predict concepts for images in development set (the top 5 ranked concepts are considered to be the nal predicted concepts). Table 1 shows the annotation performance of three settings, and one of the baselines provided by the organizers is also included for comparison. It can be observed that three settings consistently improve the performance of baseline. In particular, the tags predicted by Overfeat is considerably accurate for training images. \Single-Fold B" outperforms the \Single-Fold A" setting of traditional textual information scheme, which implies the tags is highly coherent with the concepts in ImageCLFE. Moreover, when fusing the two settings to formulate proposed \Two-Folds" setting, the result is further improved on all three measures.

Then we evaluate the e ect of \Result ltering and re nement" module. Since in the experiment settings above, we restrict the number (denoted by K) of assigned concepts to each training image as K = 4. And it is reasonable that the value of K could in uence the learning accuracy of concept classi ers, as it directly determines the quality of training samples for each concept. Thus, we further vary the value of K (ranges from 1 to 10), and explore the optimal K for concept assignments for training images. The annotation performance on development set with varying K is shown in Figure 3. It can be observed from Figure 3 that: (1) The peaks of both MF-concept and MF-sample are reached when K = 6, and peak of MAP-sample reaches the peak when K = 9. (2) The MAP-sample is more sensitive to K since the number of ground truth concepts for each image in development set ranges from 1 to 11 (with average 3.52). Based on these observations, nally we choose K 2 [6; 7; 8; 9; 10] for our latter submit runs of test set.

For the test set, we submitted ten runs5. Here we would like to present our best 5 runs with baselines provided by organizers and the best runs from the other groups. We can learn from the overall results in Table 2 that: (1) All our submitted runs are beyond the best baseline result for the test set according to all measures. Looking into the overall participants results list, our best runs are at position 6, 3 and 5 order by the MF-sample, MF-concept and MAP-sample respectively for the test set, and position 4 for the overall performance. It means that our best runs are competitive compared with other results.

However, there is still a considerable gap between our best runs and the topranked runs from KDEVIR group. Although currently we are not able to explore the details of their proposed annotation technique, there are still space to improve our annotation system itself from the following aspects: (1) In our current

5http://www.imageclef.org/2014/annotation/results

Run Baseline (oppsift) kdevir 09

MIL 03 MindLab 01 DISA-MU 04

RUC 05

IPL 09 IMC-FU 01 INAOE 05

NII 01 FINKI 01 MLIA 09 MLIA 10 MLIA 08 MLIA 07 MLIA 06 9.8 system, we directly utilized the Overfeat toolbox for tag prediction of training images, a more reasonable choice is that we can generate CNN visual features and directly use these visual features to learn concept classi ers. Indeed, several teams such as MIL and MindLab used the CNN visual features. (2) Currently, the \Context mapping" module only considered mapping the tags from Overfeat to ImageCLEF with its synonymous/hyponyms in WordNet, and the similarity measure from NLTK toolbox might not be precise to map the correct results. An optional choice is modeling the context based similarity measure of tags depending on the Flickr image metadata, which is more e cient to capture the semantic associations from the practical circumstance. (3) Our concept modeling (SVMs based concept classi ers learning) is not elaborately optimized and tuned, because of the limitations of hardware con gurations and consumption of resources. Our system capability should be improved if we could overcome these limitations. 4

Conclusion

In this paper, we presented our annotation system developed to participate at ImageCLEF 2014 for the Scalable Concept Image Annotation task. Our proposal focus on improving the accuracy of concept assignments for training images. We proposed a 2-fold concept assignments scheme which explicitly leverages the provided textural information semantically (Section 2.1) and the training images visually. To learn concept classi ers, we adopted the sophisticated SVM based model, and took the SGD algorithm to deal with large scale settings of this task. Experimental results show that our proposal on both visual and textual information processing are necessary to build a competitive system. Moreover, we also considered potential future directions to further improve current system.

1. Caputo , B. , Muller, H., Martinez-Gomez , J. , Villegas , M. , Acar , B. , Patricia , N. , Marvasti , N., Uskudarl , S. , Paredes , R. , Cazorla , M. , Garcia-Varea , I. , Morell , V.: ImageCLEF 2014: Overview and analysis of the results . In: CLEF proceedings. Lecture Notes in Computer Science , Springer Berlin Heidelberg ( 2014 )

2. Grana , C. , Serra , G. , Manfredi , M. , Cucchiara , R. , Martoglia , R. , Mandreoli , F. : Unimore at imageclef 2013: Scalable concept image annotation . In: CLEF 2013 Evaluation Labs andWorkshop, OnlineWorking Notes ( 2013 )

3. Hatcher , E. , Gospodnetic , O. , McCandless , M.: Lucene in action . Second Edition

4. Hidaka , M. , Gunji , N. , Harada , T. : Mil at imageclef 2013: Scalable system for image annotation . In: CLEF 2013 Evaluation Labs andWorkshop, OnlineWorking Notes ( 2013 )

5. Li , X. , Liao , S. , Liu , B. , Yang , G. , Jin , Q. , Xu , J. , Du , X. : Renmin university of china at imageclef 2013 scalable concept image annotation . In: CLEF 2013 Evaluation Labs and Workshop , Online Working Notes ( 2013 )

6. Sahbi , H.: Telecom paristech at imageclef 2013 scalable concept image annotation task: Winning annotations with context dependent svms . In: CLEF 2013 Evaluation Labs and Workshop , Online Working Notes ( 2013 )

7. Sermanet , P. , Eigen , D. , Zhang , X. , Mathieu , M. , Fergus , R. , LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks . CoRR abs/1312 .6229 ( 2013 )

8. Uricchio , T. , Bertini , M. , B. , L. , Bimbo , A. : Micc at imageclef 2013 image annotation subtask . In: CLEF 2013 Evaluation Labs andWorkshop, Online Working Notes ( 2013 )

9. Villegas , M. , Paredes , R.: Overview of the imageclef 2012 scalable concept image annotation subtask . In: CLEF 2012 Evaluation Labs and Workshop , Online Working Notes ( 2012 )

10. Villegas , M. , Paredes , R.: Overview of the ImageCLEF 2014 Scalable Concept Image Annotation Task . In: CLEF 2014 Evaluation Labs and Workshop, Online Working Notes ( 2014 )

11. Villegas , M. , Paredes , R. , Thomee , B. : Overview of the imageclef 2013 scalable concept image annotation subtask . In: CLEF 2013 Evaluation Labs and Workshop , Online Working Notes. pp. 1 { 19 ( 2013 )