-

MIL at ImageCLEF 2013: Scalable System for Image Annotation

Masatoshi Hidaka

Naoyuki Gunji

Tatsuya Harada

haradag@mi.t.u-tokyo.ac.jp 0 0 Machine Intelligence Lab., The University of Tokyo

2013

We give details of our methods in the ImageCLEF 2013 Scalable Concept Image Annotation task. For the textual feature, we propose a method for selecting text closely related to an image from its webpage. In addition, to consider the meaning of the concept, we propose to use WordNet for getting words related to the concept. For visual features, we use Fisher Vector (FV), which is regarded as an extension of the Bagof-Visual-Words representation. We trained linear classifiers by PassiveAggressive with Averaged Pairwise Loss (PAAPL), an online multilabel learning method based on Passive-Aggressive. Since PAAPL is computationally efficient and able to cope with multilabel data appropriately, it is suitable for this task. Results show that our annotation pipeline is simple but works well in this task.

ImageCLEF Textual Feature WordNet Annotation

In ImageCLEF 2013 Scalable Concept Image Annotation, our task is multi-label image annotation [ 1 ]. The dataset is extracted from general webpages, so that the costs in collecting data are low [ 2 ]. However, collected images have no explicit labels. Therefore, we need to extract correct labels of corresponding images from webpages. As for the extraction of labels from websites, the simplest solution is that concept labels which exist in webpages are assigned to the images. However, this method often fails to get correct labels because it does not consider meanings of concepts. Furthermore texts of webpages are not necessarily related to the images. Therefore, we try some methods to get more accurate labels. To achieve it, we use information from WordNet [ 3 ] to get words related to the concepts. In addition, limitation of text extraction range is adopted to omit text not related to the images. For visual features, we adopt Fisher Vector (FV)[ 4 ], which is an improved method of Bag-of-Visual-Words (BoVW). We use linear classifiers for each concept because they are computationally efficient and suitable for largescale data. When training classifiers, because labels assigned to the images are not ground-truth labels, they must be regarded as noisy. Therefore, we devote attention to robustness for noise in the training data. In order to train linear classifiers, we use PAAPL [ 5 ], an online multilabel learning method based on Passive–Aggressive[ 6 ]. PAAPL shows faster convergence than PA and has the same feature of robustness to the noise as PA. 2 2.1

Feature Extraction

Visual Feature As a visual feature, we use the Fisher Vector (FV). Because it can achieve a good classification performance with a linear classifier, it is often used for large-scale visual categorization. Indeed, in the ImageNet Large-Scale Visual Recognition Challenge 2012 (ILSVRC2012), four out of seven teams used FVs to represent images. We use four local descriptors: SIFT, C-SIFT, LBP, and GIST. Actually, GIST is usually used to describe a whole image, but we use it as a local descriptor. All local descriptors are reduced to 64 dimensions using Principal Component Analysis (PCA). Local descriptors are densely extracted from five scales of patches on a regular grid every six pixels and learn a Gaussian Mixture Model (GMM) with 256 components, which have a diagonal matrix as its covariance matrix. To use spatial information, we divide images into 1 1, 2 2, and 3 1 cells. Then FVs are calculated over each region as follows.

Let X = fx1; x2; ; xN g be a set of N local descriptors extracted from an image, and wi, i, i be the mixture weight, mean vector, covariance matrix of the i-th Gaussian, respectively. Then we difine, ui = vi =

1 ∑N N pwi n=1

1 N p 2wi n=1

N ∑ n(i) [

1 n(i) i 2 (xn

i) ; i 1diag((xn i)(xn i)T ) 1] ; where 1 is a column vector whose components are all 1 and diag(X) for matrix X is a column vector which is composed of diagonal components of X. n(i) is the soft assignment of xn to i-th Gaussian as n(i) = wiui(xn) K j=1wj uj (xn)

;

G = [u1T v1T : : : uTK vKT ]T ; where ui is the i-th Gaussian, and it is also known as the posterior probability. The FV representation is therefore given as where K is the number of GMM components.

Following [ 4 ], we apply power normalization and L2 normalization to each of the extracted FVs. Power normalization is done by applying the function, g(z) = sign(z)jzja; to each component of FVs, where a is a parameter and is set to 1/2 in this work. After normalization, we concatenate them into a single vector. The dimension of our FVs is 262144. webpage Concept C

XML

Parsing Word WordNet Database

T={image related words}

Label Assignment

Label C if T∩WC ≠ Φ Collection {WCC,=synonym(C), hyponym(C)} To assign correct labels to images, we take two steps. First we extract text closely related to an image from its webpage. Then if a concept word exists in the extracted text, the concept label is assigned to the image. The pipeline is presented in Fig. 1.

Text Extraction. To extract text closely related to an image, we consider three types of texts in the webpage: text around image, img tag attributes (src, alt, title), page title. First, we parse the xml file of the webpage and extract page title, text, img tag. Then we select some of them and split them into a set of single words T . For the text around the image, we consider the distance from the image (img tag position) because the entire webpage does not necessarily focuses on one image. Then we use max distance from an image as a parameter. We use words which are within the max distance. To normalize words, we singularize nouns.

Label Assignment. To assign labels to the image, first we collect words related to the each concept C given in the task. We denote a set of collected words by WC . For WC , collecting synonyms and hyponyms of C is considered.

WC = fC; synonym(C); hyponym(C)g For example, given a target concept “bird”, we get

Wbird = fbird; parrot; pigeon; :::g: For collecting synonyms and hyponyms, we use WordNet. To make implementation simpler, we use no compound words. Hyponyms are hierarchized. Therefore, we collect words of all depths recursively. Words which have multiple meanings are omitted. Determination is done by checking whether the word appears in multiple entries in WordNet.

Then if the extracted text T contains any of the concept-related words WC (concept word, synonyms and hyponyms), we assign those concept labels C to the image. Consequently, we obtain a training dataset in which some images have multiple labels, and some images have no label.

Multilabel Annotation

In this section, we describe the method of training of the classifiers and annotating of the test images. We use linear classifier for each concept label considering the scalability. With linear classifier, the annotation for test images is performed by computing score of labels as product of the visual feature and the weight vector of labels, and assigning the top 5 scored labels.

To learn the models for each concept label from various images, requirements are not only compatibility of scalability for the data amount and accuracy for label estimation, but also noise tolerability.

For that reason, we use Passive–Aggressive with Averaged Pairwise Loss (PAAPL). PAAPL is based on Passive–Aggressive (PA) method, which is known to be tolerant to the noise in training samples.

First, we describe the model update rule of PA.

Given the t-th training sample, we denote the visual feature by ft, the set of concept labels assigned to the sample by Yt, the set of concept labels not assigned to the sample by Y¯t, the current model (weight vector) corresponding to concept label C by tC . In our setting, the dimension of ft is 262144 (Fisher Vector) + 1 (bias). 1. Fetch t-th training sample, compute scores for each label using current models. 2. Find a label rt 2 Yt associated with the sample and a label st 2 Y¯t not associated with the sample as follows.

rt = arg min tr ft

r2Yt st = arg max ts ft

s2Yt Given these labels, compute the hinge-loss l from the current model. The hinge-loss l is given as l( trt ; tst ; (ft; Yt)) = {0 1 ( trt ft rt ft t st ft) otherwise t st ft > 1 t 3. Update models with the update rule below.

rt t+1 = st t+1 = trt + st t l l 2jftj2 + D1 ft 2jftj2 + D1 ft D is a parameter which controls the sensitivity to label prediciton mistakes.

Then we describe the PAAPL method. 1. Fetch t-th training sample, compute scores for each label using current models. 2. For all combinations of label rt 2 Yt associated with the sample and label st 2 Y¯t not associated with the sample, compute the hinge-loss as PA. 3. For all combinations for which the hinge-loss is not 0, update the model corresponding to the update rule of PA.

In PA, only a pair of models is updated for one sample. In PAAPL, on the other hand, all pairs of models are updated for one sample, which reduces the number of training iterations and score computation process, which is timeconsuming. Therefore, the models converge faster. 4

Results

Using the visual feature and the textual feature stated in the previous section, the image classifier was trained by PAAPL. The number of training iterations was 5.

First, we determined whether we should use synonyms and hyponyms of concept for assigning labels to the image. For extracting text from a webpage, we used 10 words of text around the image, img tag attributes and page title. The visual feature is provided BoVW representations of C-SIFT.As a result, we chose to use synonyms and hyponyms. The result is presented in Table 1.

Second, we conducted a grid search for the text extraction conditions on the length of words around the image should be considered, the necessity of using the img tag attributes and the page title. The visual feature is the same as in first step. Results show that using only img tag attributes was the best. The text far from the image decreased label assignment accuracy notably. The result is shown in Table 2. The number of images which have at least one label and the average number of labels assigned to one image was also shown in the result. Because of the property of PAAPL, only images which have at least one label are used for training. It is worth noting that in the best condition, the number of images used and the average number of labels are both lowest.

After this optimization, we tried a previous evaluation (of whether we should use synonyms and hyponyms) again, but the result was the same.

Finally, using the condition of the textual feature extraction stated above, we trained the weight vectors corresponding to each visual feature (Fisher Vector). It took 2 hr to learn for each visual feature. The final score of each test image is calculated by summing the scores of all the classifiers (C-SIFT+FV, GIST+FV, LBP+FV, and SIFT+FV). Final results are presented in Table 3. The evaluation with provided C-SIFT + Bag-of-Visual-Words is also shown in the table. Fisher vector exhibited much higher performance than Bag-of-VisualWords. We performed learning and annotation for the test set with the top 5 ranked combinations.

According to the results presented from the task organizers, we have achieved the second score among all teams with our best run. 5

Conclusions

In this working note, our methods to annotate images in ImageCLEF 2013 Scalable Concept Image Annotation task are described, with particular emphasis on extracting labels for images from websites. Results show that, using concepts’ synonyms and hyponyms from WordNet was useful and limiting text range of website was also shown to be important. For visual features, we applied Fisher Vector, a state-of-the-art coding method. Four local descriptors for FV were tried. The combination of C-SIFT, GIST and SIFT showed superior performance. Our textual and visual features are simple but we can achieve a good performance.

C-SIFT GIST LBP SIFT MF-samples

X - - - X - - - X - - - X X X - X - X X - - X - X X - X - X - - X X X X X X X - X X - X X - X X X

X X X X Provided C-SIFT + BoVW 0.312 0.324 0.279 0.311 0.338 0.321 0.336 0.331 0.340 0.317 0.342 0.346 0.332 0.339 0.343 0.276

Villegas ,

Paredes , and

Thomee . Overview of the ImageCLEF 2013 Scalable Concept Image Annotation Subtask . CLEF 2013 working notes , 2013 .

Villegas and

Paredes . Image-Text Dataset Generation for Image Annotation and Retrieval . CERI, 2012 .

Fellbaum . WordNet: An Electronic Lexical Database . MIT Press, 1998 .

Perronnin ,

Sanchez , and

Mensink . Improving the fisher kernel for large-scale image classification . European Conference on Computer Vision , 2010 .

Ushiku ,

Harada , and

Kuniyoshi . Efficient image annotation for automatic sentence generation . International Conference on Multimedia , 2012 .

Crammer ,

Dekel ,

Keshet ,

Shalev-Shwartz , and

Singer . Online PassiveAggressive Algorithms . The Journal of Machine Learning Research , Vol. 7 , pp. 551 - 585 , 2006 .