-

Automatic Image Annotation using Weakly Labelled Web Data

Pravin Kakar

Xiangyu Wang

Alex Yong-Sang Chia

yschiag@i2r.a-star.edu.sg 1 0 21-01 , 1 Fusionopolis Way , Singapore

138632

1 Social Media and Internet Vision Analytics Lab, Institute for Infocomm Research

In this work, we propose and describe a method for localizing and annotating objects in images for the Scalable Concept Image Annotation challenge at ImageCLEF 2015. The unique feature of our proposed method is in its almost exclusive reliance on a single modality { visual data { for annotating images. Additionally, we do not utilize any of the provided training data, but instead create our own similarly-sized training set. By exploiting the latest research in deep learning and computer vision, we are able to test the applicability of these techniques to a problem of extremely noisy learning. We are able to obtain state-of-theart results on an inherently multi-modal problem thereby demonstrating that computer vision can also be a primary classi cation modality instead of relying primarily on text to determine context prior to image annotation.

visual recognition scalable annotation learning from noisy data

The Scalable Concept Image Annotation challenge (SCIA) at ImageCLEF 2015 [ 15 ] is designed to evaluate methods to automatically annotate, localize and/or describe concepts and objects in images. In contrast to previous years, there have been several notable changes to the challenge. Some of them are highlighted below: { Localization of objects within images has been introduced. As a results, the focus on more \object"-like concepts has increased this year. { Use of hand-labelled data has been allowed. Although this is done to technically allow the use of deep learning models trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [ 12 ], it opens up the possibility of potential clean-up of the training data. Note that we have not done this in this work, but it appears to be legal within the regulatory framework of the challenge. { The training and test sets are identical. Therefore, a method that is able to exploit the noisy training data (e.g. via data cleaning) could, in theory, bene t from potentially over tting the training data.

From a computer vision perspective, SCIA is more challenging than the current benchmark challenge [ 12 ] in at least two senses 1) the training data provided is fairly noisy, which makes learning a di cult problem and 2) the test set is 5 the size of [ 12 ]. While this does not indicate a clear increase in level of di culty (for example, [ 12 ] has 4 the number of concepts of SCIA), certain aspects are de nitely more demanding.

In the rest of these notes, we discuss our proposed method, including data collection, classi er training and post-processing tweaks. We also discuss the challenges posed due to the fact that test data is annotated via crowd-sourcing which adds another source of label noise to \ground-truth" data. Finally, we present our results on SCIA along with proposals for future research to improve the automatic annotation capabilities of techniques in this eld. 2

Algorithm Design

As mentioned earlier, our algorithm is designed to mostly rely on visual data. We do not employ extensive ontologies to augment training data, nor do we use them during the training process, thus helping understand the importance of having a strong visual recognition pipeline. The various stages of our annotation pipeline are discussed below. 2.1

Data Collection

We do not use the provided training set for two main reasons: 1) as the training and test sets are identical, there is no penalty for overtting on the training data, which could provide an arti cial boost Concept Augmented

Search Terms

Image Crawling

Filtering

Training

Set to performance results, and 2) there is little direct relationship between the target concepts and the image keywords in the training data, making it di cult to decouple the signi cance of a good ontology from that of a good visual learning mechanism. Therefore, we create our own training data of approximately the same size as the SCIA dataset.

The data collection pipeline is shown in Figure 1. We rst consider the target concept names as keywords for which appropriate images need to be found. There is an issue of non-speci city of some of the concept names. For example, the concept \dish" can refer to both the vessel as well as the food content, although only the former is the target concept. Additionally, it is di cult to achieve both speci city and diversity using a single keyword when doing a web search for images. As an example, searching for \candy" yields generic images of candy, which while containing diverse instances of candy do not closely match single, speci c instances of candy.

Both the above issues are conventionally dealt with by using ontologies to determine the coverage for each concept. We do not build our own challenge-speci c ontology here, but instead simply rely on WordNet [ 10 ] to augment the individual keywords. In particular, this is done by also considering hyponyms (sub-categories) and lemmas (similar words) of the target concept. The hyponyms help target speci c instances of the target concept, while the lemmas help increase the coverage of the target concept.

This augmented set of keywords per concept is then passed into an image search engine. We use Bing Image Search [ 7 ] in this pipeline. Note that we search for the hyponyms and lemmas of the target concept by appending the target concept, in order to ensure that the correct sense of images is being search for. For example, searching for \tru e" rather than \tru e candy" results in a very di erent set of images that include fungi, which fall outside the scope of the target concept.

We gather up to 4000 images per target concept from our crawling engine. These images are passed through a ltering step where images that are corrupted, too small or almost completely uniform in appearance are discarded. The remaining images then form our training dataset - an automatically created, noisily labelled dataset. 2.2

Feature Extraction

For the images collected by the above process, we extract features that will be useful for image classi cation. We choose to use the features from the winner of the latest ILSVRC classi cation challenge - GoogLeNet [ 13 ], a deep learning model trained on the ILSVRC dataset and consisting of a highly-interconnected network-in-network architecture. Nevertheless, their model size is small enough to t within our available computing resources of a single GeForce GTX 480 GPU with 1.5 GB of memory.

For each training image, we scale it down to 256 256 pixels and use the center crop of 224 224 pixels. The intuition behind this is that as the images are retrieved using speci c keywords, it is likely that the object of interest is the focus of the image and should be dominant. This also reduces the computational complexity of the feature extraction process considerably. We extract features from the pooled 5B layer of the GoogLeNet model (see [ 13 ] for details), yielding a 1024-dimensional vector per training image. Each feature vector is then normalized to unit length.

We then train linear SVM classi ers [ 3 ] in a one-versus-all fashion. This is not strictly correct, as some concepts will almost certainly overlap (e.g. \face" will contain \eye", \nose", \mouth", etc.). However, making such an independence assumption greatly simpli es the learning process. Moreover, it allows us to avoid using a relationship ontology to determine the appopriate weight of every concept for each classi er. This is also in line with the goal of the challenge to Face and Gender

Detection

Object Proposals

Proposal Classification

Trees?

Non-maximal suppression

Yes Morphological

Merging

Fusion WOMAN CONF1 BOX1 .TW.AR.EGEON CCOONNFF32 BBOOXX32 design a scalable system, as the addition of a new target concept does not necessitate a recomputation of the weights against every existing concept. In order to manage space constraints, we uniformly sample the negative training samples for each concept, only selecting 60,000 of them.

Thus, we train a single 1024-dimensional linear classi er per target concept to use for annotating test images. 2.3

Annotation Pipeline

Figure 2 shows our processing pipeline for a single test image.

We rst create object proposals using the technique of [ 14 ] that uses selective-search to nd likely \object" regions. Experimentally, it is observed that many of these proposals are near-duplicates. In order to increase diversity, we limit the returned proposals to those that overlap others by at most 90%. This is found to be an acceptable value for controlling the tradeo between increasing diversity, and losing signi cant objects. We restrict the number of proposals returned to the rst 150, in decreasing order of size.

Each object is then passed through the same feature extraction pipeline as in Section 2.2, and the classi ers trained therein are run to yield the most likely concepts per region. In general, non-maximal suppression is done per concept across all regions to limit the e ect of overlapping proposals reporting the same concept for the same object.

There are two branches from this primary pipeline that we employ based on our observations on the SCIA development set. Firstly, we observe that many object proposals are labeled as \tree" if they contain heavy foliage. While not incorrect for the individual region, it may be incorrect for an overall image, where it is often di cult to localize a single tree. In order to mitigate this e ect, we perform morphological merging for all tree boxes, taking the convex hull for each merged region as the bounding box of the tree and assigning it the highest con dence of all the merged boxes. We observe this to help improve the localization performance for the \tree" concept on the development set. We also believe that this idea can be extended to other non-uniformly shaped, di cult-to-localize concepts such as \rock", \leaf", \brick", etc. but we do not have su cient annotations in the development data to verify the same.

Secondly, we observe that for any generic image dataset, humans are an important object. This is true for SCIA as well as for [ 12,8,2 ]. Note that this is in contrast to domain-speci c datasets such as [ 9,11 ]. To this end, we use face and gender detection from [ 1 ] to detect persons with frontal faces in images. We supplement this with a simple regression to an upper-body annotation using the data from [ 4 ]. Finally, we use the information from [ 6 ] to determine the locations of various other person attributes.

A fusion step merges the results from the primary and two secondary pipelines. Speci cally, person results from multiple pipelines are suppressed or modi ed based on overlaps between returned localizations for the same concepts. Additionally, localizations that have too high or low aspect ratios are suppressed, along with localizations that fall below a preset score threshold. Finally, if all localizations have been suppressed, then we report a single localization comprising of the entire image, corresponding to the global scene classi cation. This is based on the premise that all the development set images contain at least one concept, and we extend that assumption to all the test images.

Optionally, the fusion section can also contain multiple textual re nement steps. One option is to search URL lenames for concept names, and if found, assign them to the entire image. Another approach uses correlation between concepts from an ontology. This is done to test the impact of simple context addition to the annotation pipeline. Details of this latter approach are provided in the following subsection. 2.4

Ontology and correlation

With the feature extraction and annotation pipeline, a set of bounding boxes fBig are obtained for each test image. We denote the prediction scores for the target concepts in Bi as Si = [si1; :::; sim], where m is the total number of target concepts. By combining the prediction scores for all the bounding boxes fBig, the prediction score for the image is calculated as S = [s1; :::; sm] where si = maxj sji.

Due to the fact that concepts do not occur in isolation (e.g. bathroom and bathtub, mountain and cli ), semantic context can be used to improve annotation accuracy. Following a way similar to [ 5 ], we adopt semantic di usion to re ne the concept annotation score. We denote C = fc1; :::; cmg as the set of target concepts. Let W be the concept a nity matrix where Wij indicates the a nity between concepts ci and cj, and D denote the diagonal node degree matrix where Dii = di = Pj Wij. Then the graph Laplacian is = D W and the normalized graph Laplacian is L = I D 1=2W D1=2. In this problem, we measure the concept a nity based on Wikipedia dataset. Let M denote the total number of pages in Wikipedia. For concept ci, we denote yik = 1 if concept keyword ci appears in page k, and yik = 0 otherwise. The a nity Wij between concept ci and cj can then be computed using Pearson product moment correlation as: Wij =

PkM=1(yik (M i)(yjk 1) i j j) (1) where i and i are the sample mean and standard deviation for ci, respectively. Based on our study, the original prediction should be quite precise. We employ only positive correlation to boost the concepts to improve the recall.

Let g 2 Rm 1 denote the re ned score vector, the values gi and gj should be consistent with Wij (the a nity between concepts ci and cj). Motivated by the semantic consistency, we formulate the score re nement problem by minimizing a loss function The loss function can be rewritten as

1 " = tr(gT Lg)

2 g = g

rg"

The loss function can be optimized using gradient descent algorithm as where rg" = Lg, and is the learning rate.

Intially, g = S. By iteratively optimizing the loss function, we obtain the re ned smooth score vector g for the image. A threshold is chosen, so that we consider concept ci appears if gi > , otherwise we think the concept does not appear in the image (consequently not in any of the bounding boxes Bi). That is, for each bounding box in an image, we report the concept with the maximum con dence given the concept appears in the image. 3

Dataset Limitations

We tune the various parameters of our algorithm by validating its performance on the development set. Unfortunately, the development set (and by extension, the SCIA test set) has multiple problems that make it very di cult to correctly gauge the e ect of tuning. Most of these problems arise from the limitations of crowd-sourcing ground-truth annotation, and need to be addressed to make SCIA a more consistent evaluation. We summarize the major issues involved below with the 4 I's. (2) (3) (4)

Inconsistency Many images in the development set exhibit inconsistent annotations. An example of this is shown in Figure 3. Despite there clearly being 4 legs in the picture, only 1 is annotated. This is inconsistent as none of the unannotated instances are any less of \leg"s than the one annotated. Other examples of inconsistencies seen in the development set include unclear demarcations between when multiple instances of the same concept are to be grouped into a single instance or vice versa and annotating partially-visible instances in some images while not annotating completely visible instances in other images.

Incompleteness Several images in the development set are incompletely annotated. This is most prevalent in the case of humans where various body parts are skipped altogether in the annotations. Apart from this, there appears to be a certain level of arbitrariness in choosing which concepts to annotate. For instance in Figure 3, \shirt" is annotated, but \short pants" is not when clearly both have about the same level of \interestingness". Additionally, concepts like \chair", \sock", \shoe", \stadium", etc. which are also present in the image are not annotated. This makes it extremely di cult to judge the performance of a proposed technique on the development set. Moreover, it seems to run counter to the challenge assertion that the proportion of missing or incomplete annotations is insigni cant. Incorrectness Althought not as prevalent as the previous two problems, many annotations are incorrect. In Figure 3, the two balls are labelled as balloons, which is clearly wrong. There are other cases of wrong annotations in items of clothing (shirt/jacket/suit) as well as gender and age of persons (\man" labelled as \male child", \woman" labelled as \man", etc.).

Impossibility This issue is the least prevalent of the four discussed in this section. The image shown in Figure 4 was agged by our image annotation pipeline as having a very large number of object instances. It can be seen that the image contains more than 200 faces. This implies that there are more than 100 instances of at least one of \man" or \woman"1. Within the rules of the challenge, each concept is limited to 100 instaces, making it impossible to annotate all instances correctly. Grouping multiple instances into a single instance, if one were inclined to do so, is not straightforward as there is no clear group of men and women as in some other images. 4

Evaluation

In this section, we evaluate the performance of di erent settings of the algorithm on the development and test datasets. It is to be noted that there appear to be signi cant di erences in the quality of the annotations between the two sets, so results on one are not indicative of results on the other. Moreover, as there were no particulars 1 A quick inspection of the image shows no children, eliminating the possibility of instances of \male child" and \female child" provided about the measures being used for evaluation beyond \performance" on both concepts and images, we used the F-score as a measure on the development set, which formed the basis of 2 out of 3 measures in the previous iteration of the challenge. As it turns out, the evaluation measure used on the test set was the mAP and so, results between the development and test sets are again not directly comparable. These statistics are shown in Table 1. We run two base versions of our pipeline, one aimed at garnering better precision (BP) and one aimed at getting better recall (BR). These are shown in the rst two rows of the table. Following this, we notice that in some images, no concepts were predicted as their condences scores fell below their respective thresholds. In these cases, we forced at least 1 prediction to be made ( 1 pred.) giving rise to two more variants, BP1 and BR1.

URL-search corresponds to the URL lename-concept name match discussed earlier. Agg. NMS refers to aggressive non-maximal suppression that employs a NMS threshold of 0, resulting in all overlapping bounding boxes for the same concept to be reduced to a single one. From human attributes, we either report hair + mouth, which showed no deleterious e ects on the development set in the face of incomplete annotations, or face parts which also adds eyes and noses, or body parts which further adds in faces, heads, necks and arms. In the case of ontologies, while we obtain slightly better results on the development set, output errors in the submission cause the performance to be quite low, which is an outlier.

From the results, it can be seen that all human attributes signi cantly help boost performance. Moreover, URL-search causes a drop in performance, while aggressive NMS again boosts performance. Hence, a possible solution that yields even better performance could be BR1 + Agg. NMS + body parts.

It is also to be noted that the runner-up in the challenge attains a performance about 15% lower than ours. As the details of their technique are not available, it is di cult to pinpoint the cause of the large di erence, but we believe that the use of an external training set, combined with human part extraction played an important role. 5

Conclusions and Future Work

In this work, we have presented our proposed method for the ImageCLEF Scalable Concept Image Annotation challenge. Our method places heavy emphasis on the visual aspect of annotating images and demonstrates the performance that can be achieved by building an appropriate pipeline of state-of-the-art visual recognition techniques. The interconnections between the techniques are modi ed and enhanced to improve overall annotation performance by branching o secondary recognition pipelines for certain highly common concepts.

We also highlight the limitations of the current challenge dataset with respect to the ground-truth annotations, categorizing the major shortcomings. Despite these and our technique's general lack of reliance on textual data, we are able to outperform competing methods by a margin of at least 15%. In the future, we plan to re ne our annotation pipeline based on the analysis of the results. As most of the target concepts in this iteration of the challenge were localizable in a well-de ned manner, it will be interesting to examine localization for other, more abstract concepts. We also hope to combine advances in natural language processing and semantic ontologies to appropriately weigh training instances in learning classi ers as well as look at the problem from a multi-modal point of view.

1. Open biometrics, http://openbiometrics.org/

2. Everingham , M. , Van Gool , L. , Williams , C.K.I. , Winn , J. , Zisserman , A. : The PASCAL Visual Object Classes Challenge 2011 ( VOC2011) Results . http://www. pascal-network.org/challenges/VOC/voc2011/workshop/index.html

3. Fan , R.E. , Chang , K.W. , Hsieh , C.J. , Wang , X.R. , Lin , C.J.: Liblinear: A library for large linear classi cation . The Journal of Machine Learning Research 9 , 1871 { 1874 ( 2008 )

4. Ferrari , V. , Marin-Jimenez , M. , Zisserman , A. : Progressive search space reduction for human pose estimation . In: Computer Vision and Pattern Recognition , 2008 . CVPR 2008 . IEEE Conference on. pp. 1 { 8 . IEEE ( 2008 )

5. Jiang , Y.G. , Wang , J. , Chang , S.F. , Ngo , C.W. : Domain adaptive semantic diffusion for large scale context-based video annotation . In: Computer Vision , 2009 IEEE 12th International Conference on. pp. 1420 { 1427 . IEEE ( 2009 )

6. Jusko , D.A. : Human gure drawing proportions , http://www.realcolorwheel. com/human.htm

7. Microsoft: Bing image search ( 2015 ), http://www.bing.com/images

8. Opelt , A. , Pinz , A. , Fussenegger , M. , Auer , P. : Generic object recognition with boosting . IEEE Transactions on Pattern Analysis and Machine Intelligence 28 ( 3 ), 416 { 431 ( 2006 )

9. Parkhi , O.M. , Vedaldi , A. , Zisserman , A. , Jawahar , C.V. : Cats and dogs . In: IEEE Conference on Computer Vision and Pattern Recognition . pp. 3498 { 3505 ( 2012 )

10. Princeton University: About wordnet ( 2010 ), https://wordnet.princeton.edu/ wordnet/

11. Quattoni , A. , Torralba , A. : Recognizing indoor scenes . Computer Vision and Pattern Recognition ( 2009 )

12. Russakovsky , O. , Deng , J. , Su , H. , Krause , J. , Satheesh , S. , Ma, S. , Huang , Z. , Karpathy , A. , Khosla , A. , Bernstein , M. , Berg , A.C. , Fei-Fei , L. : ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV) ( 2015 )

13. Szegedy , C. , Liu , W. , Jia , Y. , Sermanet , P. , Reed , S. , Anguelov , D. , Erhan , D. , Vanhoucke , V. , Rabinovich , A. : Going deeper with convolutions . CoRR abs/1409 .4842 ( 2014 ), http://arxiv.org/abs/1409.4842

14. Uijlings , J.R ., van de Sande, K.E. , Gevers , T. , Smeulders , A.W. : Selective search for object recognition . International journal of computer vision 104(2) , 154 { 171 ( 2013 )

15. Villegas , M. , Muller, H., Gilbert , A. , Piras , L. , Wang , J. , Mikolajczyk , K. , de Herrera , A.G.S. , Bromuri , S. , Amin , M.A. , Mohammed , M.K. , Acar , B. , Uskudarli , S. , Marvasti , N.B. , Aldana , J.F. , del Mar Roldan Garc a, M.: General Overview of ImageCLEF at the CLEF 2015 Labs . Lecture Notes in Computer Science, Springer International Publishing ( 2015 )