=Paper=
{{Paper
|id=Vol-1391/5-CR
|storemode=property
|title=Automatic Image Annotation using Weakly Labelled Web Data
|pdfUrl=https://ceur-ws.org/Vol-1391/5-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/KakarWC15
}}
==Automatic Image Annotation using Weakly Labelled Web Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1391/5-CR.pdf</pdf>
<pre>
    Automatic Image Annotation using
       Weakly Labelled Web Data

     Pravin Kakar, Xiangyu Wang and Alex Yong-Sang Chia
                 Social Media and Internet Vision Analytics Lab,
                        Institute for Infocomm Research,
                          #21-01, 1 Fusionopolis Way,
                                Singapore 138632.
                 {kakarpv, wangx, yschia}@i2r.a-star.edu.sg


     Abstract. In this work, we propose and describe a method for local-
     izing and annotating objects in images for the Scalable Concept Image
     Annotation challenge at ImageCLEF 2015. The unique feature of our
     proposed method is in its almost exclusive reliance on a single modality
     – visual data – for annotating images. Additionally, we do not utilize any
     of the provided training data, but instead create our own similarly-sized
     training set. By exploiting the latest research in deep learning and com-
     puter vision, we are able to test the applicability of these techniques to a
     problem of extremely noisy learning. We are able to obtain state-of-the-
     art results on an inherently multi-modal problem thereby demonstrating
     that computer vision can also be a primary classification modality in-
     stead of relying primarily on text to determine context prior to image
     annotation.

     Keywords: visual recognition, scalable annotation, learning from noisy
     data


1   Introduction
The Scalable Concept Image Annotation challenge (SCIA) at Im-
ageCLEF 2015 [15] is designed to evaluate methods to automatically
annotate, localize and/or describe concepts and objects in images. In
contrast to previous years, there have been several notable changes
to the challenge. Some of them are highlighted below:

– Localization of objects within images has been introduced. As a
  results, the focus on more “object”-like concepts has increased
  this year.
– Use of hand-labelled data has been allowed. Although this is done
  to technically allow the use of deep learning models trained on the
  ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
  [12], it opens up the possibility of potential clean-up of the train-
  ing data. Note that we have not done this in this work, but it
  appears to be legal within the regulatory framework of the chal-
  lenge.
– The training and test sets are identical. Therefore, a method that
  is able to exploit the noisy training data (e.g. via data cleaning)
  could, in theory, benefit from potentially overfitting the training
  data.
    From a computer vision perspective, SCIA is more challenging
than the current benchmark challenge [12] in at least two senses -
1) the training data provided is fairly noisy, which makes learning a
difficult problem and 2) the test set is 5× the size of [12]. While this
does not indicate a clear increase in level of difficulty (for example,
[12] has 4× the number of concepts of SCIA), certain aspects are
definitely more demanding.
    In the rest of these notes, we discuss our proposed method, in-
cluding data collection, classifier training and post-processing tweaks.
We also discuss the challenges posed due to the fact that test data
is annotated via crowd-sourcing which adds another source of label
noise to “ground-truth” data. Finally, we present our results on SCIA
along with proposals for future research to improve the automatic
annotation capabilities of techniques in this field.

2     Algorithm Design
As mentioned earlier, our algorithm is designed to mostly rely on vi-
sual data. We do not employ extensive ontologies to augment train-
ing data, nor do we use them during the training process, thus help-
ing understand the importance of having a strong visual recognition
pipeline. The various stages of our annotation pipeline are discussed
below.

2.1   Data Collection
We do not use the provided training set for two main reasons: 1) as
the training and test sets are identical, there is no penalty for over-
fitting on the training data, which could provide an artificial boost
                     WordNet


                    Augmented
                                    Image                      Training
        Concept       Search                       Filtering
                                   Crawling                      Set
                      Terms


                      Fig. 1. Data collection pipeline


to performance results, and 2) there is little direct relationship be-
tween the target concepts and the image keywords in the training
data, making it difficult to decouple the significance of a good ontol-
ogy from that of a good visual learning mechanism. Therefore, we
create our own training data of approximately the same size as the
SCIA dataset.
    The data collection pipeline is shown in Figure 1. We first con-
sider the target concept names as keywords for which appropriate
images need to be found. There is an issue of non-specificity of some
of the concept names. For example, the concept “dish” can refer
to both the vessel as well as the food content, although only the
former is the target concept. Additionally, it is difficult to achieve
both specificity and diversity using a single keyword when doing a
web search for images. As an example, searching for “candy” yields
generic images of candy, which while containing diverse instances of
candy do not closely match single, specific instances of candy.
    Both the above issues are conventionally dealt with by using on-
tologies to determine the coverage for each concept. We do not build
our own challenge-specific ontology here, but instead simply rely on
WordNet [10] to augment the individual keywords. In particular, this
is done by also considering hyponyms (sub-categories) and lemmas
(similar words) of the target concept. The hyponyms help target spe-
cific instances of the target concept, while the lemmas help increase
the coverage of the target concept.
    This augmented set of keywords per concept is then passed into
an image search engine. We use Bing Image Search [7] in this pipeline.
Note that we search for the hyponyms and lemmas of the target con-
cept by appending the target concept, in order to ensure that the
correct sense of images is being search for. For example, searching
for “truffle” rather than “truffle candy” results in a very different
set of images that include fungi, which fall outside the scope of the
target concept.
    We gather up to 4000 images per target concept from our crawl-
ing engine. These images are passed through a filtering step where
images that are corrupted, too small or almost completely uniform
in appearance are discarded. The remaining images then form our
training dataset - an automatically created, noisily labelled dataset.

2.2   Feature Extraction
For the images collected by the above process, we extract features
that will be useful for image classification. We choose to use the fea-
tures from the winner of the latest ILSVRC classification challenge
- GoogLeNet [13], a deep learning model trained on the ILSVRC
dataset and consisting of a highly-interconnected network-in-network
architecture. Nevertheless, their model size is small enough to fit
within our available computing resources of a single GeForce GTX
480 GPU with 1.5 GB of memory.
    For each training image, we scale it down to 256×256 pixels and
use the center crop of 224×224 pixels. The intuition behind this is
that as the images are retrieved using specific keywords, it is likely
that the object of interest is the focus of the image and should be
dominant. This also reduces the computational complexity of the
feature extraction process considerably. We extract features from
the pooled 5B layer of the GoogLeNet model (see [13] for details),
yielding a 1024-dimensional vector per training image. Each feature
vector is then normalized to unit length.
    We then train linear SVM classifiers [3] in a one-versus-all fash-
ion. This is not strictly correct, as some concepts will almost certainly
overlap (e.g. “face” will contain “eye”, “nose”, “mouth”, etc.). How-
ever, making such an independence assumption greatly simplifies the
learning process. Moreover, it allows us to avoid using a relationship
ontology to determine the appopriate weight of every concept for
each classifier. This is also in line with the goal of the challenge to
                   Face and Gender                    Body Part
                      Detection                       Regression


        Object                   Proposal                                  Non-maximal
                                                  Trees?           No
       Proposals               Classification                              suppression


                                                     Yes


                                                Morphological
                                                                             Fusion
                                                  Merging


                                                                        WOMAN CONF1 BOX1
                                                                        WAGON CONF2 BOX2
                                                                        TREE CONF3 BOX3
                                                                        ...


                     Fig. 2. Image annotation pipeline


design a scalable system, as the addition of a new target concept does
not necessitate a recomputation of the weights against every existing
concept. In order to manage space constraints, we uniformly sample
the negative training samples for each concept, only selecting 60,000
of them.
    Thus, we train a single 1024-dimensional linear classifier per tar-
get concept to use for annotating test images.

2.3   Annotation Pipeline
Figure 2 shows our processing pipeline for a single test image.
    We first create object proposals using the technique of [14] that
uses selective-search to find likely “object” regions. Experimentally,
it is observed that many of these proposals are near-duplicates. In
order to increase diversity, we limit the returned proposals to those
that overlap others by at most 90%. This is found to be an accept-
able value for controlling the tradeoff between increasing diversity,
and losing significant objects. We restrict the number of proposals
returned to the first 150, in decreasing order of size.
    Each object is then passed through the same feature extraction
pipeline as in Section 2.2, and the classifiers trained therein are run
to yield the most likely concepts per region. In general, non-maximal
suppression is done per concept across all regions to limit the effect
of overlapping proposals reporting the same concept for the same
object.
    There are two branches from this primary pipeline that we em-
ploy based on our observations on the SCIA development set. Firstly,
we observe that many object proposals are labeled as “tree” if they
contain heavy foliage. While not incorrect for the individual region,
it may be incorrect for an overall image, where it is often difficult
to localize a single tree. In order to mitigate this effect, we perform
morphological merging for all tree boxes, taking the convex hull for
each merged region as the bounding box of the tree and assigning it
the highest confidence of all the merged boxes. We observe this to
help improve the localization performance for the “tree” concept on
the development set. We also believe that this idea can be extended
to other non-uniformly shaped, difficult-to-localize concepts such as
“rock”, “leaf”, “brick”, etc. but we do not have sufficient annotations
in the development data to verify the same.
    Secondly, we observe that for any generic image dataset, humans
are an important object. This is true for SCIA as well as for [12,8,2].
Note that this is in contrast to domain-specific datasets such as
[9,11]. To this end, we use face and gender detection from [1] to de-
tect persons with frontal faces in images. We supplement this with a
simple regression to an upper-body annotation using the data from
[4]. Finally, we use the information from [6] to determine the loca-
tions of various other person attributes.
    A fusion step merges the results from the primary and two sec-
ondary pipelines. Specifically, person results from multiple pipelines
are suppressed or modified based on overlaps between returned local-
izations for the same concepts. Additionally, localizations that have
too high or low aspect ratios are suppressed, along with localizations
that fall below a preset score threshold. Finally, if all localizations
have been suppressed, then we report a single localization comprising
of the entire image, corresponding to the global scene classification.
This is based on the premise that all the development set images
contain at least one concept, and we extend that assumption to all
the test images.
    Optionally, the fusion section can also contain multiple textual
refinement steps. One option is to search URL filenames for concept
names, and if found, assign them to the entire image. Another ap-
proach uses correlation between concepts from an ontology. This is
done to test the impact of simple context addition to the annotation
pipeline. Details of this latter approach are provided in the following
subsection.

2.4   Ontology and correlation
With the feature extraction and annotation pipeline, a set of bound-
ing boxes {Bi } are obtained for each test image. We denote the pre-
diction scores for the target concepts in Bi as Si = [si1 , ..., sim ], where
m is the total number of target concepts. By combining the predic-
tion scores for all the bounding boxes {Bi }, the prediction score for
the image is calculated as S = [s1 , ..., sm ] where si = maxj sji .
     Due to the fact that concepts do not occur in isolation (e.g. bath-
room and bathtub, mountain and cliff), semantic context can be used
to improve annotation accuracy. Following a way similar to [5], we
adopt semantic diffusion to refine the concept annotation score. We
denote C = {c1 , ..., cm } as the set of target concepts. Let W be the
concept affinity matrix where Wij indicates the affinity between con-
cepts ci and cj , and D denote the diagonal node degree matrix where
Dii = di = j Wij . Then the graph Laplacian is ∆ = D − W and the
            P

normalized graph Laplacian is L = I − D−1/2 W D1/2 . In this prob-
lem, we measure the concept affinity based on Wikipedia dataset.
Let M denote the total number of pages in Wikipedia. For concept
ci , we denote yik = 1 if concept keyword ci appears in page k, and
yik = 0 otherwise. The affinity Wij between concept ci and cj can
then be computed using Pearson product moment correlation as:
                            PM
                              k=1 (yik − µi )(yjk − µj )
                    Wij =                                                (1)
                                   (M − 1)σi σj
where µi and σi are the sample mean and standard deviation for
ci , respectively. Based on our study, the original prediction should
be quite precise. We employ only positive correlation to boost the
concepts to improve the recall.
    Let g ∈ Rm×1 denote the refined score vector, the values gi and
gj should be consistent with Wij (the affinity between concepts ci
and cj ). Motivated by the semantic consistency, we formulate the
score refinement problem by minimizing a loss function
                             m
                         1 X           gi gj
                      ε=         Wij || − ||2                        (2)
                         2 i,j=1       di dj

    The loss function can be rewritten as
                               1
                            ε = tr(g T Lg)                           (3)
                               2

    The loss function can be optimized using gradient descent algo-
rithm as
                           g = g − α∇g ε                        (4)
where ∇g ε = Lg, and α is the learning rate.
    Intially, g = S. By iteratively optimizing the loss function, we
obtain the refined smooth score vector g for the image. A threshold
τ is chosen, so that we consider concept ci appears if gi > τ , otherwise
we think the concept does not appear in the image (consequently not
in any of the bounding boxes Bi ). That is, for each bounding box
in an image, we report the concept with the maximum confidence
given the concept appears in the image.


3    Dataset Limitations

We tune the various parameters of our algorithm by validating its
performance on the development set. Unfortunately, the develop-
ment set (and by extension, the SCIA test set) has multiple prob-
lems that make it very difficult to correctly gauge the effect of tuning.
Most of these problems arise from the limitations of crowd-sourcing
ground-truth annotation, and need to be addressed to make SCIA a
more consistent evaluation. We summarize the major issues involved
below with the 4 I’s.
Fig. 3. Image demonstrating inconsistency, incompleteness and incorrectness of anno-
tations. Annotations are from the SCIA development set.


Inconsistency Many images in the development set exhibit inconsis-
tent annotations. An example of this is shown in Figure 3. Despite
there clearly being 4 legs in the picture, only 1 is annotated. This
is inconsistent as none of the unannotated instances are any less of
“leg”s than the one annotated. Other examples of inconsistencies
seen in the development set include unclear demarcations between
when multiple instances of the same concept are to be grouped into
a single instance or vice versa and annotating partially-visible in-
stances in some images while not annotating completely visible in-
stances in other images.

Incompleteness Several images in the development set are incom-
pletely annotated. This is most prevalent in the case of humans
where various body parts are skipped altogether in the annotations.
Apart from this, there appears to be a certain level of arbitrariness
in choosing which concepts to annotate. For instance in Figure 3,
“shirt” is annotated, but “short pants” is not when clearly both have
about the same level of “interestingness”. Additionally, concepts like
“chair”, “sock”, “shoe”, “stadium”, etc. which are also present in the
image are not annotated. This makes it extremely difficult to judge
the performance of a proposed technique on the development set.
Moreover, it seems to run counter to the challenge assertion that the
proportion of missing or incomplete annotations is insignificant.

Incorrectness Althought not as prevalent as the previous two prob-
lems, many annotations are incorrect. In Figure 3, the two balls are
labelled as balloons, which is clearly wrong. There are other cases of
wrong annotations in items of clothing (shirt/jacket/suit) as well as
gender and age of persons (“man” labelled as “male child”, “woman”
labelled as “man”, etc.).


              Fig. 4. Image demonstrating impossibility of annotations.


Impossibility This issue is the least prevalent of the four discussed
in this section. The image shown in Figure 4 was flagged by our
image annotation pipeline as having a very large number of object
instances. It can be seen that the image contains more than 200
faces. This implies that there are more than 100 instances of at least
one of “man” or “woman”1 . Within the rules of the challenge, each
concept is limited to 100 instaces, making it impossible to annotate
all instances correctly. Grouping multiple instances into a single in-
stance, if one were inclined to do so, is not straightforward as there
is no clear group of men and women as in some other images.


4      Evaluation
In this section, we evaluate the performance of different settings of
the algorithm on the development and test datasets. It is to be noted
that there appear to be significant differences in the quality of the
annotations between the two sets, so results on one are not indica-
tive of results on the other. Moreover, as there were no particulars
1
    A quick inspection of the image shows no children, eliminating the possibility of
    instances of “male child” and “female child”
provided about the measures being used for evaluation beyond “per-
formance” on both concepts and images, we used the F-score as a
measure on the development set, which formed the basis of 2 out of
3 measures in the previous iteration of the challenge. As it turns out,
the evaluation measure used on the test set was the mAP and so,
results between the development and test sets are again not directly
comparable. These statistics are shown in Table 1.

               Table 1. Performance statistics for various runs.

                                    Development set     Test set
    Method
                                    F (0.5) F (0) mAP (0.5) mAP (0)
    Better precision (BP)           0.1463 0.2921   0.3039    0.4571
    Better recall (BR)              0.1462 0.2934   0.2949    0.4466
    BP + ≥ 1 pred (BP1)             0.1459 0.2940   0.2949    0.4466
    BR + ≥ 1 pred (BR1)             0.1459 0.2927   0.3039    0.4569
    BR1 + URL search                0.1450 0.2948   0.2948    0.4465
    BR1 + Agg. NMS                  0.1380 0.2894   0.3536    0.5045
    BR1 + hair + mouth + URL search 0.1442 0.2923   0.5707    0.7603
    BR1 + hair + mouth              0.1450 0.2899   0.5786    0.7685
    BR1 + body parts + URL search   0.1279 0.2615   0.6024    0.7918
    BR1 + face parts                0.1407 0.2818   0.6595 0.7954
    Runner-up                         NA     NA     0.5100    0.6426


    We run two base versions of our pipeline, one aimed at garnering
better precision (BP) and one aimed at getting better recall (BR).
These are shown in the first two rows of the table. Following this, we
notice that in some images, no concepts were predicted as their con-
fidences scores fell below their respective thresholds. In these cases,
we forced at least 1 prediction to be made (≥ 1 pred.) giving rise to
two more variants, BP1 and BR1.
    URL-search corresponds to the URL filename-concept name match
discussed earlier. Agg. NMS refers to aggressive non-maximal sup-
pression that employs a NMS threshold of 0, resulting in all over-
lapping bounding boxes for the same concept to be reduced to a
single one. From human attributes, we either report hair + mouth,
which showed no deleterious effects on the development set in the
face of incomplete annotations, or face parts which also adds eyes
and noses, or body parts which further adds in faces, heads, necks
and arms. In the case of ontologies, while we obtain slightly better
results on the development set, output errors in the submission cause
the performance to be quite low, which is an outlier.
    From the results, it can be seen that all human attributes signifi-
cantly help boost performance. Moreover, URL-search causes a drop
in performance, while aggressive NMS again boosts performance.
Hence, a possible solution that yields even better performance could
be BR1 + Agg. NMS + body parts.
    It is also to be noted that the runner-up in the challenge attains
a performance about 15% lower than ours. As the details of their
technique are not available, it is difficult to pinpoint the cause of the
large difference, but we believe that the use of an external training
set, combined with human part extraction played an important role.


5    Conclusions and Future Work

In this work, we have presented our proposed method for the Image-
CLEF Scalable Concept Image Annotation challenge. Our method
places heavy emphasis on the visual aspect of annotating images and
demonstrates the performance that can be achieved by building an
appropriate pipeline of state-of-the-art visual recognition techniques.
The interconnections between the techniques are modified and en-
hanced to improve overall annotation performance by branching off
secondary recognition pipelines for certain highly common concepts.
    We also highlight the limitations of the current challenge dataset
with respect to the ground-truth annotations, categorizing the ma-
jor shortcomings. Despite these and our technique’s general lack of
reliance on textual data, we are able to outperform competing meth-
ods by a margin of at least 15%. In the future, we plan to refine our
annotation pipeline based on the analysis of the results. As most of
the target concepts in this iteration of the challenge were localizable
in a well-defined manner, it will be interesting to examine localiza-
tion for other, more abstract concepts. We also hope to combine
advances in natural language processing and semantic ontologies to
appropriately weigh training instances in learning classifiers as well
as look at the problem from a multi-modal point of view.
References
 1. Open biometrics, http://openbiometrics.org/
 2. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
    PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.
    pascal-network.org/challenges/VOC/voc2011/workshop/index.html
 3. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library
    for large linear classification. The Journal of Machine Learning Research 9, 1871–
    1874 (2008)
 4. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction
    for human pose estimation. In: Computer Vision and Pattern Recognition, 2008.
    CVPR 2008. IEEE Conference on. pp. 1–8. IEEE (2008)
 5. Jiang, Y.G., Wang, J., Chang, S.F., Ngo, C.W.: Domain adaptive semantic dif-
    fusion for large scale context-based video annotation. In: Computer Vision, 2009
    IEEE 12th International Conference on. pp. 1420–1427. IEEE (2009)
 6. Jusko, D.A.: Human figure drawing proportions, http://www.realcolorwheel.
    com/human.htm
 7. Microsoft: Bing image search (2015), http://www.bing.com/images
 8. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with
    boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(3),
    416–431 (2006)
 9. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE
    Conference on Computer Vision and Pattern Recognition. pp. 3498–3505 (2012)
10. Princeton University: About wordnet (2010), https://wordnet.princeton.edu/
    wordnet/
11. Quattoni, A., Torralba, A.: Recognizing indoor scenes. Computer Vision and Pat-
    tern Recognition (2009)
12. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
    Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
    Scale Visual Recognition Challenge. International Journal of Computer Vision
    (IJCV) (2015)
13. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
    houcke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842
    (2014), http://arxiv.org/abs/1409.4842
14. Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search
    for object recognition. International journal of computer vision 104(2), 154–171
    (2013)
15. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de Her-
    rera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli,
    S., Marvasti, N.B., Aldana, J.F., del Mar Roldán Garcı́a, M.: General Overview of
    ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science, Springer
    International Publishing (2015)

</pre>