=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Image-XuEt2014
|storemode=property
|title=MLIA at ImageCLFE 2014 Scalable Concept Image Annotation Challenge
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Image-XuEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/XuST14
}}
==MLIA at ImageCLFE 2014 Scalable Concept Image Annotation Challenge==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Image-XuEt2014.pdf</pdf>
<pre>
    MLIA at ImageCLFE 2014 Scalable Concept
          Image Annotation Challenge

                 Xing Xu, Atsushi Shimada, and Rin-ichiro Taniguchi

        Department of Advanced Information Technology, Kyushu University, Japan
         {xing, atsushi}@limu.ait.kyushu-u.ac.jp, {rin}@ait.kyushu-u.ac.jp


          Abstract. In this paper, we propose a large-scale image annotation sys-
          tem for the ImageCLEF 2014 Scalable Concept Image Annotation task.
          The annotation task, of this year, concentrated on developing annota-
          tion algorithms that rely only on data obtained automatically from the
          web. Since the sophisticated SVM based annotation techniques had been
          widely applied in the task last year (ImageCLEF 2013), for the task this
          year, we also adopt the SVM based annotation techniques and put our
          effort mainly on obtaining more accurate concepts assignment for train-
          ing images. More specifically, we proposed a two-fold scheme to assign
          concepts to unlabeled training images: (1) A traditional process which
          stems the extracted web data of each training image from textual aspect,
          and make concepts assignment based on the appearance of each concept.
          (2) An additional process which leverages the deep convolutional net-
          work toolbox Overfeat to predict labels (in ImageNet nouns) for each
          training image from visual aspect, then the predicted tags are mapped
          to concepts in ImageCLEF based on WordNet synonyms and hyponyms
          with semantic relations. Finally, the allocated concepts for each train-
          ing image are generated based on a fusion step of the two-fold concepts
          assignment processes. Experimental results show that the proposed con-
          cepts assignment scheme is efficient to improve the assignment results
          of traditional textual processing and to allocate reasonable concepts for
          training images. Consequently, with an efficient SVMs solver based on S-
          tochastic Gradient Descent, our annotation systems achieves competitive
          performance in the annotation task.

          Keywords: imageclef, image annotation, social web data


1       Introduction

In this year ImageCLEF 2014 [1], we participated the Scalable Concept Image
Annotation challenge1 [10] which aimed at developing more scalable image anno-
tation system. The goal of this challenge is to develop annotation systems that
for training only rely on unsupervised web data and other automatically ob-
tainable resources. In contrast to traditional image annotation evaluations with
labeled training data, this challenge requires work in more front, such as handling
    1
        http://www.imageclef.org/2014/annotation


                                          411
the noisy data, textual processing and multilabel annotations and scalability to
unobserved labels.
   Since this year is the third edition of the annotation challenge, regarding the
methodology of annotation system, we can make several observations from the
overview reports [9] [11] of previous editions:
 – The best performing system, TPT [6], only used provided visual features,
   which indicated that the visual features provided by the organizers is suf-
   ficient enough and the other features extracted by several teams might be
   complementary.
 – The top 3 teams (TPT, MIL [4], and UNIMORE [2]) all utilized SVMs based
   algorithms to learn separate classifiers for each concept, which was verified
   to be superior to the K nearest neighbor (KNN) based annotation techniques
   used by other groups, such as RUC [5], MICC [8].
 – The textual processing and concepts assignment for training images were
   significant, since they directly affected the learning accuracy of concept clas-
   sifiers.
    The major difference of the challenge this year compared with previous edi-
tions is the proportions of “scaled” concepts. In the challenge last year, there are
total 116 concepts (95 concepts for development set and 21 more for test set),
the proportions of “scaled” concepts are 116 21
                                                ≈ 0.181. On the contrast, in this
year, there are total 207 concepts (107 concepts for development set and 100
more for test set), the proportions of “scaled” concepts are 207100
                                                                     ≈ 0.483. Thus
it implies the significance of annotation system to be scalable and to generalize
well to the new concepts.
    To develop a robust and scalable annotation system, we believe that one of
the intrinsic issues is to assign more appropriate concepts to training images.
Once we have collected more accurate (positive/negative) samples for each con-
cept, it is possible to improve the performance of concepts’ classifiers. Thus for
the contest, we mainly focus on the issue of accurate concepts assignment for
training images. Besides the traditional textual information processing such as
stopwords removal and stemming, which have been widely applied in previous
editions. We also leverage the recent popular convolutional neural networks (C-
NN) [7] to allocate tags (1K WordNet nouns) for each training images from
visual aspect. As the CNN based method utilizes the deep neural network to im-
prove classification task, we can rely on the tags predicted by Overfeat and map
the tags to concepts of ImageCLFE vocabulary. Then a late fusion approach is
used to decide the final concepts assignment for each training image. Finally, we
train a linear SVM classifier for each concept (similar in development and test
set) with the visual features provided by the organizers. To tackle the high di-
mensional large volumes of training data, we adopt the online learning strategy
of stochastic gradient descent (SGD). We finally obtain competitive annotation
performances in terms of mAP-samples, MF-concepts and MF-samples measures
and are ranked the 4th place among all 11 groups on overall measure.
    The rest of the paper is organized as follows. Section 2 demonstrates the
architecture of proposed annotation system and we mainly discuss our concepts


                                     412
assignment scheme for training images. In Section 3, we describe our experimen-
tal setups and report the evaluation results obtained on both the development
and the test sets. And Section 4 includes conclusion and some future direction
of our works.


2     Proposed annotation system

The proposed annotation system is depicted in Figure 1. To assign more appro-
priate concepts for training images, we conduct a 2-fold scheme which explicitly
leverages the provided textural information semantically (Section 2.1) and the
training images visually (Section 2.2). Based on the reliable labeled training
images, we further learn SVMs based concept classifiers using standard visual
features provided by the organizers. To tackle the high dimensional features and
large volumes of data, we use online learning method combined with SGD algo-
rithm. Then we use the learnt stable concept classifiers for concept prediction of
images in development and test sets. In the following subsections, we would like
to depict the detailed procedure of each module of the diagram in Figure 1.


           Fig. 1. Overview of proposed annotation system architecture


2.1   Text Processing Approach

The organizers of ImageCLFE 2014 provided several kinds of textural features
of training images. Following the traditional text processing approach utilized
last year, to efficiently process the textual features, we applied multiple filter-
ing on the textural features. Regarding the modules of “Stopword removal and
stemming” in Figure 1, the detailed processing procedures are:


                                     413
 – “Stopword removal and stemming” is performed on the “scofeats” files,
   where stopwords, misspelled words, words from different languages other
   than English, the titles of the original web pages are extracted and parsed.
 – We then matched the semantic relations of the remaining words with the list
   of concepts in development set based on WordNet 3.02 . We extend the list of
   concepts with their synonyms, and examine whether current word matches
   with concept or its synonyms.
 – The Lucene [3] stemmer is adopted if the word does not exactly match with
   the list of concepts.
The output of the “Result filtering and refinement” produces a candidate set of
concepts for each of the training image. Indeed, the processing approach in this
subsection could be considered as a baseline as it gives many false negative and
false positive concepts to training images. Therefore, besides the textual features
of training images, it is reasonable to further consider the visuality of training
images. For example, for a training image describing “airplane”, and its textu-
ral features (web page, title, etc) contain words of “airplane pilot hats”, simply
applying the text processing approach would result in concepts “airplane”, “per-
son”, and “hat” to be assigned to the training image. However, if it is possible to
estimate the content of image visually in advance, then the unrelated concepts
“person”, “hat” could be rejected to the training image. Thus, in the next sub-
section, we would like to introduce a context mapping method to predict tags
for training images in advance.

2.2    Context mapping using CNN

To estimate the content of training images visually, we take advantages of a
recently proposed toolbox Overfeat3 , which is an image recognizer and feature
extractor built around a deep convolutional neural network (CNN). We consider
this powerful toolbox for two reasons: (1) It achieved competitive classification
results on ImageNet 2013 contest4 . (2) OverFeat convolutional net was trained
on WordNet 1K nouns, which is consistent to the concept list of ImageCLFE.
Thus it is rational to predict tags for training images based on the Overfeat and
mapping the tags to ImageCLFE using a built context mapping rule. Regarding
the modules “Tag prediction with CNN” and “Context mapping” in Figure 1,
the detailed processing procedures are:
 – For a given training image, we directly use the Overfeat toolbox to predict
   tags for it.
 – For each of the tag predicted from Overfeat, we calculate its semantic sim-
   ilarity to the concept list of development set, and mapping it to the most
   similar concept.

   2
     http://wordnet.princeton.edu/
   3
     http://cilvr.nyu.edu/doku.php?id=software:overfeat:start
   4
     http://www.image-net.org/challenges/LSVRC/2013/results.php


                                     414
                                      sky 0.20241           instrument
                                      trimaran 0.150161                  airplane
                                                            space
                                      speedboat 0.0877134                sky
                                                            toy
                                      wing 0.0797028                     vehicle
                                                            cloudless
                                      warplane 0.0761503                 cloudless
                                                            cityscape
                                      paddle 0.0544154                   cityscape
                                                            person
                                      airliner 0.047078                  person
                                                            cartoon
                                      canoe 0.0456207                    cartoon
                                                            sign
                                      flagpole 0.0329707                 boat
                                                            protest
                                      snowmobile 0.028857


        Fig. 2. Example of context mapping using CNN on a training image.


    In Figure 2, we give an example of the context mapping using CNN. The tags
in blue rectangle are obtained from the previous “text processing” stage. The tags
(with confidence scores) in green rectangle are tags predicted from Overfeat. For
the context mapping procedure (in practice, we use the path similarity measure
in NLTK toolbox as the semantic measure), we can get a candidate concept set
{sky, airplane, vehicle, boat} based on the tags in green rectangle .
    For the “Result filtering and refinement” module in Figure 1, it fuses the can-
didate concept set from both textual processing approach and context mapping
with CNN. Since there are much more number of concepts produced by textual
processing approach than context mapping with CNN, however, the concept set
from textual processing approach is more coarse. Thus for the fusion strategy, we
relied more on the concept set from context mapping with CNN and preserved
the concepts with high similarity scores in concept set from textual processing
approach. In Figure 2, the concepts in red rectangle are the final assigned con-
cepts to the training image, which are considered to be semantically related to
the training image.

3     Experimental results
3.1   Visual features
Similar as the best result of TPT [6] in ImageCLFE 2013 annotation task, we use
the visual features provided by the organizer including GIST, Color Histogram,
SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For all SIFT-based descrip-
tors, a bag-of-words (BoW) representation is provided. An early fusion is made
by concatenating all the features provided (global color histogram, getlf, CSIFT,
GIST, opponent SIFT, RGB-SIFT, SIFT) resulting in a 21,312 dimension space.
Global features GIST and Color Histogram are normalized using L2 norm, and
SIFT-based features are normalized using L1 norm.

3.2   Evaluation measures
For the performance measures used to evaluate the runs, there are three standard
measures: mean F-measure for the samples (MF-samples), mean F-measure for


                                     415
the concepts (MF-concepts) and the mean average precision for the samples
(MAP-samples). The MF is computed analyzing both the samples (MF-samples)
and the concepts (MF-concepts), whereas the MAP is computed analyzing the
samples.

3.3   Training SVM classifiers for concepts
Following the SVM based annotation techniques which had achieved best an-
notation performance last year [2] [6], again we trained “one-versus-all” SVM
classifier for each concept. The popular SVM solvers, such as SVMlight, LibSVM,
they are not feasible for training large volumes of data with high dimension, s-
ince these batch methods need to pre-load entire training data into memory, to
compute gradient in each iteration. Thus it is difficult to directly utilize these
SVM solvers. According to the configuration of our machine (an Intel Core i7
2600 CPU (3.4 GHz) and 16 GB RAM), we take into account a better solution
by the stochastic gradient descent (SGD) algorithm which is more efficient for
training SVM classifiers with large-scale data. Different from the batch method,
in the SGD algorithm, training sample is fed one by one to calculate the gradients
and update rules of model parameters. Although the SGD algorithm might need
more iteration loops to reach convergence, it requires much less memory cost
which is more appropriate for large-scale training samples and online learning
manner.
    According to the advices in [2], we randomize the training data and load the
data in chunks which fit in memory, then train the different classifiers on further
randomizations of chunks, so that different epochs will get the chunks data with
different ordering which leads the learnt classifiers to be stable. We repeat this
training process on training set for 5 times to train SVM classifier for each
concept of development set and cross validate the F-measure on development
set. Then we select the parameters of best performance on development set to
further learn classifiers for concepts of test set. To predict concepts for images in
development and test sets, we use the trained concepts’s classifiers and obtain
decision scores for each concept by thresholding the confidence score at zero.

3.4   Inside analysis of annotation results
We first discuss the proposed 2-fold concept assignments to training images, and
evaluate its influence of learning accuracy of concept classifiers. We first con-
duct experiments on the development set and then extend the 2-fold scheme
to test set. Here we consider three settings: (1) “Single-Fold A”: the single fold
scheme of traditional textual information process (“Stopword removal and stem-
ming” module in Figure 1). (2) “Single-Fold B”: the single fold scheme of CNN
based tag prediction process (“Tag prediction with CNN” and “Context map-
ping” modules), (3) “Two-Folds”: the fusion process of both “Single-Fold A”
and “Single-Fold B”. We limited the maximum number of concepts assigned to
each training image to be 4. Then we use the learned SVMs classifiers from the
labeled training data to predict concepts for images in development set (the top


                                      416
5 ranked concepts are considered to be the final predicted concepts). Table 1
shows the annotation performance of three settings, and one of the baselines
provided by the organizers is also included for comparison. It can be observed
that three settings consistently improve the performance of baseline. In particu-
lar, the tags predicted by Overfeat is considerably accurate for training images.
“Single-Fold B” outperforms the “Single-Fold A” setting of traditional textual
information scheme, which implies the tags is highly coherent with the concepts
in ImageCLFE. Moreover, when fusing the two settings to formulate proposed
“Two-Folds” setting, the result is further improved on all three measures.


Table 1. Annotation results on development set: three settings of textual information
processing scheme of concept assignments for training images.

                     Run      MF-sample MF-concept MAP-sample
               Baseline (SIFT) 0.1342    0.2261      0.2254
                Single-Fold A   0.218     0.203      0.3321
                Single-Fold B  0.2693    0.2445      0.3622
                 Two-Folds     0.3105    0.3224      0.3781


        Fig. 3. Annotation performance on development set with varying K.


    Then we evaluate the effect of “Result filtering and refinement” module.
Since in the experiment settings above, we restrict the number (denoted by K)
of assigned concepts to each training image as K = 4. And it is reasonable that
the value of K could influence the learning accuracy of concept classifiers, as
it directly determines the quality of training samples for each concept. Thus,


                                      417
we further vary the value of K (ranges from 1 to 10), and explore the optimal
K for concept assignments for training images. The annotation performance on
development set with varying K is shown in Figure 3. It can be observed from
Figure 3 that: (1) The peaks of both MF-concept and MF-sample are reached
when K = 6, and peak of MAP-sample reaches the peak when K = 9. (2) The
MAP-sample is more sensitive to K since the number of ground truth concepts
for each image in development set ranges from 1 to 11 (with average 3.52). Based
on these observations, finally we choose K ∈ [6, 7, 8, 9, 10] for our latter submit
runs of test set.
    For the test set, we submitted ten runs5 . Here we would like to present our
best 5 runs with baselines provided by organizers and the best runs from the
other groups. We can learn from the overall results in Table 2 that: (1) All our
submitted runs are beyond the best baseline result for the test set according to
all measures. Looking into the overall participants results list, our best runs are
at position 6, 3 and 5 order by the MF-sample, MF-concept and MAP-sample
respectively for the test set, and position 4 for the overall performance. It means
that our best runs are competitive compared with other results.


Table 2. Annotation results of our best 5 runs on the test set, compared best runs of
baselines and other groups.

                      Run         MF-sample MF-concept MAP-sample
               Baseline (oppsift)   16.7       9.8       20.2
                   kdevir 09        37.7      54.7       36.8
                    MIL 03          27.5      34.7       36.9
                 MindLab 01         25.8      30.7        37
                DISA-MU 04          29.7      19.1       34.3
                    RUC 05          31.1       25        27.5
                    IPL 09          18.4      15.8       23.4
                 IMC-FU 01          16.3      12.5       25.1
                  INAOE 05           5.3      10.3        9.6
                    NII 01           13        2.3       14.7
                   FINKI 01          7.2       4.7        6.9
                   MLIA 09          24.8      33.2       27.8
                   MLIA 10          24.8      33.2       27.9
                   MLIA 08          24.6      33.3       27.4
                   MLIA 07          24.4      33.5       26.9
                   MLIA 06          24.1      33.6       26.3


   However, there is still a considerable gap between our best runs and the top-
ranked runs from KDEVIR group. Although currently we are not able to explore
the details of their proposed annotation technique, there are still space to im-
prove our annotation system itself from the following aspects: (1) In our current
   5
       http://www.imageclef.org/2014/annotation/results


                                      418
system, we directly utilized the Overfeat toolbox for tag prediction of training
images, a more reasonable choice is that we can generate CNN visual features
and directly use these visual features to learn concept classifiers. Indeed, several
teams such as MIL and MindLab used the CNN visual features. (2) Currently,
the “Context mapping” module only considered mapping the tags from Overfeat
to ImageCLEF with its synonymous/hyponyms in WordNet, and the similarity
measure from NLTK toolbox might not be precise to map the correct results.
An optional choice is modeling the context based similarity measure of tags de-
pending on the Flickr image metadata, which is more efficient to capture the
semantic associations from the practical circumstance. (3) Our concept model-
ing (SVMs based concept classifiers learning) is not elaborately optimized and
tuned, because of the limitations of hardware configurations and consumption of
resources. Our system capability should be improved if we could overcome these
limitations.

4    Conclusion
In this paper, we presented our annotation system developed to participate at
ImageCLEF 2014 for the Scalable Concept Image Annotation task. Our proposal
focus on improving the accuracy of concept assignments for training images.
We proposed a 2-fold concept assignments scheme which explicitly leverages the
provided textural information semantically (Section 2.1) and the training images
visually. To learn concept classifiers, we adopted the sophisticated SVM based
model, and took the SGD algorithm to deal with large scale settings of this
task. Experimental results show that our proposal on both visual and textual
information processing are necessary to build a competitive system. Moreover,
we also considered potential future directions to further improve current system.

References
 1. Caputo, B., Müller, H., Martinez-Gomez, J., Villegas, M., Acar, B., Patricia, N.,
    Marvasti, N., Üsküdarlı, S., Paredes, R., Cazorla, M., Garcia-Varea, I., Morell, V.:
    ImageCLEF 2014: Overview and analysis of the results. In: CLEF proceedings.
    Lecture Notes in Computer Science, Springer Berlin Heidelberg (2014)
 2. Grana, C., Serra, G., Manfredi, M., Cucchiara, R., Martoglia, R., Mandreoli, F.:
    Unimore at imageclef 2013: Scalable concept image annotation. In: CLEF 2013
    Evaluation Labs andWorkshop, OnlineWorking Notes (2013)
 3. Hatcher, E., Gospodnetic, O., McCandless, M.: Lucene in action. Second Edition
 4. Hidaka, M., Gunji, N., Harada, T.: Mil at imageclef 2013: Scalable system for image
    annotation. In: CLEF 2013 Evaluation Labs andWorkshop, OnlineWorking Notes
    (2013)
 5. Li, X., Liao, S., Liu, B., Yang, G., Jin, Q., Xu, J., Du, X.: Renmin university of china
    at imageclef 2013 scalable concept image annotation. In: CLEF 2013 Evaluation
    Labs and Workshop, Online Working Notes (2013)
 6. Sahbi, H.: Telecom paristech at imageclef 2013 scalable concept image annotation
    task: Winning annotations with context dependent svms. In: CLEF 2013 Evalua-
    tion Labs and Workshop, Online Working Notes (2013)


                                         419
 7. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat:
    Integrated recognition, localization and detection using convolutional networks.
    CoRR abs/1312.6229 (2013)
 8. Uricchio, T., Bertini, M., B., L., Bimbo, A.: Micc at imageclef 2013 image anno-
    tation subtask. In: CLEF 2013 Evaluation Labs andWorkshop, Online Working
    Notes (2013)
 9. Villegas, M., Paredes, R.: Overview of the imageclef 2012 scalable concept im-
    age annotation subtask. In: CLEF 2012 Evaluation Labs and Workshop, Online
    Working Notes (2012)
10. Villegas, M., Paredes, R.: Overview of the ImageCLEF 2014 Scalable Concept
    Image Annotation Task. In: CLEF 2014 Evaluation Labs and Workshop, Online
    Working Notes (2014)
11. Villegas, M., Paredes, R., Thomee, B.: Overview of the imageclef 2013 scalable
    concept image annotation subtask. In: CLEF 2013 Evaluation Labs and Workshop,
    Online Working Notes. pp. 1–19 (2013)


                                      420

</pre>