Overview of the ImageCLEF 2013 Scalable
         Concept Image Annotation Subtask

            Mauricio Villegas,† Roberto Paredes† and Bart Thomee‡
                    †
                        ITI/DSIC, Universitat Politècnica de València
                          Camı́ de Vera s/n, 46022 València, Spain
                            {mvillegas,rparedes}@iti.upv.es
                                 ‡
                                   Yahoo! Research
                    Avinguda Diagonal 177, 08018 Barcelona, Spain
                              bthomee@yahoo-inc.com


       Abstract. The ImageCLEF 2013 Scalable Concept Image Annotation
       Subtask was the second edition of a challenge aimed at developing more
       scalable image annotation systems. Unlike traditional image annotation
       challenges, which rely on a set of manually annotated images as training
       data for each concept, the participants were only allowed to use auto-
       matically gathered web data instead. The main objective of the challenge
       was to focus not only on the image annotation algorithms developed by
       the participants, where given an input image and a set of concepts they
       were asked to decide which of them were present in the image and which
       ones were not, but also on the scalability of their systems, such that the
       concepts to detect were not exactly the same between the development
       and test sets. The participants were provided with web data consisting of
       250,000 images, which included textual features obtained from the web
       pages on which the images appeared, as well as various visual features
       extracted from the images themselves. To evaluate the performance of
       the submitted systems a development set was provided containing 1,000
       images that were manually annotated for 95 concepts and a test set con-
       taining 2,000 images that were annotated for 116 concepts. In total 13
       teams participated, submitting a total of 58 runs, most of which signif-
       icantly outperformed the baseline system for both the development and
       test sets, including for the test concepts not present in the development
       set and thus clearly demonstrating potential for scalability.


1     Introduction
Automatic concept detection within images is a challenging and as of yet un-
solved research problem. Over the past decades impressive improvements have
been achieved, albeit admittedly not yet successfully solving the problem. Yet,
these improvements have been typically obtained on datasets for which all im-
ages have been manually, and thus reliably, labeled. For instance, it has become
common in past image annotation benchmark campaigns [10,16] to use crowd-
sourcing approaches, such as the Amazon Mechanical Turk1 , in order to let mul-
1
    www.mturk.com
2      M. Villegas et al.


                   (a) Images from a search query of “rainbow”.


                     (b) Images from a search query of “sun”.

    Fig. 1: Example of images retrieved by a commercial image search engine.


tiple annotators label a large collection of images. Nonetheless, crowdsourcing is
expensive and difficult to scale to a very large amount of concepts. The image
annotation datasets furthermore usually include exactly the same concepts in
the training and test sets, which may mean that the evaluated visual concept
detection algorithms are not necessarily able to cope with detecting additional
concepts beyond what they were trained on. To address these shortcomings a
novel image annotation task [20] was proposed last year for which automatically
gathered web data was to be used for concept detection, where the concepts var-
ied between the evaluation sets. The aim of that task was to reduce the reliance
of cleanly annotated data for concept detection and rather focus on uncovering
structure from noisy data, emphasizing the importance of the need for scalable
annotation algorithms able to determine for any given concept whether or not it
is present in an image. The rationale behind the scalable image annotation task
was that there are billions of images available online appearing on webpages,
where the text surrounding the image may be directly or indirectly related to
its content, thus providing clues as to what is actually depicted in the image.
Moreover, images and the webpages on which they appear can be easily obtained
for virtually any topic using a web crawler. In existing work such noisy data has
indeed proven useful, e.g. [17,22,21].
    The second edition of the scalable image annotation task is what is pre-
sented in this overview paper, which is one of several ImageCLEF benchmark
campaigns [3]. The paper is organized as follows. In Section 2 we describe the
task in more detail, which includes introducing the dataset that was created
specifically for this challenge, the baseline system and the evaluation measures.
In Section 3 we then present and discuss the results submitted by the partici-
pants. Finally, we conclude the paper with final remarks and future outlooks in
Section 4.
                     Overview of the ImageCLEF 2013 Annotation Subtask          3


2      Overview of the Subtask
2.1     Motivation and Objectives
Image concept detection generally has relied on training data that has been
manually, and thus reliably, annotated, which is an expensive and laborious
endeavor that cannot easily scale. To address this issue, the ImageCLEF 2013
scalable annotation subtask concentrated exclusively on developing annotation
systems that rely only on automatically obtained data. A very large amount of
images can be easily gathered from the web, and furthermore, from the webpages
that contain the images, text associated with them can be obtained. However,
the degree of relationship between the surrounding text and the image varies
greatly. Moreover, the webpages can be of any language or even a mixture of
languages, and they tend to have many writing mistakes. Overall the data can
be considered to be very noisy.
    To illustrate the objective of the evaluation, consider for example that some-
one searches for the word “rainbow” in a popular image search engine. It would
be expected that many results be of landscapes in which in the sky a rainbow
is visible. However, other types of images will also appear, see Figure 1a. The
images will be related to the query in different senses, and there might even be
images that do not have any apparent relationship. In the example of Figure 1a,
one image is a text page of a poem about a rainbow, and another is a photo-
graph of an old cave painting of a rainbow serpent. See Figure 1b for a similar
example on the query “sun”. As can be observed, the data is noisy, although
it does have the advantage that this data can also handle the possible different
senses that a word can have, or the different types of images that exist, such as
natural photographs, paintings and computer-generated imagery.
    In order to handle the web data, there are several resources that could be
employed in the development of scalable annotation systems. Many resources
can be used to help match general text to given concepts, amongst which some
examples are stemmers, word disambiguators, definition dictionaries, ontologies
and encyclopedia articles. There are also tools that can help to deal with noisy
text commonly found on webpages, such as language models, stop word lists
and spell checkers. And last but not least, language detectors and statistical
machine translation systems are able to process webpage data written in various
languages.
    In summary, the goal of the scalable image annotation subtask was to eval-
uate different strategies to deal with noisy data, so that the unsupervised web
data can be reliably used for annotating images for practically any topic.

2.2     Challenge Description
The subtask2 consisted of the development of an image annotation system given
training data that only included images crawled from the Internet, the corre-
sponding webpages on which they appeared, as well as precomputed visual and
2
    Subtask website at http://imageclef.org/2013/photo/annotation
4        M. Villegas et al.


textual features. As mentioned in the previous section, the aim of the subtask
was for the annotation systems to be able to easily change or scale the list of
concepts used for image annotation. Apart from the image and webpage data,
the participants were also permitted and encouraged to use any other automat-
ically obtainable resources to help in the processing and usage of the training
data. However, the most important rule was that the systems were not permitted
to use any kind of data that had been explicitly and manually labeled for the
concepts to detect.
    For the development of the annotation systems, the participants were pro-
vided with the following:
– A training dataset of images and corresponding webpages compiled specifically
  for the subtask, including precomputed visual and textual features (see Section
  2.3).
– Source code of a simple baseline annotation system (see Section 2.4).
– Tools for computing the appropriate performance measures (see Section 2.5).
– A development set of images with ground truth annotations (including pre-
  computed visual features) for estimating the system performance.
    After a period of two months, a test set of images was released that did not
include any ground truth labels. The participants had to use their developed
systems to predict the concepts for each of the input images and submit these
results to the subtask organizers. A maximum of 6 submissions (also referred
to as runs) were allowed per participating group. Since one of the objectives
was that the annotation systems be able to scale or change the list of concepts
for annotation, the list of concepts for the test set was not exactly the same as
those for the development set. The development set consisted of 1,000 images
labeled for 95 concepts, and the test set consisted of 2,000 images labeled for
116 concepts (the same 95 concepts for development and 21 more).
    To observe the possible overfitting of the development set and the difference
of performance with respect to the test set, the participants were also required
to submit the concept predictions of the development set, using exactly the same
system and parameters as for the test set.
    The concepts to be used for annotation were defined as one or more WordNet
synsets [4]. So, for each concept there was a concept name, the type (either
noun or adjective), the synset offset(s), and the sense number(s). Defining the
concepts this way, made it straightforward to obtain the concept definition,
synonyms, hyponyms, etc. Additionally, for most of the concepts, a link to a
Wikipedia article about the respective concept was provided. The complete list
of concepts, as well as the number of images in both the development and test
sets, is included in Appendix A.

2.3     Dataset
The dataset3 used was mostly the same as the one in ImageCLEF 2012 for
the first edition of this task [20]. To create the dataset, initially a database of
3
    Dataset available at http://risenet.iti.upv.es/webupv250k
                        Overview of the ImageCLEF 2013 Annotation Subtask       5


over 31 million images was created by querying Google, Bing and Yahoo! using
words from the Aspell English dictionary [19]. The images and corresponding
webpages were downloaded, taking care to avoid data duplication. Then, a subset
of 250,000 images (to be used as the training set) was selected from this database
by choosing the top images from a ranked list. The motivation for selecting a
subset was to provide smaller data files that would not be so prohibitive for the
participants to download/handle, and because a limited amount of concepts had
to be chosen for evaluation. The ranked list was generated by retrieving images
from our database using a manually defined list of concepts, in essence more
or less as if the search engines had only been queried for these concepts. From
this ranked list, some types of problematic images were removed, and it was
guaranteed that each image had at least one webpage in which they appeared.
Unlike the training set, the development (1,000 images) and test (2,000 images)
sets were manually selected and labeled for the concepts being evaluated. For
further details on how the dataset was created, please refer to [20].
    The 250,000 training set images were exactly the same as the ones for Im-
ageCLEF 2012. However, some images from the development and test sets had
been changed. To guaranty that the visual features were the same for the new
images, due to changes in software versions, the features were recalculated and
therefore are different from those supplied in the previous edition of this sub-
task. Also this year the original images and webpages were provided. The most
significant change of the dataset with respect to 2012 was the labeling of the
development and test sets, where the images have now been labeled and linked
to concepts in WordNet [4], thus making it much easier to automatically obtain
more information for each concept. Moreover, for most of the concepts a corre-
sponding Wikipedia article was additionally supplied, which may prove to be a
useful resource.

Textual Data: Since the textual data was to be used only during training, it
was only provided for the training set. Four sets of data were made available
to the participants. The first one 4 was the list of words used to find the image
when querying the search engines, along with the rank position of the image in
the respective query and search engine it was found on. The second set of textual
data4 contained the image URLs as referenced in the webpages they appeared
in. In many cases the image URLs tend to be formed with words that relate
to the content of the image, which is why they can also be useful as textual
features. The third set of data were the webpages in which the images appeared,
for which the only preprocessing was a conversion to valid XML just to make
any subsequent processing simpler. The final set of data4 were features obtained
from the text extracted near the position(s) of the image in each webpage it
appeared in.
    To extract the text near the image, after conversion to valid XML, the script
and style elements were removed. The extracted text were the webpage title and
all the terms closer than 600 in word distance to the image, not including the
4
    This textual data was identical to the 2012 edition [20].
6      M. Villegas et al.


HTML tags and attributes. Then a weight s(tn ) was assigned to each of the
words near the image, defined as
                                 1         X
                  s(tn ) = P                        Fn,m sigm(dn,m ) ,         (1)
                               ∀t∈T s(t) ∀tn,m ∈T


where tn,m are each of the appearances of the term tn in the document T , Fn,m
is a factor depending on the DOM (e.g. title, alt, etc.) similar to what is done
in the work of La Cascia et al. [7], and dn,m is the word distance from tn,m to
the image. The sigmoid function was centered at 35, had a slope of 0.15 and
minimum and maximum values of 1 and 10 respectively. The resulting features
include for each image at most the 100 word-score pairs with the highest scores.


Visual Features: Seven types of visual features were made available to the par-
ticipants. Before feature extraction, images were filtered and resized so that the
width and height had at most 240 pixels while preserving the original aspect ra-
tio. The first feature set Colorhist consisted of 576-dimensional color histograms
extracted using our own implementation. These features correspond to divid-
ing the image in 3 × 3 regions and for each region obtaining a color histogram
quantified to 6 bits. The second feature set GETLF contained 256-dimensional
histogram based features. First, local color-histograms were extracted in a dense
grid every 21 pixels for windows of size 41 × 41. Second, these local color-
histograms were randomly projected to a binary space using 8 random vectors
and considering the sign of the resulting projection to produce the bit. Thus,
obtaining a 8-bit representation of each local color-histogram that can be con-
sidered as a word. Finally, the image is represented as a bag-of-words, leading to
a 256-dimensional histogram representation. The third set of features consisted
of GIST [11] descriptors. The other four feature types were obtained using the
colorDescriptors software [15]. Features were computed for SIFT, C-SIFT, RGB-
SIFT and OPPONENT-SIFT. The configuration was dense sampling with de-
fault parameters and a hard assignment 1,000 codebook using a spatial pyramid
of 1 × 1 and 2 × 2 [8]. Since the vectors of the spatial pyramid were concatenated,
this resulted in 5,000-dimensional feature vectors. Keeping only the first fifth of
the dimensions would be like not using the spatial pyramid. The codebooks
were generated using 1.25 million randomly selected features and the k-means
algorithm.


2.4    Baseline Systems

A toolkit was supplied to the participants as a performance reference for the
evaluation, as well as to serve as a starting point. This toolkit included software
that computed the evaluation measures (see Section 2.5) and the implementa-
tions of two baselines. The first baseline was a simple random, which is important
since any system that gets worse performance than random is useless. The other
baseline, referred to as Co-occurrence Baseline, was a basic technique that gives
                     Overview of the ImageCLEF 2013 Annotation Subtask          7


better performance than random, although it was simple enough to give the par-
ticipants a wide margin for improvement. In the latter technique, when given an
input image, obtains its nearest K = 32 images from the training set using only
the 1,000 bag-of-words C-SIFT visual features and the L1 norm. Then, the tex-
tual features corresponding to these K nearest images are used to derive a score
for each of the concepts. This is done by using a concept-word co-occurrence
matrix estimated from all of the training set textual features. In order to make
the vocabulary size more manageable, the textual features are first processed
keeping only English words. Finally, the annotations assigned to the image are
always the top 6 ranked concepts.


2.5    Performance Measures

Ultimately the goal of an image annotation system is to make decisions about
which concepts to assign to given image from a predefined list of concepts. Thus
to measure annotation performance what should be considered is how good are
those decisions. On the other hand, in practice many annotations systems are
based on estimating a score for each of the concepts and then a second technique
uses these scores to finally decide which concepts are chosen. For systems of this
type a measure of performance can be based only on the concept scores, which
considers all aspects of the system except for the technique used for concept
decisions, making it an interesting characteristic to measure.
   For this task, two basic performance measures have been used for comparing
the results of the different submissions. The first one is the F-measure (F1 ),
which takes into account the final annotation decisions, and the other is the
Average Precision (AP), which considers the concept scores.
   The F-measure is defined as
                                         2P R
                                 F1 =         ,                               (2)
                                        P +R
where P is the precision and R is the recall. In the context of image annotation,
the F1 can be estimated from two different perspectives, one being concept-based
and the other sample-based. In the former, one F1 is computed for each concept,
and in the latter one F1 is computed for each image to annotate. In both cases,
the arithmetic mean is used as a global measure of performance, and will be
referenced as MF1 -concepts and MF1 -samples, respectively.
    The AP is algebraically defined as
                                        |K|
                                    1 X    k
                            AP =                ,                             (3)
                                   |K|  rank(k)
                                        k=1

where K is the ordered set of the ground truth annotations, being the order
induced by the annotation scores, and rank(k) is the order position of the k-th
ground truth annotation. The fraction k/ rank(k) is actually the precision at the
k-th ground truth annotation, and has been written like this to be explicit on
8      M. Villegas et al.


the way it is computed. In the cases that there are ties in the scores, a random
permutation is applied within the ties. The AP can also be estimated for both
the concept-based and sample-based perspectives, however, the concept-based
AP is not a suitable measure of annotation performance (it is more adequate
for a retrieval scenario), so only the sample-based AP has been considered in
this evaluation. As a global measure of performance, also the arithmetic mean
is used, which will be referred to as MAP-samples.
    A bit of care must be taken when comparing systems using the MAP-samples
measure. What the MAP-samples turns out saying is that if for a given image the
scores are used to sort the concepts, how good would it rank the true concepts
for the image. Depending on the system, its scores could or could not be optimal
for ranking the concepts. Thus a system with a relatively low MAP-samples,
could still have a good annotation performance if the method used to select the
concepts is adequate for its concept scores. Because of this, as well as the fact
that there can be systems that do not rely on scores, it was optional for the
participants of the task to provide scores.

3     Evaluation Results
3.1    Participation
The participation was excellent, especially considering that this was the second
edition of the task and last year there was only one participant. In total 13
groups took part, submitting 58 runs overall. The following teams participated:
– CEA LIST: The team from the Vision & Content Engineering group of CEA
  LIST (Gif-sur-Yvettes, France) was represented by Hervé Le Borgne, Adrian
  Popescu and Amel Znaidia.
– INAOE: The team from the Instituto Nacional de Astrofı́sica, Óptica y
  Electrónica (Puebla, Mexico) was represented by Hugo Jair Escalante.
– KDEVIR: The team from the Computer Science and Engineering depart-
  ment of the Toyohashi University of Technology (Aichi, Japan), was repre-
  sented by Ismat Ara Reshma, Md Zia Ullah and Masaki Aono.
– LMCHFUT: The team from Hefei University of Technology (Hefei, China)
  was represented by Yan Zigeng.
– MICC: The team from the Media Integration and Communication Center
  of the Università degli Studi di Firenze (Florence, Italy) was represented by
  Tiberio Uricchio, Marco Bertini, Lamberto Ballan and Alberto Del Bimbo.
– MIL: The team from the Machine Intelligence Lab of the University of Tokyo
  (Tokyo, Japan) was represented by Masatoshi Hidaka, Naoyuki Gunji and
  Tatsuya Harada.
– RUC: The team from the School of Information of the Renmin University of
  China (Beijing, China) was represented by Xirong Li, Shuai Liao, Binbin Liu,
  Gang Yang, Qin Jin, Jieping Xu and Xiaoyong Du.
– SZTAKI: The team from the Datamining and Search Research Group of
  the Hungarian Academy of Sciences (Budapest, Hungary) was represented by
  Bálint Daróczy.
                                         Table 1: Comparison of the systems for the best submission of each group.

                    Visual Features
   System                                  Other Used Resources            Training Data Processing Highlights              Annotation Technique Highlights
                     [Total Dim.]
                                                                                                                           Multiple SVMs per concept, with
                 Provided by organizers                              Manual morphological expansions of the concepts
  TPT [13]                              * Morphological                                                                    context dependent kernels. Annotation
                 (All 7)                                             (plural forms). Training images selected by
    #6                                  expansions                                                                         based on threshold (the same for all
                 [Tot. Dim. = 21312]                                 appearance of concept in supplied textual features.
                                                                                                                           concepts).
                                                                     Extract webpage title, image attributes, surrounding
                 Fisher Vectors (SIFT,    * WordNet                                                                        Linear multilabel classifier learned by
   MIL [6]                                                           text, and singularize nouns. Label training images by
                 C-SIFT, LBP, GIST)       * ActiveSupport library                                                          PAAPL. Annotation of the top 5
    #4                                                               appearance of concept, defined by WordNet
                 [Tot. Dim. = 262144]     for word singularization                                                         concepts.
                                                                     synonyms and hyponyms with a single meaning.
                                                                     Stopword removal and stemming of supplied
            Multiv. Gauss. Distrib.
                                    * WordNet                        preprocessed features and webpage title. Label
            of local desc.                                                                                                 Linear SVMs learned by stochastic
UNIMORE [5]                         * NLTK (stopwords and            training images by appearance of concept, defined by
            (HSV-SIFT, OPP-SIFT,                                                                                           gradient descent. Annotation based on
    #2                              stemmer)                         WordNet synonyms and hyponyms with a single
            RGB-SIFT)                                                                                                      threshold (the same for all concepts).
                                    * +100k training images          meaning. Disambiguation by negative context from
            [Tot. Dim. = 201216]
                                                                     other senses of concept word.
                                                                     Positive training images selected by a combination of
                 Provided by organizers                              supplied textual features and search engine keywords Multiple staked hikSVMs and kNNs
   RUC [9]                              * Search engine keywords
                 (All 7)                                             weighted by a tag co-occurrence measure derived       (L1 distance). Annotation of the top 6


                                                                                                                                                                      Overview of the ImageCLEF 2013 Annotation Subtask
     #4                                 * Flickr tags dataset
                 [Tot. Dim. = 21312]                                 from a Flickr dataset. Negative examples selected by concepts.
                                                                     Negative Bootstrap.
                                                                                                                           Text retrieval of concepts using
                                          * Webpage of test images   Concept indexing (using Lucene) of WordNet
UNED&UV [1]                                                                                                                webpage img field. Annotation by a
                            —a            * WordNet                  definition, forms, hypernyms, hyponyms and related
    #3                                                                                                                     cut-off percentage of the maximum
                                          * Lucene                   components.
                                                                                                                           scored concept.
                                                                     Training images selected by ranking the images using Linear SVM. Annotation of concepts
             Bag of Visterms
CEA LIST [2]                              * Wikipedia                tag models learned from Flickr and Wikipedia data. with score above µ + σ (µ, σ are mean
             (SIFT)
    #4                                    * Flickr tags dataset      The first 100 ranked images as positive and the last and standard deviation of all concept
             [Tot. Dim. = 8192]
                                                                     500 images as negative.                               scores).
                                                                     Stopwords and non-English words removal and
                 Provided by organizers
                                                                     stemming of supplied textual features. Matching of    kNN (IDsim) and aggregating concept
KDEVIR [12]      (colorhist, C-SIFT,      * WordNet
                                                                     features with concepts defined by WordNet             scores (bm25). Annotation of top 10
   #1            OPP-SIFT, RGB-SIFT)      * Lucene stemmer
                                                                     synonyms and application of bm25 to obtain concept concepts.
                 [Tot. Dim. = 15576]
                                                                     scores per image.
                                                                                                                           kNN (Bhattacharyya, χ2 , and L2
                 HSV histograms, LBP      * Search engine keywords   Stopword removal and stemming of supplied textual
  URJC&                                                                                                                    distances) and aggregating concept
                 and provided by          * WordNet                  features, and enriched by WordNet synonyms and
 UNED [14]                                                                                                                 scores (co-occurrence). Annotation
                 organizers (C-SIFT)      * English Stopwords list   hyperonyms. Generation of keywords-concepts
   #3                                                                                                                      based on threshold (the same for all
                 [Tot. Dim. = 5384]       * Porter stemmer           co-occurrence matrix.
                                                                                                                           concepts).
                                        * WordNet                    Stopword removal of supplied textual features, search
                 Provided by organizers                                                                                    kNN (Gaussian kernel distance) and
 MICC [18]                              * Wikipedia                  engine keywords and URL extracted words. Enriched
                 (All 7)                                                                                                   rank concepts by tagRelevance.
   #5                                   * Search engine keywords     textual features with WordNet synonyms and
                 [Tot. Dim. = 21312]                                                                                       Annotation of the top 7 concepts.
                                        * Training image URLs        Wikipedia link structure.
                                                                                                                           Textual ranking of images based on
   SZTAKI        Fisher Vectors                                      Fisher vector-based learning of visual model given
                                       * Wikipedia                                                                         Wikipedia concept descriptions.
     #1          [Tot. Dim. = Unknown]                               training images per category.
                                                                                                                           Prediction via visual models.
                                                                     Documents are represented by a distribution of
   INAOE         SIFT                                                occurrences over other documents in the corpus, so    Ensemble of linear classifiers per
                                                  Unknown
     #3          [Tot. Dim. = Unknown]                               that documents are represented by their context,      concept.
                                                                     yielding a prototype per concept.


                                                                                                                                                                      9
                 Global: CEDD, Color,                                                                                      Tags of NN image ranked by TF-IDF.
THSSMPAM         Bag of visterms (SIFT)                                                                                    Similarity between tags and concepts
                                        * WordNet                                           Unknown
   #2            Local: SIFT, SURF                                                                                         using WordNet. Annotation by
                 [Tot. Dim. = Unknown]                                                                                     bipartite graph algorithm.
                 Provided by organizers                                                                                    Single SVM learned per concept given
 LMCHFUT                                                             Training images selected by appearance of concept in
                 (SIFT)                        Unknown                                                                     visual features of positive and negative
   #1                                                                supplied textual features.
                 [Tot. Dim. = 5000]                                                                                        training examples.

     a
         Unlike the other systems that take as input image visual features, the UNED&UV system receives as input the image webpage.
10       M. Villegas et al.


– THSSMPAM: The team from Beijing, China was represented by Jile Zhou.
– TPT: The team of CNRS TELECOM ParisTech (Paris, France) was repre-
  sented by Hichem Sahbi.
– UNED&UV: The team from the Universidad Nacional de Educación a Dis-
  tancia (Madrid, Spain) and the Universitat de València was represented by
  Xaro Benavent, Angél Castellanos Gonzáles, Esther de Ves, D. Hernández-
  Aranda, Ruben Granados and Ana Garcia-Serrano.
– UNIMORE: The team from the University of Modena and Reggio Emilia
  (Modena, Italy) was represented by Costantino Grana, Giuseppe Serra, Marco
  Manfredi, Rita Cucchiara, Riccardo Martoglia and Federica Mandreoli.
– URJC&UNED: The team of the Universidad Rey Juan Carlos (Móstoles,
  Spain) and the Universidad Nacional de Educación a Distancia (Madrid,
  Spain) was represented by Jesús Sánchez-Oro, Soto Montalvo, Antonio Mon-
  temayor, Juan Pantrigo, Abraham Duarte, Vı́ctor Fresno and Raquel Martı́nez.

In Table 1 we provide a comparison of a the key details of the best submission
of each group. For a more in depth look of the annotation systems of each team,
please refer to their corresponding paper listed in the table. Note that there were
four groups that did not submit a working notes paper describing their system,
so for those submissions less information could be listed.


3.2    Results

Table 2 presents the performance measures (mentioned in 2.5) for the baseline
techniques and all of the submitted runs by the participants.The last column of
the table corresponds to the MF1 -concepts measure which was only computed
for the 21 concepts that did not appear in the development set. The systems
are ordered by performance, beginning at the top with the best performing one.
This order of the systems has been derived by considering for the test set the
average rank when comparing all of the systems, using the MF1 -samples, the
MF1 -concepts and the MF1 -concepts unseen in dev. measures, while breaking
ties by the average of the same three performance measures.
    For an easier comparison and a more intuitive visualization, the same results
of Table 2 are presented as graphs in Figure 2 (only for the test set). These
graphs include for each result the 95% confidence intervals. These intervals have
been estimated by Wilson’s method, employing the standard deviation for the
individual measures (for the samples or concepts, and for the average precisions
(AP) or F-measures (F1 ), depending on the case).
    Finally, in Figure 3 there is for each of the 116 test set concepts, a boxplot
(or also known as box-and-whisker plot) for the F1 -measures when combining all
runs. In order to fit all of the concepts in the same graph, for multiple outliers
with the same value, only one is shown. The concepts have been sorted by the
median performance of all submissions, which in a way orders them by difficulty.
                         Overview of the ImageCLEF 2013 Annotation Subtask                     11


Table 2: Performance measures (in %) for the baseline techniques and all submissions.
The best submission for each team is highlighted in bold font.
                                                                              MF1 -concepts
                         MAP-samples        MF1 -samples      MF1 -concepts
        System                                                                unseen in dev.
                          dev.     test     dev.     test     dev.    test         test
 Baseline OPP-SIFT        24.6     21.4     19.2     16.4     13.8    11.8         10.3
 Baseline C-SIFT          24.2     21.2     18.6     16.2     10.7    10.5         10.8
 Baseline RGB-SIFT        24.3     21.2     18.5     15.8     13.0    11.7         10.5
 Baseline SIFT            24.0     21.0     17.8     15.9     11.0    11.0         10.1
 Baseline Colorhist       22.1     19.0     16.1     13.9      8.0     8.0          9.6
 Baseline GIST            20.9     17.8     14.5     12.5      6.1     6.9          7.3
 Baseline GETLF           21.0     17.7     14.9     12.5      6.6     5.4          5.9
 Baseline Random          10.9      8.7      6.2      4.6      4.8     3.6          2.3
 TPT #6                   50.4     44.4     51.3     42.6     45.0    34.1         45.1
 TPT #4                   48.9     43.2     50.7     41.8     42.5    33.7         45.3
 MIL #4                   43.8     41.4     34.0     32.4     34.7    32.3         35.8
 MIL #1                   44.5     42.1     34.6     33.2     35.2    32.6         33.8
 MIL #2                   43.1     40.7     34.3     32.7     33.9    31.8         31.4
 UNIMORE #2               46.0     44.1     27.3     27.5     34.2    33.1         34.8
 UNIMORE #5               47.9     45.6     33.3     31.5     33.7    31.9         31.9
 UNIMORE #1               39.2     36.7     33.0     31.1     34.1    32.0         31.3
 TPT #2                   38.5     37.0     41.4     38.1     30.9    30.0         30.9
 UNIMORE #6               46.0     44.1     33.0     31.1     34.1    32.0         31.3
 RUC #4                   41.2     38.0     31.6     29.0     33.4    30.4         32.8
 MIL #5                   42.2     39.7     34.0     31.7     33.4    30.9         30.2
 MIL #3                   42.5     39.6     34.2     31.8     33.4    30.2         29.5
 RUC #5                   40.5     37.6     31.0     28.3     32.7    29.6         31.5
 UNED&UV #3               27.1     26.6     22.5     23.1     31.5    31.3         43.2
 UNIMORE #3               43.7     41.9     23.1     23.1     32.4    31.5         35.5
 UNED&UV #5               35.5     33.2     27.6     24.4     31.7    29.2         35.4
 TPT #5                   49.8     44.3     38.7     32.5     33.0    26.7         27.3
 RUC #3                   39.4     36.9     29.8     27.8     31.4    29.2         30.2
 TPT #3                   49.0     43.6     38.8     31.9     30.2    24.8         24.7
 RUC #2                   38.2     35.5     28.8     26.5     30.8    28.5         29.9
 UNIMORE #4               39.7     36.2     26.8     24.1     31.7    29.5         28.0
 UNED&UV #4               31.0     29.8     29.9     30.0     26.3    22.8         24.6
 UNED&UV #1               32.8     30.3     25.0     23.0     27.5    25.0         31.7
 RUC #1                   36.1     32.4     28.8     25.4     26.6    23.9         22.7
 UNED&UV #2               32.4     30.6     24.4     22.9     26.1    24.0         30.6
 CEA LIST #4              40.3     34.2     32.2     26.0     26.1    21.2         20.1
 CEA LIST #5              39.2     33.6     31.6     25.7     25.4    21.0         20.0
 CEA LIST #3              40.4     34.1     31.8     25.2     25.3    20.2         20.5
 CEA LIST #2              39.6     33.6     30.2     24.2     24.6    20.1         20.1
 CEA LIST #1              34.6     29.4     28.7     23.0     23.6    19.0         19.8
 KDEVIR #1                28.7     26.1     25.3     22.2     21.1    18.0         17.3
 URJC&UNED #3             32.6     28.1     27.9     24.1     19.8    17.3         14.8
 MICC #5                  29.1     26.2     22.7     20.0     21.4    18.0         18.6
 MICC #4                  29.2     26.1     22.4     20.0     21.0    18.0         18.6
 MICC #3                  29.0     26.1     22.3     20.0     21.0    18.1         18.5
 URJC&UNED #2             32.2     27.6     27.7     23.8     19.7    17.2         14.6
 URJC&UNED #1             32.0     27.6     27.4     23.7     19.2    17.1         14.6
 MICC #2                  29.0     26.1     23.3     20.4     20.7    17.5         17.0
 MICC #1                  28.7     25.9     20.4     18.7     20.3    17.3         17.6
 KDEVIR #3                28.6     24.8     24.8     21.1     18.7    15.9         15.6
 TPT #1                   38.6     36.8     30.2     23.0     24.2    19.2          8.2
 KDEVIR #6                28.3     24.3     24.5     20.8     18.4    15.7         15.0
 KDEVIR #4                29.2     26.4     24.7     20.5     18.5    15.4         15.3
 KDEVIR #5                29.0     25.6     24.6     20.2     18.5    15.1         14.5
 KDEVIR #2                26.4     23.5     25.0     20.7     19.2    14.8         12.6
 SZTAKI #1                32.9     28.2     10.4     9.5      17.7    16.4         16.7
 INAOE #3                 24.0     19.1     19.7     15.4     17.7    15.2         11.1
 SZTAKI #2                32.7     28.0      9.8      8.8     17.1    15.1         16.0
 THSSMPAM #3              20.9     15.9     17.0     14.8     13.0    12.7         11.1
 THSSMPAM #2              21.7     16.1     17.0     14.8     13.0    12.7         11.1
 LMCHFUT #1               N/Aa     N/Aa     12.2     11.0     13.6    12.1         11.3
 INAOE #1                 21.5     17.5     21.3     16.9      9.0     6.9          5.1
 THSSMPAM #1              16.3     12.0     18.2     11.8     13.7    10.0          6.6
 INAOE #2                 23.6     19.0     24.8     16.7      6.3     4.8          4.7
 THSSMPAM #4              15.9     11.9     15.5     11.8     12.2    10.0          6.6
 THSSMPAM #5              15.8     11.9     15.5     11.8     12.2    10.0          6.6
 INAOE #4                 17.9      8.3     15.9      6.2     11.7     3.4          2.3

a
    Concept scores not provided, only annotation decisions.
                                                                                                                                                                         12


                                                                                                     MF1 -concepts unseen   MF1 -concepts   MF1 -samples   MAP-samples


                                                                                                      5
                                                                                                     10
                                                                                                     15
                                                                                                     20
                                                                                                     25
                                                                                                     30
                                                                                                     35
                                                                                                     40
                                                                                                     45
                                                                                                     50
                                                                                                     55
                                                                                                     60
                                                                                                                             5
                                                                                                                            10
                                                                                                                            15
                                                                                                                            20
                                                                                                                            25
                                                                                                                            30
                                                                                                                            35
                                                                                                                            40
                                                                                                                                             5
                                                                                                                                            10
                                                                                                                                            15
                                                                                                                                            20
                                                                                                                                            25
                                                                                                                                            30
                                                                                                                                            35
                                                                                                                                            40
                                                                                                                                            45
                                                                                                                                                            5
                                                                                                                                                           10
                                                                                                                                                           15
                                                                                                                                                           20
                                                                                                                                                           25
                                                                                                                                                           30
                                                                                                                                                           35
                                                                                                                                                           40
                                                                                                                                                           45
                                                                                                                                                           50


                                                                                           TPT #6
                                                                                           TPT #4
                                                                                           TPT #2
                                                                                           TPT #5
                                                                                           TPT #3
                                                                                           TPT #1


Wilson’s method.
                                                                                           MIL #4
                                                                                           MIL #1
                                                                                           MIL #2
                                                                                           MIL #5
                                                                                           MIL #3
                                                                                       UNIMORE #2
                                                                                       UNIMORE #5
                                                                                                                                                                         M. Villegas et al.


                                                                                       UNIMORE #1
                                                                                       UNIMORE #6
                                                                                       UNIMORE #3
                                                                                       UNIMORE #4
                                                                                           RUC #4
                                                                                           RUC #5
                                                                                           RUC #3
                                                                                           RUC #2
                                                                                           RUC #1
                                                                                      UNED&UV #3
                                                                                      UNED&UV #5
                                                                                      UNED&UV #4
                                                                                      UNED&UV #1
                                                                                      UNED&UV #2
                                                                                       CEA LIST #4
                                                                                       CEA LIST #5
                                                                                       CEA LIST #3
                                                                                       CEA LIST #2
                                                                                       CEA LIST #1
                                                                                        KDEVIR #1
                                                                                        KDEVIR #3
                                                                                        KDEVIR #6
                                                                                        KDEVIR #4
                                                                                        KDEVIR #5
                                                                                        KDEVIR #2
                                                                                     URJC&UNED #3
                                                                                     URJC&UNED #2
                                                                                     URJC&UNED #1
                                                                                          MICC #5
                                                                                          MICC #4
                                                                                          MICC #3
                                                                                          MICC #2
                                                                                          MICC #1
                                                                                        SZTAKI #1
                                                                                        SZTAKI #2
                                                                                         INAOE #3
                                                                                         INAOE #1
                                                                                         INAOE #2
                                                                                         INAOE #4
                                                                                      THSSMPAM #3
                                                                                      THSSMPAM #2
                                                                                      THSSMPAM #1
                                                                                      THSSMPAM #4
                                                                                      THSSMPAM #5
                                                                                      LMCHFUT #1


missions. The error bars correspond to the 95% confidence intervals computed using
Fig. 2: Graphs showing the test set performance measures (in %) for all the sub-
                    Overview of the ImageCLEF 2013 Annotation Subtask           13


3.3    Discussion

Due to the considerable participation in this evaluation very interesting results
have been obtained. As can be observed in Table 2 and Figure 2, most of the
submitted runs significantly outperformed the baseline system for both the de-
velopment and test sets. When analyzing the sample based performances, very
large differences can be observed amongst the systems. For both MAP-samples
and MF1 -samples the improvement has been from below 10% to over 40%. More-
over, the confidence intervals are relatively narrow, making the improvements
quite significant. An interesting detail to note is that for MAP-samples there
are several top performing systems, however, when comparing to the respective
MF1 -samples measures, three of the TPT submissions clearly outperform the
rest. The key difference between these is the method for deciding which con-
cepts are selected for a given image. This leads to believe that that many of
the systems could improve greatly by changing that last step of their systems.
As a side note, many of the participants chose to use the same scheme as the
baseline system for selecting the concepts, the top N and fixed for all images.
The number of concepts per image is expected to be variable, thus making this
strategy less than optimal. Future work should be addressed in this direction.
    The MF1 -concepts results in Figure 2, in contrast to the sample based per-
formances, present much wider confidence intervals. This is due to two reasons,
there are fewer concepts than sample images and the performance for differ-
ent concepts varies greatly (see Figure 3). This effect is even greater for the
MF1 -concepts unseen, since these were only 21. Nevertheless, for MF1 -concepts
unseen, the top performing systems are statistically significantly better than the
baselines and some of the lower performance systems. Moreover, in Figure 3 it
can be observed that the unseen concepts do not tend to perform worse. The
difficulty of each particular concept affects more the performance than the fact
that these have not been seen during development, or from another perspective
the systems have been able to generalize rather well to the new concepts. Thus,
this demonstrates potential for scalability of the systems. It would be desired for
future benchmarking campaigns of this type to have more labeled data available
for the evaluation, or find an alternative more automatic analysis, to be able to
compare better the systems in this scalability performance aspect.
    In contrast to usual image annotation evaluations with labeled training data,
this challenge required work in more fronts, such as handling the noisy data,
textual processing and multilabel annotations. This has given considerable free-
dom to the participants to concentrate their efforts in different aspects. Several
teams extracted their own visual features, for which they did observe improve-
ments with respect to the features provided by the organizers. On the other
hand, for the textual processing, several different approaches were tried by the
participants. Some of these teams (namely MIL, UNIMORE, CEA LIST, and
URJC&UNED) reported in their working notes papers and/or as observed in
the results in this paper that as more information and additional resources are
used (e.g. synonyms, plus hyponyms, etc.) the performance of the systems im-
proved. Curiously, the best performing system, TPT, only used the provided
14       M. Villegas et al.


visual features and did a very simple expansion of the concepts. Overall it seems
that several of the proposed ideas by the participants are complementary, and
thus considerable improvements could be expected in future works.


4     Conclusions

This paper presented an overview of the ImageCLEF 2013 Scalable Concept
Image Annotation Subtask, the second edition of a challenge aimed at develop-
ing more scalable image annotation systems. The goal was to develop annotation
systems that for training only rely on unsupervised web data and other automat-
ically obtainable resources, thus making it easy to add or change the concepts
for annotation.
    Considering that it is a relatively new challenge, the participation was excel-
lent, 13 teams submitted in total 58 system runs. The performance of the sub-
mitted systems was considerably superior to the provided baselines, improving
from below 10% to over 40% for both MAP-samples and MF1 -samples measures.
With respect to the performance of the systems when analyzed per concept, it
was observed that the concepts vary greatly in difficulty. An important result
was that for the concepts that were not seen during the development, the im-
provement was also significant, thus showing that the systems are capable of
successfully using the noisy web data and generalizing well to new concepts.
This clearly demonstrates potential for scalability of the systems. Finally, the
participating teams presented several interesting approaches to address the pro-
posed challenge, concentrating their efforts in different aspects of the problem.
Many of these approaches are complementary, thus considerable improvements
could be expected in future works.
    Due to the success of this year’s campaign and the very interesting results
obtained, it would be important to continue organizing future editions. To be
able to derive better conclusions about the performance generalization to unseen
concepts, it would be desirable to have more labeled data available and/or find
an alternative more automatic analysis which can help in giving more insight in
this respect. Also, related challenges could be organized, for instance it could be
assumed that for some concepts there is labeled data available, and find out how
to take advantage of both the supervised and unsupervised data.


Acknowledgments

The authors are very grateful of the support of the CLEF campaign for the ImageCLEF
initiative. The research leading to these results has received funding from the European
Union’s Seventh Framework Programme (FP7/2007-2013) under the tranScriptorium
project (#600707), the LiMoSINe project (#288024), and from the Spanish MEC under
the STraDA project (TIN2012-37475-C02-01).
                     Overview of the ImageCLEF 2013 Annotation Subtask                                           15
                                                                               Fig. 3: Boxplots (also known as box-and-whiskers) for the test set of the per concept annotation F1 -measures (in %) for all runs combined.


                                                              unpaved
                                                              rodent
                                                              monument
                                                              cloudless
                                                              closeup
                                                              male
                                                              female
                                                              spectacles
                                                              indoor
                                                              overcast
                                                              smoke
                                                              daytime
                                                              shadow
                                                              poster
                                                              countryside
                                                              embroidery
                                                              elder
                                                              teenager
                                                              bird
                                                              drum
                                                              outdoor
                                                                               The plots are ordered by the median performance. Concepts in red font are the ones not seen in development.


                                                              park
                                                              cityscape
                                                              hat
                                                              airplane
                                                              spider
                                                              highway
                                                              rain
                                                              garden
                                                              child
                                                              sport
                                                              dog
                                                              table
                                                              reflection
                                                              phone
                                                              cat
                                                              sculpture
                                                              arthropod
                                                              coast
                                                              soil
                                                              book
                                                              train
                                                              guitar
                                                              footwear
                                                              diagram
                                                              sign
                                                              bridge
                                                              baby
                                                              fish
                                                              rainbow
                                                              reptile
                                                              car
                                                              harbor
                                                              motorcycle
                                                              truck
                                                              river
                                                              submarine
                                                              nighttime
                                                              painting
                                                              toy
                                                              helicopter
                                                              portrait
                                                              castle
                                                              wagon
                                                              beach
                                                              food
                                                              chair
                                                              violin
                                                              horse
                                                              instrument
                                                              drink
                                                              bottle
                                                              church
                                                              butterfly
                                                              lake
                                                              fire
                                                              silhouette
                                                              traffic
                                                              bicycle
                                                              sun
                                                              logo
                                                              snow
                                                              sand
                                                              protest
                                                              moon
                                                              flower
                                                              furniture
                                                              grass
                                                              forest
                                                              sea
                                                              vehicle
                                                              newspaper
                                                              aerial
                                                              boat
                                                              desert
                                                              tricycle
                                                              nebula
                                                              cloud
                                                              pool
                                                              road
                                                              building
                                                              galaxy
                                                              mountain
                                                              bus
                                                              water
                                                              sky
                                                              cartoon
                                                              tree
                                                              fog
                                                              person
                                                              plant
                                                              underwater
                                                              space
                                                              lightning
                                                              sunrise/sunset
                                                              firework
100


      90


           80


                70


                       60


                             50


                                   40


                                        30


                                              20


                                                    10


                                                          0
16        M. Villegas et al.


References
 1. Benavent, X., Castellanos, A., de Ves, E., Hernández-Aranda, D., Granados, R.,
    Garcia-Serrano, A.: A multimedia IR-based system for the Photo Annotation Task
    at ImageCLEF2013. In: CLEF 2013 Evaluation Labs and Workshop, Online Work-
    ing Notes. Valencia, Spain (September 23-26 2013) [Cited on page 9]
 2. Borgne, H.L., Popescu, A., Znaidia, A.: CEA LIST@imageCLEF 2013: Scalable
    Concept Image Annotation. In: CLEF 2013 Evaluation Labs and Workshop, Online
    Working Notes. Valencia, Spain (September 23-26 2013) [Cited on page 9]
 3. Caputo, B., Müller, H., Thomee, B., Villegas, M., Paredes, R., Zellhöfer, D., Goeau,
    H., Joly, A., Bonnet, P., Martı́nez-Gómez, J., Garcı́a-Varea, I., Cazorla, M.: Image-
    CLEF 2013: the vision, the data and the open challenges. In: CLEF. Lecture Notes
    in Computer Science, Springer, Valencia, Spain (September 23-26 2013) [Cited on
     page 2]
 4. Fellbaum, C. (ed.): WordNet An Electronic Lexical Database. The MIT Press,
    Cambridge, MA; London (May 1998) [Cited on pages 4 and 5]
 5. Grana, C., Serra, G., Manfredi, M., Cucchiara, R., Martoglia, R., Mandreoli, F.:
    UNIMORE at ImageCLEF 2013: Scalable Concept Image Annotation. In: CLEF
    2013 Evaluation Labs and Workshop, Online Working Notes. Valencia, Spain
    (September 23-26 2013) [Cited on page 9]
 6. Hidaka, M., Gunji, N., Harada, T.: MIL at ImageCLEF 2013: Scalable System for
    Image Annotation. In: CLEF 2013 Evaluation Labs and Workshop, Online Working
    Notes. Valencia, Spain (September 23-26 2013) [Cited on page 9]
 7. La Cascia, M., Sethi, S., Sclaroff, S.: Combining textual and visual cues for content-
    based image retrieval on the World Wide Web. In: Content-Based Access of Image
    and Video Libraries, 1998. Proceedings. IEEE Workshop on. pp. 24–28 (1998) [Cited
     on page 6]
 8. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid
    Matching for Recognizing Natural Scene Categories. In: Proceedings of the 2006
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    - Volume 2. pp. 2169–2178. CVPR ’06, IEEE Computer Society, Washington, DC,
    USA (2006), http://dx.doi.org/10.1109/CVPR.2006.68 [Cited on page 6]
 9. Li, X., Liao, S., Liu, B., Yang, G., Jin, Q., Xu, J., Du, X.: Renmin University of
    China at ImageCLEF 2013 Scalable Concept Image Annotation. In: CLEF 2013
    Evaluation Labs and Workshop, Online Working Notes. Valencia, Spain (Septem-
    ber 23-26 2013) [Cited on page 9]
10. Nowak, S., Nagel, K., Liebetrau, J.: The CLEF 2011 Photo Annotation and
    Concept-based Retrieval Tasks. In: Petras, V., Forner, P., Clough, P.D. (eds.)
    CLEF 2011 Labs and Workshop, Notebook Papers, 19-22 September 2011, Ams-
    terdam, The Netherlands (2011) [Cited on page 1]
11. Oliva, A., Torralba, A.: Modeling the Shape of the Scene: A Holistic Representation
    of the Spatial Envelope. Int. J. Comput. Vision 42(3), 145–175 (May 2001), http:
    //dx.doi.org/10.1023/A:1011139631724 [Cited on page 6]
12. Reshma, I.A., Ullah, M.Z., Aono, M.: KDEVIR at ImageCLEF 2013 Image Anno-
    tation Subtask. In: CLEF 2013 Evaluation Labs and Workshop, Online Working
    Notes. Valencia, Spain (September 23-26 2013) [Cited on page 9]
13. Sahbi, H.: CNRS - TELECOM ParisTech at ImageCLEF 2013 Scalable Concept
    Image Annotation Task: Winning Annotations with Context Dependent SVMs.
    In: CLEF 2013 Evaluation Labs and Workshop, Online Working Notes. Valencia,
    Spain (September 23-26 2013) [Cited on page 9]
                     Overview of the ImageCLEF 2013 Annotation Subtask              17


14. Sánchez-Oro, J., Montalvo, S., Montemayor, A.S., Pantrigo, J.J., Duarte, A.,
    Fresno, V., Martı́nez, R.: URJC&UNED at ImageCLEF 2013 Photo Annotation
    Task. In: CLEF 2013 Evaluation Labs and Workshop, Online Working Notes. Va-
    lencia, Spain (September 23-26 2013) [Cited on page 9]
15. van de Sande, K.E., Gevers, T., Snoek, C.G.: Evaluating Color Descriptors for Ob-
    ject and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine
    Intelligence 32, 1582–1596 (2010) [Cited on page 6]
16. Thomee, B., Popescu, A.: Overview of the ImageCLEF 2012 Flickr Photo Annota-
    tion and Retrieval Task. In: CLEF 2012 working notes. Rome, Italy (2012) [Cited
    on page 1]
17. Torralba, A., Fergus, R., Freeman, W.: 80 Million Tiny Images: A Large Data Set
    for Nonparametric Object and Scene Recognition. Pattern Analysis and Machine
    Intelligence, IEEE Transactions on 30(11), 1958–1970 (nov 2008) [Cited on page 2]
18. Uricchio, T., Bertini, M., Ballan, L., Bimbo, A.D.: KDEVIR at ImageCLEF 2013
    Image Annotation Subtask. In: CLEF 2013 Evaluation Labs and Workshop, Online
    Working Notes. Valencia, Spain (September 23-26 2013) [Cited on page 9]
19. Villegas, M., Paredes, R.: Image-Text Dataset Generation for Image Annotation
    and Retrieval. In: Berlanga, R., Rosso, P. (eds.) II Congreso Español de Recu-
    peración de Información, CERI 2012. pp. 115–120. Universidad Politécnica de Va-
    lencia, Valencia, Spain (June 18-19 2012) [Cited on page 5]
20. Villegas, M., Paredes, R.: Overview of the ImageCLEF 2012 Scalable Web Image
    Annotation Task. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012
    Evaluation Labs and Workshop, Online Working Notes. Rome, Italy (September
    17-20 2012) [Cited on pages 2, 4, and 5]
21. Wang, X.J., Zhang, L., Liu, M., Li, Y., Ma, W.Y.: ARISTA - image search to
    annotation on billions of web photos. Computer Vision and Pattern Recognition,
    IEEE Computer Society Conference on 0, 2987–2994 (2010) [Cited on page 2]
22. Weston, J., Bengio, S., Usunier, N.: Large scale image annotation: learning to rank
    with joint word-image embeddings. Machine Learning 81, 21–35 (2010) [Cited on
    page 2]
18        M. Villegas et al.


A     Concept List

                           WN 3.0             WN 3.0                                #images
     Concept        Type                                     Wikipedia article
                           sense#              offset                              dev. test
          aerial    adj.        1             01380267        Aerial photography   39     72
       airplane     noun        1             02691156             Airplane        13     20
           baby     noun        1             09827683               Baby           9    29
          beach     noun        1             09217230               Beach         36     51
        bicycle     noun        1             02834778              Bicycle        17     15
           bird     noun        1             01503061                Bird         25     30
           boat     noun        1             02858304                Boat         59     83
          book      noun       2, 1      02870092, 06410904          Book          22     22
         bridge     noun        1             02898711              Bridge         35     44
       building     noun        1             02913152             Building        188   288
            car     noun        1             02958343                Car          47     86
        cartoon     noun        1             06780678              Cartoon        31     73
          castle    noun        2             02980441               Castle        18     20
            cat     noun        1             02121620                Cat          12     21
           child    noun        1             09917593               Child         23     60
         church     noun        2             03028079        Church (building)    12     16
      cityscape     noun        1             06209770             Cityscape       67     95
        closeup     noun        1             03049695              Closeup        15     56
          cloud     noun        2             09247410               Cloud         239   340
      cloudless     adj.        1             00460946                  -          99    159
          coast     noun        1             09428293               Coast         46     64
     countryside    noun        1             08645033           Countryside       43     74
       daytime      noun        1             15164957       Daytime (astronomy)   587   989
         desert     noun        1             08505573               Desert        19     27
       diagram      noun        1             03186399             Diagram         11     23
            dog     noun        1             02084071                Dog          28     34
          drum      noun        1             03249569               Drum          12     9
           elder    noun        1             10048218              Elderly        12     37
     embroidery     noun        2             03282933           Embroidery        10     14
         female     noun        2             09619168              Female         41    149
            fire    noun       3, 1      13480848, 07302836           Fire         28     34
       firework     noun        1             03348454             Firework        11     20
            fish    noun        1             02512053                Fish         17     33
         flower     noun        2             11669335              Flower         46    111
            fog     noun        2             14521648                Fog          17     39
           food     noun       2, 1      07555863, 00021265          Food          20     59
       footwear     noun       1, 2      03381126, 03380867        Footwear        19     40
          forest    noun       1, 2      08438533, 09284015          Forest        96    129
      furniture     noun        1             03405725             Furniture       52    120
        garden      noun        1             03417345              Garden         14     21
          grass     noun        1             12102133               Grass         162   253
         guitar     noun        1             03467517              Guitar          7    13
        harbor      noun        1             08639058              Harbor         20     35
      helicopter    noun        1             03512147            Helicopter        8    14
       highway      noun        1             03519981             Highway         15     16
          horse     noun        1             02374451               Horse         18     44
         indoor     adj.        1             01692786                  -          87    218
     instrument     noun        6             03800933        Musical instrument   34     58
           lake     noun        1             09328904                Lake         43     65
      lightning     noun       1, 2      11475279, 07412993        Lightning       10     16
           logo     noun        1             07272084                Logo         15     35
           male     noun        2             09624168               Male          53    115
     monument       noun        1             03743902            Monument          8    19
          moon      noun        1             09358358               Moon           7    31
     motorcycle     noun        1             03790512            Motorcycle       12     20
      mountain      noun        1             09359803             Mountain        100   181
     newspaper      noun       3, 1      03822171, 06267145       Newspaper         9     9
                                      continues in next page
                  Overview of the ImageCLEF 2013 Annotation Subtask                19

                        WN 3.0       WN 3.0                                 #images
  Concept        Type                                 Wikipedia article
                        sense#        offset                               dev. test
  nighttime      noun     1           15167027            Nighttime        54      90
    outdoor      adj.    1, 2    01692222, 03095372            -           615   1023
    overcast     noun     1           14524198             Overcast        64      71
    painting     noun     1           03876519             Painting        41      82
      park       noun     2           08615374               Park          19      28
     person      noun     1           00007846              Person         233    538
      plant      noun     2           00017222               Plant         393    694
    portrait     noun     1           07202391             Portrait        10      26
     poster      noun     1           06793426              Poster          6     16
     protest     noun     2           01177033              Protest         9     19
       rain      noun     1           11501381               Rain          12      29
    rainbow      noun     1           09403427             Rainbow          9     15
   reflection    noun    4, 5    04747115, 04068976      Mirror image      63      78
      river      noun     1           09411430               River         79      77
      road       noun     1           04096066               Road          122    212
      sand       noun     1           15019030               Sand          47      79
   sculpture     noun     2           00937656            Sculpture        23      55
       sea       noun     1           09426788                Sea          97     133
    shadow       noun     2           08646306             Shadow          50      98
       sign      noun     2           06793231               Sign          55      78
  silhouette     noun     1           08613345            Silhouette       26      39
       sky       noun     1           09436708                Sky          440    678
     smoke       noun     1           11508092              Smoke          26      17
      snow       noun     2           15043763               Snow          52      80
       soil      noun     2           14844693                Soil         40      79
      sport      noun     1           00523513               Sport         31      87
       sun       noun     1           09450163                Sun          30      62
sunrise/sunset   noun    1, 1    15168790, 15169248     Sunrise/Sunset     36      52
   teenager      noun     1           09772029             Teenager        18      27
       toy       noun     1           03964744                Toy          24      31
     traffic     noun     1           08425303              Traffic        24      39
      train      noun     1           04468005               Train         32      24
       tree      noun     1           13104059               Tree          272    451
      truck      noun     1           04490091              Truck          23      38
 underwater      adj.    1, 2    02472252, 00124685      Underwater        23      55
    unpaved      adj.     1           01739987                 -           13      21
     vehicle     noun     1           04524313              Vehicle        203    374
      water      noun     6           07935504              Water          288    430
  arthropod      noun     1           01767661            Arthropod         -      78
     bottle      noun     1           02876657              Bottle          -      32
       bus       noun     1           02924116                Bus           -      45
   butterfly     noun     1           02274259             Butterfly        -      16
      chair      noun     1           03001627               Chair          -      63
      drink      noun     1           07885223               Drink          -      53
     galaxy      noun     3           08271042              Galaxy          -      21
       hat       noun     1           03497657                Hat           -      78
     nebula      noun     3           09366940              Nebula          -      16
     phone       noun     1           04401088              Phone           -      26
       pool      noun     1           03982060          Swimming pool       -      30
     reptile     noun     1           01661091              Reptile         -      34
     rodent      noun     1           02329401              Rodent          -      44
      space      noun     4           08500433           Outer space        -      84
  spectacles     noun     1           04272054            Spectacles        -      62
     spider      noun     1           01772222              Spider          -      19
  submarine      noun     1           04347754            Submarine         -      24
      table      noun     2           04379243         Table (furniture)    -      49
    tricycle     noun     1           04482393             Tricycle         -      15
      violin     noun     1           04536866              Violin          -      23
     wagon       noun     1           04543158              Wagon           -      29