<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew Gilbert</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Piras</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josiah Wang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fei Yan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel Dellandrea</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Gaizauskas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauricio Villegas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krystian Mikolajczyk</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task was the fourth edition of a challenge aimed at developing more scalable image annotation systems. In particular this year the focus of the three subtasks available to participants had the goal to develop techniques to allow computers to reliably describe images, localize the di erent concepts depicted in the images and generate a description of the scene. All three tasks use a single mixed modality data source of 500,000 web page items which included raw images, textual features obtained from the web pages on which the images appeared, as well as various visual features extracted from the images themselves. Unlike previous years the test set was also the training set and in this edition of the task hand-labelled data has been allowed. The images were obtained from the Web by querying popular image search engines. The development and subtasks 1 and 2 test sets were both taken from the \training set" and had 1,979 and 3,070 samples, and the subtask 3 track had 500 and 450 samples. The 251 concepts this year were chosen to be visual objects that are localizable and that are useful for generating textual descriptions of visual content of images and were mined from the texts of our large database of image-webpage pairs. This year 14 groups participated in the task, submitting a total of 122 runs across the 3 subtasks and 11 of the participants also submitted working notes papers. This result is very positive, in fact if compared to the 11 participants and 58 submitted runs of the last year it is possible to see how the interest in this topic is still very high.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Every day, users struggle with the ever-increasing quantity of data available to
them. Trying to nd \that" photo they took on holiday last year, the image
on Google of their favourite actress or band, or the images of the news article
someone mentioned at work. There is a large number of images that can be
cheaply found and gathered from the Internet. However, more valuable is mixed
modality data, for example, web pages containing both images and text. A large
amount of information about the image is present on these web pages and
viceversa. However, the relationship between the surrounding text and images varies
greatly, with much of the text being redundant and/or unrelated. Moreover,
images and the webpages on which they appear can be easily obtained for
virtually any topic using a web crawler. In existing work such noisy data has indeed
proven useful, e.g. [
        <xref ref-type="bibr" rid="ref19 ref27 ref29">19,29,27</xref>
        ]. Despite the obvious bene ts of using such
information in automatic learning, the very weak supervision it provides means that
it remains a challenging problem. The Scalable Image Annotation, Localization
and Sentence Generation task aims to develop techniques to allow computers to
reliably describe images, localize the di erent concepts depicted in the images
and generate a description of the scene.
      </p>
      <p>
        The Scalable Image Annotation, Localization and Sentence Generation task
is a continuation of the general image annotation and retrieval task that has been
part of ImageCLEF since its very rst edition in 2003. In the early years the
focus was on retrieving relevant images from a web collection given (multilingual)
queries, while from 2006 onwards annotation tasks were also held, initially aimed
at object detection, but more recently also covering semantic concepts. In its
current form, the 2015 Scalable Concept Image Annotation task is its fourth
edition, having been organized in 2012 [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], 2013 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and 2014 [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In light of
recent interest in annotating images beyond just concept labels, we introduced
two new subtasks this year where participants developed systems to describe an
image with a textual description of the visual content depicted in the image.
      </p>
      <p>
        This paper presents the overview of the fourth edition of the Scalable
Concept Image Annotation task [
        <xref ref-type="bibr" rid="ref23 ref24 ref25">23,25,24</xref>
        ], one of the four benchmark campaigns
organized by ImageCLEF [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] in 2015 under the CLEF initiative1. Section 2
describes the task in detail, including the participation rules and the provided
data and resources. Followed by this, Section 3 presents and discusses the results
of the submissions. Finally, Section 4 concludes the paper with nal remarks and
future outlooks.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Overview of the Task</title>
      <sec id="sec-2-1">
        <title>Motivation and Objectives</title>
        <p>Image concept annotation, localization and natural sentence generation generally
has relied on training data that has been manually, and thus reliably annotated,
an expensive and laborious endeavour that cannot easily scale, particularly as the
number of concepts grow. However, images for any topic can be cheaply gathered
from the web, along with associated text from the webpages that contain the
images. The degree of relationship between these web images and the surrounding
text varies greatly, i.e., the data are very noisy, but overall these data contain
useful information that can be exploited to develop annotation systems.
Motivated by this need for exploiting this useful data, the ImageCLEF 2015 Scalable
Concept annotation, localization and sentence generation task aims to develop
techniques to allow computers to reliably describe images, localize the di erent
concepts depicted in the images and generate a description of the scene. Figure
1 shows examples of typical images found by querying search engines. As can
1 http://www.clef-initiative.eu
(a) Images from a search query of \rainbow".</p>
        <p>(b) Images from a search query of \sun".
be seen, the data obtained are useful and furthermore a wider variety of images
is expected, not only photographs but also drawings and computer generated
graphics. This diversity has the advantage that this data can also handle the
possible di erent senses that a word can have, or the di erent types of images
that exist. Likewise, there are other resources available that can help to
determine the relationships between text and semantic concepts, such as dictionaries
or ontologies. There are also tools that can help to deal with noisy text
commonly found on webpages, such as language models, stop word lists and spell
checkers. The goal of this task was to evaluate di erent strategies to deal with
the noisy data so that it can be reliably used for annotating, localizing, and
generating natural sentences from practically any topic.
1. Subtask 1: The image annotation task continues in the same line of past
years. The objective required the participants to develop a system that
receives as input an image and produces as output a prediction of which
concepts are present in that image, selected from a prede ned list of concepts
and starting this year, where they are located within the image.
2. Subtask 2 (Noisy Track ): In light of recent interest in annotating images
beyond just concept labels, this subtask required the participants to describe
images with a textual description of the visual content depicted in the image.</p>
        <p>It is thought of as an extension of subtask 1. This track was geared towards
2 Challenge website at http://imageclef.org/2015/annotation
participants interested in developing systems that generated textual
descriptions directly from images, e.g. by using visual detectors to identify concepts
and generating textual descriptions from the detected concepts. This had a
large overlap with subtask 1.
3. Subtask 3 (Clean track ): Aimed primarily at those interested only in the
Natural Language Generation aspects of the subtask, therefore a gold
standard input (bounding boxes labelled with concepts) was provided to develop
systems that generate sentence, natural language based descriptions based
on these gold standard annotations as input.</p>
        <p>As common training set the participants were provided with 500,000 images
crawled from the Internet, the corresponding webpages on which they appeared,
as well as precomputed visual and textual features. Apart from the image and
webpage data, the participants were also permitted and encouraged to use
similar datasets and any other automatically obtainable resources to help in the
processing and usage of the training data. In contrast to previous years, in this
edition of the task hand labelled data has been allowed. Thus, the available
trained ImageNet CNNs could be used, and the participants were encouraged
to use also other resources such as ontologies, word disambiguators, language
models, language detectors, spell checkers, and automatic translation systems.
Unlike previous years, the test set was also the training set.</p>
        <p>For the development of the annotation systems, the participants were
provided with the following:
{ A training dataset of images and corresponding webpages compiled speci cally
for the three subtasks, including precomputed visual and textual features (see
Section 2.3).
{ A development set of images with ground truth labelled bounding box
annotations and precomputed visual features for estimating the system performance.
{ A development set of images with at least ve textual descriptions per image
for Subtask 2 and Subtask 3.
{ A subset of the development set for Subtask 3 with gold standard inputs
(bounding boxes labelled with concepts) and correspondence annotation
between bounding box inputs and terms in textual descriptions.</p>
        <p>This year the training and the test images are all contained within the 500,000
images released at the beginning of the competition. At test time, it was expected
that participants provided a classi cation for all images. After a period of two
months, the development set, which included ground truth localized annotations,
was released and about two months were given for participants to work on the
development data. A maximum of 10 submissions per subtask (also referred to
as runs) was allowed per participating group.</p>
        <p>
          The 251 concepts this year were chosen to be visual objects that are
localizable and that are useful for generating textual descriptions of the visual content
of images. They include animate objects such as people, dogs and cats,
inanimate objects such as houses, cars and balls, and scenes such as city, sea and
mountains. The concepts were mined from the texts of our database of 31
million image-webpage pairs [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Nouns that are subjects or objects of sentences
are extracted and mapped onto WordNet synsets [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. These were then ltered
to `natural', basic-level categories (dog rather than a Yorkshire terrier ), based
on the WordNet hierarchy and heuristics from a large-scale text corpora [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ].
The nal list of concepts were manually shortlisted by the organisers such that
they were (i) visually concrete and localizable; (ii) suitable for use in image
descriptions; (iii) at a suitable `every day' level of speci city that were neither too
general nor too speci c. The complete list of concepts, as well as the number of
samples in the test sets, is included in Appendix A.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Dataset</title>
        <p>
          The dataset3 used was very similar to the one of the rst three editions of the
task [
          <xref ref-type="bibr" rid="ref23 ref24 ref25">23,25,24</xref>
          ]. To create the dataset, initially a database of over 31 million
images was created by querying Google, Bing and Yahoo! using words from the
Aspell English dictionary [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The images and corresponding webpages were
downloaded, taking care to avoid data duplication. Then, a subset of 500,000
images was selected from this database by choosing the top images from a ranked
list. For further details on how the dataset was created, please refer to [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The
ranked list was generated by retrieving images from our database using the list of
concepts, in essence, more or less as if the search engines had only been queried
for these. From the ranked list, some types of problematic images were removed,
and it was guaranteed that each image had at least one webpage in which they
appeared.
        </p>
        <p>The development and test sets were both taken from the \training set". A
set of 5,520 images was selected for this purpose using a CNN trained to
identify images suitable for sentence generation. The images were then annotated
via crowd-sourcing in three stages: (i) image level annotation for the 251
concepts; (ii) bounding box annotation; (iii) textual description annotation. For
the textual descriptions, basic spell correction was performed manually by the
organisers using Aspell4. Both American and British English spelling variants
(color vs. colour ) were retained to re ect the challenge of real-world English
spelling variants. A subset of these samples was then selected for subtask 3 and
further annotated by the organisers with correspondence annotations between
bounding box instances and terms in textual descriptions.</p>
        <p>The development set contained 2,000 samples, out of which 500 samples were
further annotated and used as the development set for the subtask 3. Note that
only 1,979 samples from the development set contain at least one bounding box
annotation. The number of textual descriptions for the development set ranged
from 5 to 51 per image (with a mean of 9.5 and a median of 8 descriptions).
The test set for subtasks 1 and 2 contains 3,070 samples, while the test set for
subtask 3 comprises 450 samples which are disjoint from the test set of subtasks
1 and 2.
3 Dataset available at http://risenet.prhlt.upv.es/webupv-datasets
4 http://aspell.net/
Textual Data: Four sets of data were made available to the participants. The
rst one was the list of words used to nd the image when querying the search
engines, along with the rank position of the image in the respective query and
search engine it was found on. The second set of textual data contained the image
URLs as referenced in the webpages they appeared in. In many cases, the image
URLs tend to be formed with words that relate to the content of the image,
which is why they can also be useful as textual features. The third set of data
was the webpages in which the images appeared, for which the only preprocessing
was a conversion to valid XML just to make any subsequent processing simpler.
The nal set of data were features obtained from the text extracted near the
position(s) of the image in each webpage it appeared in.</p>
        <p>To extract the text near the image, after conversion to valid XML, the script
and style elements were removed. The extracted texts were the webpage title,
and all the terms closer than 600 in word distance to the image, not including
the HTML tags and attributes. Then a weight s(tn) was assigned to each of the
words near the image, de ned as
s(tn) = P</p>
        <p>1
8t2T
s(t)</p>
        <p>X
8tn;m2T</p>
        <p>
          Fn;m sigm(dn;m) ;
(1)
where tn;m are each of the appearances of the term tn in the document T , Fn;m
is a factor depending on the DOM (e.g. title, alt, etc.) similar to what is done
in the work of La Cascia et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and dn;m is the word distance from tn;m to
the image. The sigmoid function was centered at 35, had a slope of 0.15 and
minimum and maximum values of 1 and 10 respectively. The resulting features
include for each image at most the 100 word-score pairs with the highest scores.
Visual Features: Before visual feature extraction, images were ltered and
resized so that the width and height had at most 240 pixels while preserving
the original aspect ratio. These raw resized images were provided to the
participants but also eight types of precomputed visual features. The rst feature set
Colorhist consisted of 576-dimensional color histograms extracted using our own
implementation. These features correspond to dividing the image in 3 3 regions
and for each region obtaining a color histogram quanti ed to 6 bits. The second
feature set GETLF contained 256-dimensional histogram based features. First,
local color-histograms were extracted in a dense grid every 21 pixels for windows
of size 41 41. Then, these local color-histograms were randomly projected to
a binary space using 8 random vectors and considering the sign of the resulting
projection to produce the bit. Thus, obtaining an 8-bit representation of each
local color-histogram that can be considered as a word. Finally, the image is
represented as a bag-of-words, leading to a 256-dimensional histogram
representation. The third set of features consisted of GIST [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] descriptors. The
following four feature types were obtained using the colorDescriptors software [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
namely SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. The con guration
was dense sampling with default parameters and a hard assignment 1,000
codebook using a spatial pyramid of 1 1 and 2 2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Since the vectors of the spatial
pyramid were concatenated, this resulted in 5,000-dimensional feature vectors.
The codebooks were generated using 1.25 million randomly selected features
and the k-means algorithm. And nally, CNN feature vectors have been
provided computed as the seventh layer feature representations extracted from a
deep CNN model pre-trained with the ImageNet dataset [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] using the Berkeley
Ca e library5.
2.4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Performance Measures</title>
        <p>Subtask 1 Ultimately the goal of an image annotation system is to make
decisions about which concepts to assign and localize to a given image from a
prede ned list of concepts. Thus to measure annotation performance, what should
be considered is how good and accurate are those decisions the precision of a
system. Ideally a recall measure would also be used to penalize a system that
has additional false positive output. However given di culties and unreliability
of with the hand labeling of the concepts for the test images it wasn't possible
to guarantee all concepts were labeled, however, it was assumed that the labels
present were accurate and of a high quality.</p>
        <p>
          The annotation and localization of Subtask 1 were evaluated using the
PASCAL VOC [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] style metric of intersection over union (IoU), IoU is de ned as
IoU = BBfg \ BBgt
        </p>
        <p>BBfg [ BBgt
(2)
where BB is a rectangle bounding box, f g is a foreground proposed
annotation label, gt is the ground truth label of the concept. It calculates the area
of intersection between the foreground in the proposed output localization and
the ground-truth bounding box localization, divided by the area of their union.
IoU is superior to a more naive measure of the percentage of correctly labelled
pixels as IoU is automatically normalized by the size of the object and penalizes
segmentation's that include the background. This means that small changes in
the percentage of correctly labelled pixels can correspond to large di erences in
IoU, and as the data-set has a wide variation in object size, the performance
increases from our approach are more reliably measured. The evaluation of the
ground truth and proposed output overlap was recorded from 0% to 90%. At 0%,
this is equivalent to an image level annotation output, and 50% is the standard
PASCAL VOC style metric used. The localized IoU is then used to compute
the mean average precision (MAP) of each concept independently. This is then
reported both per concept and averaged over all concepts.</p>
        <p>
          Subtask 2 Subtask 2 was evaluated using the METEOR evaluation metric [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
which is an F -measure of word overlaps taking into account stemmed words,
synonyms, and paraphrases, with a fragmentation penalty to penalize gaps and
word order di erences. This measure was chosen as it was shown to correlate
5 More details can be found at https://github.com/BVLC/caffe/wiki/Model-Zoo
well with human judgments in evaluating image descriptions [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Please refer to
Denkowski and Lavie [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for details about this measure.
        </p>
        <p>Subtask 3 Subtask 3 was also evaluated using the METEOR evaluation metric
(see above). In addition, we have pioneered a ne-grained metric to evaluate
the content selection capabilities of the sentence generation system. The content
selection metric is the F1 score averaged across all 450 test images, where each
F1 score is computed from the precision and recall averaged over all gold
standard descriptions for the image. Intuitively, this measure evaluates how well the
sentence generation system selects the correct concepts to be described against
gold standard image descriptions. Formally, let I = fI1; I2; :::IN g be the set of
test images. Let GIi = fGI1i ; GI2i ; :::; GIMi g be the set of gold standard
descriptions for image Ii, where each GImi represents the set of unique bounding box
instances referenced in gold standard description m of image Ii. Let SIi be the
set of unique bounding box instances referenced by the participant's generated
sentence for image Ii. The precision P Ii for test image Ii is computed as:
where jGImi \ SIi j is the number of unique bounding box instances referenced in
both the gold standard description and the generated sentence, and M is the
number of gold standard descriptions for image Ii.</p>
        <p>Similarly, the recall RIi for test image Ii is computed as:</p>
        <p>P Ii =</p>
        <sec id="sec-2-3-1">
          <title>1 XM jGImi \ SIi j</title>
          <p>M m jSIi j
RIi =</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>1 XM jGImi \ SIi j</title>
          <p>M m jGImi j
F Ii = 2</p>
          <p>P Ii
P Ii + RIi</p>
          <p>RIi
(3)
(4)
(5)
The content selection score for image Ii, F Ii , is computed as the harmonic mean
of P Ii and RIi :
The nal P , R and F scores are computed as the mean P , R and F scores across
all test images.</p>
          <p>The advantage of the macro-averaging process in equations (3) and (4) is that
it implicitly captures the relative importance of the bounding box instances
based on how frequently to which they are referred across the gold standard
descriptions.
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Results</title>
      <sec id="sec-3-1">
        <title>Participation</title>
        <p>The participation was excellent, with a greater number of teams including a
number of new groups. In total 14 groups took part in the task and submitted
overall 122 system runs. The number of runs is nearly double the previous year.
Among the 14 participating groups, 11 of them submitted a corresponding paper
describing their system, thus for these there were speci c details available. The
following 14 teams submitted a working paper:
Subtask 1 was well received despite the additional requirement of labelling and
localizing all 500,000 images. All submissions were able to provide results on
all 500,000 images, indicating that all groups have developed systems that are
scalable enough to annotate large amounts of images. The nal results are
presented in Table 1 in terms of mean average precision (MAP) over all images of
all concepts, with both 0% overlap (i.e. no localization) and 50% overlap. It can
Group</p>
        <p>SMIVA
IVANLPR</p>
        <p>RUC
CEA
Kdevir</p>
        <p>ISIA
CNRS-TPT
IRIP-iCC</p>
        <p>UAIC
MLVISP6
REGIM</p>
        <p>Lip6
be seen that three groups have achieved over 0.50 MAP across the evaluation set
with 50% overlap with the ground-truth. This seems an excellent result given
the challenging nature of the images used and the wide range of concepts
provided. The graph in Figure 2 shows the performance of each submission for an
increasing amount of overlap of the ground truth labels.</p>
        <p>The results from the groups seem encouraging and it would seem that the
use of CNNs has allowed for large improvements in performance. Of the top 4
groups all use CNNs in their pipeline for the feature description.</p>
        <p>SMIVA used a deep learning framework with additional annotated data,
while IVANLPR implemented a two-stage process, initially classifying at an
image level with an SVM classi er, and then applying deep learning feature
classi cation to provide localization. While RUC trained a per concept, an
ensemble of linear SVMs trained by Negative Bootstrap using CNN features as
image representation. Concept localization was achieved by classifying object
proposals generated by Selective Search. The approach by CEA LIST could be
thought of as the baseline, they just use the CNN learnt features in a small grid
based approach for localization.</p>
        <p>Examples of the most and least successful localized concepts are shown in
tables 2 and 3 respectively, together with the number of labelled occurrences of
these concepts in the test data.
Discussion for subtask 1 As can be observed in Table 1, the performance
of many submissions was high this year, even given the additional constraint of
localization. In fact, the 4 teams managed to achieve over 0.5 MAP, with 50%
overlap with the ground truth. This perhaps indicates that in conjunction with
the improvements from the CNN's real progress is starting to be made in the
image annotation.</p>
        <p>Figure 2 shows the change in performance as the requirements for intersection
with the ground truth labels increases. All the approaches show a steady drop
o in performance which is encouraging, illustrating that the approaches don't
fail to detect a number of concepts correctly even with a high degree of accuracy.
Even 90% overlap with the groundtruth the MAP for SMIVA was 0.35, which
is impressive. Table 2 shows the most correctly localized concepts and also the
number of occurrences of the concept. As it is important to remember that due
to imperfect annotation no recall level is calculated. This is likely to be why the
concept bee is so high. However there is encouraging performance for mountain,
statue, bench and suit. These are all quite varied concepts, in term of scale and
percentage of the image the concept will cover. However examining Table 3 shows
a number of concepts that should be detected and are not such as leaf and wheel.
However, many in that table are quite small concepts and, therefore, harder to
localize and intersect with the labelled ground truth. This could be an area to
direct the challenge objectives in future years.</p>
        <p>
          From a computer vision perspective, we would argue that the ImageCLEF
challenge has two key di erences in its dataset construction to that of the other
popular data sets ImageNet [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and MSCOCO [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. All 3 are working on
detection and classi cation of concepts within images. However, the ImageCLEF
dataset is created from Internet web pages. This gives a key di erence to the
other popular datasets. The web pages are unsorted and unconstrained meaning
the relationship or quality of the text and image in relation to a concept can be
very variable. Therefore instead of a high-quality Flickr style photo of a car from
ImageNet, the image in the ImageCLEF dataset could be a fuzzy abstract car
shape in the corner of the image. This allows the ImageCLEF image annotation
challenge to provide additional opportunities to test proposed approaches on.
Another important di erence is that in addition to the image, text data from
web pages can be used to train and generate the output description of the image
in a natural language form.
3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Results for Subtask 2</title>
        <p>For subtask 2, participants were asked to generate sentence-level textual
descriptions for all 500,000 training images. The systems were evaluated on a subset of
3,070 instances. Four teams participated in this pilot subtask. Table 4 shows the
METEOR scores for subtask 3, for all submitted runs by all four participants.
Three teams achieved METEOR scores of over 0.10. RUC achieved the
highest METEOR score, followed by ISIA, MindLab, and UAIC. We observed a
large variety of approaches used by participants to tackle this subtask. RUC
used the state of the art deep learning based CNN-LSTM caption generation
system, MindLab employed a joint image-text retrieval approach, and UAIC
a template-based approach.</p>
        <p>As a comparison, we estimated a human upper-bound for this subtask by
evaluating one description against the other descriptions for the same image
and repeating the process for all descriptions. The METEOR score for the
human upper-bound is estimated to be 0.3385 (Table 4). Therefore, there is clear
scope for future improvement and work to improve image description generation
systems.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results for Subtask 3</title>
        <p>For subtask 3, participants were provided with gold standard labelled bounding
box inputs for 450 test images (released one week before the submission
deadline), and were asked to generate textual descriptions for each image based on
the gold standard input bounding boxes. To enable evaluation using the content
selection metric (Section 2.4), participants were also asked to indicate within the
textual descriptions the bounding box(es) to which the relevant term(s)
correspond.</p>
        <p>Two teams participated in this subtask (both of whom also participated in
subtask 2). Table 5 shows the content section and METEOR scores for the
subtask, again for all submitted runs by the two participants. RUC performed
marginally better than UAIC in terms of the F and METEOR scores.
Interestingly, it can be observed that RUC's sentence generation system has higher
precision P , while UAIC achieved higher recall R in general than RUC. This is
possibly due to RUC's use of a deep learning based sentence generator coupled
with re-ranking based on the gold standard input which yielded higher precision,
while UAIC's template-based generator selected more bounding boxes to be
described resulting in a higher recall. Note that the METEOR scores are generally
higher in subtask 3 compared to subtask 2 as participants are provided with
gold standard input concepts, as well as the subtask having a smaller test set of
450 samples.</p>
        <p>As a baseline, we generated textual descriptions per image by selecting at
most three bounding boxes from the gold standard at random (the average
number of unique instance mentions per description in the development set is 2.89).
These concepts terms were then connected with random words or phrases
selected randomly from a prede ned list of prepositions and conjunctions followed
by an optional article the. Like subtask 2, we also computed a human
upperbound. The results for these are shown in Table 5. As observed, all participants
performed signi cantly better than the random baseline. Compared to the
human upper-bound, again much work can still be done. An interesting note is that
RUC achieved a high precision P almost on par with the human upper-bound,
at the expense of a lower R.
RUC
ISIA
MindLab
UAIC
Human
1
2
3
4
5
6
Median
Mean F</p>
        <p>Mean P</p>
        <p>Mean R
RUC
UAIC
Baseline
Human
There are two major limitations that we have identi ed with the challenge this
year. Very few of the groups used the provided data set and features, we found
this surprising, considering the state of the art CNN features and many others
were included. However, this is likely to be due to the complexity and challenge of
the 500,000 web page based images. Given they were collected from the Internet
with little, a large number of the images are poor representations of the concept.
In fact a number of participants annotated a large amount of their own more
perfect training data, as their learning process assumes perfect or near perfect
training examples, it will fail. As the number of classes increases and become
more varied annotating all perfect data will become more di cult.</p>
        <p>Another shortcoming of the overall challenge is the di culty of ensuring the
ground truth has 100% of concepts labelled, thus allowing a recall measure to be
used. This is especially problematic as the concepts selected include ne-grained
categories such as eyes and hands that are generally small but occur frequently
in the dataset. In addition, it was di cult for annotators to reach a consensus
in annotating bounding boxes for less well-de ned categories such as trees and
eld. Given the current crowd-source based hand-labelling of the ground truth,
the concepts have missed annotations. Thus, in this edition a recall measure is
not evaluated for subtask 1.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>This paper presented an overview of the ImageCLEF 2015 Scalable Concept
Image Annotation task, the fourth edition of a challenge aimed at developing
more scalable image annotation systems. The focus of the three subtasks
available to participants had the goal to develop techniques to allow computers to
reliably annotate images, localize the di erent concepts depicted in the images
and generate a description of the scene.</p>
      <p>
        The participation increased this year compared to last year with 14 teams
submitting in total 122 system runs. The performance of the submitted systems
was somewhat superior to last year's results for sub task 1. Especially
considering the requirement to label all 500,000 images in the training/test set. This
was in part probably due to the increased CNN usage as the feature
representation. The clear winner of this year's subtask 1 evaluation was the SMIVA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
team, which placed heavy emphasis on the visual aspect of annotating images
and improved their overall annotation performance by branching o secondary
recognition pipelines for certain highly common concepts. The participation rate
for subtasks 2 and 3 is encouraging as pilot subtasks. For subtask 3, we also
pioneered a concept selection metric to encourage ne-grained evaluation of image
descriptions. RUC [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] led both subtasks using the state of the art CNN-LSTM
caption generator, improving performance by exploiting concept detections from
subtask 1. Other teams, however, varied in their approaches to the problem. The
encouraging participation rate and promising results in these pilot subtasks are
su cient motivations for them to be included in future editions of the challenge.
      </p>
      <p>The results of the task have been very interesting and show that useful
annotation systems can be built using noisy web crawled data. Since the problem
requires to cover many fronts, there is still a lot of work that can be done, so
it would be interesting to continue this line of research. Papers on this topic
should be published, demonstration systems based on these ideas be built and
more evaluation of this sort be organized. Also, it remains to see how this can
be used to complement systems that are based on clean hand labeled data and
nd ways to take advantage of both the supervised and unsupervised data.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The Scalable Image Annotation, Localization and Sentence Generation task was
coorganized by the ViSen consortium under the EU CHIST-ERA D2K Programme,
supported by EPSRC Grants EP/K01904X/1 and EP/K019082/1, and by French ANR
Grant ANR-12-CHRI-0002-04. This work was also supported by the European Science
Foundation (ESF) through the research networking programme Evaluating Information
Access Systems (ELIAS).</p>
    </sec>
    <sec id="sec-6">
      <title>A Concept List 2015</title>
      <p>hog
hole
hook
horse
hospital
house
jacket
jean
key
keyboard
kitchen
knife
ladder
lake
leaf
leg
letter
library
lighter
lion
lotion
magazine
male child
man
mask
mat
mattress
microphone
milk
mirror
monkey
motorcycle
mountain
mouse
mouth
mushroom
neck
necklace
necktie
nest
newspaper
nose
nut
office
onion
orange
oven
painting
pan
park
pen
pencil
piano
picture
pillow
planet
pool
pot
potato
prison
pumpkin
rabbit
rack
radio
ramp
ribbon</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Calfa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iftene</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Using Textual and Visual Processing in Scalable Concept Image Annotation Challenge</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Denkowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Meteor universal: Language speci c translation evaluation for any target language</article-title>
          .
          <source>In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Elliott</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Keller, F.:
          <article-title>Comparing automatic evaluation measures for image description</article-title>
          .
          <source>In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          . pp.
          <volume>452</volume>
          {
          <fpage>457</fpage>
          . Association for Computational Linguistics, Baltimore, Maryland (
          <year>June 2014</year>
          ), http://www.aclweb.org/ anthology/P14-2074
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eslami</surname>
            ,
            <given-names>S.M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The pascal visual object classes challenge: A retrospective</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>111</volume>
          (
          <issue>1</issue>
          ),
          <volume>98</volume>
          {136 (Jan
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet An Electronic Lexical Database</article-title>
          . The MIT Press, Cambridge, MA; London (May
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gadeski</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borgne</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>CEA LIST's participation to the Scalable Concept Image Annotation task of ImageCLEF 2015</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kakar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chia</surname>
            ,
            <given-names>A.Y.S.</given-names>
          </string-name>
          : SMIVA at ImageCLEF 2015:
          <article-title>Automatic Image Annotation using Weakly Labelled Web Data</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>La</given-names>
            <surname>Cascia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sethi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Sclaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>Combining textual and visual cues for contentbased image retrieval on the World Wide Web</article-title>
          .
          <source>In: Content-Based Access of Image and Video Libraries</source>
          ,
          <year>1998</year>
          . Proceedings. IEEE Workshop on. pp.
          <volume>24</volume>
          {
          <issue>28</issue>
          (
          <year>1998</year>
          ), doi:10.1109/IVL.
          <year>1998</year>
          .694480
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</article-title>
          .
          <source>In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision</source>
          and Pattern Recognition - Volume
          <volume>2</volume>
          . pp.
          <volume>2169</volume>
          {
          <fpage>2178</fpage>
          . CVPR '06, IEEE Computer Society, Washington, DC, USA (
          <year>2006</year>
          ), doi:10.1109/CVPR.
          <year>2006</year>
          .68
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huo</surname>
            ,
            <given-names>Y.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <source>RUC-Tencent at ImageCLEF</source>
          <year>2015</year>
          :
          <article-title>Concept Detection, Localization and Sentence Generation</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bingyuan</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.F.</given-names>
            ,
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Lu</surname>
          </string-name>
          , H.:
          <article-title>Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Microsoft COCO: common objects in context</article-title>
          .
          <source>CoRR abs/1405</source>
          .0312 (
          <year>2014</year>
          ), http://arxiv.org/abs/1405.0312
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope</article-title>
          .
          <source>Int. J. Comput. Vision</source>
          <volume>42</volume>
          (
          <issue>3</issue>
          ),
          <volume>145</volume>
          {175 (May
          <year>2001</year>
          ), doi:10.1023/A:1011139631724
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pellegrin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanegas</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arevalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beltran</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-YGomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>INAOE-UNAL at ImageCLEF 2015: Scalable Concept Image Annotation</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (IJCV) pp.
          <volume>1</volume>
          {
          <issue>42</issue>
          (April
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sahbi</surname>
          </string-name>
          , H.:
          <article-title>CNRS TELECOM ParisTech at ImageCLEF 2015 Scalable Concept Image Annotation Task: Concept Detection with Blind Localization Proposals</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. van de Sande,
          <string-name>
            <given-names>K.E.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.</surname>
          </string-name>
          :
          <article-title>Evaluating Color Descriptors for Object and Scene Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>32</volume>
          ,
          <fpage>1582</fpage>
          {
          <fpage>1596</fpage>
          (
          <year>2010</year>
          ), doi:10.1109/TPAMI.
          <year>2009</year>
          .154
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piwowarski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Graph Based Method Approach to the ImageCLEF2015 Task1 - Image Annotation</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergus</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
            , W.: 80
            <given-names>Million</given-names>
          </string-name>
          <string-name>
            <surname>Tiny</surname>
          </string-name>
          <article-title>Images: A Large Data Set for Nonparametric Object and Scene Recognition</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on
          <volume>30</volume>
          (
          <issue>11</issue>
          ),
          <year>1958</year>
          {1970 (nov
          <year>2008</year>
          ), doi:10.1109/TPAMI.
          <year>2008</year>
          .128
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ullah</surname>
            ,
            <given-names>M.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aono</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>KDEVIR at ImageCLEF 2015 Scalable Image Annotation, Localization, and Sentence Generation task: Ontology based Multi-label Image Annotation</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bromuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uskudarli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marvasti</surname>
            ,
            <given-names>N.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aldana</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Mar</surname>
          </string-name>
          Roldan Garc a, M.:
          <article-title>General Overview of ImageCLEF at the CLEF 2015 Labs</article-title>
          . Lecture Notes in Computer Science, Springer International Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Image-Text Dataset Generation for Image Annotation and Retrieval</article-title>
          . In: Berlanga,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          , P. (eds.) II Congreso Espan~ol de Recuperacion de Informacion,
          <string-name>
            <surname>CERI</surname>
          </string-name>
          <year>2012</year>
          . pp.
          <volume>115</volume>
          {
          <fpage>120</fpage>
          . Universidad Politecnica de Valencia, Valencia, Spain (June 18-19
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
          </string-name>
          , R.:
          <article-title>Overview of the ImageCLEF 2012 Scalable Web Image Annotation Task</article-title>
          . In: Forner,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Womser-Hacker</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2012 Evaluation Labs</article-title>
          and Workshop, Online Working Notes. Rome,
          <source>Italy (September</source>
          <volume>17</volume>
          -20
          <year>2012</year>
          ), http://mvillegas.info/pub/Villegas12_CLEF_
          <article-title>Annotation-Overview</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
          </string-name>
          , R.:
          <article-title>Overview of the ImageCLEF 2014 Scalable Concept Image Annotation Task</article-title>
          .
          <source>In: CLEF2014 Working Notes. CEUR Workshop Proceedings</source>
          , vol.
          <volume>1180</volume>
          , pp.
          <volume>308</volume>
          {
          <fpage>328</fpage>
          .
          <article-title>CEUR-WS.org, She eld</article-title>
          ,
          <source>UK (September</source>
          <volume>15</volume>
          -18
          <year>2014</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1180</volume>
          /
          <fpage>CLEF2014wn</fpage>
          -Image-VillegasEt2014.pdf
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2013 Scalable Concept Image Annotation Subtask</article-title>
          . In:
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop, Online Working Notes. Valencia,
          <source>Spain (September</source>
          <volume>23</volume>
          -26
          <year>2013</year>
          ), http: //mvillegas.info/pub/Villegas13_CLEF_
          <article-title>Annotation-Overview</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A poodle or a dog? Evaluating automatic image annotation using human descriptions at di erent levels of granularity</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Vision and Language</source>
          . pp.
          <volume>38</volume>
          {
          <fpage>45</fpage>
          . Dublin City University and the Association for Computational Linguistics, Dublin, Ireland (
          <year>August 2014</year>
          ), http://www.aclweb.org/anthology/W14-5406
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Ma</surname>
          </string-name>
          , W.Y.:
          <article-title>ARISTA - image search to annotation on billions of web photos</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2010 IEEE Conference on</source>
          . pp.
          <volume>2987</volume>
          {
          <issue>2994</issue>
          (
          <year>June 2010</year>
          ), doi:10.1109/CVPR.
          <year>2010</year>
          .5540046
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          , L.:
          <article-title>BUAA-iCC at ImageCLEF 2015 Scalable Concept Image Annotation Challenge</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
          </string-name>
          , N.:
          <article-title>Large scale image annotation: learning to rank with joint word-image embeddings</article-title>
          .
          <source>Machine Learning</source>
          <volume>81</volume>
          ,
          <volume>21</volume>
          {
          <fpage>35</fpage>
          (
          <year>2010</year>
          ), doi:10.1007/s10994-010-5198-3
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Zarka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ammar</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alimi.</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Regimvid at ImageCLEF 2015 Scalable Concept Image Annotation Task: Ontology based Hierarchical Image Annotation</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>