<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CVPR.2006.68</article-id>
      <title-group>
        <article-title>Overview of the ImageCLEF 2012 Scalable Web Image Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mauricio Villegas</string-name>
          <email>mvillegas@iti.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Paredes</string-name>
          <email>rparedes@iti.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut Tecnolo`gic d'Informa`tica Universitat Polit`ecnica de Val`encia Cam ́ı de Vera</institution>
          <addr-line>s/n, 46022 Val`encia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <volume>2</volume>
      <fpage>24</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>The ImageCLEF 2012 Scalable Image Annotation Using General Web Data Task proposed a challenge, in which as training data instead of relying only on a set of manually annotated images, the objective was to make use of automatically gathered Web data, with the aim of developing more scalable image annotation systems. To this end, the participants were provided with a new dataset, composed of 250,000 images for training, which included various visual feature types, and textual features obtained from the websites in which the images appeared. Two subtasks were defined. The first subtask employed the same test set as the ImageCLEF 2012 Flickr Photo Annotation subtask, with the particularity that both the Flickr and Web training sets had to be used. The idea was to determine if the Web data could help to enhance the annotation performance in comparison to using only manually annotated data. The second subtask consisted in using only automatically gathered Web data to develop an image annotation system. For this, we provided a development and test sets of 1,000 and 2,000 images, respectively, both manually annotated for 95 and 105 concepts, respectively. The participants of the first subtask were not able to take advantage of the Web data to enhance the annotation performance. On the contrary, in the second subtask interesting results were obtained. As expected, the overall performance of the systems is worse than using manually annotated data, nonetheless, the results are promising when analyzing per concept. For some concepts the performance is relatively good, confirming that the Web data can in fact be quite useful. Moreover, due to the low participation and the relatively simple techniques used, it is believed that there is considerable room for improvement on both subtasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The rapidly increasing amount of digital information that people have to deal
with every day, has created huge interest in developing automatic indexing
systems, so that information needs can be easily and efficiently fulfilled. In the
case of images and video, this indexing can be addressed by means of an
automatic image annotation system, in which images are associated with one or
more concepts. The research on image concept detection has generally relied on
training data that have been manually, and thus reliably labeled, an expensive
and laborious endeavor that cannot easily scale. Because of this, it has become
common in past image annotation benchmark campaigns [6,10] to use
crowdsoursing approaches such as the Amazon Mechanical Turk1 (MTurk), in order
to label a large amount of images. Still, crowdsoursing is expensive and
difficult to scale to a very large amount of concepts, thus it is advisable to explore
possible alternatives.</p>
      <p>With the advance of multimedia technology and the Internet, we have at our
disposal billions of images available online. Furthermore, the images are found on
webpages surrounded with text which might have a direct relationship with the
content of the image. Even though this surrounding unsupervised text is noisy
and sometimes unrelated to the image, it potentially has useful information,
and furthermore it can be cheaply gathered and be obtained for practically
any topic. Thus, determining whether this kind of data can be used for reliably
annotating images is important. Previous research indicates that this information
is useful, being the work of Torralba et al. [11] an example of this, in which
almost 80 million tiny images were effectively used for several tasks such as
person detection. More closely related, in the work of Weston et al. [16] an image
annotation learning method is proposed that scales to millions of images and
thousands of possible annotations. Another related work is the Arista project [15]
in which accurate tags can be generated for popular Web images that have
nearduplicates included in their Web image database of billions of images.</p>
      <p>This paper presents an overview of the ImageCLEF 2012 Scalable Image
Annotation Using General Web Data Task, a benchmark campaign oriented
at using automatically gathered Web data for image annotation. The paper is
organized as follows. Section 2 describes the generation of the dataset that was
created specifically for this evaluation. Followed by this, the two subtasks that
were defined are presented in section 3. Then, section 4 presents the results
submitted by the participants and a discussion of these. Finally, section 5 is the
conclusion of the paper.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Creation of the Dataset</title>
      <sec id="sec-2-1">
        <title>Web Crawling</title>
        <p>Among the objectives for the dataset being created [14], was to have a wide
variety of images with a relatively small amount of images (not billions). Thus
in order to obtain a good set of image URLs, we opted to use the same crawling
strategy as in [11], where the image URLs (and the corresponding URLs of
the webpages that contain the images) are obtained by querying popular image
search engines. We selected Google, Bing and Yahoo, and queried them using
words from the English dictionary that comes with the aspell spelling checker.</p>
        <p>The next step in the crawling process was to download the images and
corresponding webpages, and store a snapshot of these. At the end, in total we</p>
        <sec id="sec-2-1-1">
          <title>1 www.mturk.com</title>
          <p>obtained over 31 million of both images and webpages. In order to avoid
duplicate images, several precautions were taken. First the URLs were normalized
to prevent different versions of the same URL to be downloaded several times.
However, there was also the possibility that the same image was found under
different URLs. To account for this, the images were stored using a unique code
or image identifier, composed of: part of the MD5 checksum of an 864-bit image
signature (in some aspects similar to the one presented in [17]) and, part of the
MD5 checksum of the file. This scheme guaranties storing exactly the same file
only once and easily identifying duplicates or near duplicates (accounting for
images in various formats, at different resolutions and with minor modifications
such as some watermarks). The final image identifiers are 16 base64url digits.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Image Subset Selection</title>
        <p>Even though the set of downloaded images was obtained using all of the words
of the English dictionary, and therefore it contains images from practically any
topic, a subset of the images was selected for practical reasons. Basically selecting
a subset permitted to provide smaller data files that would not be so prohibitive
for the participants to download and handle. Furthermore, since the test sets
had to be manually labeled and this could be done only for a relatively small list
of concepts, we could select only the images indexed with words related to the
list of concepts. The size of the training set was chosen to be of 250,000 images,
which results in feature vector sets of moderate size that can be easily handled
on current personal computers.</p>
        <p>Another reason for selecting a subset, was to discard some types of
images. Even though the URLs were obtained from trustworthy search engines,
inevitably there is a certain amount of problematic images that we decided to
remove. Among the problematic images are for instance a message saying
“Image removed”, or dummy images some servers send specifically to web crawlers.
Removing this type images is in itself a difficult problem, however we noticed
that most of these tended to have many different URLs linking to them or be
images that appeared in a large amount of webpages. So the approach to remove
most of these was simply not to include images that had more than N URLs
linking to them or that appeared in more than M webpages. The values of N
and M were set manually from a quick look at the images being considered for
removal.</p>
        <p>When crawling the Web, another problem encountered was that an image
could be still reachable, although the webpage where it appeared has changed
and no longer includes the image, or has been removed and does not supply the
proper HTTP 404 code. Resolving this issue was simple due to the requirements
of the dataset. For each image in the dataset there was supposed to be at least
one webpage that contained the image, thus we verified the location of the
images within the webpages. Any image not having a corresponding webpage was
obviously not considered for inclusion.</p>
        <p>
          The image identifier codes guarantied storing exactly the same image file only
once. However, for this subset selection we also employed a very simple near
image duplicate removal scheme. This was done using the same image signature
mentioned in the previous section. To reduce the amount computation required
for duplicate detection, the 864-bit image signatures were first reduced to 128
bits using PCA followed by a random rotation and thresholding [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The
duplicate removal scheme was not including images that had a normalized hamming
distance lower than 0.1 to the other images already in the subset.
        </p>
        <p>The final selection of the 250,000 images was based on a list of 158 concepts
that were manually defined. This list included all of the concepts of the test
sets of both subtasks described on section 3. A small set of 3,000 images was
first manually labeled using 115 of the concepts. These are the same images
and ground truth labels used as development and test sets in subtask 2. Then
for each concept (defined by the concept words and synonyms of these) and
cooccurrences of concepts in the labeled set, we retrieved ranked lists of images,
using the query results from the search engines, and also querying our own image
index generated from the downloaded webpages. The lists were sorted by rank
and the first 250,000 images were the ones selected for the final dataset.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Available Data</title>
        <p>This dataset was made available under a Creative Commons license, however,
since the data was gathered from the Internet, and their original copyright
conditions are difficult to determine automatically, only the feature vectors were
distributed2. Nonetheless, as it is commonly done on image search engines,
thumbnails of the images could be obtained from a web server by using the image
identifiers3.</p>
        <p>For each of the 250,000 training images, both the textual and visual features
described next were available. For the additional 3,000 images used in subtask 2
that were manually labeled, and the 25,000 flickr images used in subtask 1, only
the visual features were included.</p>
        <p>Textual Features: Four sets of textual features were extracted. First is
the list of the words used to find the image when querying the search engines,
along with the rank position of the image in the respective query and the search
engine it was found on. The second textual features were the image URLs as
referenced in the webpages they appeared in. In many cases the image URLs
tend to be formed with words that relate to the content of the image, this is
why they can also be useful as textual features. The other two textual features
available correspond to text extracted from the webpages near the position of
the image. The difference between these two feature sets was the amount of
preprocessing.</p>
        <p>To extract the text near the image, we first converted the webpages to valid
XML to ease processing and removed the script and style elements. The text
considered close, was the webpage title and all the terms that are closer than
600 in word distance to the image, not including the HTML tags and attributes.</p>
        <sec id="sec-2-3-1">
          <title>2 http://risenet.iti.upv.es/webupv250k 3 http://risenet.iti.upv.es/db/img/{IID}.jpg</title>
          <p>The first level of processing of the features included this raw text, although some
types of terms were converted to a special symbol, such as words with non-latin
characters. In these features, the position of the image was also indicated and
the words replaced by a special symbol serve to preserve word distances.</p>
          <p>For the next level of processing, a weight s(tn) was assigned to each of the
words near the image, defined as
s(tn) =</p>
          <p>1
∀t∈T s(t) ∀tn,m∈T</p>
          <p>Fn,m sigm(dn,m) ,
(1)
where tn,m are each of the appearances of the term tn in the document T , Fn,m
is a factor depending on the DOM (e.g. title, alt, etc.) similar to what is done
in the work of La Cascia et al. [4], and dn,m is the word distance from tn,m to
the image. The sigmoid function was centered at 35, had a slope of 0.15 and
minimum and maximum values of 1 and 10 respectively. The resulting features
include for each image at most the 100 word-score pairs with the highest scores.</p>
          <p>Visual Features: As features extracted from the images, we made available
seven types. As preprocessing, we filtered the images and resized them so that
the width and height had at most 240 pixels while preserving the original aspect
ratio. The first feature set were 576-dimensional color histograms extracted using
our own implementation. The second set of features were the GIST [7]. The other
four features were obtained using the colorDescriptors software [8]. We computed
features for SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. As configuration
we used dense sampling with default parameters, and a hard assignment 1,000
and 10,000 codebooks using a spatial pyramid of 1 × 1 and 2 × 2 [5]. Since
the vectors of the spatial pyramid were concatenated, this resulted in
5,000dimensional and 50,000-dimensional feature vectors, respectively. Keeping only
the first fifth of the dimensions would be like not using the spatial pyramid. The
codebooks were generated using 1.25 million randomly selected features and the
k-means algorithm.</p>
          <p>Individual features (i.e. without using a codebook) were also made available
for SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT, and the seventh feature type
SURF, extracted using the TOP-SURF software [9]. In this case the
preprocessing was a little different since these were extracted exactly the same as for the
ImageCLEF 2012 Flickr Photo Annotation and Retrieval Task [10] to ease
participation in both tasks. The images were filtered with the catrom filter and
resized to 256 × 256 pixels ignoring the original aspect ratio.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>As commented before, a very large amount of images can be cheaply gathered
from the Web, and furthermore, from the webpages that contain the images,
text associated with them can be obtained. However, the degree of relationship
between the surrounding text and the image, varies greatly, thus this data can
be considered to be very noisy. Moreover, the webpages can be of any language
(a) Images from a Web search query of “rainbow”.</p>
      <p>(b) Images from a Web search query of “sun”.
or even a mixture of languages, and they tend to have many writing mistakes.
The goal of this task is to evaluate different strategies to deal with noisy data
so that it can be reliably used for annotating images from practically any topic.</p>
      <p>To illustrate the objective of the task, consider for example that we searched
for the word “rainbow” in a popular image search engine. It would be expected
that many results be of landscapes in which in the sky a rainbow is visible.
However, other types of images will also appear, see Figure 1a. The images will
be related to the query in different senses, and there might even be images that
do not have any apparent relationship. In the example of Figure 1a, one image
is a text page of a poem about a rainbow, and another is a photograph of an
old cave painting of a rainbow serpent. See Figure 1b for a similar example on
the query “sun”. As can be observed, the data is noisy, although it does have
the advantage that this data can also handle the possible different senses that a
word can have.</p>
      <p>Based on these observations, an interesting research topic would be: how
to use and handle the automatically retrieved noisy Web data to complement
the manually labeled training data and obtain a better performing annotation
system than when using the manually labeled data alone. On the other hand,
since the Web data can easily be obtained for any topic, another research topic
would be: how to use the noisy Web data to develop an annotation system with a
somewhat unbounded list of concepts, using only automatically retrieved image
and textual Web data.</p>
      <p>Both of the research topics just mentioned have been addressed in two
separate subtasks.
3.1</p>
      <sec id="sec-3-1">
        <title>Subtask 1: Complementing Manually Annotated Data</title>
        <p>In this subtask the list of concepts and the test samples were exactly the same
as the ones used in the ImageCLEF 2012 Flickr Photo Annotation subtask [10].
The ImageCLEF 2012 Flickr dataset consisted of a training and test sets of
15,000 and 10,000 images, respectively, that were manually labeled by means of
crowdsoursing using a list of 94 concepts. For further details on this dataset, the
reader should refer to the overview paper of the ImageCLEF 2012 Flickr Photo
Annotation and Retrieval Task [10].</p>
        <p>In this subtask, the participants had available for developing their
annotation systems, both the Flickr and Web training datasets. The objective was to
develop techniques to take advantage of the Web data, trying to obtain better
concept annotation performance in comparison to using only the Flickr manually
annotated data. The participants had to submit results using as training only
the Flickr dataset, and using both the Flickr and Web datasets.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Subtask 2: Scalable Concept Image Annotation</title>
        <p>In this subtask, the objective was to develop systems that could easily change
or scale the list of concepts used for image annotation. In other words, the list
of concepts is also considered to be an input to the system. Thus, the system
when given an input image and a list of concepts, its job is to give a score to
each of the concepts in the list and decide how many and which of them assign
as annotations. To observe this scalable characteristic of the systems, the list of
concepts was different for the development and test sets, and the participants
only had available the ground truth annotations for the development set.</p>
        <p>
          The idea was that the participants use the 250,000 images of the Web
training set, including the visual and textual features (see Section 2), to develop
and estimate the models for image annotation. It was not permitted to use any
manually annotated data, such as the Flickr training set. However, the use of
other additional language resources, such as language models, language
detectors, stemmers, WordNet [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], spell checkers, etc., was permitted and encouraged.
        </p>
        <p>The development set consisted of 1,000 images annotated for 95 concepts,
and the test set consisted of 2,000 images for 105 concepts, among which 85
were common to the development set, i.e. 10 concepts were removed and 20
were added. The list of concepts and the number of images for each can be
observed in Table 1. So that the Web training would be the same as for subtask
1, for this first edition of the task, the list of concepts overlapped considerably
with the concepts of the Flickr annotation task.</p>
        <p>For this subtask, so that there could be a reference performance and also
serve as a starting point, a toolkit was supplied to the participants. This toolkit
included software that computed the evaluation measures (see Section 4.1), and
the implementations of two baselines. The first baseline was a simple random,
which is important since any system which gets worse performance than the
random baseline means that this system is doing nothing.
development and test sets of subtask 2.
technique for this image annotation task, which obviously gives better
performance than random, although it was simple enough to give the participants a
wide margin for improvement. In this technique when given an input image,
its nearest K = 32 images from the training set are obtained, using only the
1,000 bag-of-words C-SIFT visual features and the L1 norm. Then, the textual
features corresponding to these K nearest images are used to derive a score for
each of the concepts. This is done by using a concept-word co-occurrence matrix
estimated from all of the training set textual features. In order to make the
vocabulary size more manageable, the textual features are first processed keeping
only the words from the English dictionary. Finally for the selection of concepts
for annotation, for all input images, the first 5 ranked concepts are always chosen
as annotations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Results</title>
      <sec id="sec-4-1">
        <title>Performance Measures</title>
        <p>The participants were asked to submit the results in the following way. For each
image to annotate, a score had to be given for every one of the concepts in the
list and also indicate which concepts had finally been selected as annotations.</p>
        <p>Two basic performance measures have been used for comparing the results of
the different submissions. These basic measures are the Average Precision (AP)
and the F-measure (F1). The AP only takes into account the scores assigned to
the concepts and ignores the decisions of the selected annotations. On the other
hand, the F1 only considers the the selected annotations.</p>
        <p>The AP is algebraically defined as</p>
        <p>AP =
1
|K|
|K|
k=1</p>
        <p>k
rank(k)
,
(2)
where K is the ordered set of the ground truth annotations, being the order
induced by the annotation scores, and rank(k) is the order position of the k-th
ground truth annotation. The fraction k/ rank(k) is actually the precision at the
k-th ground truth annotation, and has been written like this to be explicit on
the way it is computed. In the cases that there are ties in the scores, a random
permutation is applied within the ties.</p>
        <p>In the context of image annotation, the AP can be estimated from two
different perspectives, one being concept-based and the other example-based. In the
former, one AP is computed for each concept, and in the latter one AP is
computed for each image to annotate. Which of these is more correct to use actually
depends on exactly what the scores are. If the scores for example relate to the
probability that the concept is present for a given image, and the comparison
between scores for different images is not clearly defined, then the concept-based
AP does not make sense and will probably not be a good indicator of the
performance of the system. On the other hand, in this case the example-based AP
will be a good indicator of the performance of the system. In the instructions
given to the participants, this was not clearly explained, however, for all of the
submissions, the scores seemed to be image based. Therefore, in this paper we
present results only for the example-based AP. Finally, to obtain a global
performance measure of the systems, we have taken the arithmetic mean, in which
case it is known as the Mean Average Precision (MAP).</p>
        <p>The other performance measure used, the F1, is defined as</p>
        <p>F1 =
2P R
P + R
where P is the precision and R is the recall. Again this measure can also be
estimated from the concept-based and the example-based perspectives. In this
case both approaches are adequate and serve to analyze different aspects. For
the example-based F1, as a global system performance measure the arithmetic
mean is used, thus obtaining a mean F-measure (MF1). On the other hand, the
concept-based F1 is used to analyze the behavior for different concepts.</p>
        <p>Other performance measures were computed and analyzed, however, for the
received submissions they do not give any important details that are not already
observed with the previously mentioned measures. Therefore for simplicity, we
are not including them in this paper. These other measures were, the AP using
the geometric mean and the interpolated versions, i.e. the Geometric Mean
Average Precision (GMAP) the Interpolated Average Precision (IAP), the Mean
Interpolated Average Precision (MIAP) and the Geometric Mean Interpolated
Average Precision (GMIAP).
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Participation</title>
        <p>In total, 47 groups registered for the task and signed the license agreement, and
therefore had access for downloading the datasets. Unfortunately in the end, the
participation was considerably low. For subtask 1 we received 15 runs from three
groups, and for subtask 2 we received 10 runs from one group. Also, one of the
groups that submitted results for subtask 1 said that they made a mistake and
did not intend to participate in the task.</p>
        <p>
          KIDS-NUTN: The Knowledge, Information, and Database System
Laboratory (KIDS-NUTN), from the National University of Tainan [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] submitted in
total 9 runs. All of the runs were for subtask 1, being 5 using only the Flickr
training set and the other 4 using both the Flickr and Web training sets. They
used a combination of several visual feature types, namely
AutoColorCorrelogram, ColorLayout, FCTH, Gabor, GIST, and ROI background, and as textual
features they used the EXIF data. For the annotation, they used Random Forests
and for comparison the also tried as a baseline the Multiple Bernoulli Relevance
Models (MBRM). For further details, please refer to [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>ISI: The Intelligent Systems and Informatics Laboratory (ISI), from the
University of Tokyo [13] submitted in total 20 runs. Half of the runs were for
subtask 1, being 6 using only the Flickr training set and the other 4 using both
the Flickr and Web training sets. For the other 10 runs for subtask 2, half
correspond to the development set, and the other half to the test set. Their
effort was targeted at making the system scalable, so for annotation they used
the Passive-Aggressive with Averaged Pairwise Loss (PAAPL) [12], which is an
online learning method they propose for multiclass multilabel classification using
a linear model. As visual features, they used the provided *SIFT features, and to
tackle the Web data, they artificially labeled it by looking at the textual features
and if a word that defined a concept appeared, then that concept was assumed to
be present. The images that did not have any concept were simply discarded. In
subtask 1, they tried first to learn models for the Flickr and Web data separately,
and combine the results, and second they tried learning the models by merging
all of the data. All of their submissions were using the latter approach, since
during development it was the one that performed best. The difference between
submissions is simply the combination of visual features. In subtask 2, again
they artificially labeled the training data for learning, and the submissions differ
in the combination of visual features. For further details, please refer to [13].
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results and Discussion for Subtask 1</title>
        <p>The results for subtask 1, given by the example-based MAP and the MF1, are
presented in Tables 2a and 2b. The first table includes the best result of each
group when using as training only the Flickr manually annotated data, and
a baseline, which is randomly assigning scores to the concepts and selecting
randomly the top N as annotations. The second table includes results for all of
the submissions using both the Flickr and Web training data.</p>
        <p>As can be observed in the tables, all of the results using both the Flickr and
Web training data have a worse performance than when using only Flickr data.
In the case of the KIDS-NUTN, the difference between using or not using Web
data is not so high. When we inquired them about these results, they answered
that the textual data of the Web dataset did not help, but they did not give
us a clearer explanation on how they arrived to this conclusion or exactly to
what the submissions correspond. Moreover, they said that they were not able
to dedicate much time on the problem.</p>
        <p>Regarding the results of ISI, the difference between using or not using Web
data is considerably high. In fact it seems that during development [13] they
obtained better results, and these did not generalize to the test set. For the
Flickr task, they obtained an MF1 higher than 0.5 both during development and
during test. However, using both Flickr and Web training data they obtained
an MF1 in the order of 0.48 during development in contrast to the 0.18 they
obtain for the test set. This suggests that possibly there was some mistake or
something was different in the models used for annotating the test set. In fact
in subtask 2 (see Section 4.4), they obtain better results even though they have
used the same technique, and the problem is harder since only Web data can be
used as training and the random baseline is lower.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Results and Discussion for Subtask 2</title>
        <p>Tables 3a and 3b present the results for the example-based MAP and MF1 for
the development and test sets, respectively. The tables include the results for the
submitted runs and the two baselines, assigning random scores and selecting the
random top N concepts per image, and the co-occurrence baseline as described
in Section 3.2. The first thing to note is that the results for the development set,
generalizes well to the test set, unlike what was observed in the ISI results for
subtask 1. The second thing to note is that the submitted runs have a
considerably better performance than the supplied co-occurrence baseline. This is a great
achievement, even though the co-occurrence baseline is considerably simple.
Unfortunately, there was only one participant, so there was no much competition,
and definitely it is not possible to say that this level of performance is more or
less what can be achieved using Web data as training, so that it could be
compared to using labeled data as training. Furthermore, the ISI system can also be
Concepts</p>
        <p>(a)
none, noblur, dog, fireworks, flower, partialblur, fooddrink, adult
female, outdoor, tree, bird, coast, citylife, day
male, stars, moon, car, graffiti, baby
insect, closeupmacro, cycle, fish, indoor, silhouette, cat
clearsky, rainbow, flames, lenseffect, sun, circularwarp, partylife, calm, homelife, 0.2 ≤ F1 &lt; 0.3
grass, sunrisesunset, air
child, forestpark, underwater, portrait, fogmist, horse, inactive, smoke, overlay 0.1 ≤ F1 &lt; 0.2
drawing/diagram, galaxy
newspaper, lightning, forest, pool, fire, aerial, horse, bicycle/tricycle, protest
sky, building, nebula, mountain, cartoon, church, footwear, logo, lake, moon, 0.3 ≤ F1 &lt; 0.4
grass, road, underwater, tree, snow, painting
water, plant, desert, furniture, airplane/helicopter, beach, sun, food, guitar, 0.2 ≤ F1 &lt; 0.3
flower, train/tram/metro, boat, rainbow, silhouette, sand, glass, harbor/port
river, bus, truck, car, cat, dog, castle, fish, baby, book, chair, embroidery, sports, 0.1 ≤ F1 &lt; 0.2
child, phone, toy, garden/park
motorcycle, wagon, bottle, poster, bird, rain, sculpture, table, outdoor, 0.0 ≤ F1 &lt; 0.1
cityscape, daytime, dirt, drink, droplets, elder, glasses, graffiti, hat, indoor,
insect, music+instrument, nighttime, overcast, reflection, reptile, rodent, rural,
shadow, sitting, smoke, submarine, teenager, traffic, unpaved, violin
1424), and (4b) automatically gathered Web data (ISI 1411).
considered to be a relatively simple technique. It does seem that their proposed
PAAPL technique is able to learn from the data despite the large amount of noise
that it has. However, their use of the textual features is extremely simple, only
searching exactly for the words that define the concepts. They have not used
synonym information, stemming, WordNet, or any other resources that could be
quite useful, and surely the performance could be improved.</p>
        <p>Even though the test sets of subtask 1 and 2, differ in difficulty, image quality,
number of concepts, etc., if we dare compare the results of image annotation
using manually labeled data (see Table 2a) with using automatically gathered
data (see Table 3b), the performance is lower for the latter. This is something
expected since learning with Web data is considerably more challenging. The real
objective is to observe how much can be achieved using the Web data. Ultimately
in practice one or the other or a combination of both approaches will be better in
a certain circumstance. Thinking about the problem, it would be understandable
that an annotation system will work better for some concepts than for others.
In Tables 4a and 4b, there are lists of concepts categorized by the range of their
concept-based F1, for the best system in subtask 1 and 2, respectively. Here, it
can be observed that for some concepts, the Web data performs rather well, and
in general it does not look too bad with respect to the results using manually
labeled data. The same thing can be observed in Tables 5a and 5b, which show
the top performing concepts according to the relative improvement4 with respect
to the random baseline.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The ImageCLEF 2012 Scalable Image Annotation Using General Web Data
proposed two subtasks. The overall objective was to take advantage of automatically
gathered image and textual Web data for training, in order to develop more
scalable image annotation systems. In the first subtask, the participants could use
for developing their annotation systems, both manually labeled data, and
automatically gathered Web data. In this subtask, none of the participants were
able to use the Web data to obtain a better performance than when using only
manually labeled data. The participation was extremely low, there being only
three groups, and it seemed that they were not able to invest much time in the
problem. Due to this, few conclusions can be drawn from the results. Although
it certainly cannot be stated that the Web data is simply not useful, since in
4 Relative improvement defined as the absolute improvement divided by the difference
between the baseline performance and perfect performance which for F1 is 1.
subtask 2 the results were somewhat positive, suggesting that in subtask 1 also
good results could be achieved.</p>
      <p>Subtask 2 consisted in using only automatically gathered Web data, and
possibly additional external language resources, to develop a more scalable image
annotation system. A special characteristic was that the list of concepts was
different for development than for test. In this subtask the participation also low,
having participated only one group. However, the obtained results were specially
interesting. The submissions obtained a considerably better performance than
the two provided baselines, and the results generalized well to the test set,
despite the change of concept list. Furthermore, the system of the participant was
specially targeted at scalability, by using an online learning method adequate for
this type of problem, thus it fulfills the initial objective. On the other hand, the
processing of the textual data could only be considered to be very basic, thus
suggesting that a much better performance could be achieved.</p>
      <p>Another interesting aspect of the results of subtask 2 was that when analyzing
on a per concept basis, in some cases the performance was comparable to good
annotation systems learned using manually labeled data. Therefore, for some
concepts the Web data is considerably effective. This also suggests that one
possible way to address what was proposed in subtask 1, is to do some type of
fusion per concept.</p>
      <p>Since the participation was low, and there were positive results, it would be
interesting to repeat this benchmark, but making a greater effort to get more
groups to participate. However, even though it is believed that in subtask 1, good
results could be achieved, it is not so interesting from a scalability point of view.
For a future edition it could be changed slightly. For example it could be that
for the concepts where there is manually labeled data available, the annotation
systems would use a combination of manual and automatically gathered data,
otherwise only automatically gathered data is used.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank the CLEF campaign for supporting the ImageCLEF
initiative. Work supported by the Spanish MICINN under the MIPRCV Consolider
Ingenio 2010 program (CSD2007-00018) and by the Generalitat Valenciana
under grant Prometeo/2009/014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chien</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaou</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          :
          <article-title>KIDSNUTN at ImageCLEF 2012 Photo Annotation and Retrieval Task</article-title>
          . In:
          <article-title>CLEF 2012 working notes</article-title>
          . Rome, Italy (
          <year>2012</year>
          )
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet An Electronic Lexical Database</article-title>
          . The MIT Press, Cambridge, MA; London (May
          <year>1998</year>
          )
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Iterative quantization: A procrustean approach to learning binary codes</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on</source>
          . pp.
          <fpage>817</fpage>
          -
          <lpage>824</lpage>
          (june
          <year>2011</year>
          ) 4
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>