<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIL at ImageCLEF 2013: Scalable System for Image Annotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masatoshi Hidaka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naoyuki Gunji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tatsuya Harada</string-name>
          <email>haradag@mi.t.u-tokyo.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Machine Intelligence Lab., The University of Tokyo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>We give details of our methods in the ImageCLEF 2013 Scalable Concept Image Annotation task. For the textual feature, we propose a method for selecting text closely related to an image from its webpage. In addition, to consider the meaning of the concept, we propose to use WordNet for getting words related to the concept. For visual features, we use Fisher Vector (FV), which is regarded as an extension of the Bagof-Visual-Words representation. We trained linear classifiers by PassiveAggressive with Averaged Pairwise Loss (PAAPL), an online multilabel learning method based on Passive-Aggressive. Since PAAPL is computationally efficient and able to cope with multilabel data appropriately, it is suitable for this task. Results show that our annotation pipeline is simple but works well in this task.</p>
      </abstract>
      <kwd-group>
        <kwd>ImageCLEF</kwd>
        <kwd>Textual Feature</kwd>
        <kwd>WordNet</kwd>
        <kwd>Annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In ImageCLEF 2013 Scalable Concept Image Annotation, our task is multi-label
image annotation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The dataset is extracted from general webpages, so that
the costs in collecting data are low [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, collected images have no explicit
labels. Therefore, we need to extract correct labels of corresponding images from
webpages. As for the extraction of labels from websites, the simplest solution is
that concept labels which exist in webpages are assigned to the images. However,
this method often fails to get correct labels because it does not consider meanings
of concepts. Furthermore texts of webpages are not necessarily related to the
images. Therefore, we try some methods to get more accurate labels. To achieve
it, we use information from WordNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to get words related to the concepts. In
addition, limitation of text extraction range is adopted to omit text not related
to the images. For visual features, we adopt Fisher Vector (FV)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is an
improved method of Bag-of-Visual-Words (BoVW). We use linear classifiers for
each concept because they are computationally efficient and suitable for
largescale data. When training classifiers, because labels assigned to the images are
not ground-truth labels, they must be regarded as noisy. Therefore, we devote
attention to robustness for noise in the training data. In order to train linear
classifiers, we use PAAPL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an online multilabel learning method based on
Passive–Aggressive[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. PAAPL shows faster convergence than PA and has the
same feature of robustness to the noise as PA.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Extraction</title>
      <p>Visual Feature
As a visual feature, we use the Fisher Vector (FV). Because it can achieve
a good classification performance with a linear classifier, it is often used for
large-scale visual categorization. Indeed, in the ImageNet Large-Scale Visual
Recognition Challenge 2012 (ILSVRC2012), four out of seven teams used FVs
to represent images. We use four local descriptors: SIFT, C-SIFT, LBP, and
GIST. Actually, GIST is usually used to describe a whole image, but we use it
as a local descriptor. All local descriptors are reduced to 64 dimensions using
Principal Component Analysis (PCA). Local descriptors are densely extracted
from five scales of patches on a regular grid every six pixels and learn a Gaussian
Mixture Model (GMM) with 256 components, which have a diagonal matrix as
its covariance matrix. To use spatial information, we divide images into 1 1,
2 2, and 3 1 cells. Then FVs are calculated over each region as follows.</p>
      <p>Let X = fx1; x2; ; xN g be a set of N local descriptors extracted from an
image, and wi, i, i be the mixture weight, mean vector, covariance matrix of
the i-th Gaussian, respectively. Then we difine,
ui =
vi =</p>
      <p>1 ∑N
N pwi n=1</p>
      <p>1
N p
2wi n=1</p>
      <p>N
∑ n(i) [</p>
      <p>1
n(i) i 2 (xn</p>
      <p>i) ;
i 1diag((xn
i)(xn
i)T )
1] ;
where 1 is a column vector whose components are all 1 and diag(X) for matrix
X is a column vector which is composed of diagonal components of X. n(i) is
the soft assignment of xn to i-th Gaussian as
n(i) =
wiui(xn)
K
j=1wj uj (xn)</p>
      <p>;</p>
      <p>G = [u1T v1T : : : uTK vKT ]T ;
where ui is the i-th Gaussian, and it is also known as the posterior probability.
The FV representation is therefore given as
where K is the number of GMM components.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we apply power normalization and L2 normalization to each of
the extracted FVs. Power normalization is done by applying the function,
g(z) = sign(z)jzja;
to each component of FVs, where a is a parameter and is set to 1/2 in this work.
After normalization, we concatenate them into a single vector. The dimension
of our FVs is 262144.
webpage
Concept C
      </p>
      <p>XML</p>
      <sec id="sec-2-1">
        <title>Parsing</title>
      </sec>
      <sec id="sec-2-2">
        <title>Word</title>
      </sec>
      <sec id="sec-2-3">
        <title>WordNet</title>
      </sec>
      <sec id="sec-2-4">
        <title>Database</title>
        <p>T={image related words}</p>
      </sec>
      <sec id="sec-2-5">
        <title>Label</title>
      </sec>
      <sec id="sec-2-6">
        <title>Assignment</title>
        <p>Label C
if T∩WC ≠ Φ
Collection {WCC,=synonym(C), hyponym(C)}
To assign correct labels to images, we take two steps. First we extract text
closely related to an image from its webpage. Then if a concept word exists in
the extracted text, the concept label is assigned to the image. The pipeline is
presented in Fig. 1.</p>
        <p>Text Extraction. To extract text closely related to an image, we consider three
types of texts in the webpage: text around image, img tag attributes (src, alt,
title), page title. First, we parse the xml file of the webpage and extract page
title, text, img tag. Then we select some of them and split them into a set of
single words T . For the text around the image, we consider the distance from the
image (img tag position) because the entire webpage does not necessarily focuses
on one image. Then we use max distance from an image as a parameter. We use
words which are within the max distance. To normalize words, we singularize
nouns.</p>
        <p>Label Assignment. To assign labels to the image, first we collect words related
to the each concept C given in the task. We denote a set of collected words by
WC . For WC , collecting synonyms and hyponyms of C is considered.</p>
        <p>WC = fC; synonym(C); hyponym(C)g
For example, given a target concept “bird”, we get</p>
        <p>Wbird = fbird; parrot; pigeon; :::g:
For collecting synonyms and hyponyms, we use WordNet. To make
implementation simpler, we use no compound words. Hyponyms are hierarchized. Therefore,
we collect words of all depths recursively. Words which have multiple meanings
are omitted. Determination is done by checking whether the word appears in
multiple entries in WordNet.</p>
        <p>Then if the extracted text T contains any of the concept-related words WC
(concept word, synonyms and hyponyms), we assign those concept labels C to
the image. Consequently, we obtain a training dataset in which some images
have multiple labels, and some images have no label.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Multilabel Annotation</title>
      <p>In this section, we describe the method of training of the classifiers and
annotating of the test images. We use linear classifier for each concept label considering
the scalability. With linear classifier, the annotation for test images is performed
by computing score of labels as product of the visual feature and the weight
vector of labels, and assigning the top 5 scored labels.</p>
      <p>To learn the models for each concept label from various images, requirements
are not only compatibility of scalability for the data amount and accuracy for
label estimation, but also noise tolerability.</p>
      <p>For that reason, we use Passive–Aggressive with Averaged Pairwise Loss
(PAAPL). PAAPL is based on Passive–Aggressive (PA) method, which is known
to be tolerant to the noise in training samples.</p>
      <p>First, we describe the model update rule of PA.</p>
      <p>Given the t-th training sample, we denote the visual feature by ft, the set
of concept labels assigned to the sample by Yt, the set of concept labels not
assigned to the sample by Y¯t, the current model (weight vector) corresponding
to concept label C by tC . In our setting, the dimension of ft is 262144 (Fisher
Vector) + 1 (bias).
1. Fetch t-th training sample, compute scores for each label using current
models.
2. Find a label rt 2 Yt associated with the sample and a label st 2 Y¯t not
associated with the sample as follows.</p>
      <p>rt = arg min tr ft</p>
      <p>r2Yt
st = arg max ts ft</p>
      <p>s2Yt
Given these labels, compute the hinge-loss l from the current model. The
hinge-loss l is given as
l( trt ; tst ; (ft; Yt)) =
{0
1
( trt ft
rt ft
t
st ft) otherwise
t
st ft &gt; 1
t
3. Update models with the update rule below.</p>
      <p>rt
t+1 =
st
t+1 =
trt +
st
t
l
l
2jftj2 + D1 ft
2jftj2 + D1 ft
D is a parameter which controls the sensitivity to label prediciton mistakes.</p>
      <p>Then we describe the PAAPL method.
1. Fetch t-th training sample, compute scores for each label using current
models.
2. For all combinations of label rt 2 Yt associated with the sample and label
st 2 Y¯t not associated with the sample, compute the hinge-loss as PA.
3. For all combinations for which the hinge-loss is not 0, update the model
corresponding to the update rule of PA.</p>
      <p>In PA, only a pair of models is updated for one sample. In PAAPL, on the
other hand, all pairs of models are updated for one sample, which reduces the
number of training iterations and score computation process, which is
timeconsuming. Therefore, the models converge faster.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Using the visual feature and the textual feature stated in the previous section,
the image classifier was trained by PAAPL. The number of training iterations
was 5.</p>
      <p>First, we determined whether we should use synonyms and hyponyms of
concept for assigning labels to the image. For extracting text from a webpage,
we used 10 words of text around the image, img tag attributes and page title.
The visual feature is provided BoVW representations of C-SIFT.As a result, we
chose to use synonyms and hyponyms. The result is presented in Table 1.</p>
      <p>Second, we conducted a grid search for the text extraction conditions on the
length of words around the image should be considered, the necessity of using
the img tag attributes and the page title. The visual feature is the same as in
first step. Results show that using only img tag attributes was the best. The
text far from the image decreased label assignment accuracy notably. The result
is shown in Table 2. The number of images which have at least one label and
the average number of labels assigned to one image was also shown in the result.
Because of the property of PAAPL, only images which have at least one label
are used for training. It is worth noting that in the best condition, the number
of images used and the average number of labels are both lowest.</p>
      <p>After this optimization, we tried a previous evaluation (of whether we should
use synonyms and hyponyms) again, but the result was the same.</p>
      <p>Finally, using the condition of the textual feature extraction stated above,
we trained the weight vectors corresponding to each visual feature (Fisher
Vector). It took 2 hr to learn for each visual feature. The final score of each test
image is calculated by summing the scores of all the classifiers (C-SIFT+FV,
GIST+FV, LBP+FV, and SIFT+FV). Final results are presented in Table 3.
The evaluation with provided C-SIFT + Bag-of-Visual-Words is also shown in
the table. Fisher vector exhibited much higher performance than
Bag-of-VisualWords. We performed learning and annotation for the test set with the top 5
ranked combinations.</p>
      <p>According to the results presented from the task organizers, we have achieved
the second score among all teams with our best run.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this working note, our methods to annotate images in ImageCLEF 2013
Scalable Concept Image Annotation task are described, with particular emphasis on
extracting labels for images from websites. Results show that, using concepts’
synonyms and hyponyms from WordNet was useful and limiting text range of
website was also shown to be important. For visual features, we applied Fisher
Vector, a state-of-the-art coding method. Four local descriptors for FV were
tried. The combination of C-SIFT, GIST and SIFT showed superior
performance. Our textual and visual features are simple but we can achieve a good
performance.</p>
      <sec id="sec-5-1">
        <title>C-SIFT GIST LBP SIFT</title>
      </sec>
      <sec id="sec-5-2">
        <title>MF-samples</title>
        <p>X - -
- X -
- - X
- - - X
X X -
X - X
X - - X
- X X
- X - X
- - X X
X X X
X X - X
X - X X
- X X X</p>
        <p>X X X X
Provided C-SIFT + BoVW
0.312
0.324
0.279
0.311
0.338
0.321
0.336
0.331
0.340
0.317
0.342
0.346
0.332
0.339
0.343
0.276</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paredes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEF 2013 Scalable Concept Image Annotation Subtask</article-title>
          .
          <source>CLEF 2013 working notes</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Paredes</surname>
          </string-name>
          .
          <article-title>Image-Text Dataset Generation for Image Annotation and Retrieval</article-title>
          . CERI,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          .
          <article-title>WordNet: An Electronic Lexical Database</article-title>
          . MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mensink</surname>
          </string-name>
          .
          <article-title>Improving the fisher kernel for large-scale image classification</article-title>
          .
          <source>European Conference on Computer Vision</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ushiku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kuniyoshi</surname>
          </string-name>
          .
          <article-title>Efficient image annotation for automatic sentence generation</article-title>
          .
          <source>International Conference on Multimedia</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>K.</given-names>
            <surname>Crammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dekel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Keshet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shalev-Shwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>Online PassiveAggressive Algorithms</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          , Vol.
          <volume>7</volume>
          , pp.
          <fpage>551</fpage>
          -
          <lpage>585</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>