<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>INAOE's participation at ImageCLEF 2016: Text Illustration Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis Pellegrin</string-name>
          <email>pellegrin@inaoep.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Pastor Lopez-Monroy</string-name>
          <email>pastor@inaoep.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugo Jair Escalante</string-name>
          <email>hugojair@inaoep.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Montes-y-Gomez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Nacional de Astrof sica</institution>
          ,
          <addr-line>Optica y Electronica (INAOE)</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe the participation of the Language Technologies Lab of INAOE at ImageCLEF 2016 teaser 1: Text Illustration (TI). The goal of the TI task consists in nding the best image that describes a given document query. For evaluating this task, there is a dataset containing web pages having text and images. We address the TI as a purely Information Retrieval (IR) task, for a given document query we search for the most similar web pages and use the associated images to them as illustrations. In this way, queries are used to retrieve related images from web pages, but the retrieval result are only the associated images. For this, we represent the web pages and queries using state-of-the-art text representations. Those representations incorporate information that allows us to exploit textual or semantic aspects. According to ImageCLEF 2016 evaluation, the proposed approach holds the best performance for the TI task.</p>
      </abstract>
      <kwd-group>
        <kwd>text illustration</kwd>
        <kwd>image retrieval</kwd>
        <kwd>document representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Since 2010, ImageCLEF promotes research into annotation of images using noisy
web page data. Following the same path, for the 2016 edition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] two new tasks
were introduced as teasers: Text illustration and Geolocation, this paper focuses
in the former. The goal of the Text Illustration task consists in nding the best
illustration, from a set of reference images, for a given text-document. Unlike
the problem of illustrating a sentence formed by few words, the TI is a much
more challenging task. The reason of this is that we want to illustrate a whole
document (i.e. web page) including a number of di erent topics. In this regard,
the used dataset consists in images embedded in web pages.
      </p>
      <p>
        We address the TI problem as an Information Retrieval (IR) task. The
hypothesis is that related web pages have related images. Thus, the document
queries to be illustrated are a target set of web pages, which we illustrate
using the embedded images of the retrieved web pages. For this, we bring two
popular representations from the IR eld, that do not take into account visual
characteristics of images. On the one hand, the bag-of-words representation
denes each document as histograms of word occurrences. On the other hand,
the Word2vec representation incorporates distributional semantics to text
documents with learned word vectors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Finally, as work in progress we experiment
with a third novel multimodal representation, where the textual and visual
information are used to produce a multimodal representation of the queries. Such
representation, allows us to directly retrieve images from a reference image dataset.
The o cial results in the evaluation are encouraging and lays the background
for future avenues of inquiry.
      </p>
      <p>The remainder of the paper is organized as follows. Section 2 describes our
method; Section 3 shows the obtained experimental results; nally, in Section 4
some conclusions of this work are presented.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Text illustration using an IR-based approach</title>
      <p>To approach the TI task we consider the following elements in our strategy.
Let Q = fq1; : : : ; qmg be the set of document queries to be illustrated. Also,
let D = f(d1; I1); : : : ; (dn; In)g be the set of web pages of dn documents and
In images pairs in the collection. Finally, let V = fw1; : : : ; wrg be the textual
features extracted from documents in the reference collection D. The general
process of the proposed approach has two stages. The rst consist in representing
each query qj and each document di into the same space Rr. In the second
stage, each query qj 2 Q is used to retrieve the k most similar web pages
f(dh; Ih) : (dh; Ih) 2 Dg to qj . The nal result only considers the Ih elements
as the resultant illustration set. The rest of this section explains the stages in
detail.
2.1</p>
      <sec id="sec-2-1">
        <title>Representing documents</title>
        <p>The rst stage in our strategy requires computing query vectors qj = hw1; : : : ; wri
and document vectors di = hw1; : : : ; wri in a space Rr. For this, we relied in two
di erent textual representations exploiting word occurrences (i.e., in BoW) and
co-occurrences (i.e., in Word2vec) in documents, as described below. Note that
jrj is de ned according to each representation. For the case of BoW, jrj = jVj.
In the case of Word2vec, jrj is number of hidden neurons used to represent the
learned word vectors.</p>
        <p>
          Bag-of-Words (BoW) Under BoW each document is represented by
taking each word in the vocabulary as an attribute to build document vectors
di = hw1; : : : ; wri. Intuitively, the BoW is an histogram representing word
frequencies in each document. BoW representation was built ltering terms with
high frequency and using TF-IDF (Term Frequency Inverse Document
Frequency) weighting scheme [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Word2vec Adaptation The purpose of Word2vec is to build accurate
representations of words in a space Rr. The main goal is that semantically related
words should have similar word vectors in Rr [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For instance, Paris vector are
close to Berlin vector, since both are capitals. Surprisingly, Mikolov et. a.l (2013)
also showed other generalizations using speci c lineal operations. For example,
France-Paris+Berlin result in a very close vector to Germany. In this paper,
we exploit the use of learned word vectors from Wikipedia using Word2vec [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
For our experiments, the learned word vectors from each document are used to
compute the average document vector as in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The idea is that the average of
those word representations, should capture rich notions of semantic relatedness
and compositionality of the whole document.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Retrieval stage</title>
        <p>In this stage, a document query qj under a speci c representation is used to
retrieve a set of relevant items f(dh; Ih) : (dh; Ih) 2 Dg. Note that only the textual
information from web pages and textual queries are used in the retrieval stage,
but the reported results correspond to the immersed images in the retrieved
items. For the retrieval stage we used the cosine similarity measure, which is
de ned in Equation 1.</p>
        <p>similarity(qj ; di) = cosine(qj ; di) =
qj</p>
        <p>di
jjqj jjjjdijj
(1)
where qj ; di are the representations of the document query qj , and the ith
document di from the collection, respectively. This equation iterates over all
documents from D, then the images associated to the k most similar documents
to qj are used to illustrate it.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results</title>
      <p>In this section we present quantitative and qualitative results of the proposed
approach in the TI task.
3.1</p>
      <sec id="sec-3-1">
        <title>Quantitative results</title>
        <p>
          In Table 1, it can be seen the performance of the proposed representations for
TI. The table reports scores from the metric proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], where basically the
recall is evaluated at the k-th rank position (R@K) of the ground truth images.
Several values of k are reported, in Table 1 we can see the scores that correspond
to the test set.
        </p>
        <p>Our best score is reported by run1, which uses the BoW under a TF-IDF
weighting scheme ltering 5% of the highest frequent terms. The results obtained
by run1 validate our hypothesis that related images appear in related web pages.
On the other hand, the run2 and run3 report scores obtained by Word2vec
representation (denoted as d2v), both runs use also a ltering of 5%, but with and
without TF-IDF weighting respectively. In these latter results, we consider that
the representation is a ected by noise when increasing the number of used word
to build it. Although, a Word2vec representation helps to retrieve similar
documents (as is showed in Figure 1), we have found that this representation is more
con dent with short documents or in speci c domains. However, in documents
with diversity of topics the performance decrease (see Figure 2) because of the
great variety of di erent words involved.</p>
        <p>Team
Baseline</p>
        <p>CEA
INAOE
In this subsection we compare the proposed representation. In Figure 1, we show
top retrieved images that illustrate the document query under two
representations. In this case, the document query consists in a short text, we can see that
both representations show relevant images to illustrate the text. An interesting
output is obtained by run3 that shows diversity on retrieved images.</p>
        <p>On the other hand, Figure 2 shows a long document used as query. Again,
the outputs of run1 and run3 are compared. Despite that in the documents are
included great quantity of topics, the image retrieval of run1 is e ective, instead
the image retrieval of run3 includes few relevant images. Taking as examples the
Figures 1 and 2, we can see that the quantity of terms and rich vocabularies
contained in the documents is an important factor for selecting the
representation. While Word2vec representation seems to be robust in short documents or
documents in a speci c domain, the BoW representation plus weighting TF-IDF
shows to be a better option for the case of long documents.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Work in progress: representing documents in a visual space</title>
        <p>We have worked with a visual representation but it is not reported in Table 1.
Unfortunately, we were not able to submit a run because of the tight time for
the deadline. Nevertheless, we also present an in-house evaluation showing
qualitative results.</p>
        <p>
          For representing documents in a visual space, we used a multimodal
representation M composed by visual prototypes. The construction of M is performed
in an unsupervised way by using images immersed in web pages. The idea is that
images can be represented by two di erent modalities: a visual representation
extracted from the image I, and a textual representation extracted from the web
pages D. In M for every word in D a visual prototype is formed, where each
prototype is a distribution over visual representation (more detail of this approach
in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). We used a reference image dataset (training set of [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]) for construction
of M.
        </p>
        <p>The aim of this representation is to include the visual information in the text
illustration. Under this representation, the words from a given document query
are seen in function of its visual representation. First, using visual prototypes
of words extracted from a query, then an average visual prototypes is formed.
Second, using average visual prototype as query, then we retrieve some related
images. In other words, the document query is translated to a visual document
and used it to retrieve images, as a CBIR (Content-Based Image Retrieval) task.</p>
        <p>In Figure 3, we show one favorable case for the visual representation.
However, the average visual prototype in this case is formed by three words of the
document query with the highest weight. For this kind of representation, we have
observed that the more terms in the document query, the more noisy the visual
representation is. As conclusion, a visual document representation is formed only
by few words, so it is necessary a keyword extraction process on document query.
In this paper we presented an IR approach to address the Text Illustration
task. The documents are de ned as textual, semantic or visual representations.
The performed experiments under these di erent representations give an initial
point of comparison for future approaches. According to the performed
evaluation we conclude that, related web pages have related images, then it is possible
to retrieve highly relevant elements using IR techniques. On the one hand, the
BoW obtained outstanding performances because of the ltering of high
frequent terms and the discriminative information captured by TF-IDF weighting
scheme. On the other hand, Word2vec representation did not obtain reliable
representations because of the great diversity of words involved in web pages. Such
diversity makes di cult to build accurate document representations using the
simple average of words. Our perspectives for future work include exploring
relationships between representation to incorporate mix information (textual-visual)
and adding a keyword extraction for the document query.</p>
        <p>Acknowledgments. This work was supported by CONACYT under project
grant CB-2014-241306 (Clasi cacion y recuperacion de imagenes mediante tecnicas
de miner a de textos). The rst author was supported by CONACyT under
scholarship No. 214764, and Lopez-Monroy thanks for doctoral scholarship
CONACyTMexico 243957.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramisa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task</article-title>
          .
          <source>In: CLEF2016 Working Notes. CEUR Workshop Proceedings</source>
          , Evora, Portugal (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>K.S.:</given-names>
          </string-name>
          <article-title>A statistical interpretation of term speci city and its application in retrieval</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>28</volume>
          (
          <year>1972</year>
          )
          <volume>11</volume>
          {
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>CoRR abs/1405</source>
          .4053 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hodosh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J.:
          <article-title>Framing image description as a ranking task: Data, models and evaluation metrics</article-title>
          .
          <source>J. Artif. Int. Res</source>
          .
          <volume>47</volume>
          (
          <year>2013</year>
          )
          <volume>853</volume>
          {
          <fpage>899</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pellegrin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanegas</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arevalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beltran</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-YGomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>INAOE-UNAL at ImageCLEF 2015: Scalable Concept Image Annotation</article-title>
          .
          <source>In: CLEF2015 Working Notes. CEUR Workshop Proceedings</source>
          , Toulouse, France (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>