<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Recognition in the EAGLE Pro ject</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Bolettieri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fausto Rabitti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Vadicamo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present a system for visually retrieving ancient inscriptions, developed in the context of the ongoing Europeana network of Ancient Greek and Latin Epigraphy (EAGLE) EU Project. The system allows the user in front of an inscription (e.g, in a museum, street, archaeological site) or watching a reproduction (e.g., in a book, from a monitor), to automatically recognize the inscription and obtain information about it just using a smart-phone or a tablet. The experimental results show that the Vector of Locally Aggregated Descriptors is a promising encoding strategy for performing visual recognition in this speci c context.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The large availability of digital cameras, especially embedded in smartphones
and tablets, allows nal users to make photos of their objects of interest at
almost no cost. On one side, there are users making thousands of photos. On
the other side, cultural heritage institutions typically have photos and metadata,
both in digital form, related to the objects they preserve. In this context, there
is a growing demand of technologies for content-based multimedia information
retrieval.</p>
      <p>
        In the last few years, research on object recognition has focused on local
features [
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ]. Following this approach, an image is represented by describing
the visual content of typically thousands of regions of interest that are
automatically selected. Then, images are compared by matching their local features
and searching for a geometric transformation that can associate the regions of
both images. To deal with large dataset, compact images signatures based on
the aggregation of local features have been proposed [
        <xref ref-type="bibr" rid="ref10 ref6">10, 6</xref>
        ].
      </p>
      <p>
        Visual objects recognition has also been studied in the context of cultural
heritage and computing. As an example, the VISITO Tuscany1 project has
investigated the visual recognition of cultural heritage objects (such as monuments,
landmarks, etc.) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, to the best of our knowledge, there are no results
in the literature regarding experiments conducted on ancient inscriptions.
      </p>
      <p>
        The research reported in this paper summarizes the results presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The focus is on searching for the most similar inscriptions in an archive with
1 http://www.visitotuscany.it/
respect to the one represented in a photo. This functionality will be integrated on
an o cial EAGLE mobile application in order to allow the user to take a picture
of an inscription (e.g., in a museum, in an archaeological site, in a book, etc.),
send it to the central repository and receive back the information associated with
that inscription.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>
        The dataset we used consists of 17,155 photos related to 14,560 ancient
inscriptions that were made available by Sapienza University of Rome, within the
EAGLE project. In order to visual recognizing inscriptions, we selected and tested
the most promising approaches from the recent literature. As local feature we
used the well known Scale Invariant Feature Transform (SIFT) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Given an
image, thousands of local features are extracted. In our case, we obtained an
average of 1591 SIFT per image. However, the fact that some of them refer to
bigger regions than others allows to select a subset of local features that are
in principle more relevant [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Thus, in the experiments we also tried to reduce
the number of local features by selecting only the most important ones. With
the goal of e ciently searching in the archive, we tested the most famous local
features aggregation techniques: the Bag-of-Features (BoF) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the Vector
of Locally Aggregated Descriptors (VLAD) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Both approaches use a
codebook of visual words and the cosine similarity. For BoF, we applied the TF-IDF
weighting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>To recognize the actual object in a query image, we perform a visual
similarity search between all the images in the dataset. The optimum would be to have
an image of the same inscription as rst result. Whenever this is not the case, it
is interesting to understand at which position in the result list the most visually
similar photo of the same object appears. In fact, traditional computer vision
techniques could be applied on the results in order to achieve better e ectiveness.
Thus, we use the probability p of nding an image of the same object within the
rst r results, as quality measure. For r = 1, p equals the accuracy of a classi er
that classify the query inscription as the most similar that has been found (i.e.,
a 1-NN classi er). A common measure of e ectiveness in similarity search
applications is the mean-Average Precision (mAP) which e ectively summarizes the
average and precision curves.</p>
      <p>
        In Table 1, we report the best results obtained ordered with respect to the
mAP. In the rst column, we report a brief text about the approach. In the
second column, the average number of SIFT considered is shown (i.e., 235 when
local features selection was applied and 1,591 otherwise). The third column
reports the number of words used in the aggregation phase. While the words have
been selected both for BoF and VLAD using k-means, their use is very di
erent. Thus, in the bytes column, we computed the average size in bytes of the
resulting representation. As quality measures, we used the probability p of
having at least one relevant image between the rst r results for r =1,10,100 and
the mAP. In case we use these approaches to recognize the query image relying
Visual Recognition in the EAGLE Project
on the nearest image in the dataset, the best approach is the VLAD that
obtained an accuracy of 0:69 for a codebook size of 256 and selecting the 235 most
relevant local features. The second best is the BoF in conjunction with
geometry consistency checks performed using RANSAC [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, this approach is
not indexable and was only used as an e ective but not e cient baseline. The
more traditional cosine TF-IDF similarity applied to BoF obtained good results
only in conjunction with a very large codebook (i.e., 400k). It is worth to note
that this approach outperforms VLAD for r =10,100. We believe that VLAD is
still preferable, since recent works as [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have shown that VLAD can be more
e ciently indexed than BoF.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>In this work, we tested state-of-the-art object recognition techniques on a dataset
of 17,155 photos related to 14,560 inscriptions. The best accuracy was obtained
by using the VLAD approach that has been recently proposed for performing
object recognition on a large scale. Surprisingly, even the BoF approach in
conjunction with geometry consistency checks was not able to outperform the VLAD
representation, that can be also more e ciently indexed than BoF. The obtained
accuracy was of 0.69, which is good considering the di culties of the task and
the few images available for each inscription in the dataset. However, we plan to
improve this results by performing re-ranking and direct local features
matching. To this goal, we also reported the probability of having a relevant images
between the retrieved images. The results show that it is possible to have a
relevant image between 100 retrieved ones with probability 0.90 using the VLAD
approach with a visual vocabulary of size 256 and ltering the SIFT. Thus, we
plan to try binary local features and other techniques in order to improve the
obtained 0.69 accuracy up to the 0.90 obtainable, in theory, by re-ranking the
rst 100 image retrieved using VLAD.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknoledgments</title>
      <p>This work was partially supported by EAGLE (Europeana network of Ancient
Greek and Latin Epigraphy, co-founded by the European Commision,
CIP-ICTPSP.2012.2.1 - Europeana and creativity, Project Reference 325122).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolettieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Large scale image retrieval using vector of locally aggregated descriptors</article-title>
          .
          <source>In: Similarity Search and Applications, Lecture Notes in Computer Science</source>
          , vol.
          <volume>8199</volume>
          , pp.
          <volume>245</volume>
          {
          <fpage>256</fpage>
          . Springer Berlin Heidelberg (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>On reducing the number of visual words in the bag-of-features representation</article-title>
          .
          <source>In: VISAPP 2013 - Proceedings of the International Conference on Computer Vision Theory and Applications</source>
          , Volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <fpage>21</fpage>
          -
          <lpage>24</lpage>
          February,
          <year>2013</year>
          . pp.
          <volume>657</volume>
          {
          <fpage>662</fpage>
          .
          <string-name>
            <surname>SciTePress</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Landmark recognition in VISITO: VIsual Support to Interactive TOurism in Tuscany</article-title>
          .
          <source>In: Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR2011)</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Inscriptions visual recognition. A comparison of state-of-the-art object recognition approaches</article-title>
          .
          <source>In: Proceedings of the First EAGLE International Conference</source>
          . vol.
          <volume>26</volume>
          , pp.
          <volume>117</volume>
          {
          <fpage>131</fpage>
          .
          <string-name>
            <surname>Sapienza Universita Editrice</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fischler</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolles</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          :
          <article-title>Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography</article-title>
          .
          <source>Commun. ACM</source>
          <volume>24</volume>
          (
          <issue>6</issue>
          ),
          <volume>381</volume>
          {395 (Jun
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Aggregating local image descriptors into compact codes</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on 34(9)</source>
          ,
          <volume>1704</volume>
          {
          <fpage>1716</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <volume>91</volume>
          {
          <fpage>110</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.:</given-names>
          </string-name>
          <article-title>A performance evaluation of local descriptors</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on
          <volume>27</volume>
          (
          <issue>10</issue>
          ),
          <volume>1615</volume>
          {
          <fpage>1630</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to Modern Information Retrieval. McGrawHill, Inc</article-title>
          ., New York, NY, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Video google: A text retrieval approach to object matching in videos</article-title>
          .
          <source>In: Computer Vision</source>
          ,
          <year>2003</year>
          . Proceedings. Ninth IEEE International Conference on. pp.
          <volume>1470</volume>
          {
          <fpage>1477</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tuytelaars</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolajczyk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Local invariant feature detectors: a survey</article-title>
          .
          <source>Foundations and Trends R in Computer Graphics and Vision</source>
          <volume>3</volume>
          (
          <issue>3</issue>
          ),
          <volume>177</volume>
          {
          <fpage>280</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>