<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Surrogate Text Representation of Visual Features for Fast Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Science and Technologies (ISTI), Italian National Research Council (CNR)</institution>
          ,
          <addr-line>Via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Supervised by Dr. Giuseppe Amato</institution>
          ,
          <addr-line>Dr. Claudio Gennaro, and Prof. Francesco Marcelloni</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a simple and e ective methodology to index and retrieve image features without the need for a time-consuming codebook learning step. We employ a scalar quantization approach combined with Surrogate Text Representation (STR) to perform large-scale image retrieval relying on the latest text search engine technologies. Experiments on large-scale image retrieval benchmarks show that we improve the e ectiveness-e ciency trade-o of current STR approaches while performing comparably to state-of-the-art main-memory methods without requiring a codebook learning procedure.</p>
      </abstract>
      <kwd-group>
        <kwd>image retrieval</kwd>
        <kwd>deep features</kwd>
        <kwd>surrogate text representa- tion</kwd>
        <kwd>inverted index</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As part of the contributions of our thesis, we explore new approaches for making
image retrieval as similar as possible to text to reuse the technologies and
platforms exploited today for text retrieval without the need for dedicated access
methods. In a nutshell, the idea is to use image representation extracted from
a CNN, often referred to as deep features, and to transform them into a
surrogate text that standard text search engine can index. Our general approach
is based on the transformation of deep features, which are (dense) vectors of
real numbers, into sparse vectors of integer numbers that can be mapped in
term frequencies. Moreover, sparseness is necessary to achieve su cient levels of
e ciency as it does for search engines for text documents. We introduce and
evaluate a novel approach for surrogate text representation of deep features based
on Scalar Quantization (SQ).</p>
      <p>Copyright c 2019 for the individual papers by the papers authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted by
its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
f : X ! Nn
o 7! f o = [fo(1); : : : ; fo(n)] ;
n f(i)</p>
      <p>o
STRf (o) = [ [</p>
      <p>i</p>
    </sec>
    <sec id="sec-2">
      <title>Surrogate Text Representation</title>
      <p>
        We de ne a family of transformations that map a feature vector into a
textual representation without the need for tedious training procedures. We require
that such transformations preserve as much as possible the proximity relations
between the data, i.e., similar feature vectors are mapped to similar textual
documents. This basic idea was rstly exploited in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where the authors de ned
the Surrogate Text Representation (STR) to represent a generic metric object,
i.e., an object living in a space where a distance function is de ned [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The STR
of an object is a space-separated concatenation of some alphanumeric codeword
selected from a pre-de ned dictionary. We observe that a surrogate text
representation for an object o of a data domain X can be obtained more generally by
de ning a transformation
(1)
(2)
where f o will act as the vector of the term frequencies of a synthetic text
document to with respect to a dictionary of n words. Given a dictionary f 1; : : : ; ng
and the transformation f : Rm ! Nn, we de ne the surrogate text to = STRf (o)
as
where, by abuse of notation, we denote the space-separated concatenation of the
codewords with the union operator [. Thus, by construction, the integer values
of the i-th component of the vector f o is the frequency of the codeword i in
the text STRf (o). For example, given f o = [1; 3; 0; 2] and a codebook f 1 =
\A00; 2 = \B00; 3 = \C00; 4 = \D00g, we have STRf (o) = \A B B B D D00.
Using this representation, the search engine indexes the text by using inverted
les, i.e., each object o is stored in the posting lists associated to the codewords
appearing in STRf (o). We demonstrate indexing and searching D-dimensional
vectors compared with the dot product, i.e. X = RD.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Scalar Quantization-Based Approach</title>
      <p>
        The rst step in our approach is the application of an random orthogonal
transformation to the vectors v ! R(v ); q ! Rq, where v and q are vectors of
database and query images, R is a random orthogonal matrix (kRk2 = 1), and
2 RD can be arbitrary chosen. The bene t of this transformation is
manyfold: a) it re-balances the variance of the components of the feature vectors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
thus preventing unbalanced posting lists in the inverted le; b) it is
orderingpreserving with respect to the dot-product: given a rotation matrix R 2 RD D
and a vector 2 RD, then
8 q; v1; v2 2 RD
      </p>
      <p>Then, we transform the rotated vectors into integer term-frequency vectors
by scaling and quantization: v ! bsvc where b c denotes the oor function, and
s &gt; 1 is the quantization factor. This process introduces a quantization error
due to the representation of oat components in integers that does not a ect
the retrieval e ectiveness. In summary, the SQ-based STR is obtained using
the transformation fSQ : v 7! bsR(v )c. For instance, suppose after the
random rotation we have the feature vector v = [0:1; 0:3; 0:4; 0; 0:2], by adopting
a multiplication factor s = 10, we obtain the term frequencies vector will be
fSQ(v) = [1; 3; 4; 0; 2], and thus STRfSQ (v) = \A B B B C C C C E E".</p>
      <p>In order to sparsify the term frequency vectors, that is, to cancel the less
signi cant components of the vectors, we adopt thresholding
v (i) =
(v(i) if v(i)
0
otherwise
1
;
(4)
where v(i) indicates the i-th dimension of v. To optimize this step, we set of
Eq. (3) to the mean vector to obtain zero-centered components. Thresholding
alone ignores the negative components of the original real-valued feature
vectors, discarding useful information. In order to prevent this, before our proposed
transformation, we also apply the Concatenated ReLU (CReLU) transformation
to feature vectors, de ned as v+ = max([v; v]; 0), where max( ; 0) is applied
element-wise. Notice that in general, the CReLU operation is lossy since the dot
product between vectors is not preserved, i.e., v1 v2 v1+ v2+. However, this
transformation allows us not to completely neglect the negative components.
4</p>
      <p>
        Experimental Evaluation and Discussion
To validate our approach, we extract R-MAC features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] | a 2048-dimensional
real-valued feature vector | from the images of INRIA Holidays +
MIRFlickr1M [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] | a standard benchmark for image retrieval. We evaluate our
approach for di erent values of the thersholding parameter . We compare with
Deep Permutations (DP) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a permutation-based STR approach. We generate
di erent sets of permutations from the original and CReLU-ed features by
considering the top-k truncated sorting permutations at di erent values of k. We
also compared with state-of-the-art main-memory approximate nearest neighbor
algorithms based on Product Quantization FAISS [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For a fair comparison, we
use a con guration for FAISS which gives the best trade-o between e ectiveness
and e ciency; we choose a relatively big code size (C = 1; 024) and optimal
number of Voronoi cells (N = 16k), resulting in 1GB in main memory against about
0.7GB of our solution in secondary memory in the larger con guration. Since
FAISS methods need to learn a codebook from data, we report results with
two di erent training datasets: the indexed dataset itself, on which the mAP
evaluation is performed, and T4SA, an uncorrelated dataset of Twitter images.
Results in Figure 1 show a satisfactory trade-o trend between e ectiveness and
query selectivity and the general bene t introduced by the CReLU
preprocessing. We notice that the impact of codebook learning on PQ-based methods is
0:8
0:7
0:6
0:5
P
A
m0:4
0:3
0:2
0:1
Brute-force
DP
DP + CReLU
SQ + Threshold
SQ + Threshold + CReLU
IVFPQ* N=16010 C=1024
      </p>
      <p>IVFPQ N=16010 C=1024
10 3 10 2
Query Selectivity SDB
really strong, and it could, in real applications, in uence the scalability of the
system or require continuous codebook adjustments, forcing to re-indexing the
data periodically. Our solution has an intermediate performance but does not
require any training procedure and therefore any re-adjustments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolettieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Large-scale image retrieval with elasticsearch</article-title>
          .
          <source>In: The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval</source>
          . pp.
          <volume>925</volume>
          {
          <fpage>928</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolettieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savino</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An approach to content-based image retrieval based on the lucene search engine library</article-title>
          .
          <source>In: International Conference on Theory and Practice of Digital Libraries</source>
          . pp.
          <volume>55</volume>
          {
          <fpage>66</fpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Hamming embedding and weak geometric consistency for large scale image search</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <volume>304</volume>
          {
          <fpage>317</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Aggregating local descriptors into a compact image representation</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2010 IEEE Conference on</source>
          . pp.
          <volume>3304</volume>
          {
          <fpage>3311</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Billion-scale similarity search with gpus</article-title>
          .
          <source>arXiv preprint arXiv:1702.08734</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tolias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sicre</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Particular object retrieval with integral max-pooling of cnn activations (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zezula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dohnal</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Similarity search: the metric space approach</article-title>
          , vol.
          <volume>32</volume>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>