<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalar Quantization-Based Text Encoding for Large Scale Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Giuseppe Amato</institution>
          ,
          <addr-line>Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti, and Lucia Vadicamo</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information Science and Technologies (ISTI), CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The great success of visual features learned from deep neural networks has led to a signi cant e ort to develop e cient and scalable technologies for image retrieval. This paper presents an approach to transform neural network features into text codes suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea is providing a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. An extensive experimental evaluation conducted on standard benchmarks shows the e ectiveness and e ciency of the proposed approach and how it compares to state-of-the-art main-memory indexes.</p>
      </abstract>
      <kwd-group>
        <kwd>Image retrieval</kwd>
        <kwd>Deep Features</kwd>
        <kwd>Inverted index</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Full-text search engines on the Web have achieved great results in terms of
e ciency thanks to the use of inverted index technology. In the last years, we
experienced an increasing interest in the retrieval of other forms of expression,
such as images; nevertheless, the development in those cases was not as rapid as
text-based paradigms.</p>
      <p>
        In the eld of image retrieval, since 2014 we have witnessed a great
development of learned features obtained by neural networks, in particular
Convolutional Neural Networks (CNN), which have emerged as e ective image
descriptors. Di erently from text, in which inverted indexes perfectly marry the sparse
document representation in standard vector models, learned image descriptors
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). This volume is published
and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
tend to be dense and compact, thus making directly unfeasible the usage of
mature text-tailored index technologies. While e cient index structures for this
type of data exist [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ], they usually come with caveats that prevent their
usage in very large-scale scenarios, such as main-memory-only implementations
and computationally expensive indexing or codebook-learning phases.
      </p>
      <p>
        This paper summarizes the main contributions presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where we
explored new approaches to make image retrieval as similar as possible to text
retrieval so as to reuse the technologies and platforms exploited today for text
retrieval without the need for dedicated access methods. In a nutshell, the idea
is to use image representations extracted from a CNN (deep features ) and to
transform them into text so that they can be indexed with a standard text
search engine. In particular, we propose a Scalar Quantization approach that
transforms deep features, which are (dense) vectors of real numbers, into sparse
vectors of integer numbers. The components of these integer vectors are then
translated to \term frequencies" of synthetic textual documents. Sparseness is
necessary for e ciency issues to achieve su cient levels of e ciency exactly as
it does for search engines for text documents.
      </p>
      <p>We consider the problem of image retrieval in a large-scale context, with
an eye to scalability. This aspect is often overlooked by the literature, most of
the image retrieval systems are designed to work in main memory and many of
these cannot be distributed across a cluster of nodes. Many techniques present in
literature try to tackle this problem by heavily compressing the representation of
visual features to adapt more and more data to the secondary memory. However,
these approaches are not able to scale because sooner or later response times
become unacceptable as the size of the data to be managed increases.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In the last years, image features extracted using deep CNNs have been widely
employed as e ective image descriptors. Deep features achieved state-of-the-art
results in several vision tasks, including image retrieval [
        <xref ref-type="bibr" rid="ref5 ref6">6,5</xref>
        ] and object detection
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As a consequence, there is an increasing interest in identifying techniques
able to e ciently index and search large set of deep features.
      </p>
      <p>
        To frame our work in the context of scienti c literature, we focus on
techniques that deal with emerging deep features using an inverted index. Liu et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed a framework that adapts the BoW model and inverted table to
index deep features. However, it needs to learn a large visual dictionary when
dealing with a large-scale dataset. Other works treat the features in a
convolutional layer as local features by using aggregation schemes like BoW and VLAD
[
        <xref ref-type="bibr" rid="ref19 ref4">4,19</xref>
        ] or try to quantize the deep features using a codebook [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Jegou et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
proposed an approximate nearest neighbor algorithm based on product
quantization (PQ), which exploits an inverted index. In PQ, the original vector is divided
into M sub-vectors that are independently quantized. A codebook is learned via
k-means for each of the M sub-division, and each sub-vector is compressed by
storing the nearest centroid index. An implementation of PQ-compressed
inverted indexes, denoted IVFPQ, is available in the FAISS library.
      </p>
      <p>
        In our work, we propose a novel approach to generate sparse
representations of deep features that can be employed to e ciently index and search the
deep features by using inverted les. Without loss of generality, we use the
Regional Maximum Activations of Convolutions (R-MAC) feature vector de ned
by Gordo et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as representative of the family of deep-learned dense
realvalued representations for instance-level image retrieval. This kind of features
poses challenges to the current technique of sparse encoding when dealing with
non-sparse and real-valued feature vectors that this work aims to tackle. On
the other hand, other types of deep features typically used in this eld, such as
ReLU-ed features extracted from pretrained image classi ers, have already been
explored and indexed with a similar approach in a preliminary work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Scalar Quantization</title>
      <p>In this section, we present a novel approach to generate sparse representations for
D-dimensional vectors originally compared with the dot product. In particular,
we focus on R-MAC descriptors, which are real-valued dense vectors that are
particularly powerful for applications of instance-level and content-based image
retrieval. However, our method can be adapted to general euclidean vectors.</p>
      <p>
        In a nutshell, we aim to de ne a transformation f : RD ! Nn that generate
sparse vectors and preserves the similarities of objects as much as possible. The
reason why we want to transform real-valued dense vectors into sparse vectors
of natural numbers is because we want to use a full-text search engine to index
the vectors so transformed. In fact, a full-text search engine based on the vector
space model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] generates a vector representation of a text via term frequencies,
i.e. the number of occurrences of the words in it. These systems transform the
texts into vector representations using the well-known TF scheme and
practically use the dot product as a function of similarity between vectors. So, the
abstract transformation f ( ) represents a function that exactly generates the
vectors that are internally represented by the search engine in the case of the
simple term-weighting scheme. In other words, given a dictionary of n codewords
we transform an object o into a synthetic text encoding to that is obtained as
a space-separated concatenation of codewords so that the i-th codeword is
repeated a number of times equal to the i-th element of the vector f (o). Using
this representation, the search engine indexes the text by using inverted les,
i.e. each object o is stored in the posting lists associated to the codewords
appearing in the text representation of o. The number of posting lists equals the
number of codewords of the considered vocabulary. This approach is known in
the literature as Surrogate Text Representation [
        <xref ref-type="bibr" rid="ref1 ref7">1,7</xref>
        ].
      </p>
      <p>
        The idea behind our Scalar Quantization approach is to map the real-valued
vector components independently into a smaller set of integer values which act as
the term frequencies of a prede ned set of codewords. The rst step is applying
a transformation to the vectors that helps preventing the presence of unbalanced
posting lists in the inverted le (thus important for e ciency of inverted indices).
To understand why, note that each component of the vectors is associated with
a posting list storing the id of the vector and the value of the component itself, if
nonzero. Therefore, if on average some component is nonzero for many data
vectors then the corresponding posting list will be accessed many times, provided
that the queries follow the same distribution of the data. The ideal case occurs
when the component share exactly the same distribution (same mean and
variance is su cient). To this end, we apply a random orthogonal transformation to
the entire set of data vectors, which is known to provide good balancing for high
dimensional vectors without the need to search for an optimal balancing
transformation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. An important aspect of the orthogonal transformation is that
it preserves the ranking when we search using the kNN approach by ordering
the vectors on the basis of their Euclidean distance to the query. Moreover, if
applying the orthogonal transformation and the mean centering to all the data
objects and just the orthogonal transformation to the query, we have an
ordering preserving transformation with respect to the dot-product (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for further
details). Thus, our preprocessing step is de ned as:
v ! R(v
q ! R q
)
(1)
(2)
where R is a random orthogonal matrix and 2 RD is set to center the data
to zero mean. The next step is transforming the rotated vectors into term
frequency vectors. We do it by quantizing the vectors so that posting entries will
contain numeric values proportional to the oat values of the deep feature
entries. Speci cally, we use the transformation w ! bswc where b c denotes the
oor function and s is a multiplication factor &gt; 1.
      </p>
      <p>The approach presented so far is intended to encode a vector of real numbers
into a vector of integers preserving as much as possible the order with respect
to the dot product. However, this approach does not solve the problem that in
most cases these vectors are dense, which leads to low e ciency when using
inverted les to index textual documents. To sparsify the term frequency vectors,
that is to discard their less signi cant components, we must accept a further
loss in precision. To achieve this, we propose to keep components above a
certain threshold 1= and zeroing the others. The parameter 2 N controls the
sparseness of the thresholded feature. This approach is optimal when we have
many components near or equal to zero; thus, we exploit the previously de ned
transformation (Eq. 1) to center the mean values of each dimension to zero.</p>
      <p>
        To sum up, our proposed transformation is f : v 7! g (bsR(v )c); where g
is a component-wise thresholding function, i.e. g (x) = x if x &gt; 1= , 0 otherwise.
Dealing with negative values In order to index R-MAC features and represent
them in the vector space model of text retrieval, we encode each dimension of the
R-MAC features as a di erent codeword, and we use the TF eld to represent
a single value of our feature vector. However, TF must be positive (most search
engine admits positive-only TFs even if this in principle would be possible),
nonetheless, both negative and positive elements contribute to informativeness.
In the scalar quantization approach presented above, the negative values are
atted to zero. Naive techniques such as taking the absolute value result in a
degraded performance due to respectively loss or aliasing of information. In order
to prevent this imbalance towards positive activations at the expense of negative
ones, we use the Concatenated Recti ed Linear Unit (CReLU) transformation
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. It simply makes an identical copy of vector elements, negate it, concatenate
both original vector and its negation, and then apply ReLU altogether. More
precisely the CReLU of the vector v is de ned as v+ = ReLU ([v; v]), where
the ReLU ( ) = max( ; 0) is applied element-wise. After applying CReLU, we
apply the transformation f to v+ as described in the previous section.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <p>
        To assess the performance of our Scalar Quatization technique in content-based
image retrieval task, we performed an extensive experimental evaluation on two
instance-level retrieval benchmarks. A complete presentation of the results is
given in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; here we report some of the most relevant results.
      </p>
      <p>
        The experiments were conducted on two benchmarks. INRIA Holidays [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
is a collaction of 1,491 images representing a large variety of scene type and 500
queries for which result lists are provided. Usually this benchmark is extended
with a distractor dataset, namely MIRFlickr1M1, that contains 1M images.
Oxford Buildings [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is composed of 5,062 images of 11 Oxford landmarks
downloaded from Flickr. A manually labeled groundtruth is available for ve queries
for each landmark, for a total of 55 queries. As for INRIA Holidays, we merged
the dataset with the distraction dataset Flickr100k including 100k images2.
      </p>
      <p>
        We used the ResNet-101 trained model as an R-MAC feature extractor, which
has been shown to achieve the best performance on standard benchmarks of
instance-level image retrieval. We extracted the R-MAC features using xed
grid regions at two di erent scales as proposed in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Then, we produced the
sparse representations by using di erent sparsi cation thresholds .
      </p>
      <p>
        We compared the performance of our approach with FAISS3, which includes
state-of-the-art approximate nearest neighbor algorithms based on PQ [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Note
that PQ constitutes a suitable competitor to our approach since is a one-stage
inverted- le-based index as the one used in standard textual search engines.
For a fair comparison, we used the con guration for FAISS that gives the best
e ectiveness-e ciency trade-o for each dataset (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for further details).
PQbased methods need a training set and an o ine training phase to initialize
the inverted index. We examined two possible scenarios dubbed IVFPQ and
IVFPQ*. In the former, the index is trained on the data to be indexed, while in
the latter, a set of unrelated images (the T4SA dataset4) is used as training set.
1 http://press.liacs.nl/mirflickr/
2 http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/
3 https://github.com/facebookresearch/faiss
4 http://t4sa.it
18
18
20
20
22
      </p>
      <p>24
24
22
18
18
20
20
22</p>
      <p>24
24
22
INRIA Holidays + MIRFlickr1M
28</p>
      <p>32
28</p>
      <p>38
32</p>
      <p>10 3 10 2
Query Selectivity SDB</p>
      <p>10 3 10 2
Query Selectivity SDB
Fig. 1. E ectiveness (mAP) vs e ciency (SDB | fraction of dataset accessed)
tradeo s of Scalar Quantization based (SQ) and Product Quantization based (IVFPQ)
methods. Curves are produced varying for SQ (reported near each point) and the number
of accessed lists (nprobe) for IVFPQ. Brute-force represents the sequential scan
baseline. IVFPQ* represents IVFPQ trained on out-of-distribution images.</p>
      <p>To assess the quality of the results, we used the mean Average Precision
(mAP ) which is a standard evaluation measure in information retrieval. It is
de ned as the mean of the average precision scores for a set of queries, where
the average precision equals the area under the precision-recall curve. As we put
our work in the context of large scale image search, in the experiments, we report
the mAP in function of the Query Selectivity SDB, i.e. the average fraction of
database accessed per query, on both the considered datasets (Figure 1). Each
line is obtained varying the most e ective parameter ( for SQ, the number
of accessed list nprobe for PQ). Well-trained PQ-based index perform best, but
SQ-based methods provide a slightly degraded o -the-shelf performance without
the need of any initial training set, training phase, or specialized index structure
that instead highly in uences IVFPQ. Moreover, the CReLU transformation
consistently boost performance over plain SQ in all regimes.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper presented a simple and e ective methodology to index and retrieve
deep features without the need for a time-consuming codebook learning step. Our
approach relies on transforming the deep features into text encodings, which can
be subsequently indexed and searched using o -the-shelf text search engines. An
important aspect is that our encoding technique is completely independent from
the technology used for indexing. We can rely on the latest text search engine
technologies, without having to worry about issues related to implementation
problems, such as software maintenance, updates to new hardware technologies,
bugs, etc. Furthermore, with our approach, it is possible to include in the image
records, in addition to the visual features (which are in textual form), other
information such as text metadata, geotags, etc.</p>
      <p>Acknowledgements The work was partially supported by Smart News (CUP CIPE
D58C15000270008), VISECH, ARCO-CNR (CUP B56J17001330004), ADA (CUP CIPE
D55F17000290009), and the AI4EU project (funded by the EC, H2020 - Contract n.
825619). We gratefully acknowledge the support of NVIDIA Corporation with the
donation of the Tesla K40 GPU used for this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolettieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Combining local and global visual feature similarity using a text search engine</article-title>
          .
          <source>In: Proceedings of the 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI)</source>
          . pp.
          <volume>49</volume>
          {
          <issue>54</issue>
          (june
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Large-scale instance-level image retrieval</article-title>
          .
          <source>Information Processing &amp;</source>
          Management p.
          <volume>102100</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep Permutations: Deep convolutional neural networks and permutation-based indexing</article-title>
          .
          <source>In: Proceedings of the 9th International Conference on Similarity Search and Applications</source>
          . pp.
          <volume>93</volume>
          {
          <fpage>106</fpage>
          .
          <source>SISAP</source>
          <year>2016</year>
          , LNCS, Springer International Publishing (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Arandjelovic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gronat</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torii</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pajdla</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>NetVLAD: CNN architecture for weakly supervised place recognition</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>5297</volume>
          {
          <fpage>5307</fpage>
          .
          <source>CVPR</source>
          <year>2016</year>
          , IEEE (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Babenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slesarev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chigorin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lempitsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Neural codes for image retrieval</article-title>
          .
          <source>In: Proceedings of 13th European Conference on Computer Vision</source>
          . pp.
          <volume>584</volume>
          {
          <fpage>599</fpage>
          . ECCV 2014, Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            <given-names>man</given-names>
          </string-name>
          , J., Zhang, N.,
          <string-name>
            <surname>Tzeng</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
          </string-name>
          , T.:
          <article-title>DeCAF: A deep convolutional activation feature for generic visual recognition</article-title>
          .
          <source>CoRR abs/1310</source>
          .1531 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolettieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savino</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An approach to content-based image retrieval based on the lucene search engine library</article-title>
          .
          <source>In: Proceedings of the International Conference on Theory and Practice of Digital Libraries</source>
          . pp.
          <volume>55</volume>
          {
          <fpage>66</fpage>
          . TPDL 2010, Springer Berlin Heidelberg (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>580</volume>
          {
          <fpage>587</fpage>
          .
          <source>CVPR</source>
          <year>2014</year>
          , IEEE (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gordo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almazan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Revaud</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larlus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>End-to-end learning of deep visual representations for image retrieval</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>124</volume>
          (
          <issue>2</issue>
          ),
          <volume>237</volume>
          {254 (Sep
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Hamming embedding and weak geometric consistency for large scale image search</article-title>
          .
          <source>In: Proceedings of the European Conference on Computer Vision</source>
          , ECCV
          <year>2008</year>
          ,
          <article-title>LNCS</article-title>
          , vol.
          <volume>5302</volume>
          , pp.
          <volume>304</volume>
          {
          <fpage>317</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Product quantization for nearest neighbor search</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <volume>117</volume>
          {128 (Jan
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Aggregating local descriptors into a compact image representation</article-title>
          .
          <source>In: IEEE Conference on Computer Vision &amp; Pattern Recognition</source>
          . pp.
          <volume>3304</volume>
          {
          <issue>3311</issue>
          (jun
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Indexing of the CNN features for the large scale image search</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>77</volume>
          (
          <issue>24</issue>
          ),
          <volume>32107</volume>
          {32131 (Dec
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mohedano</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salvador</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marques</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giroi Nieto</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Bags of local convolutional features for scalable instance search</article-title>
          .
          <source>In: Proceedings of the ACM International Conference on Multimedia Retrieval</source>
          . pp.
          <volume>327</volume>
          {
          <fpage>331</fpage>
          .
          <source>ICMR</source>
          <year>2016</year>
          , ACM (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Philbin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chum</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Object retrieval with large vocabularies and fast spatial matching</article-title>
          .
          <source>In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <source>CVPR</source>
          <year>2007</year>
          , IEEE (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to Modern Information Retrieval. McGrawHill, Inc</article-title>
          ., New York, NY, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Understanding and improving convolutional neural networks via concatenated recti ed linear units</article-title>
          .
          <source>In: Proceedings of the 33rd International Conference on Machine Learning. ICML 2016</source>
          , vol.
          <volume>48</volume>
          , pp.
          <volume>2217</volume>
          {
          <fpage>2225</fpage>
          . JMLR.org (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tolias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sicre</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Particular object retrieval with integral maxpooling of CNN activations</article-title>
          .
          <source>CoRR abs/1511</source>
          .05879 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Yue-Hei Ng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>L.S.:</given-names>
          </string-name>
          <article-title>Exploiting local features from deep networks for image retrieval</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops</source>
          . pp.
          <volume>53</volume>
          {
          <fpage>61</fpage>
          .
          <source>CVPRW</source>
          <year>2015</year>
          , IEEE (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>