<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Fisher Vector and Convolutional Neural Networks for Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fausto Rabitti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Vadicamo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTI-CNR</institution>
          ,
          <addr-line>via G. Moruzzi 1, Pisa 56124</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fisher Vector (FV) and deep Convolutional Neural Network (CNN) are two popular approaches for extracting e ective image representations. FV aggregates local information (e.g., SIFT) and have been state-of-the-art before the recent success of deep learning approaches. Recently, combination of FV and CNN has been investigated. However, only the aggregation of SIFT has been tested. In this work, we propose combining CNN and FV built upon binary local features, called BMM-FV. The results show that BMM-FV and CNN improve the latter retrieval performance with less computational e ort with respect to the use of the traditional FV which relies on non-binary features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Convolutional neural networks (CNNs)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have attracted enormous interest within
research community because of the state-of-the-art results achieved in several
domains, like image classi cation, image retrieval, object recognition, and speech
recognition, to cite some. Several works [
        <xref ref-type="bibr" rid="ref2 ref3 ref8">8, 3, 2</xref>
        ] have shown that the outputs
of the intermediate layers of CNN can be e ectively used as high-level image
descriptors. These CNN features have topped the already high results achieved
by other image descriptors, such as Fisher Vectors (FV)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. FV and CNN
features capture di erent aspects of the image visual content and have di erent
strengths. For example, CNN features achieve very high e ectiveness but have
limited level of rotation invariance. The FV, instead, is robust to rotations but
seems to be more a ected to small scale changes than CNN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In order to leverage the positive aspects of both these methods, Chandrasekhar
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have proposed a fusion of FV and CNN features. The results shown
that FV can help improving the already high e ectiveness of CNN features.
However, the cost of extracting SIFT can be considered too high with respect to
the small increase in the quality of retrieval. ORB binary local features, which
extraction is typically two order of magnitude faster than SIFT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], have been
used in computer vision whenever high e ciency is needed. In this work, we
tested FV built upon ORB [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] binary local features in conjunction with CNN
feature. Our results show that while FVs of binary local features are less e ective
than the ones obtained from SIFT, when used in combination with CNN feature
their e ectiveness is comparable. Thus, the proposed combination of CNN and
BMM-FV results in a pro table solution for both e ciency and e ectiveness.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background on Image Representations</title>
      <p>
        The state-of-the-art image representations are mainly based on the use of local
features, which are mathematical representation of local structure of images. To
date, the most used and cited local feature is the SIFT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which led to e ectively
nd correct matches between images. However, SIFTs extraction is costly due to
local image gradients computation. Recently, the cost for extracting, representing
and comparing local visual descriptors has been dramatically reduced by the
introduction of binary local features, such as ORB[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Even if binary local descriptors are more compact and faster than non-binary
ones, each image is still represented by thousands of local descriptors making
it di cult to scale up the search to large digital archives. Encoding techniques,
such as Fisher Vectors (FVs) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] allow to obtain a more compact image
representation. The FV approch transforms an incoming set of local descriptors into a
xed-size vector representation, that describe how the sample of the descriptors
deviates from a \probabilistic visual vocabulary" which usually is modeled by a
Gaussian Mixture Model (GMM). In this work we want to encode local binary
descriptors and so we used a Bernoulli Mixture Model (BMM) that describes
binary outcomes better than GMM. The resulting image representation is referred
to as BMM-FV.
      </p>
      <p>
        Recently, a new class of image descriptor built upon Deep Convolutional
Neural Networks (CNNs)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been used as e ective alternative to
descriptors built upon local features. In particular, it has been proven that activations
produced by an image within the intermediate layers of the CNN can be used
as a high-level descriptor. To improve retrieval results and balance the lack of
geometrical invariance of CNN, in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] a fusion of FV and CNN features have
been proposed.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>In the follow we evaluate the performance of the combination of BMM-FV and
CNN features for image retrieval tasks.</p>
      <p>
        Experimental Setup. The evaluation was performed on the public INRIA
Holidays dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which is a benchmark for image retrieval. It contains 1; 491
images, 500 of them being used as queries. All the learning stages were
performed o -line using the independent Flickr60k dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The retrieval
performance was measured by the mean average precision (mAP) with the query
image removed from the ranking list.
      </p>
      <p>The ORB descriptors were extracted by using OpenCV1. The BMM-FVs
were computed on ORB binary features by using our Visual Information
Retrieval library, that is publicly available on GitHub2.
1 http://opencv.org/
2 https://github.com/ffalchi/it.cnr.isti.vir</p>
      <p>BMM-FV, K=128 (32,768 dim)
BMM-FV, K=64 (16,384 dim)
BMM-FV, K=32 (8,192 dim)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
α</p>
      <p>BMM-FV K=128, HybridNet fc6
BMM-FV K=64, HybridNet fc6
BMM-FV K=32, HybridNet fc6
1</p>
      <p>
        The CNN features were computed using the pre-trained HybridNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], public
available on the Ca e Model Zoo3. We used Ca e to extract the output of the
rst fully-connected layer (fc6 ). The resulting 4; 096 dimensional descriptors
were L2 normalized.
      </p>
      <p>Combination of FV and CNN Features. We represented each image by a couple
(c; f ), where c and f were respectively the CNN descriptor and the BMM-FV
of the image. Then, we evaluated the distance d between two couples (c1; f1)
and (c2; f2) as the convex combination between the L2 distances of the CNN
descriptors and the BMM-FV descriptors, i.e.</p>
      <p>d (c1; f1); (c2; f2) =
kc1
c2k2 + (1
) kf1
f2k2
(1)
with 0 1. Choosing = 0 corresponds to use only FV approach, while
= 1 correspond to use only CNN features.</p>
      <p>
        Results. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] it has been shown that combining the HybridNet fc6 with FV
representation led to obtain a relative improvement of 4.9% mAP respect to
the use of the CNN feature alone. Speci cally, they used a FV computed on
64dimensional PCA-reduced SIFTs, using K = 256 mixtures of Gaussians, which
results in a 32; 768 dimensional vector.
      </p>
      <p>
        In this work we propose to combine the CNN feature with the less expensive
BMM-FV built on ORB binary features. In gure 1 we plot the mAP obtained for
three di erent number K of Bernoulli mixtures used in the BMM-FV
representation, namely K = 32; 64; 128. It is worth noting that all the three BMM-FVs
led to improve the performance when combined with the HybridNet fc6 and that
exists an optimal to be used in the convex combination (equation (1)). When
using K = 64 the optimal was obtained around 0:5, which correspond to give
3 https://github.com/BVLC/caffe/wiki/Model-Zoo
the same importance to both FV and CNN feature. In this case the achieved
mAP was 79.2% which correspond to a relative improvement of 4.9% respect
to the use of the CNN feature alone (whose mAP was 75.5%). The combination
with BMM-FV for K = 128 achieves the best e ectiveness (mAP of 79.5%) for
= 0:4. However, since the cost for computing and storing FV increase with
the number K of Bernoulli, the improvement obtained using K = 128 respect
to that of K = 64 doesn't worth the extra cost of using a bigger value of K.
Moreover for K = 64 we obtain the same relative improvement of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] using a less
expensive FV representation that takes advantage from both the use of binary
local features and smaller number K of mixtures.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The retrieval performance of CNN features is improved when using information
provided by other image representations. In particular, the combination of CNN
and FV features has been proved to be e ective. However, state-of-the-art FV is
generally computed using non-binary feature, as SIFT, which extraction is
timeconsuming. This paper shows that the more e cient BMM-FV, built upon ORB
features, can be pro table use to this scope. In fact, the relative improvement
in the retrieval performance obtained using the BMM-FV is similar to that
obtained using the more expensive FV built upon SIFT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. Learning deep features for scene recognition using places database</article-title>
          . In: Ghahramani,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          , pp.
          <volume>487</volume>
          {
          <fpage>495</fpage>
          . Curran Associates, Inc. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chandrasekhar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morere</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veillard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A practical guide to cnns and sher vectors for image instance retrieval</article-title>
          .
          <source>CoRR abs/1508</source>
          .02496 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            <given-names>man</given-names>
          </string-name>
          , J., Zhang, N.,
          <string-name>
            <surname>Tzeng</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Decaf: A deep convolutional activation feature for generic visual recognition</article-title>
          .
          <source>CoRR abs/1310</source>
          .1531 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Hamming embedding and weak geometric consistency for large scale image search</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . LNCS, vol. I, pp.
          <volume>304</volume>
          {
          <fpage>317</fpage>
          . Springer (oct
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          <volume>521</volume>
          (
          <issue>7553</issue>
          ),
          <volume>436</volume>
          {444 (may
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <volume>91</volume>
          {
          <fpage>110</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dance</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . CVPR '07. IEEE Conference on. pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>June 2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Razavian</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azizpour</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sullivan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carlsson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Cnn features o -the-shelf: an astounding baseline for recognition</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          ,
          <source>2014 IEEE Conference on</source>
          . pp.
          <volume>512</volume>
          {
          <fpage>519</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rublee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabaud</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konolige</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradski</surname>
          </string-name>
          , G.:
          <article-title>Orb: An e cient alternative to sift or surf</article-title>
          .
          <source>In: Computer Vision</source>
          (ICCV),
          <year>2011</year>
          IEEE International Conference on. pp.
          <volume>2564</volume>
          {
          <issue>2571</issue>
          (Nov
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>