<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Features Selection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ISTI-CNR</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>via G. Moruzzi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>giuseppe.amato</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>fabrizio.falchi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>claudio.gennaro}@isti.cnr.it</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The state-of-the-art algorithms for large visual content recognition and content based similarity search today use the \Bag of Features" (BoF) or \Bag of Words" (BoW) approach. The idea, borrowed from text retrieval, enables the use of inverted les. A very well known issue with the BoF approach is that the query images, as well as the stored data, are described with thousands of words. This poses obvious e ciency problems when using inverted les to perform e cient image matching. In this paper, we propose and compare various techniques to reduce the number of words describing an image to improve e ciency.</p>
      </abstract>
      <kwd-group>
        <kwd>bag of features</kwd>
        <kwd>bag of words</kwd>
        <kwd>local features</kwd>
        <kwd>content based image retrieval</kwd>
        <kwd>landmark recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>During the last decade, the use of local features, as for instance SIFT [Lowe, 2004],
has obtained an increasing appreciation for its good performance in tasks of
image matching, object recognition, landmark recognition, and image classi cation.
The total number of local features extracted from an image depends on its visual
content and size. However, the average number of features extracted from an
image is in the order of thousands. The BoF approach [Sivic and Zisserman, 2003]
quantizes local features extracted from images representing them with the
closest local feature chosen from a xed visual vocabulary of local features (visual
words). Matching of images represented with the BoF approach is performed
with traditional text retrieval techniques.</p>
      <p>However a query image is associated with thousands of visual words.
Therefore, the search algorithm on inverted les has to access thousands of di erent
posting lists. As mentioned in [Zhang et al., 2009], "a fundamental di erence
between an image query (e.g. 1500 visual terms) is largely ignored in existing index
design. This di erence makes the inverted list inappropriate to index images."
From the very beginning [Sivic and Zisserman, 2003] some words reduction
techniques were used (e.g. removing 10% of the more frequent images).</p>
      <p>To improve e ciency, many di erent approaches have been considered
including GIST descriptos [Douze et al., 2009], Fisher Kernel [Zhang et al., 2009]
and Vector of Locally Aggregated Descriptors (VLAD) [Jegou et al., 2010].
However, their usage does not allow the use of traditional text search engine which
has actually been another bene t of the BoF approach.</p>
      <p>In order to mitigate the above problems, this paper proposes, discusses, and
evaluates some methods to reduce the number of visual words assigned to images.
This paper is a summary of a longer paper that will presented at VISAPP 2013
[Amato et al., 2013].
2</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>The goal of the BoF approach is to substitute each description of the region
around an interest points (i.e., each local features ) of the images with visual
words obtained from a prede ned vocabulary in order to apply traditional text
retrieval techniques to content-based image retrieval. At the end of the process,
each image is described as a set of visual words. The retrieval phase is then
performed using text retrieval techniques considering a query image as
disjunctive text-query. Typically, the cosine similarity measure in conjunction with a
term weighting scheme is adopted for evaluating the similarity between any two
images.</p>
      <p>The proposed words reduction criteria are: random, scale, tf, idf, tf*id. Each
proposed criterion is based on the de nition of a score that allows us to assign
each local feature or word, describing an image, an estimate of its importance.
Thus, local features or words can be ordered and only the most important ones
can be retained. The percentage of information to discard is con gurable through
a score threshold, allowing trade-o between e ciency and e ectiveness. The
random criterion was used as a baseline. It assigns random score to features.
The scale criterion is based on the information about the size of the region from
which the local features was extracted: the larger the region, the higher the score.</p>
      <p>The retrieval engine used in the experiments is built as following:
1. For each image in the dataset the SIFT local features are extracted for the
identi ed regions around interest points.
2. A vocabulary of words is selected among all the local features using the
k-means algorithm.
3. The Random or Scale reduction techniques are performed (if requested).
4. Each image is described following the BoF approach, i.e., with the ID of the
nearest word in the vocabulary to each local feature.
5. The tf, idf, or tf*idf reduction technique are performed (if requested).
6. Each image of the test set is used as a query for searching in the training
set. The similarity measure adopted for comparing two images is the Cosine
between the query vector and the image vectors corresponding to the set
of words assigned to the images. The weight assigned to each word of the
vectors are calculated using tf*idf measure.
7. In case the system is requested to identify the content of the image, the
landmark of the most similar image in the dataset (which is labeled) is
assigned to the query image.</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental results</title>
      <p>The quality of the retrieved images is typically evaluated by means of precision
and recall measures. As in many other papers, we combined these information by
means of the mean Average Precision (mAP), which represents the area below
the precision and recall curve.</p>
      <p>For evaluating the performance of the various reduction techniques approaches,
we use the Oxford Building datasets that was presented in [Philbin et al., 2007]
and has been used in many other papers. The dataset consists of 5,062 images of
55 buildings in Oxford. The ground truth consists of 55 queries and related sets
of results divided in best, correct, ambiguous and not relevant. The vocabulary
used has one million words.</p>
      <p>0.6
0.5
0.4</p>
      <p>We rst report the results obtained in a content based image retrieval scenario
using the Oxford Building dataset using the ground truth given by the authors
[Philbin et al., 2007]. In Figure 1 we report the mAP obtained On the x-axis
we reported the average words per image obtained after the reduction. Note
that the x-axis is logarithmic. We rst note that all the reduction techniques
signi cantly outperform naive random approach and that both the idf and scale
approaches are able to achieve very good mAP results (about 0.5) while reducing
the average number of words per image from 3,200 to 800. Thus, just taking the
25% of the most relevant words, we achieve the 80% of the e ectiveness. The
comparison between the idf and scale approaches reveals that scale is preferable
for reduction up to 500 words per image. Thus, it seems very important to
discard small regions of interest up to 500 words.</p>
      <p>While the average number of words is useful to describe the length of the
image description, it is actually the number of distinct words per image that have
0.6
0.5
0.4
0.2
0.1
idf
scale
more impact on the e ciency of searching using inverted index. Thus, in Figure
2 we report mAP with respect to the average number of distinct words. In this
case the results obtained by tf*idf and tf are very similar to the ones obtained
by idf. In fact, considering tf in the reduction results in a smaller number of
average distinct words per image for the same vales of average number of words.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Amato et al.,
          <year>2013</year>
          ] Amato,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            , and
            <surname>Gennaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>On reducing the number of visualwords in the bag-of-features representation</article-title>
          .
          <source>In VISAPP 2013 - Proceedings of the International Conference on Computer Vision Theory and Applications</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Douze et al.,
          <year>2009</year>
          ] Douze,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Sandhawalia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Amsaleg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , and
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Evaluation of gist descriptors for web-scale image search</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Image and Video Retrieval</source>
          ,
          <source>CIVR '09</source>
          , pages
          <issue>19:1</issue>
          {
          <issue>19</issue>
          :
          <fpage>8</fpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Jegou et al.,
          <year>2010</year>
          ] Jegou,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , and
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Improving bag-offeatures for large scale image search</article-title>
          .
          <source>Int. J. Comput. Vision</source>
          ,
          <volume>87</volume>
          :
          <fpage>316</fpage>
          {
          <fpage>336</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Lowe</source>
          , 2004] Lowe,
          <string-name>
            <surname>D. G.</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <volume>91</volume>
          {
          <fpage>110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Philbin et al.,
          <year>2007</year>
          ] Philbin,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , and
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Object retrieval with large vocabularies and fast spatial matching</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Sivic and Zisserman</source>
          , 2003] Sivic,
          <string-name>
            <given-names>J.</given-names>
            and
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Video google: A text retrieval approach to object matching in videos</article-title>
          .
          <source>In Proceedings of the Ninth IEEE International Conference on Computer Vision</source>
          - Volume
          <volume>2</volume>
          , ICCV '
          <volume>03</volume>
          , pages
          <fpage>1470</fpage>
          {, Washington, DC, USA. IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Zhang et al.,
          <year>2009</year>
          ] Zhang,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ma</surname>
          </string-name>
          , W.-Y., and
          <string-name>
            <surname>Shum</surname>
          </string-name>
          , H.-Y. (
          <year>2009</year>
          ).
          <article-title>E cient indexing for large scale visual search</article-title>
          .
          <source>In Computer Vision</source>
          ,
          <year>2009</year>
          IEEE 12th International Conference on, pages
          <volume>1103</volume>
          {
          <fpage>1110</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>