<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ETH-CVL @ MediaEval 2015: Learning Objective Functions for Improved Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sai Srivatsa R</string-name>
          <email>saisrivatsan12@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Gygli</string-name>
          <email>gygli@vision.ee.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luc Van Gool</string-name>
          <email>vangool@vision.ee.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Laboratory, ETH Zurich</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Kharagpur</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present a method to select a re ned subset of images, given an initial list of retrieved images. The goal of any image retrieval system is to present results that are maximally relevant as well as diverse. We formulate this as a subset selection problem and we address it using submodularity. In order to select the best subset, we learn an objective function as a linear combination of submodular functions. This objective quanti es how relevant and representative a selected subset it. Using this method we obtain promising results at MediaEval 2015.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Image retrieval using text queries is a central topic in
Multimedia retrieval. While early approaches relied solely on
text associated with images, more recent approaches
combine textual and visual cues to return more relevant
results [
        <xref ref-type="bibr" rid="ref12 ref6">12, 6</xref>
        ]. Nonetheless, search engines of photo sharing
sites such as Flickr still retrieve results that are often
irrelevant and redundant. The MediaEval 2015 Retrieving
Diverse Social Images Task fosters research to improve results
retrieved by Flickr. It asks the participants to develop
algorithms to re ne a ranked list of photos retrieved from Flickr
using the photo's visual, textual and meta information. An
overview of the task is presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>We formulate the task of diversifying Image retrieval
results as a subset selection problem. Given a set of retrieved
images, I = (I1; I2; : : : ; In) and a budget B, the task is to
nd a subset S I; jSj = B such that S is maximally
relevant as well as diverse. Such problems are usually solved by
using a scoring function F : 2n ! R that assigns a higher
score for diverse and relevant subsets. Let V be the power
set of I, we obtain the best subset S by computing:
S = argmax F (S):</p>
      <p>S V;jSj=B
(1)
Evaluating the scores for all possible subsets (2n) is
intractable. We address this issue with submodularity.</p>
      <p>
        A set function f(.) is said to be submodular if
f (A [ v)
f (A)
f (B [ v)
f (B);
(2)
where A B V nv, V being the ground set of elements [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Submodular functions naturally model properties such as
representativeness and relevance as they exhibit a
diminishing returns property.
      </p>
      <p>
        If the scoring function is monotone submodular, we can
nd a near optimal solution for equation 1 using greedy
submodular maximization methods [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ]. A linear
combination of submodular functions with non-negative weights is
still submodular. Thus we de ne our scoring function as
F (S) = wT f (S);
(3)
where f (S) = [f1(S); f2(S) : : : fk(S)]T are normalized
submodular monotone functions and w 2 Rk+ is a weight vector.
We learn these weights with sub-gradient descent1 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Submodular Scoring Functions</title>
      <p>We use several submodular functions, aimed at
quantifying how relevant or diverse the selected subset is.</p>
      <p>
        Visual Representativeness We de ne the
representativeness score as 1 - k-Medoid Loss. The k-Medoid loss for
a subset is obtained by computing the sum of euclidean
distance between images in the query and the nearest
selected medoid (images in the selected subset) in the feature
space [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (using CNN features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Thus k-Medoid loss is
minimum when the selected subset is representative thereby
resulting in a higher representativeness score.
      </p>
      <p>Visual Relevance We use the relevance ground truth
provided for the devset topics to train a generic SVM on
CNN features with relevance ground truth as labels. The
relevance score of a subset is the number of images in the
subset that are predicted as relevant.</p>
      <p>
        Text Relevance In order to obtain a text-based score for
an image, given a query, we use a Bag-of-Words model. We
represent the wikipage associated with the query as a vector.
Similarly, each image is represented as vector obtained
encoding its title, tags and description (with the same relative
weighting as [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). The text relevance of an image is
computed as its cosine similarity to the wikipedia page, using
tf-idf weighting2. Finally, the text relevance score of a set of
image is simply the sum over the relevance of its individual
elements.
      </p>
      <p>
        Flickr Ranks For an image having Flickr rank i
belonging to a topic having n images, its Flickr score is given by
n i
n . The sum of ickr scores of images in the subset is the
ickr score of the subset.
1We use the implementation of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for submodular
maximization and learning weights.
2Using the implementation provided in scikit-learn [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Vis. Rel
Vis. Rep
0.0 0.1 0.2 0.3 0.4 0.5 0.6
      </p>
      <p>Weight</p>
      <p>Time Representativeness This function quanti es how
diverse the images are with respect to time taken. Photos
taken during di erent times of the day, or taken during
different seasons can also lead to increase in diversity. This
score is computed using the same k-medoid loss as in Visual
representativeness, but using the timestamp as the feature
representation.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Learning</title>
      <p>
        Using the relevance and cluster ground truth, for a given
query and a budget B, we construct a ground truth subset
(Stgt) for each query t in the devset. To learn the weights,
we optimize the following large-margin formulation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        T
min 1 X L^t(w) + 2 jjwjj2
w 0 T t=1
where T is the total number of queries in the devset and
L^t(w), the hinge loss of for training examples t is given by
L^t(w) = Smt2aVxt(F (St) + `(St))
F (Stgt)
where `(:) is the loss function. We use F1-loss (`(St) =
jStj F 1(St)) as the loss function. As F1-loss is not
submodular, we use its (pointwise) modular approximation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
We perform the optimization using sub-gradient descent [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
with an adaptive learning rate [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>
        We evaluated our method on the MediaEval 2015 diverse
social images task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The test data consists of 139 queries
with more than 40; 000 images. It includes single-topic
(location) as well as multi-topic queries (events associated with
locations). In Fig. 2 we show performance for di erent
congurations and varying budgets. The con gurations are: (i)
Run 1 - Visual only, i.e. relevance prediction and
representativeness. (ii) Run 2 - Meta only: In this run we only use
(4)
(5)
      </p>
      <sec id="sec-5-1">
        <title>Run Type Run 1 Run 2 Run 3</title>
      </sec>
      <sec id="sec-5-2">
        <title>Run Description</title>
        <p>all
single-topic
multi-topic</p>
        <p>all
Single-topic
Multi-topic</p>
        <p>
          All
Single-topic
Multi-topic
information associated with the image, but not the image
itself, i.e text relevance, Flickr rank and time
representativeness. (iii) Run 3 - we use a combination of the above
mentioned objectives. In Tab. 1 we provide the results
using the o cial performance metrics computed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The
distribution of weights learnt for each shell is as shown in
Fig. 1.
        </p>
        <p>
          The visual run yields higher cluster recall while the
textual run yields a better value of precision. This suggests
that using visual information is e ective for diversifying the
retrieval results while textual information is more e ective
for retrieving relevant images. The lower precision of the
visual run is not surprising, as it only uses a generic relevance
prediction. While this allows to lter out images of
people and several non-landmarks, it does not score relevance
in a query-speci c way. In order to improve our visual
approach it is thus necessary to compute similarities between
text queries and images. This could be done by learning a
joint embedding of text and images, similar to e.g. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We
also note that the method that we use performs better on
the single-topic sets than the multi-topic sets.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            J. Ho man,
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Tzeng, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
          <article-title>Decaf: A deep convolutional activation feature for generic visual recognition</article-title>
          .
          <source>In International Conference on Machine Learning (ICML)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duchi</surname>
          </string-name>
          , E. Hazan, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>Adaptive subgradient methods for online learning and stochastic optimization</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gygli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Grabner</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Van Gool. Video</surname>
          </string-name>
          <article-title>Summarization by Learning Submodular Mixtures of Objectives</article-title>
          .
          <source>In Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation</article-title>
          .
          <source>In Proceedings of MediaEval Benchmarking Initiative for Multimedia Evaluation</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Golovin</surname>
          </string-name>
          .
          <article-title>Submodular function maximization</article-title>
          . Tractability: Practical Approaches to Hard Problems,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Djeraba</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Content-based multimedia information retrieval: State of the art and challenges</article-title>
          .
          <source>ACM Transactions on Multimedia Computing</source>
          , Communications, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (TOMM),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          .
          <article-title>Learning mixtures of submodular shells with application to document summarization</article-title>
          .
          <source>In Uncertainty in Arti cial Intelligence (UAI)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Che</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          <article-title>. Multi-task deep visual-semantic embedding for video thumbnail selection</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          .
          <article-title>A submodular-supermodular procedure with applications to discriminative structure learning</article-title>
          .
          <source>In Uncertainty in Arti cial Intelligence (UAI)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Nemhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Wolsey</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. L. Fisher.</surname>
          </string-name>
          <article-title>An analysis of approximations for maximizing submodular set functions-</article-title>
          <source>I. Mathematical Programming</source>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Image retrieval: Current techniques, promising directions, and open issues</article-title>
          .
          <source>Journal of Visual Communication and Image Representation</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Improving diversity in image search via supervised relevance scoring</article-title>
          .
          <source>In ACM on International Conference on Multimedia Retrieval. ACM</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>