=Paper=
{{Paper
|id=Vol-1436/Paper18
|storemode=property
|title=ETH-CVL @ MediaEval 2015: Learning Objective Functions for Improved Image Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper18.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RavindranathGG15
}}
==ETH-CVL @ MediaEval 2015: Learning Objective Functions for Improved Image Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper18.pdf</pdf>
<pre>
ETH-CVL @ MediaEval 2015: Learning Objective Functions
            for Improved Image Retrieval

                  Sai Srivatsa R                           Michael Gygli                       Luc Van Gool
          Indian Institute of Technology,          Computer Vision Laboratory,          Computer Vision Laboratory,
                    Kharagpur                             ETH Zurich                           ETH Zurich
          saisrivatsan12@gmail.com gygli@vision.ee.ethz.ch                             vangool@vision.ee.ethz.ch


ABSTRACT                                                             where A ⊆ B ⊆ V \v, V being the ground set of elements [9].
In this paper, we present a method to select a refined sub-          Submodular functions naturally model properties such as
set of images, given an initial list of retrieved images. The        representativeness and relevance as they exhibit a diminish-
goal of any image retrieval system is to present results that        ing returns property.
are maximally relevant as well as diverse. We formulate this            If the scoring function is monotone submodular, we can
as a subset selection problem and we address it using sub-           find a near optimal solution for equation 1 using greedy sub-
modularity. In order to select the best subset, we learn an          modular maximization methods [10, 5]. A linear combina-
objective function as a linear combination of submodular             tion of submodular functions with non-negative weights is
functions. This objective quantifies how relevant and repre-         still submodular. Thus we define our scoring function as
sentative a selected subset it. Using this method we obtain                                 F(S) = wT f (S),                      (3)
promising results at MediaEval 2015.
                                                                                                             T
                                                                     where f (S) = [f1 (S), f2 (S) . . . fk (S)] are normalized sub-
                                                                     modular monotone functions and w ∈ Rk+ is a weight vector.
1.   INTRODUCTION                                                    We learn these weights with sub-gradient descent1 [7].
   Image retrieval using text queries is a central topic in Mul-
timedia retrieval. While early approaches relied solely on           2.1    Submodular Scoring Functions
text associated with images, more recent approaches com-                We use several submodular functions, aimed at quantify-
bine textual and visual cues to return more relevant re-             ing how relevant or diverse the selected subset is.
sults [12, 6]. Nonetheless, search engines of photo sharing             Visual Representativeness We define the representa-
sites such as Flickr still retrieve results that are often irrel-    tiveness score as 1 - k-Medoid Loss. The k-Medoid loss for
evant and redundant. The MediaEval 2015 Retrieving Di-               a subset is obtained by computing the sum of euclidean
verse Social Images Task fosters research to improve results         distance between images in the query and the nearest se-
retrieved by Flickr. It asks the participants to develop algo-       lected medoid (images in the selected subset) in the feature
rithms to refine a ranked list of photos retrieved from Flickr       space [3] (using CNN features [1]). Thus k-Medoid loss is
using the photo’s visual, textual and meta information. An           minimum when the selected subset is representative thereby
overview of the task is presented in [4].                            resulting in a higher representativeness score.
                                                                        Visual Relevance We use the relevance ground truth
2.   METHODOLOGY                                                     provided for the devset topics to train a generic SVM on
                                                                     CNN features with relevance ground truth as labels. The
   We formulate the task of diversifying Image retrieval re-
                                                                     relevance score of a subset is the number of images in the
sults as a subset selection problem. Given a set of retrieved
                                                                     subset that are predicted as relevant.
images, I = (I1 , I2 , . . . , In ) and a budget B, the task is to
                                                                        Text Relevance In order to obtain a text-based score for
find a subset S ⊆ I, |S| = B such that S is maximally rele-
                                                                     an image, given a query, we use a Bag-of-Words model. We
vant as well as diverse. Such problems are usually solved by
                                                                     represent the wikipage associated with the query as a vector.
using a scoring function F : 2n → R that assigns a higher
                                                                     Similarly, each image is represented as vector obtained en-
score for diverse and relevant subsets. Let V be the power
                                                                     coding its title, tags and description (with the same relative
set of I, we obtain the best subset S ∗ by computing:
                                                                     weighting as [13]). The text relevance of an image is com-
                     S ∗ = argmax F(S).                       (1)    puted as its cosine similarity to the wikipedia page, using
                           S⊂V,|S|=B                                 tf-idf weighting2 . Finally, the text relevance score of a set of
                                                                     image is simply the sum over the relevance of its individual
Evaluating the scores for all possible subsets (2n ) is in-          elements.
tractable. We address this issue with submodularity.                    Flickr Ranks For an image having Flickr rank i belong-
  A set function f(.) is said to be submodular if                    ing to a topic having n images, its Flickr score is given by
                                                                     n−i
            f (A ∪ v) − f (A) ≥ f (B ∪ v) − f (B),            (2)      n
                                                                          . The sum of flickr scores of images in the subset is the
                                                                     flickr score of the subset.
                                                                     1
                                                                       We use the implementation of [3] for submodular maxi-
Copyright is held by the author/owner(s).                            mization and learning weights.
                                                                     2
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany            Using the implementation provided in scikit-learn [11].
                                                 run1                                                                                 run2                                                                            run3

                                                                                                                                                                             Time Rep
                                                                                                    Time Rep
   Vis. Rel                                                                                                                                                        Flickr Ranks
                                                                                     Flickr Ranks                                                                              Text Rel

 Vis. Rep                                                                                                                                                                      Vis. Rel
                                                                                                      Text Rel
                                                                                                                                                                               Vis. Rep

                       0.0     0.1    0.2        0.3       0.4        0.5    0.6                                   0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40                              0.0     0.1        0.2          0.3        0.4        0.5
                                             Weight                                                                                  Weight                                                                       Weight


            Figure 1: Weights learnt for normalized submodular objectives for various configurations (See Sec. 3).

                                                                                                    0.8                                                                      0.70
            0.80                                                                                                   Run1                                                      0.65            Run1
                                                                                                    0.7
                                                                                                                   Run2                                                      0.60            Run2


                                                                                   Cluster Recall
                                                                                                    0.6
                                                                                                                   Run3                                                      0.55            Run3
Precision


            0.75


                                                                                                                                                                  F1 score
                                                                                                    0.5                                                                      0.50
                                                                                                    0.4                                                                      0.45
            0.70              Run1                                                                                                                                           0.40
                                                                                                    0.3
                              Run2                                                                                                                                           0.35
                              Run3                                                                  0.2                                                                      0.30
            0.65
                                                                                                    0.1                                                                      0.25
                   5     10     15   20     25     30      35    40     45   50                           5   10     15    20   25    30     35   40   45   50                      5   10     15    20    25         30     35    40         45   50
                                            Budget                                                                               Budget                                                                    Budget


                             Figure 2: Precision, Cluster Recall and F1 scores for the official runs on the dataset of [4].


   Time Representativeness This function quantifies how                                                                                   Run Type           Run Description                        P@20              CR@20                   F1@20
diverse the images are with respect to time taken. Photos                                                                                                           all                            0.6853             0.4724                  0.5453
taken during different times of the day, or taken during dif-                                                                                Run 1             single-topic                        0.6877             0.4829                  0.5575
ferent seasons can also lead to increase in diversity. This                                                                                                    multi-topic                         0.6829             0.4622                  0.5333
score is computed using the same k-medoid loss as in Visual                                                                                                         all                            0.7906             0.4051                  0.5207
representativeness, but using the timestamp as the feature                                                                                   Run 2             Single-topic                        0.8290             0.4145                  0.5406
representation.                                                                                                                                                Multi-topic                         0.7529             0.3958                  0.5010
                                                                                                                                                                   All                             0.7709             0.4366                  0.5393
2.2                Learning                                                                                                                  Run 3             Single-topic                        0.8420             0.4420                  0.5674
  Using the relevance and cluster ground truth, for a given                                                                                                    Multi-topic                         0.7007             0.4312                  0.5116
query and a budget B, we construct a ground truth subset
(Stgt ) for each query t in the devset. To learn the weights,                                                                         Table 1: Official Results. We report performance
we optimize the following large-margin formulation [7]                                                                                metrics according to [4]. Best results are highlighted
                                                                                                                                      in bold.
                                                       T
                                         1 X            λ
                                     min       L̂t (w) + ||w||2                                                           (4)
                                     w≥0 T              2
                                           t=1
                                                                                                                                      information associated with the image, but not the image
where T is the total number of queries in the devset and
                                                                                                                                      itself, i.e text relevance, Flickr rank and time representa-
L̂t (w), the hinge loss of for training examples t is given by
                                                                                                                                      tiveness. (iii) Run 3 - we use a combination of the above
                             L̂t (w) = max (F(St ) + `(St )) − F (Stgt )                                                  (5)         mentioned objectives. In Tab. 1 we provide the results us-
                                          St ∈Vt                                                                                      ing the official performance metrics computed by [4]. The
where `(.) is the loss function. We use F1-loss (`(St ) =                                                                             distribution of weights learnt for each shell is as shown in
|St | − F 1(St )) as the loss function. As F1-loss is not sub-                                                                        Fig. 1.
modular, we use its (pointwise) modular approximation [9].                                                                               The visual run yields higher cluster recall while the tex-
We perform the optimization using sub-gradient descent [7]                                                                            tual run yields a better value of precision. This suggests
with an adaptive learning rate [2].                                                                                                   that using visual information is effective for diversifying the
                                                                                                                                      retrieval results while textual information is more effective
                                                                                                                                      for retrieving relevant images. The lower precision of the vi-
3.             RESULTS AND DISCUSSION                                                                                                 sual run is not surprising, as it only uses a generic relevance
   We evaluated our method on the MediaEval 2015 diverse                                                                              prediction. While this allows to filter out images of peo-
social images task [4]. The test data consists of 139 queries                                                                         ple and several non-landmarks, it does not score relevance
with more than 40, 000 images. It includes single-topic (lo-                                                                          in a query-specific way. In order to improve our visual ap-
cation) as well as multi-topic queries (events associated with                                                                        proach it is thus necessary to compute similarities between
locations). In Fig. 2 we show performance for different con-                                                                          text queries and images. This could be done by learning a
figurations and varying budgets. The configurations are: (i)                                                                          joint embedding of text and images, similar to e.g. [8]. We
Run 1 - Visual only, i.e. relevance prediction and represen-                                                                          also note that the method that we use performs better on
tativeness. (ii) Run 2 - Meta only: In this run we only use                                                                           the single-topic sets than the multi-topic sets.
4.   REFERENCES
 [1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
     E. Tzeng, and T. Darrell. Decaf: A deep convolutional
     activation feature for generic visual recognition. In
     International Conference on Machine Learning
     (ICML), 2014.
 [2] J. Duchi, E. Hazan, and Y. Singer. Adaptive
     subgradient methods for online learning and stochastic
     optimization. The Journal of Machine Learning
     Research, 2011.
 [3] M. Gygli, H. Grabner, and L. Van Gool. Video
     Summarization by Learning Submodular Mixtures of
     Objectives. In Conference on Computer Vision and
     Pattern Recognition (CVPR), 2015.
 [4] B. Ionescu, A. L. Ginsca, B. Boteanu, A. Popescu,
     M. Lupu, and H. Müller. Retrieving diverse social
     images at mediaeval 2015: Challenge, dataset and
     evaluation. In Proceedings of MediaEval Benchmarking
     Initiative for Multimedia Evaluation, 2015.
 [5] A. Krause and D. Golovin. Submodular function
     maximization. Tractability: Practical Approaches to
     Hard Problems, 2012.
 [6] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain.
     Content-based multimedia information retrieval: State
     of the art and challenges. ACM Transactions on
     Multimedia Computing, Communications, and
     Applications (TOMM), 2006.
 [7] H. Lin and J. Bilmes. Learning mixtures of
     submodular shells with application to document
     summarization. In Uncertainty in Artificial
     Intelligence (UAI), 2012.
 [8] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo.
     Multi-task deep visual-semantic embedding for video
     thumbnail selection. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern
     Recognition, 2015.
 [9] M. Narasimhan and J. Bilmes. A
     submodular-supermodular procedure with applications
     to discriminative structure learning. In Uncertainty in
     Artificial Intelligence (UAI), 2005.
[10] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An
     analysis of approximations for maximizing submodular
     set functions-I. Mathematical Programming, 1978.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     D. Cournapeau, M. Brucher, M. Perrot, and
     E. Duchesnay. Scikit-learn: Machine learning in
     Python. Journal of Machine Learning Research, 2011.
[12] Y. Rui, T. S. Huang, and S.-F. Chang. Image
     retrieval: Current techniques, promising directions,
     and open issues. Journal of Visual Communication
     and Image Representation, 1999.
[13] E. Spyromitros-Xioufis, S. Papadopoulos, A. L.
     Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas.
     Improving diversity in image search via supervised
     relevance scoring. In ACM on International
     Conference on Multimedia Retrieval. ACM, 2015.

</pre>