=Paper=
{{Paper
|id=None
|storemode=property
|title=UEC, Tokyo at MediaEval 2013 Retrieving Diverse Social Images Task
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_30.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/YanaiN13
}}
==UEC, Tokyo at MediaEval 2013 Retrieving Diverse Social Images Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1043/mediaeval2013_submission_30.pdf</pdf>
<pre>
                           UEC, Tokyo at MediaEval 2013
                       Retrieving Diverse Social Images Task

                                             Keiji Yanai and Do Hang Nga
                                     The University of Electro-Communications, Tokyo
                                  1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, JAPAN
                                 yanai@cs.uec.ac.jp, dohang@mm.cs.uec.ac.jp


ABSTRACT                                                         Markov-chain probabilistic model.
In this paper, we describe our method and results for the          VisualRank uses a similarity matrix of images instead of
MediaEval 2013 Retrieving Diverse Social Images Task. To         hyper-link structure. Eq.(1) represents an equation to com-
accomplish the task objective, we adopt VisualRank [5] and       pute VisualRank.
Ranking with Sink Points [2], which are common methods                     ri+1 = αSri + (1 − α)p,      (0 ≤ α ≤ 1)         (1)
to select representative and diverse photos. To obtain an
aﬃnity matrix for both ranking methods, we used only the         S is the column-normalized similarity matrix of images, p
oﬃcially-provided features including visual features and tag     is a damping vector, r is the ranking vector each element of
features. We submitted three required runs including only        which represents a ranking score of each image, and α plays
visual feature run, only textual feature run and textual-        a role to control the extent of eﬀect of p. The ﬁnal value of r
visual fused feature run.                                        is estimated by updating r iteratively with Eq.(1). Because
                                                                 S is column-normalized and the sum of elements of p is 1,
                                                                 the sum of ranking vector r does not change. Although p
1. INTRODUCTION                                                  is set as a uniform vector in VisualRank as well as normal
   In this paper, we describe our method and results for the     PageRank, it is known that p can plays a bias vector which
MediaEval 2013 Retrieving Diverse Social Images Task [4].        aﬀects the ﬁnal value of r [3].
The objective of this task is to select relevant and diverse
photos from the given photos regarding the speciﬁc loca-         2.2    Ranking with Sink Points
tions. To do that, we adopt VisualRank [5] and Ranking              Because VisualRank is a ranking method considering only
with Sink Points [2]. The reason why we adopted these            representativeness of items, higher ranks are sometimes oc-
method is that we had used these methods for ranking geo-        cupied with items which are similar to each other. This is,
tagged photos [6]. First we calculate a similarity matrix        VisualRank cannot accomplish ranking considering diversity
using the given features, and we apply VisualRank to select      of items. Therefore, we adopt Ranking with Sink Points [2],
the most representative photo. Then we re-rank the remain-       which can be regarded as an extension of PageRank [1] to
ing photos by Ranking with Sink Points after removing the        make obtained ranking relevant and diverse.
ﬁrst-ranked photo. We repeat re-ranking by Ranking with             To address the diversity in ranking, the concept of sink
Sink Points and removing the ﬁrst-ranked photos until 50         points is useful. The sink points are data objects whose
photos are selected.                                             ranking scores are ﬁxed at zero during the ranking process.
   To obtain a similarity matrix for both ranking methods,       Hence, the sink points will never spread any ranking score to
we used only the oﬃcially-provided features including visual     their neighbors. Intuitively, we can imagine the sink points
features and tag features. We submitted three required runs      as the “black holes” on the ranking manifold, where ranking
including only visual feature run, only textual feature run      scores spreading to them will be absorbed and no ranking
and textual-visual fused feature run, which are the minimum      scores would escape from them.
requirements to participate this task.                              First we apply VisualRank to select the most represen-
                                                                 tative photo with the obtained aﬃnity matrix. Then we
2. RANKING METHOD                                                re-rank the remaining photos by Ranking with Sink Points
  To obtain representative and diverse photos in the up-         as shown in Eq.(2), after removing the ﬁrst-ranked photo as
per rank, we adopt VisualRank [5] and Ranking with Sink          “a sink point”. We repeat re-ranking by Ranking with Sink
Points [2]. In this section, we explain both methods and         Points and removing the ﬁrst-ranked photos until 50 photos
features brieﬂy.                                                 are selected as following:

2.1 VisualRank                                                                  ri+1 = αSIi ri + (1 − α)p                   (2)
  VisualRank is an image ranking method based on PageR-          Ii is an indicator matrix which is a diagonal matrix with its
ank [1]. PageRank calculates ranking of Web pages using          (i, i) − element equal to 0 if xi ∈ Xs and 1 otherwise. Xs is
hyper-link structure of the Web. The rank values are esti-       a set of “sink points”.
mated as the steady state distribution of the random-walk           Note that α is set as 0.85 in the experiments.

                                                                 2.3 Visual Features
Copyright is held by the author/owner(s).                          We used the ten kinds of visual features oﬃcially provided
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   by the task organizers such as Global Histogram of Oriented
                                                                  Table 1: Results evaluated by experts for the entire
                                                                  test set of 346 locations.
                                                                       Runs         P@5      P@10     P@20     P@30     P@40     P@50
                                                                    Only visual     0.7164   0.7056   0.7092   0.7076   0.6948   0.6752
                                                                   Only textual     0.7082   0.6863   0.6845   0.6904   0.6841   0.6667
                                                                   Visual-textual   0.7135   0.7155   0.7063   0.7026   0.6934   0.6723

                                                                       Runs         CR@5     CR@10    CR@20    CR@30    CR@40    CR@50
                                                                    Only visual     0.2233   0.3633   0.5448   0.6743   0.7572   0.8154
                                                                   Only textual     0.213    0.3579   0.5515   0.6706   0.7549   0.8094
                                                                   Visual-textual   0.2258   0.3621   0.5414   0.6642   0.7427   0.8015
Figure 1: An example of the ranking by the three
                                                                       Runs         F1@5     F1@10    F1@20    F1@30    F1@40    F1@50
kinds of features: “The Gate of Forbidden City in                   Only visual     0.3288   0.4617   0.5926   0.6618   0.6936   0.7068
China”.                                                            Only textual     0.318    0.4531   0.5869   0.6544   0.689    0.7001
                                                                   Visual-textual   0.3303   0.4614   0.5879   0.6545   0.6869   0.6995


Gradient and Color Moments on HSV Color Space. The                Table 2: Results evaluated by crowd (the average
detail on oﬃcial visual features is explained in [4].             of GT1, GT2 and GT3) for a subset of 50 locations
   With histogram intersection, we calculate similarities for     from the test set.
each of visual features. Finally we construct an aﬃnity ma-            Runs         P@5      P@10     P@20     P@30     P@40     P@50
trix by averaging similarity on ten kinds of visual features.       Only visual     0.7061   0.6959   0.6857   0.6878   0.6847   0.6845
                                                                   Only textual     0.6857   0.6673   0.6847   0.6966   0.6964   0.6865
                                                                   Visual-textual   0.6367   0.6531   0.6653   0.6823   0.6765   0.6747
2.4 Textual Features
  We use social TF-IDF weights provided by the task orga-              Runs         CR@5     CR@10    CR@20    CR@30    CR@40    CR@50
nizers. We extract bag-of-words vectors from Flickr meta-           Only visual     0.5947   0.7198   0.8070   0.8803   0.9153   0.9394
                                                                   Only textual     0.5875   0.7331   0.8429   0.9118   0.9355   0.9449
data with social TF-IDF weights for all the given images.          Visual-textual   0.5573   0.6824   0.8050   0.8829   0.9223   0.9447
We calculate an aﬃnity matrix with cosine similarity be-
tween bag-of-words vectors within each place.                          Runs         F1@5     F1@10    F1@20    F1@30    F1@40    F1@50
                                                                    Only visual     0.5915   0.6657   0.7052   0.7435   0.7586   0.7675
  To obtain an aﬃnity matrix for the visual-textural-fused         Only textual     0.5818   0.6659   0.7366   0.7669   0.7770   0.7735
runs, we simply averaged both visual-feature-based aﬃnity          Visual-textual   0.5441   0.6261   0.6971   0.7446   0.7578   0.7638
matrix and textual-feature-based aﬃnity matrix.

3. EXPERIMENTAL RESULTS                                           ing dataset, GPS data coordinates and Wikipedia photos.
   Tables 1 and 2 show the evaluated results of our three         In fact, if you had enough time, we should have used the
submission runs by experts and crowds, respectively. Note         training data for estimating optimal parameters such as α
that the results by experts is based on evaluation for the en-    in the VisualRank formulation and a mixing weight of visual
tire dataset of 346 locations, while the results by the crowds    similarity and textual similarity.
is based on evaluation for only 50 locations in the dataset
and are obtained by averaging evaluations by three crowd
persons.                                                          5.   REFERENCES
   Basically, the results by only visual were better than the     [1] S. Brin and L. Page. The anatomy of a large-scale
results by only textual and the results by visual-textual,            hypertextual web search engine. In Proc. of the Seventh
although the diﬀerence were not so large.                             International World Wide Web Conference, 1998.
   We show the top six photos of an successful example by         [2] X.-Q. Cheng, P. Du, J. Guo, X. Zhu, and Y. Chen.
the proposed method with three kinds of features: textual,            Ranking on data manifold with sink points. IEEE
visual and visual-textual features in Figure 1. These photos          Transactions on Knowledge and Data Engineering,
represents “The Gate of Forbidden City in Beijing, China.”            25(1):177–191, 2013.
In this example, the photos selected by the visual-textual        [3] T. Haveliwala. Topic-sensitive PageRank: A
feature is more representative and diverse than the photos            context-sensitive ranking algorithm for web search.
selected by the only textual or only visual features. This            IEEE trans. on Knowledge and Date Engneering,
indicates that our proposed methods works successfully.               15(4):784–796, 2003.
   In the case of the above example, most of the photos in-       [4] B. Ionescu, M. Menendez, H. Muller, and A. Popescu.
cluded in the given photo set are relevant and only a few             Retrieving diverse social images at mediaeval 2013:
noise photos are included. However, given photo sets of               Objectives, dataset and evaluation. In MediaEval 2013
some landmark include many noise photos. In such case, the            Workshop, CEUR-WS.org, ISSN: 1613-0073,
proposed methods sometimes failed to select relevant photos           Barcelona, Spain, October 18-19 2013.
and selected noise photos in the upper ranking. Therefore,        [5] Y. Jing and S. Baluja. Visualrank: Applying pagerank
removal of noise photos is one of our important future works.         to large-scale image search. IEEE Transactions on
                                                                      Pattern Analysis and Machine Intelligence,
4. CONCLUSIONS                                                        30(11):1870–1890, 2008.
  We tackled MediaEval 2013 Retrieving Diverse Social Im-         [6] H. Kawakubo and K. Yanai. Geovisualrank: A ranking
ages Task with VisualRank [5] and Ranking with Sink Points [2].       method of geotagged images considering visual
  Unfortunately, due to time limitation, we had to give               similarity and geo-location proximity. In Proc. of the
up using some useful additional data including the train-             ACM International World Wide Web Conference, 2011.

</pre>