=Paper=
{{Paper
|id=None
|storemode=property
|title=Ghent University at the 2011 Placing Task
|pdfUrl=https://ceur-ws.org/Vol-807/vanLaere_UGENT_Placing_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LaereSD11
}}
==Ghent University at the 2011 Placing Task==
<pdf width="1500px">https://ceur-ws.org/Vol-807/vanLaere_UGENT_Placing_me11wn.pdf</pdf>
<pre>
                   Ghent University at the 2011 Placing Task

                                                                          ∗
               Olivier Van Laere                    Steven Schockaert                        Bart Dhoedt
            Department of Information            Dept. of Applied Mathematics        Department of Information
               Technology, IBBT                    and Computer Science                 Technology, IBBT
            Ghent University, Belgium             Ghent University, Belgium          Ghent University, Belgium
          olivier.vanlaere@ugent.be             steven.schockaert@ugent.be            bart.dhoedt@ugent.be


ABSTRACT                                                         same day by the same user with identical tags are treated as
We present the results of a system that georeferences Flickr     duplicates, to reduce the impact of bulk uploads, after which
videos using a combination of language models and similar-       2 096 712 photos remained. For run 5, a larger training set
ity search. The system extends our approach from last year       was used, crawled using the Flickr API, consisting of 11 770
by using language models with a more adaptive granularity,       000 photos with the highest level of location accuracy (i.e.
and by taking into account the home location of the user.        level 16). We ensured not to crawl any videos and thus any
                                                                 possible items from the test set.
                                                                    In both cases, the locations of the photos in the training
Keywords                                                         set were clustered using agglomerative hierarchical cluster-
Georeferencing, Language models, Dempster-Shafer theory          ing, from which flat clusterings into 500, 2500, 5000 and
                                                                 7500 clusters have been obtained; these clusterings will be
                                                                 referred to as C500 , C2500 , C5000 and C7500 respectively. For
1.   INTRODUCTION                                                each cluster within these four clusterings, the most relevant
   The Placing Task requires participants to estimate the ge-    tags are determined using χ2 feature selection, leading to the
ographical coordinates of a video, based on the visual and       vocabularies (i.e. sets of tags) V500 , V2500 , V5000 and V7500 .
auditory features of the video, textual tags that have been
assigned to it by its owner, context information about the
                                                                 Finding the most likely area.
owner, etc. Training data consists of a portion of the georef-
                                                                   To determine the probability P (a|x) that a video x was
erenced photos on Flickr. For a detailed description of this
                                                                 taken in area a ∈ Ck , a unigram language modeling ap-
task, we refer to [2]. Participants were allowed to submit
                                                                 proach is used (except for run 4, which does not permit the
five runs, which differ in the kind of meta-data and external
                                                                 use of textual tags), whereby [3]
resources that are allowed.
   We participated in the 2010 Placing Task with a system
                                                                                        0                 1
based on a two-step approach [6]. In the first step, language
                                                                                            Y
                                                                             P (a|x) ∝ @           P (t|a)A · P (a)      (1)
models are used to determine the area which is most likely to                              t∈tagsk (x)
contain the location of a previously unseen video. The sec-
ond step determines the location of the most similar photo       where tagsk (x) is the set of tags from Vk that have been
within the chosen area and uses its location as the predic-      assigned to video x. The probability P (t|a) is estimated us-
tion. An important lesson drawn from last year’s participa-      ing Bayesian smoothing (see [6] for more details). Different
tion was that the chosen granularity of the areas in the first   to our system of last year, we estimate the prior probabil-
step crucially influences the performance, and that more-        ity P (a) using the home location of the owner of video x,
over this optimal granularity varies greatly across different    in those runs where the use of gazetteer look-up was al-
test videos. Therefore, this year we have experimented with      lowed, and for those videos where a textual home location
two methods to determine a suitable granularity. As a sec-       was available and georeferencing did not fail. Specifically,
ond extension, this year we have included the possibility of     we take
using the home location of the user, which is available in                                 „                «θ
textual form for a majority of all test videos.                                                     1
                                                                                   P (a) ∝                                 (2)
                                                                                             d(pa , phome )
2.   METHODOLOGY                                                 where d refers to geodesic distance, pa are the coordinates
  A total number of 3 185 258 georeferenced photos from          of the most central photo of area a (i.e. the medoid of the
Flickr were provided as training data by the task organiz-       locations of the photos from the training data located in
ers. As last year, photos that have been uploaded on the         area a) and phome are the coordinates obtained from the
                                                                 textual home location using the Google Geocoding API1 .
∗Postdoctoral Fellow of the Research Foundation – Flanders
                                                                 The parameter θ was set to 0.75 in our experiments. If
(FWO).                                                           coordinates of the home location cannot be obtained, P (a)
                                                                 is estimated as the percentage of all photos from the training
                                                                 1
Copyright is held by the author/owner(s).                          http://code.google.com/apis/maps/documentation/
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy        geocoding/
data that are contained in area a, i.e.                                        1km     10km     100km      1000km     10000km
                                                                       run 1   1245    2386      3340       4010        5207
                                      |a|                              run 2   1294    2753      3883       4578        5232
                     P (a) = P                             (3)
                                 a∈Ck |a|                              run 3   1263    2665      3759       4499        5231
identifying a with the set of photos from area a in the train-         run 4    2        6        49         624        4332
ing data. In run 1, where a textual home location may be               run 5   2567    3528      4109       4672        5263
available, but gazetteer look-up is not allowed, (3) can be re-
fined by looking at tags from the vocabulary Vk that appear       Table 1: Overview of the results on the test collec-
in it:                                                            tion of 5347 videos, using textual tags and visual
           0                               1                      features (run 1); using textual tags, gazetteer ser-
                      Y                  µ        |a|             vices and visual features (runs 2 and 3); using only
   P (a) ∝ @                      P (t|a) A · P             (4)   visual features (run 4); and using tags, gazetteers
                                                 a∈Ck |a|
            t∈homeTags(x)∩tagsk (x)                               and visual features on an extended training set with
where µ was set to 0.45 in the experiments.                       the Dempster-Shafer approach (run 5).

Determining the level of granularity.
   The language modeling approach to georeferencing re-           frames of video x that were provided by the task organizers.
quires an appropriate level of granularity to be determined:      Visual features were extracted using the Color and Edge Di-
for videos with more informative tags, it is beneficial to con-   rectivity Descriptor (cedd) of the LIRE tool [1] . When dif-
sider a finer-grained clustering. As a baseline technique for     ferent key frames of the video yield conflicting predictions
selecting the optimal value of k, we check the number of          (i.e. when they are most similar to different photos), the
tags a video x has in common with the different vocabu-           (keyframe,photo) pair which provided the highest degree of
laries. If tags7500 (x) ∩ V7500 ≥ t7500 , with t7500 an appro-    similarity is used.
priate threshold value, k = 7500 is chosen. Otherwise, if
tags5000 (x) ∩ V5000 ≥ t5000 we select k = 5000, etc. For run
1 and 2 the threshold values where chosen as t500 = 1 and
                                                                  3.    RESULTS AND DISCUSSION
t2500 = t5000 = t7500 = 2. For run 3, on the other hand,             The results of the five runs are provided in Table 1. In
we set t500 = t2500 = t5000 = t7500 = 1. Run 4 is not             particular, the table shows how many of the 5347 videos in
based on language models. For run 5, we used a technique          the test collection were localized within 1km, 10km, 100km,
based on Dempster-Shafer theory which was proposed in [5].        1000km and 10000km of the correct location.
Intuitively, this approach combines the probability distribu-        As can been concluded by comparing the results of runs
tions obtained at each of the granularity levels into a sin-      1 and 2, using the geocoded home location is really boost-
gle structure, called a belief function, and then determines      ing the results. Also, determining a good threshold value to
the most likely area at the most appropriate level of gran-       fall back to a coarser clustering can impact the results, as
ularity2 . While this approach allows for a better informed       is demonstrated in run 3 which only differs from run 2 in
decision, it requires language model probabilities to be cali-    its choice of the threshold values t500 , t2500 , t5000 and t7500 .
brated, which necessitates the use of a sufficiently large de-    Run 4 is a baseline run which only uses visual features. Un-
velopment set which is disjoint from the training set. Initial    surprisingly, run 5, which is based on a larger training set,
experiments revealed that the training set provided by the        yielded the best results. As further experiments have indi-
task organizers was not sufficiently large to allow for both      cated, however, this increased performance is not only due
accurate training and accurate calibration. Therefore this        to the larger training set, but also to the use of Dempster-
technique was only applied in run 5, using 10.7M photos for       Shafer theory to combine the different granularity levels.
training and 1.07M photos for calibration.
                                                                  4.    REFERENCES
Determining the location.                                         [1] M. Lux and S. A. Chatzichristofis. Lire: lucene image
   Once a suitable value of k has been chosen, the area a             retrieval: an extensible java CBIR library. In Proc.
from Ck that maximizes (1) is determined. Subsequently                ACM Multimedia, pages 1085–1088, 2008.
the photo from area a (in the training data) which is most        [2] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm.
similar to the video x is determined, and its location is used        Working Notes for the Placing Task at MediaEval2011.
as the prediction for the location of x. Similarity is deter-         In Working Notes of the MediaEval Workshop, 2011.
mined by comparing the tags assigned to each photo with           [3] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
the tags assigned to x using Jaccard similarity (without fea-         flickr photos on a map. In Proc. ACM SIGIR, pages
ture selection).                                                      484–491, 2009.
   As a fall-back strategy, if no tags have been assigned to
                                                                  [4] P. Smets. Constructing the pignistic probability
x at all, the home location of x is used as the prediction
                                                                      function in a context of uncertainty. In Proc. UAI,
(in those runs where the use of a gazetteer is allowed). If
                                                                      pages 29–40, 1990.
no home location is available, we use the location of the
photo which is visually most similar to video x. To mea-          [5] O. Van Laere, S. Schockaert, and B. Dhoedt.
sure visual similarity, a photo is compared against the key           Combining multi-resolution evidence for georeferencing
                                                                      Flickr images. In Proc. SUM, pages 347–360. 2010.
2
  Specifically, the most likely area was determined using the     [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
pignistic probability decision rule [4], choosing the granu-          locations of Flickr resources using language models and
larity level as the most fine-grained level for which pignistic
probability was above the threshold of 0.6.                           similarity search. In Proc. ACM ICMR, 2011.

</pre>