=Paper=
{{Paper
|id=None
|storemode=property
|title=Ghent University at the 2011 Placing Task
|pdfUrl=https://ceur-ws.org/Vol-807/vanLaere_UGENT_Placing_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LaereSD11
}}
==Ghent University at the 2011 Placing Task==
Ghent University at the 2011 Placing Task ∗ Olivier Van Laere Steven Schockaert Bart Dhoedt Department of Information Dept. of Applied Mathematics Department of Information Technology, IBBT and Computer Science Technology, IBBT Ghent University, Belgium Ghent University, Belgium Ghent University, Belgium olivier.vanlaere@ugent.be steven.schockaert@ugent.be bart.dhoedt@ugent.be ABSTRACT same day by the same user with identical tags are treated as We present the results of a system that georeferences Flickr duplicates, to reduce the impact of bulk uploads, after which videos using a combination of language models and similar- 2 096 712 photos remained. For run 5, a larger training set ity search. The system extends our approach from last year was used, crawled using the Flickr API, consisting of 11 770 by using language models with a more adaptive granularity, 000 photos with the highest level of location accuracy (i.e. and by taking into account the home location of the user. level 16). We ensured not to crawl any videos and thus any possible items from the test set. In both cases, the locations of the photos in the training Keywords set were clustered using agglomerative hierarchical cluster- Georeferencing, Language models, Dempster-Shafer theory ing, from which flat clusterings into 500, 2500, 5000 and 7500 clusters have been obtained; these clusterings will be referred to as C500 , C2500 , C5000 and C7500 respectively. For 1. INTRODUCTION each cluster within these four clusterings, the most relevant The Placing Task requires participants to estimate the ge- tags are determined using χ2 feature selection, leading to the ographical coordinates of a video, based on the visual and vocabularies (i.e. sets of tags) V500 , V2500 , V5000 and V7500 . auditory features of the video, textual tags that have been assigned to it by its owner, context information about the Finding the most likely area. owner, etc. Training data consists of a portion of the georef- To determine the probability P (a|x) that a video x was erenced photos on Flickr. For a detailed description of this taken in area a ∈ Ck , a unigram language modeling ap- task, we refer to [2]. Participants were allowed to submit proach is used (except for run 4, which does not permit the five runs, which differ in the kind of meta-data and external use of textual tags), whereby [3] resources that are allowed. We participated in the 2010 Placing Task with a system 0 1 based on a two-step approach [6]. In the first step, language Y P (a|x) ∝ @ P (t|a)A · P (a) (1) models are used to determine the area which is most likely to t∈tagsk (x) contain the location of a previously unseen video. The sec- ond step determines the location of the most similar photo where tagsk (x) is the set of tags from Vk that have been within the chosen area and uses its location as the predic- assigned to video x. The probability P (t|a) is estimated us- tion. An important lesson drawn from last year’s participa- ing Bayesian smoothing (see [6] for more details). Different tion was that the chosen granularity of the areas in the first to our system of last year, we estimate the prior probabil- step crucially influences the performance, and that more- ity P (a) using the home location of the owner of video x, over this optimal granularity varies greatly across different in those runs where the use of gazetteer look-up was al- test videos. Therefore, this year we have experimented with lowed, and for those videos where a textual home location two methods to determine a suitable granularity. As a sec- was available and georeferencing did not fail. Specifically, ond extension, this year we have included the possibility of we take using the home location of the user, which is available in „ «θ textual form for a majority of all test videos. 1 P (a) ∝ (2) d(pa , phome ) 2. METHODOLOGY where d refers to geodesic distance, pa are the coordinates A total number of 3 185 258 georeferenced photos from of the most central photo of area a (i.e. the medoid of the Flickr were provided as training data by the task organiz- locations of the photos from the training data located in ers. As last year, photos that have been uploaded on the area a) and phome are the coordinates obtained from the textual home location using the Google Geocoding API1 . ∗Postdoctoral Fellow of the Research Foundation – Flanders The parameter θ was set to 0.75 in our experiments. If (FWO). coordinates of the home location cannot be obtained, P (a) is estimated as the percentage of all photos from the training 1 Copyright is held by the author/owner(s). http://code.google.com/apis/maps/documentation/ MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy geocoding/ data that are contained in area a, i.e. 1km 10km 100km 1000km 10000km run 1 1245 2386 3340 4010 5207 |a| run 2 1294 2753 3883 4578 5232 P (a) = P (3) a∈Ck |a| run 3 1263 2665 3759 4499 5231 identifying a with the set of photos from area a in the train- run 4 2 6 49 624 4332 ing data. In run 1, where a textual home location may be run 5 2567 3528 4109 4672 5263 available, but gazetteer look-up is not allowed, (3) can be re- fined by looking at tags from the vocabulary Vk that appear Table 1: Overview of the results on the test collec- in it: tion of 5347 videos, using textual tags and visual 0 1 features (run 1); using textual tags, gazetteer ser- Y µ |a| vices and visual features (runs 2 and 3); using only P (a) ∝ @ P (t|a) A · P (4) visual features (run 4); and using tags, gazetteers a∈Ck |a| t∈homeTags(x)∩tagsk (x) and visual features on an extended training set with where µ was set to 0.45 in the experiments. the Dempster-Shafer approach (run 5). Determining the level of granularity. The language modeling approach to georeferencing re- frames of video x that were provided by the task organizers. quires an appropriate level of granularity to be determined: Visual features were extracted using the Color and Edge Di- for videos with more informative tags, it is beneficial to con- rectivity Descriptor (cedd) of the LIRE tool [1] . When dif- sider a finer-grained clustering. As a baseline technique for ferent key frames of the video yield conflicting predictions selecting the optimal value of k, we check the number of (i.e. when they are most similar to different photos), the tags a video x has in common with the different vocabu- (keyframe,photo) pair which provided the highest degree of laries. If tags7500 (x) ∩ V7500 ≥ t7500 , with t7500 an appro- similarity is used. priate threshold value, k = 7500 is chosen. Otherwise, if tags5000 (x) ∩ V5000 ≥ t5000 we select k = 5000, etc. For run 1 and 2 the threshold values where chosen as t500 = 1 and 3. RESULTS AND DISCUSSION t2500 = t5000 = t7500 = 2. For run 3, on the other hand, The results of the five runs are provided in Table 1. In we set t500 = t2500 = t5000 = t7500 = 1. Run 4 is not particular, the table shows how many of the 5347 videos in based on language models. For run 5, we used a technique the test collection were localized within 1km, 10km, 100km, based on Dempster-Shafer theory which was proposed in [5]. 1000km and 10000km of the correct location. Intuitively, this approach combines the probability distribu- As can been concluded by comparing the results of runs tions obtained at each of the granularity levels into a sin- 1 and 2, using the geocoded home location is really boost- gle structure, called a belief function, and then determines ing the results. Also, determining a good threshold value to the most likely area at the most appropriate level of gran- fall back to a coarser clustering can impact the results, as ularity2 . While this approach allows for a better informed is demonstrated in run 3 which only differs from run 2 in decision, it requires language model probabilities to be cali- its choice of the threshold values t500 , t2500 , t5000 and t7500 . brated, which necessitates the use of a sufficiently large de- Run 4 is a baseline run which only uses visual features. Un- velopment set which is disjoint from the training set. Initial surprisingly, run 5, which is based on a larger training set, experiments revealed that the training set provided by the yielded the best results. As further experiments have indi- task organizers was not sufficiently large to allow for both cated, however, this increased performance is not only due accurate training and accurate calibration. Therefore this to the larger training set, but also to the use of Dempster- technique was only applied in run 5, using 10.7M photos for Shafer theory to combine the different granularity levels. training and 1.07M photos for calibration. 4. REFERENCES Determining the location. [1] M. Lux and S. A. Chatzichristofis. Lire: lucene image Once a suitable value of k has been chosen, the area a retrieval: an extensible java CBIR library. In Proc. from Ck that maximizes (1) is determined. Subsequently ACM Multimedia, pages 1085–1088, 2008. the photo from area a (in the training data) which is most [2] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm. similar to the video x is determined, and its location is used Working Notes for the Placing Task at MediaEval2011. as the prediction for the location of x. Similarity is deter- In Working Notes of the MediaEval Workshop, 2011. mined by comparing the tags assigned to each photo with [3] P. Serdyukov, V. Murdock, and R. van Zwol. Placing the tags assigned to x using Jaccard similarity (without fea- flickr photos on a map. In Proc. ACM SIGIR, pages ture selection). 484–491, 2009. As a fall-back strategy, if no tags have been assigned to [4] P. Smets. Constructing the pignistic probability x at all, the home location of x is used as the prediction function in a context of uncertainty. In Proc. UAI, (in those runs where the use of a gazetteer is allowed). If pages 29–40, 1990. no home location is available, we use the location of the photo which is visually most similar to video x. To mea- [5] O. Van Laere, S. Schockaert, and B. Dhoedt. sure visual similarity, a photo is compared against the key Combining multi-resolution evidence for georeferencing Flickr images. In Proc. SUM, pages 347–360. 2010. 2 Specifically, the most likely area was determined using the [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding pignistic probability decision rule [4], choosing the granu- locations of Flickr resources using language models and larity level as the most fine-grained level for which pignistic probability was above the threshold of 0.6. similarity search. In Proc. ACM ICMR, 2011.