WISTUD at MediaEval 2011: Placing Task Claudia Hauff Geert-Jan Houben Web Information Systems Web Information Systems Delft University of Technology Delft University of Technology Delft, The Netherlands Delft, The Netherlands c.hauff@tudelft.nl g.j.p.m.houben@tudelft.nl ABSTRACT cell is split into four equally sized cells (four children nodes In this paper, we describe our approach to estimating the are added) and the training items are re-distributed to these geographic location of videos. Our system relies on textual cells. To avoid too many splits in areas where large amounts meta-data and includes two basic term filtering strategies: of training data are available, a cell may not be split any filtering according to the general use of terms and filtering further if its latitude/longitude range reaches a lower limit according to the geographic spread. Combining both filter- `lat lng . This process yields cells of small size for areas where ing steps yields 50% accuracy within a 10km range. the training data is dense, and cells of large size in areas where the training data is sparse. If a test video contains no tags or title terms (or all terms 1. INTRODUCTION are filtered out as described below), the terms in the user The goal of the Placing Task [4] is to assign geographic location are used instead, a fall-back strategy inspired by locations (latitude and longitude values) to test videos, using [2]: if a user does not tag a video with its location, it is textual meta-data, audio and visual features as well as social likely to be taken at the user’s home location. In contrast information that is available in the training and test data. to [2], we add the user location terms to Ttest , instead of The training corpus consists of approximately three million relying on an external resource to convert the user location Flickr images and ten thousand videos. All but one of our to latitude/longitude coordinates. Finally, if the user loca- experiments rely exclusively on the textual meta-data. tion yields no usable terms, a latitude/longitude of 0/0 is assigned to the test video. 2. SYSTEM AND RUNS 2.1 Term Filtering In the first year of participating at MediaEval, we focused We experiment with two basic term filters. Filtered out our efforts on building a system that predicts a video’s loca- from Ttest are (i) terms that are used by less than U users in tion based on the textual meta-data assigned to it, in par- the training data, and, (ii) terms with a geographic spread ticular the tags and the title terms. We follow the approach score greater than threshold θgeo . Excluding terms that are described in [5] and divide the world map into a number used by very few users is hypothesized to improve the ro- of cells with varying latitude/longitude ranges and assign bustness of the approach. all items of the training data to their respective cells. All Geographic spread filtering is applied for a similar reason: available images from the development set with an accuracy a video may be tagged with a number of non-geographic of 11 or higher as well as all training videos were used for terms such as “wedding” or “bowling” in addition to tags training (2, 974, 635 items in total). Since we rely on tex- that are likely to refer to locations such as “london” or “syd- tual meta-data, we can treat images and videos in the same ney”. Whether a term is likely to have a geographic scope manner. Then, for each cell, a language model [7] is derived can either be determined by matching the term against a from the items’ textual meta-data. Predicting the location geographical database (such as GeoNames1 ) or by consider- of a test video is a two-step process: first, the cell Cmax is ing how localized the term is in the training data. We follow identified whose language model generates the test video’s the latter approach here as it does not require any external bag-of-words (tags and title terms) Ttest with the highest resources. While in the development data the term “sydney” probability. In a second step, the same process is repeated occurs primarily in one particular grid cell (as expected the within Cmax to find the most closely matching training item cell containing the location of Sydney, Australia), the term Imax . The latitude/longitude of Imax is returned as the esti- “bowling” is spread considerably wider, mainly across North mated location of the test video. America. This observation leads to a simple but effective ge- In contrast to [5], the grid cells in our approach are of ographic spread score: a grid is placed over the world map varying size: starting with a grid cell that spans the entire (1 degree latitude/longitude range per cell) and the num- world map (if viewed as a graph, this cell is the root node), ber of training items in the cell that contain the term are the training items are added to the cell one at a time. Once recorded. Neighbouring grid cells with a non-zero count are the number of items in a cell exceeds the set limit `split , the merged (in order to avoid penalizing geographic terms that cover a wide area) and the number of non-zero connected components are determined. This score is normalized by Copyright is held by the author/owner(s). 1 MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy http://www.geonames.org/ Term Geographic Spread Score 1 km 10 km 50 km 1000 km bowling 3.237 Basic 20.3% 38.2% 49.2% 66.4% baby 1.809 Gen 21.5% 40.5% 51.2% 67.8% valley 1.512 GeoGen 17.2% 50.8% 70.0% 82.6% british 0.363 UserSpecific 17.8% 38.0% 52.1% 72.7% lakepukaki 0.049 Visual 0.0% 0.1% 0.7% 10.9% españa 0.021 thenetherlands 0.011 Table 2: Prediction accuracy of the runs for a num- london 0.010 ber of distance cutoffs. sydney 0.007 4. DISCUSSION Table 1: Examples of geographic spread scores. In The biggest improvements over the baseline run are achieved our experiments, we use a threshold of θgeo = 0.1. by filtering out terms that have a large geographic spread. the maximum count. Thus, the smaller the score, the more The only exception is the 1km cutoff, where Basic outper- localized the term occurs in the training data. Our approach forms GeoGen. We hypothesize that once the correct cell is simpler than the χ2 feature selection based geo-term filter- Cmax is identified in the first step of the estimation process, ing [6], which determines the geographic score for the tags in finding the closest match within the training documents of each cell separately. Examples of terms and their geographic Cmax may be more robust if all terms of Ttest are used. Al- spread score are shown in Table 1. While the scores of most though more than 80% of the test set users also contributed terms appear reasonable, “british” is incorrectly identified as items to the training set (on average 582 items), relying on non-geographic (if we assume a threshold of θgeo = 0.1) as it only the user’s contributed items for training did not yield is not only used to tag pictures taken in the United Kingdom. improvements over relying on all available training items. In the development data it is also used to describe British Our implementation of the visual features based nearest Columbia (Canada), the British Virgin Islands (Caribbean), neighbour approach did not result in a usable location esti- British restaurants (mainly in the USA) and placed where mator. Future work will focus on a failure analysis of this historical battles against the British took places (mainly in sub-system. Exploiting weather and daylight information to the USA). place outdoor images on a map, e.g., [3], will also be inves- tigated. Finally, we plan to research to what extent social 2.2 Run Descriptions network information (such as the home location of the user’s contacts, the locations of the images the user comments on, Based on the results of preliminary experiments, we fixed etc.) can improve the text-based location estimation of im- a number of parameters across all submitted runs: language ages that are geographically underspecified. modeling with Dirichlet smoothing (µ = 5000), `split = 5000 and `lat lng = 0.01. These settings result in a total of 1786 non-empty cells. The maximum extent in terms of latitude Acknowledgments and longitude are 22.5 and 45.0 in areas of the world map This research has received funding from the European Union where the development data is sparse. Listed below are the Seventh Framework Programme (FP7/2007-2013), grant agree- details of the submitted runs: ment no ICT 257831 (ImREAL project). Basic: baseline run without term filtering. 5. REFERENCES Gen: run with general term filtering applied, U = 2. [1] O. Boiman, E. Shechtman, and M. Irani. In defense of GeoGen: run with geographic and general term filtering nearest-neighbor based image classification. In CVPR applied, U = 2 and θgeo = 0.1. ’08, pages 1–8, 2008. [2] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI UserSpecific: run with geographic and general term filter- Video Location Estimation System. In MediaEval 2010 ing applied, U = 2 and θgeo = 0.1. If the user who Workshop, 2010. uploaded the test video has contributed at least one item to the training data set, only the user’s training [3] N. Jacobs, K. Miskell, and R. Pless. Webcam items are utilized to create the grid cells and language geo-localization using aggregate light levels. In WACV models (similar to [2]). ’11, pages 132–138, 2011. [4] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm. Visual: run which is based on the provided visual features. Working Notes for the Placing Task at MediaEval 2011. The partition of the training data is the same as in the In MediaEval 2011 Workshop, 2011. text-based approaches, though for performance reason [5] P. Serdyukov, V. Murdock, and R. van Zwol. Placing only 10% of the training data was used. The Naive- flickr photos on a map. In SIGIR ’09, pages 484–491, Bayes nearest neighbour approach [1] with all visual 2009. features was implemented. [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of flickr resources using language models and 3. RESULTS similarity search. In ICMR ’11, pages 48:1–48:8, 2011. The results of the listed runs are shown in Table 2. Re- [7] C. Zhai and J. Lafferty. A study of smoothing methods ported is the accuracy within {1, 10, 50, 1000}km of the ground for language models applied to ad hoc information truth location. retrieval. In SIGIR ’01, pages 334–342, 2001.