The 2014 ICSI/TU Delft Location Estimation System Jaeyoung Choi1,2 , Xinchao Li2 1 International Computer Science Institute, Berkeley, CA, USA 2 Multimedia Computing Group, Delft University of Technology, Netherlands 1 jaeyoung@icsi.berkeley.edu, 2 x.li-3@tudelft.nl ABSTRACT available textual metadata, we utilized the user-annotated In this paper, we describe the ICSI/TU Delft video location tags, and title. Machine tags were treated the same way as estimation system presented at the MediaEval 2014 Plac- the user-annotated tags. This also applies to the following ing Task. We describe two text-based approaches based on graphical model based approach. spatial variance and graphical model framework, a visual- 2.1.2 Graphical model based approach content-based geo-visual ranking approach, and a multi-modal approach that combines the text and visual-based algorithms. The random variables in our graphical model setup are the geo-locations of the query videos that need to be es- timated [1]. We treat the textual tags as observed ran- dom variables that are probabilistically related to the geo- 1. INTRODUCTION location of that video. The goal is to obtain the best esti- The Placing Task 2014 [2] is to automatically estimate the mate of the unobserved random variables (locations of the geo-location of each query video using any or all of metadata, query videos) given all the observed variables. We used visual/audio content, and user information. For the text- graphical models to characterize the dependencies amongst based approaches, we used the spatial variance based base- the different random variables and use efficient message- line system [3] and the graphical model based framework [1] passing algorithms to obtain the desired estimates. that poses the geo-tagging problem as one of inference over An undirected graphical model or a Markov Random Field the graph. The graphical model jointly estimates the geo- (MRF) G(V, E) consists of a vertex set V and an edge set E. locations of all the test videos, which helps obtain perfor- The vertices (nodes) of the graph represent random variables mance improvements. The visual-based location estimation {xv }v∈V and the edges capture the conditional independen- is based on the evidence collected from images that are not cies amongst the random variables through graph separa- only geographically close to the query’s location but it also tion. The joint probability distribution for a N -node pair- exploits the visual similarity to the query image within the wise MRF can be written as follows: considered image collection [4]. For the fusion of these sys- Y Y tems’ results, we ran both systems and chose the result of p(x1 , ...., xN ) = ψ(xi ) ψ(xi , xj ). (1) i∈V (i,j)∈E the text-based system as the overall result, except when the confidence was low, in which case we chose the visual-based ψ(.)’s are known as potential functions that depend on the result. probability distribution of the random variables. Given the training data, we fit a Gaussian Mixture Model 2. SYSTEM DESCRIPTION (GMM) for the distribution of the location given a partic- ular tag t, i.e., p(x|t). The intuition is that tags usually 2.1 Text-based Approach correspond to one or more specific locations and the distri- bution is multi-modal (e.g., the tag “washington” can refer to 2.1.1 Spatial Variance approach two geographic places). Given that for many of the tags, the GMM will have one strong mixture component, the distribu- The intuition behind this approach is that if the spatial tion ψ(xi ), can be approximated by a Gaussian distribution distribution of a tag based on the anchors in the develop- with the mean (µ̃i ) and variance (σ̃i2 ) given by, ment data set is concentrated in a very small area, the tag  ni is likely a toponym. If the spatial variance of the distribu-  X 1 k tion is high, the tag is likely something else but a toponym. µ i  k=1 σik2   For a detailed description of our algorithm, see [3]. This ap- 1 (µ̃i , σ̃i2 ) =   ni , ni , (2) proach was used as a baseline to evaluate the performance 1 X 1   X   of the graphical model based algorithm. For each query, the σik2 σik2 2 k=1 k=1 confidence of the estimation was represented by e−v where v 2 is the lowest spatial variance of the keywords. From all where µki and σik2 are the mean and variance of the mix- ture component with the largest weight of the distribution p(xi |tki ). The location estimate for the ith query video x̂i Copyright is held by the author/owner(s). is taken to be µ̃i and the variance σ̃i2 provides a confidence MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain metric on the location estimate. 10m 100m 1km 10km 100km 1000km than the spatial variance approach in locating videos within run1 0.24 3.15 16.65 34.70 45.58 60.67 10m, 100m, and 1km from the ground truth location. It run2 0.17 1.60 3.88 5.86 6.82 17.43 outperformed the spatial variance approach in other ranges run3 0.22 2.75 16.28 46.20 52.81 72.19 by a large margin. One theory behind this result is that run4 0.31 3.41 12.13 19.95 22.82 33.79 the graphical model’s belief propagation process causes the run5 0.30 3.12 12.75 24.82 27.33 42.89 query node to move away from the ground truth location Oracle 0.41 4.52 19.05 37.02 47.86 65.81 as the reference images or videos that are far away from the ground truth have influences that are more than desired. For Table 1: Percentage of correctly estimated query the both text-based approaches, we ignored the description images/videos of each run of the photo/video as its usage degraded the performance. The visual-based approach (run2 ) has lower accuracy in 2.2 Visual content-based approach all ranges when compared to the text-based approaches (run1 and run3 ). However, note that visual-only result does rela- For the visual-based location estimation, we propose the tively well in the lower error range (10m, 100m, and 1km). Geo-Visual Ranking (GVR) approach [4]. The basic intu- This implies that local feature matching gives very good es- ition is that, compared to the images from the wrong loca- timation when similar image can be found in the training tion, more images from the ground truth location will likely set. contain more elements of the visual content of the query For the multimodal approaches (run4 and run5 ), replac- image. Thus, instead of choosing the nearest neighbor im- ing the text-based estimation with a low confidence score age, or relying on the biggest cluster of visual neighbors of with the visual-based estimation helped improving the sys- the query image, we searched for geo-visual neighbors of the tem’s performance in the 10m and 100m range. Oracle in query image. Geo-visual neighbors are images that are suffi- Table 1 shows the result of the oracle-condition experiment ciently visually similar to the query image and also taken at where we chose the estimation between the run1 and run2 the same location as the query image. Let’s assume a case with the shorter error distance. It shows the upper bound where a query image has two visually similar geo-tagged im- of the multimodal approach and the possible margin of per- ages taken at different locations (which we refer to as candi- formance increase is high. Future work needs to investigate date images). The nearest neighbor approach faces difficulty an optimal method for the fusion of multimodal features. in this situation as the probability to select the wrong ref- erence image from the two candidates is high. However, the GVR approach’s estimation is affected by additional sets 4. ACKNOWLEDGMENTS of images that are found around at both candidate images’ This work was partially supported by funding provided to locations (referred to as candidate geo-visual neighbors at ICSI through National Science Foundation grant IIS:1251276 candidate locations). These candidate geo-visual neighbors’ (“BIGDATA: Small: DCM: DA: Collaborative Research: contribution to the decision is based not on just the number SMASH—Scalable Multimedia content AnalysiS in a High- of the images in each neighbor, but on the combined visual level language”). Any opinions, findings, and conclusions proximity to the query image, aggregated over all images or recommendations expressed in this material are those of from a set. Use of the set’s visual proximity makes it pos- the authors or originators and do not necessarily reflect the sible to point to the right candidate image even if it has a views of the National Science Foundation. smaller set of geo-neighbors than the candidate image. We used SURF descriptors extracted using the BoofCV soft- ware with the default parameters and used exact k-means 5. REFERENCES to cluster these descriptors and generate visual words. [1] J. Choi, G. Friedland, V. Ekambaram, and K. Ramchandran. Multimodal location estimation of 2.3 Multimodal Approach consumer media: Dealing with sparse training data. In For the fusion of these systems’ results, we ran both sys- Multimedia and Expo (ICME), 2012 IEEE tems, and for the text-based estimations with low confi- International Conference on, pages 43–48. IEEE, 2012. dence, visual-based result was used instead. The optimal [2] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni, threshold for confidence was searched using grid search over D. Borth, B. Elizalde, L. Gottlieb, C. Carrano, the development set and the used value was when the vari- R. Pearce, and D. Poland. The placing task: A ance v 2 was 25. large-scale geo-estimation challenge for social-media videos and images. In Proceedings of the 3rd ACM 3. RESULTS AND DISCUSSION International Workshop on Geotagging and Its We submitted four runs: run1 - spatial variance approach Applications in Multimedia, 2014. (text only), run2 - visual-content-based geo-visual ranking [3] G. Friedland, J. Choi, H. Lei, and A. Janin. approach, run3 - graphical model based approach (text only), Multimodal Location Estimation on Flickr Videos. In run4 - spatial variance + GVR, and run5 - graphical model Proceedings of the 3rd SIGMM Workshop on Social based approach + GVR. Each column in Table 1 shows what Media in Conjunction with ACMMM, 2011. percentage of test videos and images were placed within [4] X. Li, M. Riegler, M. Larson, and A. Hanjalic. 10m, 100m, 1km, 10km, 100km, and 1000km from the ground Exploration of feature combination in geo-visual truth location. For the text-based approaches (run1 and ranking for visual content-based location prediction. In run3 ), graphical model approach performed slightly worse MediaEval, 2013.