The 2014 ICSI/TU Delft Location Estimation System

                                              Jaeyoung Choi1,2 , Xinchao Li2
                               1
                                International Computer Science Institute, Berkeley, CA, USA
                      2
                          Multimedia Computing Group, Delft University of Technology, Netherlands
                                   1
                                       jaeyoung@icsi.berkeley.edu, 2 x.li-3@tudelft.nl


ABSTRACT                                                         available textual metadata, we utilized the user-annotated
In this paper, we describe the ICSI/TU Delft video location      tags, and title. Machine tags were treated the same way as
estimation system presented at the MediaEval 2014 Plac-          the user-annotated tags. This also applies to the following
ing Task. We describe two text-based approaches based on         graphical model based approach.
spatial variance and graphical model framework, a visual-        2.1.2    Graphical model based approach
content-based geo-visual ranking approach, and a multi-modal
approach that combines the text and visual-based algorithms.        The random variables in our graphical model setup are
                                                                 the geo-locations of the query videos that need to be es-
                                                                 timated [1]. We treat the textual tags as observed ran-
                                                                 dom variables that are probabilistically related to the geo-
1.    INTRODUCTION                                               location of that video. The goal is to obtain the best esti-
   The Placing Task 2014 [2] is to automatically estimate the    mate of the unobserved random variables (locations of the
geo-location of each query video using any or all of metadata,   query videos) given all the observed variables. We used
visual/audio content, and user information. For the text-        graphical models to characterize the dependencies amongst
based approaches, we used the spatial variance based base-       the different random variables and use efficient message-
line system [3] and the graphical model based framework [1]      passing algorithms to obtain the desired estimates.
that poses the geo-tagging problem as one of inference over         An undirected graphical model or a Markov Random Field
the graph. The graphical model jointly estimates the geo-        (MRF) G(V, E) consists of a vertex set V and an edge set E.
locations of all the test videos, which helps obtain perfor-     The vertices (nodes) of the graph represent random variables
mance improvements. The visual-based location estimation         {xv }v∈V and the edges capture the conditional independen-
is based on the evidence collected from images that are not      cies amongst the random variables through graph separa-
only geographically close to the query’s location but it also    tion. The joint probability distribution for a N -node pair-
exploits the visual similarity to the query image within the     wise MRF can be written as follows:
considered image collection [4]. For the fusion of these sys-                                Y          Y
tems’ results, we ran both systems and chose the result of               p(x1 , ...., xN ) =    ψ(xi )       ψ(xi , xj ). (1)
                                                                                            i∈V           (i,j)∈E
the text-based system as the overall result, except when the
confidence was low, in which case we chose the visual-based      ψ(.)’s are known as potential functions that depend on the
result.                                                          probability distribution of the random variables.
                                                                    Given the training data, we fit a Gaussian Mixture Model
2.    SYSTEM DESCRIPTION                                         (GMM) for the distribution of the location given a partic-
                                                                 ular tag t, i.e., p(x|t). The intuition is that tags usually
2.1     Text-based Approach                                      correspond to one or more specific locations and the distri-
                                                                 bution is multi-modal (e.g., the tag “washington” can refer to
2.1.1    Spatial Variance approach                               two geographic places). Given that for many of the tags, the
                                                                 GMM will have one strong mixture component, the distribu-
   The intuition behind this approach is that if the spatial     tion ψ(xi ), can be approximated by a Gaussian distribution
distribution of a tag based on the anchors in the develop-       with the mean (µ̃i ) and variance (σ̃i2 ) given by,
ment data set is concentrated in a very small area, the tag
                                                                                             ni
is likely a toponym. If the spatial variance of the distribu-
                                                                                                                   
                                                                                              X 1 k
tion is high, the tag is likely something else but a toponym.                                          µ i
                                                                                             k=1 σik2
                                                                                                                  
For a detailed description of our algorithm, see [3]. This ap-                                                  1
                                                                            (µ̃i , σ̃i2 ) = 
                                                                                                                   
                                                                                                ni         , ni
                                                                                                                   ,       (2)
proach was used as a baseline to evaluate the performance                                            1       X 1 
                                                                                             X
                                                                                                                  
of the graphical model based algorithm. For each query, the                                        σik2           σik2
                                                       2                                     k=1            k=1
confidence of the estimation was represented by e−v where
v 2 is the lowest spatial variance of the keywords. From all     where µki and σik2 are the mean and variance of the mix-
                                                                 ture component with the largest weight of the distribution
                                                                 p(xi |tki ). The location estimate for the ith query video x̂i
Copyright is held by the author/owner(s).                        is taken to be µ̃i and the variance σ̃i2 provides a confidence
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain   metric on the location estimate.
           10m    100m    1km     10km     100km    1000km       than the spatial variance approach in locating videos within
 run1      0.24    3.15   16.65   34.70     45.58    60.67       10m, 100m, and 1km from the ground truth location. It
 run2      0.17    1.60   3.88     5.86      6.82    17.43       outperformed the spatial variance approach in other ranges
 run3      0.22    2.75   16.28   46.20     52.81    72.19       by a large margin. One theory behind this result is that
 run4      0.31    3.41   12.13   19.95     22.82    33.79       the graphical model’s belief propagation process causes the
 run5      0.30    3.12   12.75   24.82     27.33    42.89       query node to move away from the ground truth location
 Oracle    0.41    4.52   19.05   37.02     47.86    65.81       as the reference images or videos that are far away from the
                                                                 ground truth have influences that are more than desired. For
Table 1: Percentage of correctly estimated query                 the both text-based approaches, we ignored the description
images/videos of each run                                        of the photo/video as its usage degraded the performance.
                                                                    The visual-based approach (run2 ) has lower accuracy in
2.2   Visual content-based approach                              all ranges when compared to the text-based approaches (run1
                                                                 and run3 ). However, note that visual-only result does rela-
   For the visual-based location estimation, we propose the
                                                                 tively well in the lower error range (10m, 100m, and 1km).
Geo-Visual Ranking (GVR) approach [4]. The basic intu-
                                                                 This implies that local feature matching gives very good es-
ition is that, compared to the images from the wrong loca-
                                                                 timation when similar image can be found in the training
tion, more images from the ground truth location will likely
                                                                 set.
contain more elements of the visual content of the query
                                                                    For the multimodal approaches (run4 and run5 ), replac-
image. Thus, instead of choosing the nearest neighbor im-
                                                                 ing the text-based estimation with a low confidence score
age, or relying on the biggest cluster of visual neighbors of
                                                                 with the visual-based estimation helped improving the sys-
the query image, we searched for geo-visual neighbors of the
                                                                 tem’s performance in the 10m and 100m range. Oracle in
query image. Geo-visual neighbors are images that are suffi-
                                                                 Table 1 shows the result of the oracle-condition experiment
ciently visually similar to the query image and also taken at
                                                                 where we chose the estimation between the run1 and run2
the same location as the query image. Let’s assume a case
                                                                 with the shorter error distance. It shows the upper bound
where a query image has two visually similar geo-tagged im-
                                                                 of the multimodal approach and the possible margin of per-
ages taken at different locations (which we refer to as candi-
                                                                 formance increase is high. Future work needs to investigate
date images). The nearest neighbor approach faces difficulty
                                                                 an optimal method for the fusion of multimodal features.
in this situation as the probability to select the wrong ref-
erence image from the two candidates is high. However, the
GVR approach’s estimation is affected by additional sets         4.   ACKNOWLEDGMENTS
of images that are found around at both candidate images’           This work was partially supported by funding provided to
locations (referred to as candidate geo-visual neighbors at      ICSI through National Science Foundation grant IIS:1251276
candidate locations). These candidate geo-visual neighbors’      (“BIGDATA: Small: DCM: DA: Collaborative Research:
contribution to the decision is based not on just the number     SMASH—Scalable Multimedia content AnalysiS in a High-
of the images in each neighbor, but on the combined visual       level language”). Any opinions, findings, and conclusions
proximity to the query image, aggregated over all images         or recommendations expressed in this material are those of
from a set. Use of the set’s visual proximity makes it pos-      the authors or originators and do not necessarily reflect the
sible to point to the right candidate image even if it has a     views of the National Science Foundation.
smaller set of geo-neighbors than the candidate image. We
used SURF descriptors extracted using the BoofCV soft-
ware with the default parameters and used exact k-means
                                                                 5.   REFERENCES
to cluster these descriptors and generate visual words.          [1] J. Choi, G. Friedland, V. Ekambaram, and
                                                                     K. Ramchandran. Multimodal location estimation of
2.3   Multimodal Approach                                            consumer media: Dealing with sparse training data. In
  For the fusion of these systems’ results, we ran both sys-         Multimedia and Expo (ICME), 2012 IEEE
tems, and for the text-based estimations with low confi-             International Conference on, pages 43–48. IEEE, 2012.
dence, visual-based result was used instead. The optimal         [2] J. Choi, B. Thomee, G. Friedland, L. Cao, K. Ni,
threshold for confidence was searched using grid search over         D. Borth, B. Elizalde, L. Gottlieb, C. Carrano,
the development set and the used value was when the vari-            R. Pearce, and D. Poland. The placing task: A
ance v 2 was 25.                                                     large-scale geo-estimation challenge for social-media
                                                                     videos and images. In Proceedings of the 3rd ACM
3.    RESULTS AND DISCUSSION                                         International Workshop on Geotagging and Its
  We submitted four runs: run1 - spatial variance approach           Applications in Multimedia, 2014.
(text only), run2 - visual-content-based geo-visual ranking      [3] G. Friedland, J. Choi, H. Lei, and A. Janin.
approach, run3 - graphical model based approach (text only),         Multimodal Location Estimation on Flickr Videos. In
run4 - spatial variance + GVR, and run5 - graphical model            Proceedings of the 3rd SIGMM Workshop on Social
based approach + GVR. Each column in Table 1 shows what              Media in Conjunction with ACMMM, 2011.
percentage of test videos and images were placed within          [4] X. Li, M. Riegler, M. Larson, and A. Hanjalic.
10m, 100m, 1km, 10km, 100km, and 1000km from the ground              Exploration of feature combination in geo-visual
truth location. For the text-based approaches (run1 and              ranking for visual content-based location prediction. In
run3 ), graphical model approach performed slightly worse            MediaEval, 2013.