WISTUD at MediaEval 2011: Placing Task

                           Claudia Hauff                                         Geert-Jan Houben
                    Web Information Systems                                   Web Information Systems
                   Delft University of Technology                            Delft University of Technology
                      Delft, The Netherlands                                    Delft, The Netherlands
                        c.hauff@tudelft.nl                                   g.j.p.m.houben@tudelft.nl


ABSTRACT                                                           cell is split into four equally sized cells (four children nodes
In this paper, we describe our approach to estimating the          are added) and the training items are re-distributed to these
geographic location of videos. Our system relies on textual        cells. To avoid too many splits in areas where large amounts
meta-data and includes two basic term filtering strategies:        of training data are available, a cell may not be split any
filtering according to the general use of terms and filtering      further if its latitude/longitude range reaches a lower limit
according to the geographic spread. Combining both filter-         `lat lng . This process yields cells of small size for areas where
ing steps yields 50% accuracy within a 10km range.                 the training data is dense, and cells of large size in areas
                                                                   where the training data is sparse.
                                                                      If a test video contains no tags or title terms (or all terms
1.   INTRODUCTION                                                  are filtered out as described below), the terms in the user
  The goal of the Placing Task [4] is to assign geographic         location are used instead, a fall-back strategy inspired by
locations (latitude and longitude values) to test videos, using    [2]: if a user does not tag a video with its location, it is
textual meta-data, audio and visual features as well as social     likely to be taken at the user’s home location. In contrast
information that is available in the training and test data.       to [2], we add the user location terms to Ttest , instead of
The training corpus consists of approximately three million        relying on an external resource to convert the user location
Flickr images and ten thousand videos. All but one of our          to latitude/longitude coordinates. Finally, if the user loca-
experiments rely exclusively on the textual meta-data.             tion yields no usable terms, a latitude/longitude of 0/0 is
                                                                   assigned to the test video.
2.   SYSTEM AND RUNS                                               2.1     Term Filtering
   In the first year of participating at MediaEval, we focused        We experiment with two basic term filters. Filtered out
our efforts on building a system that predicts a video’s loca-     from Ttest are (i) terms that are used by less than U users in
tion based on the textual meta-data assigned to it, in par-        the training data, and, (ii) terms with a geographic spread
ticular the tags and the title terms. We follow the approach       score greater than threshold θgeo . Excluding terms that are
described in [5] and divide the world map into a number            used by very few users is hypothesized to improve the ro-
of cells with varying latitude/longitude ranges and assign         bustness of the approach.
all items of the training data to their respective cells. All         Geographic spread filtering is applied for a similar reason:
available images from the development set with an accuracy         a video may be tagged with a number of non-geographic
of 11 or higher as well as all training videos were used for       terms such as “wedding” or “bowling” in addition to tags
training (2, 974, 635 items in total). Since we rely on tex-       that are likely to refer to locations such as “london” or “syd-
tual meta-data, we can treat images and videos in the same         ney”. Whether a term is likely to have a geographic scope
manner. Then, for each cell, a language model [7] is derived       can either be determined by matching the term against a
from the items’ textual meta-data. Predicting the location         geographical database (such as GeoNames1 ) or by consider-
of a test video is a two-step process: first, the cell Cmax is     ing how localized the term is in the training data. We follow
identified whose language model generates the test video’s         the latter approach here as it does not require any external
bag-of-words (tags and title terms) Ttest with the highest         resources. While in the development data the term “sydney”
probability. In a second step, the same process is repeated        occurs primarily in one particular grid cell (as expected the
within Cmax to find the most closely matching training item        cell containing the location of Sydney, Australia), the term
Imax . The latitude/longitude of Imax is returned as the esti-     “bowling” is spread considerably wider, mainly across North
mated location of the test video.                                  America. This observation leads to a simple but effective ge-
   In contrast to [5], the grid cells in our approach are of       ographic spread score: a grid is placed over the world map
varying size: starting with a grid cell that spans the entire      (1 degree latitude/longitude range per cell) and the num-
world map (if viewed as a graph, this cell is the root node),      ber of training items in the cell that contain the term are
the training items are added to the cell one at a time. Once       recorded. Neighbouring grid cells with a non-zero count are
the number of items in a cell exceeds the set limit `split , the   merged (in order to avoid penalizing geographic terms that
                                                                   cover a wide area) and the number of non-zero connected
                                                                   components are determined. This score is normalized by
Copyright is held by the author/owner(s).                          1
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy              http://www.geonames.org/
      Term               Geographic Spread Score                                      1 km    10 km     50 km     1000 km
      bowling                        3.237                             Basic          20.3%    38.2%     49.2%      66.4%
      baby                           1.809                             Gen            21.5%    40.5%     51.2%      67.8%
      valley                         1.512                             GeoGen         17.2%    50.8%     70.0%      82.6%
      british                        0.363                             UserSpecific   17.8%    38.0%     52.1%      72.7%
      lakepukaki                     0.049                             Visual          0.0%     0.1%      0.7%      10.9%
      españa                        0.021
      thenetherlands                 0.011                        Table 2: Prediction accuracy of the runs for a num-
      london                         0.010                        ber of distance cutoffs.
      sydney                         0.007
                                                                  4.     DISCUSSION
Table 1: Examples of geographic spread scores. In
                                                                     The biggest improvements over the baseline run are achieved
our experiments, we use a threshold of θgeo = 0.1.
                                                                  by filtering out terms that have a large geographic spread.
the maximum count. Thus, the smaller the score, the more          The only exception is the 1km cutoff, where Basic outper-
localized the term occurs in the training data. Our approach      forms GeoGen. We hypothesize that once the correct cell
is simpler than the χ2 feature selection based geo-term filter-   Cmax is identified in the first step of the estimation process,
ing [6], which determines the geographic score for the tags in    finding the closest match within the training documents of
each cell separately. Examples of terms and their geographic      Cmax may be more robust if all terms of Ttest are used. Al-
spread score are shown in Table 1. While the scores of most       though more than 80% of the test set users also contributed
terms appear reasonable, “british” is incorrectly identified as   items to the training set (on average 582 items), relying on
non-geographic (if we assume a threshold of θgeo = 0.1) as it     only the user’s contributed items for training did not yield
is not only used to tag pictures taken in the United Kingdom.     improvements over relying on all available training items.
In the development data it is also used to describe British          Our implementation of the visual features based nearest
Columbia (Canada), the British Virgin Islands (Caribbean),        neighbour approach did not result in a usable location esti-
British restaurants (mainly in the USA) and placed where          mator. Future work will focus on a failure analysis of this
historical battles against the British took places (mainly in     sub-system. Exploiting weather and daylight information to
the USA).                                                         place outdoor images on a map, e.g., [3], will also be inves-
                                                                  tigated. Finally, we plan to research to what extent social
2.2    Run Descriptions                                           network information (such as the home location of the user’s
                                                                  contacts, the locations of the images the user comments on,
  Based on the results of preliminary experiments, we fixed       etc.) can improve the text-based location estimation of im-
a number of parameters across all submitted runs: language        ages that are geographically underspecified.
modeling with Dirichlet smoothing (µ = 5000), `split = 5000
and `lat lng = 0.01. These settings result in a total of 1786
non-empty cells. The maximum extent in terms of latitude          Acknowledgments
and longitude are 22.5 and 45.0 in areas of the world map         This research has received funding from the European Union
where the development data is sparse. Listed below are the        Seventh Framework Programme (FP7/2007-2013), grant agree-
details of the submitted runs:                                    ment no ICT 257831 (ImREAL project).

Basic: baseline run without term filtering.                       5.     REFERENCES
Gen: run with general term filtering applied, U = 2.              [1] O. Boiman, E. Shechtman, and M. Irani. In defense of
GeoGen: run with geographic and general term filtering                nearest-neighbor based image classification. In CVPR
   applied, U = 2 and θgeo = 0.1.                                     ’08, pages 1–8, 2008.
                                                                  [2] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI
UserSpecific: run with geographic and general term filter-
                                                                      Video Location Estimation System. In MediaEval 2010
    ing applied, U = 2 and θgeo = 0.1. If the user who
                                                                      Workshop, 2010.
    uploaded the test video has contributed at least one
    item to the training data set, only the user’s training       [3] N. Jacobs, K. Miskell, and R. Pless. Webcam
    items are utilized to create the grid cells and language          geo-localization using aggregate light levels. In WACV
    models (similar to [2]).                                          ’11, pages 132–138, 2011.
                                                                  [4] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm.
Visual: run which is based on the provided visual features.           Working Notes for the Placing Task at MediaEval 2011.
    The partition of the training data is the same as in the          In MediaEval 2011 Workshop, 2011.
    text-based approaches, though for performance reason          [5] P. Serdyukov, V. Murdock, and R. van Zwol. Placing
    only 10% of the training data was used. The Naive-                flickr photos on a map. In SIGIR ’09, pages 484–491,
    Bayes nearest neighbour approach [1] with all visual              2009.
    features was implemented.
                                                                  [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
                                                                      locations of flickr resources using language models and
3.    RESULTS                                                         similarity search. In ICMR ’11, pages 48:1–48:8, 2011.
  The results of the listed runs are shown in Table 2. Re-        [7] C. Zhai and J. Lafferty. A study of smoothing methods
ported is the accuracy within {1, 10, 50, 1000}km of the ground       for language models applied to ad hoc information
truth location.                                                       retrieval. In SIGIR ’01, pages 334–342, 2001.