The 2011 ICSI Video Location Estimation System

                 Jaeyoung Choi                              Howard Lei                    Gerald Friedland
             International Computer                 International Computer             International Computer
                 Science Institute                      Science Institute                  Science Institute
            1947 Center St., Suite 600             1947 Center St., Suite 600         1947 Center St., Suite 600
            Berkeley, CA 94704, USA                Berkeley, CA 94704, USA            Berkeley, CA 94704, USA
         jaeyoung@icsi.berkeley.edu hlei@icsi.berkeley.edu                          fractor@icsi.berkeley.edu


ABSTRACT                                                             In order to utilize the visual content of the video for loca-
In this paper, we describe the International Computer Sci-        tion estimation, we reduce location estimation to an image
ence Institute’s (ICSI’s) multimodal video location estima-       retrieval problem, assuming that similar images mean sim-
tion system presented at the MediaEval 2011 Placing Task.         ilar locations. We therefore extract GIST features [5] for
We describe how textual, visual, and audio cues were inte-        both query and reference videos and run a k-nearest neigh-
grated into a multimodal location estimation system.              bor search on the reference data set to find the video frame
                                                                  or a photo that has is most similar. GIST features have been
                                                                  shown to be effective in automatic geolocation of images [3].
1.    INTRODUCTION                                                We convert each image and video frame to grayscale and
  The MediaEval 2011 Placing Task [6] is to automatically         resize them to 128 × 128 pixels before we extract a GIST de-
estimate the location (latitude and longitude) of each query      scriptor with a 5 × 5 pixels spatial resolution with each bin
video using any or all of metadata, visual/audio content,         containing responses to 6 orientation and 4 scales. We use
and/or social information. For a detailed explanation of the      Euclidean distance to compare the GIST descriptors and
task, please refer to the Placing Task overview paper [6].        use 1-nearest neighbor matching between the closest pre-
Please note that the videos for the Placing Task were not fil-    extracted frame to the temporal mid-point of a query video
tered or selected for content in any way and represent “found     and all photos and frames from the reference videos.
data”. This is described in more detail in [1] and [4]. The
system presented herein utilizes the visual and acoustic con-     2.3    Utilizing acoustic cues
tent of a video together with textual metadata, whereas the          Our approach for utilizing acoustic features is based on [4].
system from 2010 [1] only leveraged metadata. As a result,        The article showed the feasibility and super-human accuracy
the accuracy has improved significantly compared to 2010.         of acoustic features for location estimation by describing a
The system is described as follows.                               city identification system derived from a state-of-the-art 128-
                                                                  mixture GMM-UBM speaker recognition system, with sim-
                                                                  plified factor analysis and Mel-Frequency Cepstral Coeffi-
2.    SYSTEM DESCRIPTION                                          cient (MFCC) features. For each audio track, a set of MFCC
   Our system integrates textual metadata with visual and         features is extracted and one Gaussian Mixture Model (GMM)
audio cues from the video content into a multimodal system.       is trained for each city, using MFCC features from all its au-
Each component and the overall integration is described as        dio tracks (i.e. city-dependent audio tracks). This is done
follows.                                                          via MAP adaptation from a universal background GMM.
                                                                  The log-likelihood ratio of MFCC features from the audio
2.1    Utilizing textual metadata                                 track of each query video is computed using the pre-trained
   From all available textual metadata, we only utilized the      GMM models of each city. A likelihood score of each query
user-annotated tags and ignored the title and descriptions.       video corresponding to each of the cities is obtained. A city
   Our intuition for using tags to find the geolocation of a      with the highest score is picked as the query video’s loca-
video is the following: If the spatial distribution of a tag      tion. This approach, however, limits the range of estimated
based on the anchors in the development data set is con-          locations to pre-picked 15 cities around the world with the
centrated in a very small area, the tag is likely a toponym       highest concentration of videos. This was due to the rela-
(location name). If the spatial variance of the distribution      tively small amount of 10,000 videos provided compared to
is high, the tag is likely something else but a toponym. For      more than 3 millions images and metadata.
a detailed description of our algorithm, see [1]. Also, we use
GeoNames [7], a geographical gazetteer, in permitted runs         2.4    Multimodal integration
as a backup method when the spatial variance algorithm              Although recent research on automatic geolocation of im-
returns 0 coordinate.                                             ages using visual and acoustic features have shown to be
                                                                  promising (e.g., [3, 2, 4]), the performance of these experi-
2.2    Utilizing visual cues                                      ments is not in the same ballpark as the ones using textual
                                                                  metadata. When cues from multiple modalities are used
                                                                  together, we found that textual metadata provided by the
Copyright is held by the author/owner(s).                         user plays a dominant role in providing cues for the plac-
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy         ing task. Therefore, our system is designed to use visual
           100	
                                                                                                                                                                                                          94.2	
  
                                                                                                                                                                                                                       93.6	
  	
   	
                              60	
  
                                                                       Audio	
  Only	
                                                   Visual	
  Only	
  
                    90	
  
                                                                       Visual+Audio	
                                                    Vis+Tags	
  (No	
  GazeCeer)	
  	
                           79.2	
  	
                                                    50	
  
                    80	
                                                                                                                                                                           74.5	
  	
   73.5	
  	
  
                                                                                                                                                                                                                                            Percentage	
  [%]	
  
                                                                       Vis+Tags	
  (GazeCeer)	
  
                    70	
                                                                                                                                                            66.3	
  	
                                                                      40	
  
                                                                                                                                                                                 61.3	
  	
  
Percentage	
  [%]	
  

                    60	
                                                                                                                       52.6	
  	
                                                                                                           30	
  
                    50	
                                                                                                                    45.8	
  	
  
                                                                                                          38.2	
  	
                                                                                                                                                20	
  
                    40	
                                                                               33.7	
  	
  
                    30	
                                                                                                                                                                                                                                            10	
  
                                                                   20.0	
  
                                                                19.4	
  	
   	
                    20	
  
                                                                                                                                                                 11.2	
  	
                                                                                           0	
  
                    10	
                                                                                                                                      7.2	
  	
   9.1	
  	
  
                                                                                                                                                                                                                                                                              1	
                 10	
           100	
           1000	
               >1000	
  
                                             0.0	
  0.1	
  
                                                    	
   0.1	
  
                                                           	
                  0.0	
  0.3	
  	
   	
   0.3	
                           	
   1.3	
  
                                                                                                                         0.8	
  1.3	
  	
   	
                           0	
                                                                                                                                                                                                                                                Distance	
  between	
  ground	
  truth	
  and	
  es6ma6on	
  [km]
                                                         	
                                   10km	
                              100km	
                            1000km	
                             10000km	
  
                                                        1km	
                                                                                                                                                                                                                         >6400	
        6400	
     1600	
     400	
            100	
  
                                                                                Distance	
  between	
  es2ma2on	
  and	
  groundtruth	
  

                                                                                                                                                                                                                                           Figure 3: Test videos from denser region has higher
                                                                                                                                                                                                                                           chance of being estimated within closer range from
Figure 1: Comparison of runs result as described in                                                                                                                                                                                        groundtruth.
Section 3.

                                       100	
                                                                                                                                                                                               seem to contribute to the accuracy, although very little.
                                         90	
                                                                                                                                                                                              With so little data available for audio matching, acoustic
                                         80	
  
                                                                                                                                                                                                                                           cues did not seem to contribute to the performance signif-
                                                                                                                                                                                                                                           icantly when used alone or together with visual feature as
                                         70	
  
               Percentage	
  [%]	
  

                                                                                                                                                                                                                                           described in Section 2.4.
                                         60	
                                                                                                                                                                            3.2M	
  
                                                                                                                                                                                                                                              Figure 2 shows that using more development data helps,
                                         50	
                                                                                                                                                                            1.2M	
  
                                                                                                                                                                                                                                           especially boosting the number of correct estimation within
                                         40	
                                                                                                                                                                            320K	
            1km radius of ground truth. A little over 14% of the test
                                         30	
                                                                                                                                                                            80K	
             data don’t contain any useful information at all in the meta-
                                         20	
                                                                                                                                                                            20K	
             data (tag, title, and description). The training curve of test
                                         10	
                                                                                                                                                                                              videos that were left over after applying the text based al-
                                           0	
                                                                                                                                                                                             gorithm (not shown here due to the lack of space) shows the
                                                               0	
         1	
        10	
    100	
          1000	
   10000	
                                                                                                              curve reaching 14.6% when 3.2 million development data
                                                              Distance	
  between	
  groundtruth	
  and	
  es6ma6on	
  [km]	
                                                                                                              were used.
                                                                                                                                                                                                                                              Figure 3 shows that the system works better in dense areas
                                                                                                                                                                                                                                           compared to sparse areas. The whole map was divided into
                                                                                                                                                                                                                                           approximately 100 km by 100 km grid and the number of
Figure 2: Increasing the size of development data
                                                                                                                                                                                                                                           development data was counted for each grid.
improves performance
                                                                                                                                                                                                                                              In conclusion, we believe that the biggest challenge for the
                                                                                                                                                                                                                                           future is being able to handle sparse reference data.
features as second preference and acoustic features only as
third preference.                                                                                                                                                                                                                          4.                         REFERENCES
   In order to integrate the visual features, we first run the                                                                                                                                                                             [1] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI
tags-only algorithm from Section 2.1 and use the resulting                                                                                                                                                                                     Video Location Estimation System. In Proceedings of
top-3 tags as anchor points for a 1-NN search using visual                                                                                                                                                                                     MediaEval, October 2010.
features (see Section 2.2). We compare against all refer-                                                                                                                                                                                  [2] A. Gallagher, D. Joshi, J. Yu, and J. Luo. Geo-location
ence images and video frames within 1 km radius from the                                                                                                                                                                                       inference from image content and user tags. In
3 anchor points. As explained above (see Section 2.3), the                                                                                                                                                                                     Proceedings of IEEE CVPR. IEEE, 2009.
number of audio references is much smaller and the result                                                                                                                                                                                  [3] J. Hays and A. Efros. IM2GPS: estimating geographic
of the audio matching is always one of 15 pre-defined cities.                                                                                                                                                                                  information from a single image. In IEEE Conference
Therefore, the acoustic approach was only used as a backup                                                                                                                                                                                     on Computer Vision and Pattern Recognition, 2008.
when the visual distance between query video and any ref-                                                                                                                                                                                      CVPR 2008, pages 1–8, 2008.
erence video was too large (i.e., the algorithm was unable to                                                                                                                                                                              [4] H. Lei, J. Choi, and G. Friedland. City-Identification on
find a similar enough scene).                                                                                                                                                                                                                  Flickr Videos Using Acoustic Features. In ICSI
                                                                                                                                                                                                                                               Technical Report TR-11-001, April 2011.
3.                                       RESULTS AND DISCUSSION                                                                                                                                                                            [5] A. Oliva and A. Torralba. Building the gist of a scene:
  Figure 1 shows the comparative result of our runs using                                                                                                                                                                                      The role of global image features in recognition.
audio only, visual feature only, audio+visual feature and                                                                                                                                                                                      Progress in brain research, 155:23–36, 2006.
tag+visual feature approaches. As explained above, the                                                                                                                                                                                     [6] A. Rae, V. Murdock, and P. Serdyukov. Working Notes
tag-based approach shows far better performance than other                                                                                                                                                                                     for the Placing Task at MediaEval 2011. In MediaEval
approaches that does not use textual metadata. Also, given                                                                                                                                                                                     2011 Workshop, Pisa, Italy, September 2011.
the amount of reference data, the gazetteer information does                                                                                                                                                                               [7] M. Wick. Geonames. http://www.geonames.org, 2011.