The 2011 ICSI Video Location Estimation System Jaeyoung Choi Howard Lei Gerald Friedland International Computer International Computer International Computer Science Institute Science Institute Science Institute 1947 Center St., Suite 600 1947 Center St., Suite 600 1947 Center St., Suite 600 Berkeley, CA 94704, USA Berkeley, CA 94704, USA Berkeley, CA 94704, USA jaeyoung@icsi.berkeley.edu hlei@icsi.berkeley.edu fractor@icsi.berkeley.edu ABSTRACT In order to utilize the visual content of the video for loca- In this paper, we describe the International Computer Sci- tion estimation, we reduce location estimation to an image ence Institute’s (ICSI’s) multimodal video location estima- retrieval problem, assuming that similar images mean sim- tion system presented at the MediaEval 2011 Placing Task. ilar locations. We therefore extract GIST features [5] for We describe how textual, visual, and audio cues were inte- both query and reference videos and run a k-nearest neigh- grated into a multimodal location estimation system. bor search on the reference data set to find the video frame or a photo that has is most similar. GIST features have been shown to be effective in automatic geolocation of images [3]. 1. INTRODUCTION We convert each image and video frame to grayscale and The MediaEval 2011 Placing Task [6] is to automatically resize them to 128 × 128 pixels before we extract a GIST de- estimate the location (latitude and longitude) of each query scriptor with a 5 × 5 pixels spatial resolution with each bin video using any or all of metadata, visual/audio content, containing responses to 6 orientation and 4 scales. We use and/or social information. For a detailed explanation of the Euclidean distance to compare the GIST descriptors and task, please refer to the Placing Task overview paper [6]. use 1-nearest neighbor matching between the closest pre- Please note that the videos for the Placing Task were not fil- extracted frame to the temporal mid-point of a query video tered or selected for content in any way and represent “found and all photos and frames from the reference videos. data”. This is described in more detail in [1] and [4]. The system presented herein utilizes the visual and acoustic con- 2.3 Utilizing acoustic cues tent of a video together with textual metadata, whereas the Our approach for utilizing acoustic features is based on [4]. system from 2010 [1] only leveraged metadata. As a result, The article showed the feasibility and super-human accuracy the accuracy has improved significantly compared to 2010. of acoustic features for location estimation by describing a The system is described as follows. city identification system derived from a state-of-the-art 128- mixture GMM-UBM speaker recognition system, with sim- plified factor analysis and Mel-Frequency Cepstral Coeffi- 2. SYSTEM DESCRIPTION cient (MFCC) features. For each audio track, a set of MFCC Our system integrates textual metadata with visual and features is extracted and one Gaussian Mixture Model (GMM) audio cues from the video content into a multimodal system. is trained for each city, using MFCC features from all its au- Each component and the overall integration is described as dio tracks (i.e. city-dependent audio tracks). This is done follows. via MAP adaptation from a universal background GMM. The log-likelihood ratio of MFCC features from the audio 2.1 Utilizing textual metadata track of each query video is computed using the pre-trained From all available textual metadata, we only utilized the GMM models of each city. A likelihood score of each query user-annotated tags and ignored the title and descriptions. video corresponding to each of the cities is obtained. A city Our intuition for using tags to find the geolocation of a with the highest score is picked as the query video’s loca- video is the following: If the spatial distribution of a tag tion. This approach, however, limits the range of estimated based on the anchors in the development data set is con- locations to pre-picked 15 cities around the world with the centrated in a very small area, the tag is likely a toponym highest concentration of videos. This was due to the rela- (location name). If the spatial variance of the distribution tively small amount of 10,000 videos provided compared to is high, the tag is likely something else but a toponym. For more than 3 millions images and metadata. a detailed description of our algorithm, see [1]. Also, we use GeoNames [7], a geographical gazetteer, in permitted runs 2.4 Multimodal integration as a backup method when the spatial variance algorithm Although recent research on automatic geolocation of im- returns 0 coordinate. ages using visual and acoustic features have shown to be promising (e.g., [3, 2, 4]), the performance of these experi- 2.2 Utilizing visual cues ments is not in the same ballpark as the ones using textual metadata. When cues from multiple modalities are used together, we found that textual metadata provided by the Copyright is held by the author/owner(s). user plays a dominant role in providing cues for the plac- MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy ing task. Therefore, our system is designed to use visual 100   94.2   93.6       60   Audio  Only   Visual  Only   90   Visual+Audio   Vis+Tags  (No  GazeCeer)     79.2     50   80   74.5     73.5     Percentage  [%]   Vis+Tags  (GazeCeer)   70   66.3     40   61.3     Percentage  [%]   60   52.6     30   50   45.8     38.2     20   40   33.7     30   10   20.0   19.4       20   11.2     0   10   7.2     9.1     1   10   100   1000   >1000   0.0  0.1     0.1       0.0  0.3         0.3     1.3   0.8  1.3       0   Distance  between  ground  truth  and  es6ma6on  [km]   10km   100km   1000km   10000km   1km   >6400   6400   1600   400   100   Distance  between  es2ma2on  and  groundtruth   Figure 3: Test videos from denser region has higher chance of being estimated within closer range from Figure 1: Comparison of runs result as described in groundtruth. Section 3. 100   seem to contribute to the accuracy, although very little. 90   With so little data available for audio matching, acoustic 80   cues did not seem to contribute to the performance signif- icantly when used alone or together with visual feature as 70   Percentage  [%]   described in Section 2.4. 60   3.2M   Figure 2 shows that using more development data helps, 50   1.2M   especially boosting the number of correct estimation within 40   320K   1km radius of ground truth. A little over 14% of the test 30   80K   data don’t contain any useful information at all in the meta- 20   20K   data (tag, title, and description). The training curve of test 10   videos that were left over after applying the text based al- 0   gorithm (not shown here due to the lack of space) shows the 0   1   10   100   1000   10000   curve reaching 14.6% when 3.2 million development data Distance  between  groundtruth  and  es6ma6on  [km]   were used. Figure 3 shows that the system works better in dense areas compared to sparse areas. The whole map was divided into approximately 100 km by 100 km grid and the number of Figure 2: Increasing the size of development data development data was counted for each grid. improves performance In conclusion, we believe that the biggest challenge for the future is being able to handle sparse reference data. features as second preference and acoustic features only as third preference. 4. REFERENCES In order to integrate the visual features, we first run the [1] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI tags-only algorithm from Section 2.1 and use the resulting Video Location Estimation System. In Proceedings of top-3 tags as anchor points for a 1-NN search using visual MediaEval, October 2010. features (see Section 2.2). We compare against all refer- [2] A. Gallagher, D. Joshi, J. Yu, and J. Luo. Geo-location ence images and video frames within 1 km radius from the inference from image content and user tags. In 3 anchor points. As explained above (see Section 2.3), the Proceedings of IEEE CVPR. IEEE, 2009. number of audio references is much smaller and the result [3] J. Hays and A. Efros. IM2GPS: estimating geographic of the audio matching is always one of 15 pre-defined cities. information from a single image. In IEEE Conference Therefore, the acoustic approach was only used as a backup on Computer Vision and Pattern Recognition, 2008. when the visual distance between query video and any ref- CVPR 2008, pages 1–8, 2008. erence video was too large (i.e., the algorithm was unable to [4] H. Lei, J. Choi, and G. Friedland. City-Identification on find a similar enough scene). Flickr Videos Using Acoustic Features. In ICSI Technical Report TR-11-001, April 2011. 3. RESULTS AND DISCUSSION [5] A. Oliva and A. Torralba. Building the gist of a scene: Figure 1 shows the comparative result of our runs using The role of global image features in recognition. audio only, visual feature only, audio+visual feature and Progress in brain research, 155:23–36, 2006. tag+visual feature approaches. As explained above, the [6] A. Rae, V. Murdock, and P. Serdyukov. Working Notes tag-based approach shows far better performance than other for the Placing Task at MediaEval 2011. In MediaEval approaches that does not use textual metadata. Also, given 2011 Workshop, Pisa, Italy, September 2011. the amount of reference data, the gazetteer information does [7] M. Wick. Geonames. http://www.geonames.org, 2011.