<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaeyoung Choi Howard Lei</string-name>
          <email>hlei@icsi.berkeley.edu</email>
          <email>jaeyoung@icsi.berkeley.edu</email>
          <email>jaeyoung@icsi.berkeley.edu hlei@icsi.berkeley.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerald Friedland</string-name>
          <email>fractor@icsi.berkeley.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Computer International Computer, Science Institute Science Institute</institution>
          ,
          <addr-line>1947 Center St., Suite 600 1947 Center St., Suite 600, Berkeley, CA 94704, USA Berkeley, CA 94704</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Computer, Science Institute</institution>
          ,
          <addr-line>1947 Center St., Suite 600, Berkeley, CA 94704</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <issue>4</issue>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>In this paper, we describe the International Computer Science Institute's (ICSI's) multimodal video location estimation system presented at the MediaEval 2011 Placing Task. We describe how textual, visual, and audio cues were integrated into a multimodal location estimation system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2011 Placing Task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is to automatically
estimate the location (latitude and longitude) of each query
video using any or all of metadata, visual/audio content,
and/or social information. For a detailed explanation of the
task, please refer to the Placing Task overview paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Please note that the videos for the Placing Task were not
ltered or selected for content in any way and represent \found
data". This is described in more detail in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
system presented herein utilizes the visual and acoustic
content of a video together with textual metadata, whereas the
system from 2010 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] only leveraged metadata. As a result,
the accuracy has improved signi cantly compared to 2010.
The system is described as follows.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>Our system integrates textual metadata with visual and
audio cues from the video content into a multimodal system.
Each component and the overall integration is described as
follows.</p>
    </sec>
    <sec id="sec-3">
      <title>Utilizing textual metadata</title>
      <p>From all available textual metadata, we only utilized the
user-annotated tags and ignored the title and descriptions.</p>
      <p>
        Our intuition for using tags to nd the geolocation of a
video is the following: If the spatial distribution of a tag
based on the anchors in the development data set is
concentrated in a very small area, the tag is likely a toponym
(location name). If the spatial variance of the distribution
is high, the tag is likely something else but a toponym. For
a detailed description of our algorithm, see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Also, we use
GeoNames [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a geographical gazetteer, in permitted runs
as a backup method when the spatial variance algorithm
returns 0 coordinate.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Utilizing visual cues</title>
      <p>
        In order to utilize the visual content of the video for
location estimation, we reduce location estimation to an image
retrieval problem, assuming that similar images mean
similar locations. We therefore extract GIST features [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for
both query and reference videos and run a k-nearest
neighbor search on the reference data set to nd the video frame
or a photo that has is most similar. GIST features have been
shown to be e ective in automatic geolocation of images [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
We convert each image and video frame to grayscale and
resize them to 128 128 pixels before we extract a GIST
descriptor with a 5 5 pixels spatial resolution with each bin
containing responses to 6 orientation and 4 scales. We use
Euclidean distance to compare the GIST descriptors and
use 1-nearest neighbor matching between the closest
preextracted frame to the temporal mid-point of a query video
and all photos and frames from the reference videos.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Utilizing acoustic cues</title>
      <p>
        Our approach for utilizing acoustic features is based on [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The article showed the feasibility and super-human accuracy
of acoustic features for location estimation by describing a
city identi cation system derived from a state-of-the-art
128mixture GMM-UBM speaker recognition system, with
simpli ed factor analysis and Mel-Frequency Cepstral Coe
cient (MFCC) features. For each audio track, a set of MFCC
features is extracted and one Gaussian Mixture Model (GMM)
is trained for each city, using MFCC features from all its
audio tracks (i.e. city-dependent audio tracks). This is done
via MAP adaptation from a universal background GMM.
The log-likelihood ratio of MFCC features from the audio
track of each query video is computed using the pre-trained
GMM models of each city. A likelihood score of each query
video corresponding to each of the cities is obtained. A city
with the highest score is picked as the query video's
location. This approach, however, limits the range of estimated
locations to pre-picked 15 cities around the world with the
highest concentration of videos. This was due to the
relatively small amount of 10,000 videos provided compared to
more than 3 millions images and metadata.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Multimodal integration</title>
      <p>
        Although recent research on automatic geolocation of
images using visual and acoustic features have shown to be
promising (e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">3, 2, 4</xref>
        ]), the performance of these
experiments is not in the same ballpark as the ones using textual
metadata. When cues from multiple modalities are used
together, we found that textual metadata provided by the
user plays a dominant role in providing cues for the
placing task. Therefore, our system is designed to use visual
      </p>
      <sec id="sec-6-1">
        <title>Audio  Only  </title>
        <p>Visual+Audio  
Vis+Tags  (GazeCeer)  
38.2    
33.7    
19.420  .0    
0.0 0  .1 0  .1    
0.0 0  .3 0  .3    
0.8 1  .3 1  .3    </p>
      </sec>
      <sec id="sec-6-2">
        <title>Visual  Only  </title>
        <p>Vis+Tags  (No  GazeCeer)     79.2    
74.5     73.5    
20  
10  
0  
100  
90  
80  
]  70  
%
[e  60  
g
ta 50  
n
rce 40  
e
P 30  
20  
10  
0  
0  
1   10   100   1000   &gt;1000  
Distance  between  ground  truth  and  es6ma6on  [km]
seem to contribute to the accuracy, although very little.
With so little data available for audio matching, acoustic
cues did not seem to contribute to the performance
significantly when used alone or together with visual feature as
described in Section 2.4.</p>
        <p>Figure 2 shows that using more development data helps,
especially boosting the number of correct estimation within
1km radius of ground truth. A little over 14% of the test
data don't contain any useful information at all in the
metadata (tag, title, and description). The training curve of test
videos that were left over after applying the text based
algorithm (not shown here due to the lack of space) shows the
curve reaching 14.6% when 3.2 million development data
were used.</p>
        <p>Figure 3 shows that the system works better in dense areas
compared to sparse areas. The whole map was divided into
approximately 100 km by 100 km grid and the number of
development data was counted for each grid.</p>
        <p>In conclusion, we believe that the biggest challenge for the
future is being able to handle sparse reference data.
3.2M  
1.2M  
320K  
80K  
20K  
0   1   10   100   1000   10000  </p>
        <p>Distance  between  groundtruth  and  es6ma6on  [km]  
features as second preference and acoustic features only as
third preference.</p>
        <p>In order to integrate the visual features, we rst run the
tags-only algorithm from Section 2.1 and use the resulting
top-3 tags as anchor points for a 1-NN search using visual
features (see Section 2.2). We compare against all
reference images and video frames within 1 km radius from the
3 anchor points. As explained above (see Section 2.3), the
number of audio references is much smaller and the result
of the audio matching is always one of 15 pre-de ned cities.
Therefore, the acoustic approach was only used as a backup
when the visual distance between query video and any
reference video was too large (i.e., the algorithm was unable to
nd a similar enough scene).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND DISCUSSION</title>
      <p>Figure 1 shows the comparative result of our runs using
audio only, visual feature only, audio+visual feature and
tag+visual feature approaches. As explained above, the
tag-based approach shows far better performance than other
approaches that does not use textual metadata. Also, given
the amount of reference data, the gazetteer information does</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Janin</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Friedland.</surname>
          </string-name>
          <article-title>The 2010 ICSI Video Location Estimation System</article-title>
          .
          <source>In Proceedings of MediaEval</source>
          ,
          <year>October 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          .
          <article-title>Geo-location inference from image content and user tags</article-title>
          .
          <source>In Proceedings of IEEE CVPR. IEEE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          and
          <string-name>
            <surname>A. Efros.</surname>
          </string-name>
          <article-title>IM2GPS: estimating geographic information from a single image</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2008</year>
          .
          <source>CVPR</source>
          <year>2008</year>
          , pages
          <issue>1{8</issue>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          .
          <article-title>City-Identi cation on Flickr Videos Using Acoustic Features</article-title>
          .
          <source>In ICSI Technical Report TR-11-001</source>
          ,
          <year>April 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Building the gist of a scene: The role of global image features in recognition</article-title>
          .
          <source>Progress in brain research</source>
          ,
          <volume>155</volume>
          :
          <fpage>23</fpage>
          {
          <fpage>36</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <article-title>Working Notes for the Placing Task at MediaEval 2011</article-title>
          . In MediaEval 2011 Workshop, Pisa, Italy,
          <year>September 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wick</surname>
          </string-name>
          . Geonames. http://www.geonames.org,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>