<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Geotagging Flickr Photos And Videos Using Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sanket Kumar Singh</string-name>
          <email>sanketku@ualberta.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davood Rafiei</string-name>
          <email>drafiei@ualberta.ca</email>
          <email>ei@ualberta.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Alberta</institution>
          ,
          <addr-line>Edmonton, AB</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper presents an experimental framework for the Placing tasks, both estimation and veri cation at MediaEval Benchmarking 2016. The proposed framework provides results for four runs - rst, using metadata (such as user tags and title of images and videos), second, using visual features extracted from the images (such as tamura), third, by using the textual and visual features together and fourth, using metadata as in the rst run but with the training data augmented with external sources. Our work mainly focusses on textual features where we develop a language-based model using bag-of-tags with neighbour based smoothing. The effectiveness of the framework is evaluated through experiments in the placing task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The goal of this work is to estimate the coordinates of an
image or a video on the world map and to verify whether an
image belongs to a given location. Tags assigned to a photo
may not be location-speci c and even the location-speci c
tags can be vague and may refer to multiple locations. Some
photos have no tags or have only tags that have not been
seen before (e.g. in the training phase). All these issues
make location prediction from user tags challenging. We
address these problems by learning the associations between
user tags and locations and by using this information in our
prediction.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Language modeling is used in placing photos on a map.
In particular, Pavel et. al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] place a grid of xed degree
over the world map and map train instances to cells based
on their coordinates.They learn a model which allows them
to predict the location of the test instances on the grid.
Though this work provides several smoothing techniques to
predict the location of a test instance whose tags are not
seen, it does not di erentiate between general and location
speci c tags. Giorgos et. al in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use a similar model but
capture information regarding how many users use a
particular tag in a particular region. Additionally, they use
Shannon's Entropy to give small weights to tags which are
user speci c or general. Our base model is the same, as
it provides a weighting of each tag based on its
popular3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED APPROACH</title>
      <p>
        The proposed framework consists of two phases: (1)
preprocessing the placing dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and (2) building the model
and doing the predictions.
      </p>
      <p>
        Preprocessing Each photo or video has a title, some
user tags and the id of a user who posted it. After
removing punctuations and special characters from the title, the
remaining terms are included in the tag set. This helps the
cases where a photo has a title but no tags. In Run 4, which
is also based on textual metadata, we include in our training
photos instances extracted from the YFCC100M [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] dataset
which are uploaded by users other than those in our test set.
Furthermore, we augment the tag set with place names from
Geonames [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and assign the location tags to cells based on
location coordinates. In all run, each tags that is used by
only one user is removed to reduce noise, and the remaining
tags are then used for training. For testing, we only use
user tags in each run except for Run 4, where we
additionally use title and description, for those test instance which
have no user tag or none of the tag are found in train data.
Our goal in Run 4 is to use as much data as possible. To
build a model for Run 2 (which uses visual features), we use
2,182,400 images with Tamura [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] features; the features are
preprocessed so they can be fed into Vowpal Wabbit (VW)
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is used to train the model. The dataset has 2,735
counties and these are used as labels for training; for our
training, county was the smallest region with enough data
points per label (812 on average compared to 38 for town).
      </p>
      <p>
        Methodology For the estimation task, we place a grid of
1, 0.1 and 0.01 degrees and predict a cell c for each test photo
based on a generative model which estimates the probability
ppti|cq that the tags ti in the photo are emitted from cell c.
The model captures the degree at which a tag is popular
among users in describing a location within a cell, i.e.
ppti|cq
number of user who use tag ti in cell c
number of user who use tag ti globally
;
where n is the number of tags in a test instance T. A cell
c that gives the maximum p(Tjc) is considered as predicted
cell of the test instance T. We further extend the base model
by performing a neighbour based smoothing as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], taking
into account who use tag t in the neighbouring cells of cell
c. Since we need to estimate the actual coordinates of a test
instance within a cell, we use the coordinate of a training
instance in the same cell that has the maximum Jaccard
similarity to the test instance.
      </p>
      <p>
        Test instances that have no tags (or their tags are not seen
in the training set) are assigned to the cell with the largest
number of training instances. In this case, the coordinates
of the training instance which has the minimum Karney's
distance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to other instances in the cell is considered as
the estimated coordinate. To use visual features, we train
a one-against-all multiclass model using VW to predict a
county for the test instance. The coordinates are estimated
using the same strategy as before, based on the coordinates
of a training instance. Since textual features provide a more
accurate estimation, visual features are used in Run 3 only
if a photo has no textual features. Otherwise, only textual
features are used. For the veri cation task, we use the place
information of the training instance, used to predict the
coordinates in the estimation task, and mark a test instance
veri ed if its predicted location string contains the given
place name.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS AND ANALYSIS</title>
      <p>We performed our experiments for the estimation task
using grids of 0.01, 0.1, 1.0 degrees and evaluated the
results using precision at each distance, average distance error
(ADE), median distance error (MDE) and the veri cation
accuracy (VA). The results are listed in Tables 1 and 2.
From Table 1, we can see that the precision for large
distances is high as each target cell covers more area and has
more tags. Additionally, as we apply our neighbour based
smoothing using adjacent cells, more tags from neighbours
are included, which is useful in cases where tags cover wider
area such as tags with province name or geographical
division which cover more than one grid. This results in an
improved cell prediction accuracy.</p>
      <p>Analyzing the wrong predictions using the validation set,
we nd that misspellings, mismatches between plural and
singular forms, and the di erences in spelling (such as
\barcellona" for \Barcelona", \nederland" for \Netherlands") are
some of the causes for the tags not to be found in a correct
cell. Famous spots such as \the Empire State" building in
New York are easily predicted because of abundant location
speci c tags. However, instances with general words such as
\bogus" and \ nding" lead to prediction of wrong cells. In an
experiment comparing top-k and top-1 predictions for test
instances, we found that top-10 accuracy was 47.74% while
top-1 accuracy was 31.80% (for photos and video together)
using 0.1 degree grid ( 10 km). Furthermore, the predicted
cells were closer to the real cells. Another set of instances
that were di cult to predict were 335845 test instances
(including photos and videos) which either had no tags or their
tags were not used by any user in the training set. We assign
these instances to the most popular cell, which only gives a
correct prediction for 3751 instances.</p>
      <p>For Run2, we use Tamura features to train a multiclass
model using VW. As the dataset consists of di erent
landscapes, animals, places etc., it is di cult to distinguish
be1
2
3
4
a
i
d
e</p>
      <p>M
photo
video
photo
video
photo
video
photo
video
tween di erent places from where a photo or video is taken
and thus model mainly predicts by most popular county. For
Run 4, we augment the cells with place names from
Geonames (giving it an arbitrary user id) and from YFCC100M
dataset. Since the tags which are used by only one user are
removed, only Geonames tags which are used by an actual
user in the cell are retained. This increases the count of
place speci c tags which are used by real users. Using title
and description for test instances, which have no user tags
or their tags are not found in the training set, reduces the
median distance error for the estimation task.</p>
      <p>Before reaching the proposed approach, we tried to nd
location speci c tags by assessing their frequency concentration
in a region, as compared to the whole map. This approach,
however, did not work for instances where the same tag was
equally present at two or more places, that were far from
each other. Further, we used the KL-Divergence to separate
probability distribution of general tags from location speci c
tags but this approach also did not work well as the model
ended up giving more weights to user speci c tags such as
\lehmans", \gladston", etc.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we study the problem of predicting
coordinates for multimedia objects. We adopt an approach which
identi es the tags which are frequently used by users at each
location. This, in turn helps us predict the cell and
thereafter the coordinates for each object. Our analysis of wrong
prediction reveals that true cells are often present is in top-k
and are close to the predicted cell. This seems to be an area
for improvement, where one needs to disambiguate between
the neighbouring cells, maybe considering cells of varying
sizes or forming clusters based on the closeness of training
instances.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Laere</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          .
          <article-title>The placing task at mediaeval 2016</article-title>
          . MediaEval 2016 Workshop, Oct.
          <volume>20</volume>
          -21
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Karney</surname>
          </string-name>
          .
          <article-title>Algorithms for geodesics</article-title>
          .
          <source>Journal of Geodesy</source>
          ,
          <volume>87</volume>
          (
          <issue>1</issue>
          ):
          <volume>43</volume>
          {
          <fpage>55</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kordopatis-Zilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Geotagging social media content with a re ned language modelling approach</article-title>
          .
          <source>In Paci c-Asia Workshop on Intelligence and Security Informatics</source>
          , pages
          <volume>21</volume>
          {
          <fpage>40</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Strehl</surname>
          </string-name>
          .
          <article-title>Vowpal wabbit</article-title>
          . URL https://github. com/JohnLangford/vowpal wabbit/wiki,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Makazhanov</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Ra ei</article-title>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Waqar</surname>
          </string-name>
          .
          <article-title>Predicting political preference of twitter users</article-title>
          .
          <source>Social Network Analysis and Mining</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>15</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murdock</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. Van</given-names>
            <surname>Zwol</surname>
          </string-name>
          .
          <article-title>Placing ickr photos on a map</article-title>
          .
          <source>In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>484</volume>
          {
          <fpage>491</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mori</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamawaki</surname>
          </string-name>
          .
          <article-title>Textural features corresponding to visual perception</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>460</volume>
          {
          <fpage>473</fpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Friedland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Poland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Borth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Yfcc100m: The new data in multimedia research</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>59</volume>
          (
          <issue>2</issue>
          ):
          <volume>64</volume>
          {
          <fpage>73</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wick</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Boutreux</surname>
          </string-name>
          . Geonames.
          <source>GeoNames Geographical Database</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>