Geotagging Flickr Photos And Videos Using Language
                           Models

                     Sanket Kumar Singh                                             Davood Rafiei
                       University of Alberta                                   University of Alberta
                      Edmonton, AB, Canada                                    Edmonton, AB, Canada
                    sanketku@ualberta.ca                                      drafiei@ualberta.ca


ABSTRACT                                                         ity among users in describing a place, but we experiment
This paper presents an experimental framework for the Plac-      with additional components to boost the performance. An-
ing tasks, both estimation and verification at MediaEval         other related weighting scheme is that of Aibek et. al in [6]
Benchmarking 2016. The proposed framework provides re-           which uses the Kullback-Leibler divergence to differentiate
sults for four runs - first, using metadata (such as user tags   between class-specific and general terms. Even though that
and title of images and videos), second, using visual features   work is done in a different context, we experiment with this
extracted from the images (such as tamura), third, by using      model in identifying location specific tags.
the textual and visual features together and fourth, using
metadata as in the first run but with the training data aug-     3.   PROPOSED APPROACH
mented with external sources. Our work mainly focusses on           The proposed framework consists of two phases: (1) pre-
textual features where we develop a language-based model         processing the placing dataset [1], and (2) building the model
using bag-of-tags with neighbour based smoothing. The ef-        and doing the predictions.
fectiveness of the framework is evaluated through experi-           Preprocessing Each photo or video has a title, some
ments in the placing task.                                       user tags and the id of a user who posted it. After remov-
                                                                 ing punctuations and special characters from the title, the
1.   INTRODUCTION                                                remaining terms are included in the tag set. This helps the
  The goal of this work is to estimate the coordinates of an     cases where a photo has a title but no tags. In Run 4, which
image or a video on the world map and to verify whether an       is also based on textual metadata, we include in our training
image belongs to a given location. Tags assigned to a photo      photos instances extracted from the YFCC100M [9] dataset
may not be location-specific and even the location-specific      which are uploaded by users other than those in our test set.
tags can be vague and may refer to multiple locations. Some      Furthermore, we augment the tag set with place names from
photos have no tags or have only tags that have not been         Geonames [10] and assign the location tags to cells based on
seen before (e.g. in the training phase). All these issues       location coordinates. In all run, each tags that is used by
make location prediction from user tags challenging. We          only one user is removed to reduce noise, and the remaining
address these problems by learning the associations between      tags are then used for training. For testing, we only use
user tags and locations and by using this information in our     user tags in each run except for Run 4, where we addition-
prediction.                                                      ally use title and description, for those test instance which
                                                                 have no user tag or none of the tag are found in train data.
                                                                 Our goal in Run 4 is to use as much data as possible. To
2.   RELATED WORK                                                build a model for Run 2 (which uses visual features), we use
   Language modeling is used in placing photos on a map.         2,182,400 images with Tamura [8] features; the features are
In particular, Pavel et. al [7] place a grid of fixed degree     preprocessed so they can be fed into Vowpal Wabbit (VW)
over the world map and map train instances to cells based        [5], which is used to train the model. The dataset has 2,735
on their coordinates.They learn a model which allows them        counties and these are used as labels for training; for our
to predict the location of the test instances on the grid.       training, county was the smallest region with enough data
Though this work provides several smoothing techniques to        points per label (812 on average compared to 38 for town).
predict the location of a test instance whose tags are not          Methodology For the estimation task, we place a grid of
seen, it does not differentiate between general and location     1, 0.1 and 0.01 degrees and predict a cell c for each test photo
specific tags. Giorgos et. al in [4] use a similar model but     based on a generative model which estimates the probability
capture information regarding how many users use a par-          ppti |cq that the tags ti in the photo are emitted from cell c.
ticular tag in a particular region. Additionally, they use       The model captures the degree at which a tag is popular
Shannon's Entropy to give small weights to tags which are        among users in describing a location within a cell, i.e.
user specific or general. Our base model is the same, as                           number of user who use tag ti in cell c
it provides a weighting of each tag based on its popular-             ppti |cq “                                           ,
                                                                                   number of user who use tag ti globally

Copyright is held by the author/owner(s).                                                            n
                                                                                                     ÿ
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-                            ppT |cq “         ppti |cq
lands.                                                                                               i“1
                                                                                                                                                 1000(km)
where n is the number of tags in a test instance T. A cell


                                                                                                                                      100(km)
                                                                                                                            10(km)
                                                                                                       100(m)
c that gives the maximum p(T|c) is considered as predicted


                                                                                   Media


                                                                                                                 1(km)
                                                                                              10(m)
                                                                        Run
cell of the test instance T. We further extend the base model
by performing a neighbour based smoothing as in [7], taking
into account who use tag t in the neighbouring cells of cell                      photo      0.27     2.88      14.13      35.28     50.28      64.17
c. Since we need to estimate the actual coordinates of a test       1
                                                                                  video      0.27     3.03      13.50      33.24     47.60      60.08
instance within a cell, we use the coordinate of a training                       photo       0.0      2.0       0.42       2.13      4.0       22.97
instance in the same cell that has the maximum Jaccard              2
                                                                                  video       0.0      0.0       0.14       0.81     1.77        6.95
similarity to the test instance.                                                  photo      0.27     2.89      14.13      35.26     50.25      64.03
   Test instances that have no tags (or their tags are not seen     3
                                                                                  video      0.27     3.03      13.50      33.24     47.60      60.08
in the training set) are assigned to the cell with the largest                    photo      0.27     2.94      13.24      33.02     51.14      64.58
number of training instances. In this case, the coordinates         4
                                                                                  video      0.27     3.36      13.29      32.61     49.35      61.18
of the training instance which has the minimum Karney's
distance [3] to other instances in the cell is considered as                  Table 1: Precision at different distances (in %)
the estimated coordinate. To use visual features, we train
a one-against-all multiclass model using VW to predict a
                                                                              Run          Media      ADE(km)            MDE(km)        VA
county for the test instance. The coordinates are estimated
                                                                                           photo       2452.24              93.49      0.64
using the same strategy as before, based on the coordinates                   1
                                                                                           video       2744.00             185.10      0.62
of a training instance. Since textual features provide a more
                                                                                           photo       5243.38            5738.70      0.50
accurate estimation, visual features are used in Run 3 only                   2
                                                                                           video       5719.46            6374.11      0.50
if a photo has no textual features. Otherwise, only textual
features are used. For the verification task, we use the place                             photo      2451.53               94.23      0.64
                                                                              3
information of the training instance, used to predict the co-                              video       2744.00             185.09      0.63
ordinates in the estimation task, and mark a test instance                                 photo       2457.29             82.23       0.64
                                                                              4
verified if its predicted location string contains the given                               video      2703.20             114.48       0.63
place name.
                                                                          Table 2: Estimation & Verification results for runs
4.   RESULTS AND ANALYSIS
   We performed our experiments for the estimation task            tween different places from where a photo or video is taken
using grids of 0.01, 0.1, 1.0 degrees and evaluated the re-        and thus model mainly predicts by most popular county. For
sults using precision at each distance, average distance error     Run 4, we augment the cells with place names from Geon-
(ADE), median distance error (MDE) and the verification            ames (giving it an arbitrary user id) and from YFCC100M
accuracy (VA). The results are listed in Tables 1 and 2.           dataset. Since the tags which are used by only one user are
From Table 1, we can see that the precision for large dis-         removed, only Geonames tags which are used by an actual
tances is high as each target cell covers more area and has        user in the cell are retained. This increases the count of
more tags. Additionally, as we apply our neighbour based           place specific tags which are used by real users. Using title
smoothing using adjacent cells, more tags from neighbours          and description for test instances, which have no user tags
are included, which is useful in cases where tags cover wider      or their tags are not found in the training set, reduces the
area such as tags with province name or geographical di-           median distance error for the estimation task.
vision which cover more than one grid. This results in an          Before reaching the proposed approach, we tried to find loca-
improved cell prediction accuracy.                                 tion specific tags by assessing their frequency concentration
Analyzing the wrong predictions using the validation set,          in a region, as compared to the whole map. This approach,
we find that misspellings, mismatches between plural and           however, did not work for instances where the same tag was
singular forms, and the differences in spelling (such as “bar-     equally present at two or more places, that were far from
cellona” for “Barcelona”, “nederland” for “Netherlands”) are       each other. Further, we used the KL-Divergence to separate
some of the causes for the tags not to be found in a correct       probability distribution of general tags from location specific
cell. Famous spots such as “the Empire State” building in          tags but this approach also did not work well as the model
New York are easily predicted because of abundant location         ended up giving more weights to user specific tags such as
specific tags. However, instances with general words such as       “lehmans”, “gladston”, etc.
“bogus” and “finding” lead to prediction of wrong cells. In an
experiment comparing top-k and top-1 predictions for test
instances, we found that top-10 accuracy was 47.74% while          5.         CONCLUSION AND FUTURE WORK
top-1 accuracy was 31.80% (for photos and video together)             In this paper, we study the problem of predicting coordi-
using 0.1 degree grid („10 km). Furthermore, the predicted         nates for multimedia objects. We adopt an approach which
cells were closer to the real cells. Another set of instances      identifies the tags which are frequently used by users at each
that were difficult to predict were 335845 test instances (in-     location. This, in turn helps us predict the cell and there-
cluding photos and videos) which either had no tags or their       after the coordinates for each object. Our analysis of wrong
tags were not used by any user in the training set. We assign      prediction reveals that true cells are often present is in top-k
these instances to the most popular cell, which only gives a       and are close to the predicted cell. This seems to be an area
correct prediction for 3751 instances.                             for improvement, where one needs to disambiguate between
For Run2, we use Tamura features to train a multiclass             the neighbouring cells, maybe considering cells of varying
model using VW. As the dataset consists of different land-         sizes or forming clusters based on the closeness of training
scapes, animals, places etc., it is difficult to distinguish be-   instances.
6.   REFERENCES
 [1] J. Choi, C. Hauff, O. V. Laere, and B. Thomee. The
     placing task at mediaeval 2016. MediaEval 2016
     Workshop, Oct. 20-21 2016.
 [2] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A
     density-based algorithm for discovering clusters in
     large spatial databases with noise.
 [3] C. F. Karney. Algorithms for geodesics. Journal of
     Geodesy, 87(1):43–55, 2013.
 [4] G. Kordopatis-Zilos, S. Papadopoulos, and
     Y. Kompatsiaris. Geotagging social media content
     with a refined language modelling approach. In
     Pacific-Asia Workshop on Intelligence and Security
     Informatics, pages 21–40. Springer, 2015.
 [5] J. Langford, L. Li, and A. Strehl. Vowpal wabbit.
     URL https://github. com/JohnLangford/vowpal
     wabbit/wiki, 2011.
 [6] A. Makazhanov, D. Rafiei, and M. Waqar. Predicting
     political preference of twitter users. Social Network
     Analysis and Mining, 4(1):1–15, 2014.
 [7] P. Serdyukov, V. Murdock, and R. Van Zwol. Placing
     flickr photos on a map. In Proceedings of the 32nd
     international ACM SIGIR conference on Research and
     development in information retrieval, pages 484–491.
     ACM, 2009.
 [8] H. Tamura, S. Mori, and T. Yamawaki. Textural
     features corresponding to visual perception. IEEE
     Transactions on Systems, Man, and Cybernetics,
     8(6):460–473, 1978.
 [9] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde,
     K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m:
     The new data in multimedia research.
     Communications of the ACM, 59(2):64–73, 2016.
[10] M. Wick and C. Boutreux. Geonames. GeoNames
     Geographical Database, 2011.