Photo Set Refinement and Tag Segmentation in
                      Georeferencing Flickr Photos
                                                                  Jiewei Cao
                                                   South China University of Technology
                                                           GuangZhou, China
                                                        jonbakerfish@gmail.com


ABSTRACT                                                                   resulted in a pre-processed training set with 4,538,784 photos wh-
In this paper, we describe our approach as part of the MediaEval           en the           . There are five different test sets and we chose
2013 Placing Task evaluation. We use language model and simila-            test3 whose size is 53,000. We didn’t use any external resource
rity search as baseline approach, and improve the accuracy by two          for georeferencing except run 5, in which we geocoded the home
techniques: photo set refinement and tag segmentation. The first           location of users in the test set, using the Google Geocoding API2.
technique takes advantage of geo-location correlation among test           2.2 Baseline Approach
photos and the second one exploits the textual similarity between             The framework proposed by [2] applies a two steps approach to
tags.                                                                      estimate the location of test photos. First, the location of the train-
                                                                           ing data are clustered into 500, 2500 and 10000 clusters which
1. INTRODUCTION                                                            could be referred to as      ,       and         . Given a clustering,
    The MediaEval 2013 Placing Task requires participants to
                                                                           a Naïve Bayes classifier is used to find the most likely cluster to
assign geographical coordinates (latitude and longitude) to each
                                                                           contain the location of a given test photo. Second, within the
provided test image, we refer to [1] for a detailed description. A
                                                                           found cluster, they use a similarity search to find the training
framework proposed by [2] is used as our baseline approach. The
                                                                           items whose tags are the closest to the ones of test photo. In [3],
main contributions of this paper are two techniques to improve the
                                                                           they proposed an improved spatially aware feature ranking met-
accuracy of georeferencing. Firstly, we noticed that Flickr users
                                                                           hod which is based on Ripley’s K statistic. Therefore, we use this
can organize their photos by assigning them to different sets and
                                                                           framework with Ripley’s K feature selection as our baseline
collections1. Intuitively, photos in the same set are highly corre-
                                                                           approach.
lated, and we can exploit these relations when estimating the geo-
location of given images. The outcome of our submitted runs just-          2.3 Photo Set Refinement
ifies this assumption. Secondly, when only training data provided            Photos within the same set or collection would be highly geo-
by the task organizers can be used, the unseen tags - tags only            location correlated. For example, a user can upload his photos
existing in test data - are useless for geo-referencing. However,          which were taken on during a trip into a new set created by him.
we tried to exploit these tags by applying tag segmentation. This          However, not every photo in the same set is well tagged because a
is similar to the word segmentation pre-processing for language            user only tags the photos he loved or interested in, and leaving
that written without spaces between words, such as Chinese. Both           others un-tagged or poorly tagged. This will result in photos with
proposed techniques can be applied to other existing systems with          completely different tag sets or visual content could be considered
little changes.                                                            as taken in the same location or nearby, if they were within the
2. METHODOLOGY                                                             same photo set.

2.1 Data Pre-processing                                                       A test photo with poor tags will result in a bad estimation. Ho-
                                                                           wever, if this photo belongs to a photo set which contains one or
   A total of 8,539,050 geo-referenced photos from Flickr were
provided as training data. Following [2], we carried out two pre-          more photos with well estimated location (usually well tagged),
liminary filter steps on this training set. First, photos without tags     then we can use the centroid location of these photos as the
                                                                           estimation for the bad one. This is the intuition of our proposed
are removed. Second, we removed the duplicated photos in a
                                                                           photo set refinement, and there are two problems here: 1. Given a
slightly different approach: photos uploaded by the same user, and
with an identical tag set, and the Haversine distance among these          photo, how to find its neighbors within the same photo sets? 2.
                                                                           How to distinguish between the well estimated photo and bad one?
photos is less than            are treated as duplicates and only one
                                                                           Although we didn’t handle the Placeability sub-task of Placing
instance is retained. Here we use a distance threshold instead of
                                                                           Task at MediaEval 2013, our solution for the second problem may
identical latitude and longitude in order to relax the restriction of
                                                                           be considered as a naive approach for error estimation.
filtering, and we can remove more or less duplicates according to
the we selected. Smaller distance threshold means more photos                 To handle the first problem, it seems we can simply break down
with identical tag set and different location can be retained, and         the test data into different sets according to the original photo sets
identical geo-location is a special case when            . Finally, this   created by users. However, a photo set in this user scenario can be

                                                                           1
                                                                               http://www.flickr.com/help/collections/
 Copyright is held by the author/owner(s)                                  2
 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
                                                                               https://developers.google.com/maps/documentation/geocoding/
changed from time to time, whether it’s adding new photos or del-           Table 1: Percentage of correctly detected locations and
eting the old ones. And the geo-location correlation between these                  median error of each run in kilometer.
photos will become weaker. Therefore we need a different app-
roach: Given a photo, we find its neighbors in the test data by                1 km    10 km    100 km     500 km    1000 km     ME km
comparing their user id, the timestamp of the photo was taken on        run1    20.7     43.0      55.3      62.8        66.3    37.65831
and uploaded. If a photo has an identical user id with the given        run2     0.0      0.0       0.0       0.1         0.6    10026.17
photo, and the time interval between their taken dates is less than
                  , and their uploaded dates interval is less than      run3    21.1     44.2      57.1      65.2        69.2    28.01581
                     , then we consider these two photos belong to      run4    21.2     44.2      57.5      65.5        69.6     27.0791
the same photo set. Here both thresholds (          and           )     run5    20.9     46.1      61.7      71.8        76.5    16.73021
are set to 7 days because we consider a week-long vacation is
common for most people, and photos taken and uploaded during
these days can be consider as a photo set.                              run4: is similar to run3 but tag segmentation is used to preprocess
                                                                             the test data before georeferencing.
   There are three clusterings of the training data, namely       ,
                                                                        run5: uses the user home location in the photo set refinement step.
        and       , and a given test photo can be classified to
                                                                             Note that this location is also used when estimating the prior
three different medoids respectively, which we referred to as       ,        probability in language model framework, we refer to [2]
         and        . Intuitively, these three medoids are not far           for more details.
from each other if is well estimated and vice versa. So given a
photo set                         , we consider as well estimated          The result of run3 justifies our assumption and we can estimate
                                                                        test photos jointly to improve the accuracy. In our experiment, the
if all the Haversine distances among         ,      and         are
                                                                        number of different estimated photos between run1 and run3 is
less than 1000km, otherwise          is marked as badly estimated.
                                                                        4,963, and this is the number of photos changed during the photo
Finally, we use the centroid location of well estimated photos as
                                                                        set refinement step. After comparing the georeferencing result of
the finial estimation for the poorly estimated ones, and if no well
                                                                        run1 and run3 with the ground truth, among these 4,963 photos,
estimated photo is found, we use the home location of the user (in
                                                                        we found that 4,390 photos’ estimated location in run3 became
run 5 only) or simply leave it unchanged.
                                                                        closer to the real location in comparison with run1, and the rest of
2.4 Tag Segmentation                                                    573 photos had a larger error distance in run3 compared with run1.
                                                                        This is mainly caused by the incorrectness of differentiating well
   Consider the tag = ‘southchinauniversityoftechnology’ and
                                                                        estimated photo and the bad one. For some well estimated photos,
tag = ‘southchinauniversityoftechnologylibrary’. If           was an
unseen tag, it will be ignored even though we can assume that           the Haversine distances among their            ,       and
and are correlated because of their textual similarity. However,        could be far from each other. Therefore, we need a much more
we can split into two terms ‘southchinauniversityoftechnology’          robust way to find out the error estimation.
and ‘library’, then the first term is identical to and can be used         Run4 doesn’t show a promising improvement compared with
for georeferencing. Our approach for tag segmentation is to model       run3. The reason is that unseen tags are not always segmentable,
the distribution of the segmentation output. First, we assume all       but the proposed technique did improve the performance slightly
tags are independently distributed, and the relative frequency of       and the extra time and computational costs are low. However,
all tags in the training data was calculated. We created a tag          other than tag segmentation which only exploits the textual simi-
dictionary sorted in descending order with size 2,080,618. We           larity between unseen tags and training tags, we can also try to
also assume that the tags in the training data follow Zipf's law [4],   find out the semantic similarity between them by utilizing external
which means that the tag with rank has probability             , whe-   resource or machine learning technique.
re is the number of tags in the dictionary. Then we use dynamic            Run5 indicates that the home location of the user is very im-
programming to infer the position of the cut point. The most likely     portant for georeferencing for most photos, which is consistent
segmentation is the one that maximizes the product of the               with previous research findings. In run2, we simply used the
probability of each individual split term. Instead of directly using    extracted visual features provided by task organizers and ran a K-
the tag probability, we use a cost defined as the logarithm of the      nearest neighbor search to find the most similar photo in the
inverse of the probability to avoid overflows.                          training set. However, we didn’t get a reasonably geo-location
                                                                        prediction and more intensive study is needed in our future work.
   Given a test photo, all the tags in this photo are preprocessed by
tag segmentation before georeferencing. For each tag, we select         4. REFERENCES
its longest split term and assign it to this photo as a new tag. The    [1] C. Hauff and B. Thomee and M. Trevisiol. Working Notes
remaining terms (such as ‘library’) are discarded because these             for the Placing Task at MediaEval 2013. In MediaEval 2013
terms are usually not spatially relevant.                                   Workshop, 18-19 October 2013, Barcelona, Spain.
3. RESULTS AND DISCUSSION                                               [2] O. Van Laere, S. Schockaert, and B. Dhoedt. Georeferencing
  We submitted five runs and the results of our experiments are             Flickr resources based on textual meta-data. Information
shown in Table 1.                                                           Sciences, 2013, http://dx.doi.org/10.1016/j.ins.2013.02.045.
run1: is the baseline approach                                          [3] O. Van Laere, J. Quinn, S. Schockaert, B. Dhoedt. Spatially-
                                                                            Aware Term Selection for Geotagging. IEEE TKDE 2013.
run2: uses visual features only and K-nearest neighbor search.
                                                                            http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.42
run3: corrects poorly estimated photos in run1 by photo set refin-
                                                                        [4] G. K. Zipf. Human Behaviour and the Principle of Least-
     ement proposed in section 2.3.
                                                                            Effort. Addison-Wesley, Cambridge MA, 1949