<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Photo Set Refinement and Tag Segmentation in Georeferencing Flickr Photos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiewei Cao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>South China University of Technology GuangZhou</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In this paper, we describe our approach as part of the MediaEval 2013 Placing Task evaluation. We use language model and similarity search as baseline approach, and improve the accuracy by two techniques: photo set refinement and tag segmentation. The first technique takes advantage of geo-location correlation among test photos and the second one exploits the textual similarity between tags.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. METHODOLOGY</title>
    </sec>
    <sec id="sec-2">
      <title>2.1 Data Pre-processing</title>
      <p>
        A total of 8,539,050 geo-referenced photos from Flickr were
provided as training data. Following [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we carried out two
preliminary filter steps on this training set. First, photos without tags
are removed. Second, we removed the duplicated photos in a
slightly different approach: photos uploaded by the same user, and
with an identical tag set, and the Haversine distance among these
photos is less than are treated as duplicates and only one
instance is retained. Here we use a distance threshold instead of
identical latitude and longitude in order to relax the restriction of
filtering, and we can remove more or less duplicates according to
the we selected. Smaller distance threshold means more photos
with identical tag set and different location can be retained, and
identical geo-location is a special case when . Finally, this
resulted in a pre-processed training set with 4,538,784 photos
when the . There are five different test sets and we chose
test3 whose size is 53,000. We didn’t use any external resource
for georeferencing except run 5, in which we geocoded the home
location of users in the test set, using the Google Geocoding API2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.2 Baseline Approach</title>
      <p>
        The framework proposed by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] applies a two steps approach to
estimate the location of test photos. First, the location of the
training data are clustered into 500, 2500 and 10000 clusters which
could be referred to as , and . Given a clustering,
a Naïve Bayes classifier is used to find the most likely cluster to
contain the location of a given test photo. Second, within the
found cluster, they use a similarity search to find the training
items whose tags are the closest to the ones of test photo. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
they proposed an improved spatially aware feature ranking
method which is based on Ripley’s K statistic. Therefore, we use this
framework with Ripley’s K feature selection as our baseline
approach.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Photo Set Refinement</title>
      <p>Photos within the same set or collection would be highly
geolocation correlated. For example, a user can upload his photos
which were taken on during a trip into a new set created by him.
However, not every photo in the same set is well tagged because a
user only tags the photos he loved or interested in, and leaving
others un-tagged or poorly tagged. This will result in photos with
completely different tag sets or visual content could be considered
as taken in the same location or nearby, if they were within the
same photo set.</p>
      <p>A test photo with poor tags will result in a bad estimation.
However, if this photo belongs to a photo set which contains one or
more photos with well estimated location (usually well tagged),
then we can use the centroid location of these photos as the
estimation for the bad one. This is the intuition of our proposed
photo set refinement, and there are two problems here: 1. Given a
photo, how to find its neighbors within the same photo sets? 2.
How to distinguish between the well estimated photo and bad one?
Although we didn’t handle the Placeability sub-task of Placing
Task at MediaEval 2013, our solution for the second problem may
be considered as a naive approach for error estimation.</p>
      <p>To handle the first problem, it seems we can simply break down
the test data into different sets according to the original photo sets
created by users. However, a photo set in this user scenario can be
1 http://www.flickr.com/help/collections/
2 https://developers.google.com/maps/documentation/geocoding/
run2
run4
changed from time to time, whether it’s adding new photos or
deleting the old ones. And the geo-location correlation between these
photos will become weaker. Therefore we need a different
approach: Given a photo, we find its neighbors in the test data by
comparing their user id, the timestamp of the photo was taken on
and uploaded. If a photo has an identical user id with the given
photo, and the time interval between their taken dates is less than
, and their uploaded dates interval is less than</p>
      <p>, then we consider these two photos belong to
the same photo set. Here both thresholds ( and )
are set to 7 days because we consider a week-long vacation is
common for most people, and photos taken and uploaded during
these days can be consider as a photo set.</p>
      <p>There are three clusterings of the training data, namely ,
and , and a given test photo can be classified to
three different medoids respectively, which we referred to as
and . Intuitively, these three medoids are not far
from each other if is well estimated and vice versa. So given a
photo set , we consider as well estimated
if all the Haversine distances among , and are
less than 1000km, otherwise is marked as badly estimated.
Finally, we use the centroid location of well estimated photos as
the finial estimation for the poorly estimated ones, and if no well
estimated photo is found, we use the home location of the user (in
run 5 only) or simply leave it unchanged.
,</p>
    </sec>
    <sec id="sec-5">
      <title>2.4 Tag Segmentation</title>
      <p>
        Consider the tag = ‘southchinauniversityoftechnology’ and
tag = ‘southchinauniversityoftechnologylibrary’. If was an
unseen tag, it will be ignored even though we can assume that
and are correlated because of their textual similarity. However,
we can split into two terms ‘southchinauniversityoftechnology’
and ‘library’, then the first term is identical to and can be used
for georeferencing. Our approach for tag segmentation is to model
the distribution of the segmentation output. First, we assume all
tags are independently distributed, and the relative frequency of
all tags in the training data was calculated. We created a tag
dictionary sorted in descending order with size 2,080,618. We
also assume that the tags in the training data follow Zipf's law [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which means that the tag with rank
has probability
re is the number of tags in the dictionary. Then we use dynamic
programming to infer the position of the cut point. The most likely
segmentation is the one that maximizes the product of the
probability of each individual split term. Instead of directly using
the tag probability, we use a cost defined as the logarithm of the
inverse of the probability to avoid overflows.
      </p>
      <p>Given a test photo, all the tags in this photo are preprocessed by
tag segmentation before georeferencing. For each tag, we select
its longest split term and assign it to this photo as a new tag. The
remaining terms (such as ‘library’) are discarded because these
terms are usually not spatially relevant.</p>
    </sec>
    <sec id="sec-6">
      <title>3. RESULTS AND DISCUSSION</title>
      <p>We submitted five runs and the results of our experiments are
shown in Table 1.
run1: is the baseline approach
run2: uses visual features only and K-nearest neighbor search.
run3: corrects poorly estimated photos in run1 by photo set
refinement proposed in section 2.3.
66.3
0.6
69.2
69.6
76.5</p>
      <p>ME km
37.65831
10026.17
28.01581
27.0791
16.73021
run4: is similar to run3 but tag segmentation is used to preprocess
the test data before georeferencing.
run5: uses the user home location in the photo set refinement step.</p>
      <p>
        Note that this location is also used when estimating the prior
probability in language model framework, we refer to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
for more details.
      </p>
      <p>The result of run3 justifies our assumption and we can estimate
test photos jointly to improve the accuracy. In our experiment, the
number of different estimated photos between run1 and run3 is
4,963, and this is the number of photos changed during the photo
set refinement step. After comparing the georeferencing result of
run1 and run3 with the ground truth, among these 4,963 photos,
we found that 4,390 photos’ estimated location in run3 became
closer to the real location in comparison with run1, and the rest of
573 photos had a larger error distance in run3 compared with run1.
This is mainly caused by the incorrectness of differentiating well
estimated photo and the bad one. For some well estimated photos,
the Haversine distances among their , and
could be far from each other. Therefore, we need a much more
robust way to find out the error estimation.</p>
      <p>Run4 doesn’t show a promising improvement compared with
run3. The reason is that unseen tags are not always segmentable,
but the proposed technique did improve the performance slightly
and the extra time and computational costs are low. However,
other than tag segmentation which only exploits the textual
similarity between unseen tags and training tags, we can also try to
find out the semantic similarity between them by utilizing external
resource or machine learning technique.</p>
      <p>Run5 indicates that the home location of the user is very
important for georeferencing for most photos, which is consistent
with previous research findings. In run2, we simply used the
extracted visual features provided by task organizers and ran a
Knearest neighbor search to find the most similar photo in the
training set. However, we didn’t get a reasonably geo-location
prediction and more intensive study is needed in our future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauff</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          .
          <article-title>Working Notes for the Placing Task at MediaEval 2013</article-title>
          . In MediaEval 2013 Workshop,
          <fpage>18</fpage>
          -19
          <source>October</source>
          <year>2013</year>
          , Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Georeferencing Flickr resources based on textual meta-data</article-title>
          .
          <source>Information Sciences</source>
          ,
          <year>2013</year>
          , http://dx.doi.org/10.1016/j.ins.
          <year>2013</year>
          .
          <volume>02</volume>
          .045.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>O. Van Laere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Dhoedt.</surname>
          </string-name>
          <article-title>SpatiallyAware Term Selection for Geotagging</article-title>
          .
          <source>IEEE TKDE</source>
          <year>2013</year>
          . http://doi.ieeecomputersociety.
          <source>org/10</source>
          .1109/TKDE.
          <year>2013</year>
          .42
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Zipf</surname>
          </string-name>
          .
          <article-title>Human Behaviour and the Principle of LeastEffort</article-title>
          . Addison-Wesley, Cambridge MA, 1949
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>