=Paper= {{Paper |id=None |storemode=property |title=CERTH at MediaEval Placing Task 2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_22.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-ZilosPXSK13 }} ==CERTH at MediaEval Placing Task 2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_22.pdf
                      CERTH at MediaEval Placing Task 2013

        Giorgos Kordopatis-Zilos                   Symeon Papadopoulos,              Andreas L. Symeonidis,
         Electrical Engineering Dept.                   Eleftherios                   Yiannis Kompatsiaris
            Aristotle University of                 Spyromitros-Xioufis,              Information Technologies
            Thessaloniki, Greece                    Information Technologies                   Institute
            gkordopa@auth.gr                                 Institute              CERTH, Thessaloniki, Greece
                                                  CERTH, Thessaloniki, Greece         [asymeon,ikom]@iti.gr
                                                  [papadop,espyromi]@iti.gr

ABSTRACT                                                         k-means on their latitude-longitude values. We opted for
We describe the participation of the CERTH team in the           K = 5000 clusters-areas so that on average each area would
Placing task of MediaEval 2013. We submitted 5 runs on           contain approximately 1450 images. For each such area, we
the full test set, two of which are based on tag information,    then apply LDA to derive a local topic distribution, using
two on visual content, and one uses both tag and visual          100 topics and 20 terms per topic. We denote an area j as
information. Our best performance (median error 650km)           Aj = {uj , {wj }, {τjk }}, where uj is the area id, wj is the
was achieved with the use of tag features.                       set of images belonging to the area, and τjk is the k-th topic
                                                                 (set of terms) of the local distribution.
Categories and Subject Descriptors                               C. Creating bag-of-excluded-words (BoEW): In this
                                                                 step, we attempt to create a set of non-geographic tags, i.e.
H.3 [Information Search and Retrieval]: Miscellaneous            tags that should not be taken into account for geotagging.
                                                                 To this end, we first apply LDA on the whole dataset (global
1. INTRODUCTION                                                  LDA) using 500 topics and 50 terms per topic. For each
   The goal of the task is to produce location estimates for     of the resulting topics, we compute its frequency per area2 ,
a set of 262,000 images using a set of over 8.5 million geo-     thus ending up with a topic-area distribution (histogram).
tagged images and their metadata for training. One may           This essentially corresponds to the spatial distribution of the
find more details regarding the challenge and the dataset in      topic. Based on this, we compute its entropy and flag the
[2]. For the tag-based runs, we built upon the scheme of         topics that exceed a threshold of 180 (empirically selected).
[6], making use of a two-level LDA scheme [1]1 to filter out      The terms of these topics form the bag-of-excluded-words.
non-geographic terms. For the visual-based runs, we relied       Some example excluded words include the terms landscape,
on a simple Nearest Neighbour scheme using SURF+VLAD             35mm, kodak, boats, christmas, sunset, and tree, while some
features [5] and an efficient indexing scheme for very fast        terms that were mistakenly considered as non-geographic in-
retrieval [4]. Our hybrid run combined a tag and a visual        clude europe, usa, atlanticocean and newmexico, most prob-
run using a simple fall-back scheme. All models were built       ably due to their large geographic span.
solely on the training data provided by the organizers (i.e.       Having created the geographical-tag model, we then pro-
no external gazetteers or Internet data were used).              ceed with the online location estimation step for each test
                                                                 image Ti . We first filter the tags of the image that are either
2. APPROACHES                                                    machine-tags or belong to the BoEW, ending up with the set
                                                                 of {ti } clean tags. We then compute the Jaccard similarity
2.1 Placing images using tags                                    between this set of tags and the set of tags for each topic of
                                                                 each local LDA:
   The tag-based method relies on an offline analysis, in
which a complex geographical-tag model is built from the                                         |ti ∩ τjk |
                                                                                        sijk =                              (1)
tags and locations of the approximately 8.5 million images                                       |ti ∪ τjk |
of the training set. The implemented approach comprises
                                                                    In a first variant of the approach (run 1), the image is as-
three steps.
                                                                 signed to the area with the highest Jaccard similarity with
A. Filtering: In this step, we aim at removing noisy and
                                                                 any local topic, areai = arg maxj,k sijk . In the second vari-
irrelevant tags from the training data. We, therefore, re-
                                                                 ant of the approach (run 3), we first compute the mean Jac-
move machine-tags from all images and then remove from
                                                                 card similarity for each area sij over all topics with at least
the training set those images with no tags left. We end up
                                                                 one common tag with the test image, and then select the
with 7,266,903 images.
                                                                 area with the highest mean similarity, areai = arg maxj sij .
B. Spatial clustering and local LDA: Here, we first
                                                                    Having assigned the test image to an area, we then adopt
cluster the training set images based on their location, using
                                                                 the location estimation technique of [6]: we first determine
1                                                                the k most similar training images (using Jaccard similarity
  We used the JGibbLDA implementation, available on:
http://jgibblda.sourceforge.net/.                                on the corresponding sets of tags) and use their center-of-
                                                                 2
                                                                  This is computed by counting the images contained in the area
Copyright is held by the author/owner(s).                        that activate the topic. These are determined by applying a
                                                                                 1
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   threshold (= 0.9×500 ) on the image-topic distribution.
gravity (weighted by the similarity values) as the location             measure        tmax    vnn     tmean    vclust    hyb
estimate for the test image. For test images with no clean              acc(1km)       10.26   0.60      7.82     0.76   10.37
                                                                        acc(10km)      23.53   0.99     19.86     1.16   23.70
tags, we set their location equal to the centroid of the largest
                                                                        acc(100km)     36.27   1.86     31.99     2.04   36.22
area (a kind of maximum likelihood estimation).
                                                                        acc(500km)     47.20   6.49     43.31     6.64   47.36
2.2 Placing images using visual features                                acc(1000km)    53.12   13.43    49.74    13.50   53.27
                                                                        median error    651    6715     1028     6691     681
   In the offline analysis step, we extract optimized SURF+
VLAD features from each image in the training set (over            Table 1: Geotagging accuracy (%) for five ranges
8.5 million images) and index the features using the IV-           and median error (in km). Runs tmax and tmean
FADC scheme proposed in [4]. The SURF+VLAD vectors                 were based on the approach of subsection 2.1, while
are based on multiple vocabulary aggregation (four visual          vnn and vclust on the approach of subsection 2.2.
vocabularies with k = 128 centroids each) and joint di-
mensionality reduction (to only 48 dimensions for efficiency)
with PCA and whitening [3]. The vectors were then quan-
tized using a coarse quantizer with 1024 centroids and ap-         images, the best median error (by tmax ) was just 521km.
plied Product Quantization on the residual vectors using an        The accuracy measure was more stable, e.g. in the case of
8 × 10 scheme [4], which led to a signature of length 112bits      tmax ranging from 9.62% to 10.95% (for <1km) and from
for each image. We made the implementation of the above            30.34% to 38.01% (for <10km). Similar fluctuations were
feature extraction and indexing methods publicly available3 .      noted for all other runs, which indicates that each test set
   For the online location estimation step, we retrieve the top    has an inherent degree of placeability.
k visually most similar images and use those for the estimate.
In the first variant (run 2), k = 1 and we simply consider the      4.    FUTURE WORK
location of the most similar image as the location of the test        In the future, we plan to conduct a more thorough analy-
image. In the second variant (run 4), k = 20 and we apply a        sis on the different sources of error for the proposed scheme,
simple incremental spatial clustering scheme, in which if the      and extend it to also include additional metadata of the in-
j-th image (out of the k most similar) is within 1km from          put images, as well as external resources. Regarding the
the closest one of the previous j − 1 images, it is assigned to    error analysis, we will look into the impact of (a) the num-
its cluster, otherwise it forms its own cluster. In the end, the   ber of topics and terms per topic both for the local and the
largest cluster (or the first in case of equal size) is selected    global LDAs, (b) the selection of the entropy threshold, (c)
and its centroid is used as a location estimate.                   the number K of geographical areas, (d) the type and qual-
                                                                   ity of visual features, (e) the use of better visual matching
                                                                   methods (e.g. geometric verification applied on the list of
3. RUNS AND RESULTS                                                top k most similar images). Regarding the use of additional
   As described above, we prepare two tag-based runs, which        metadata, we plan to incorporate the author of an image as
we will refer to as tmax (run 1) and tmean (run 3), and two        an indicator of the image location. Finally, we will consider
visual runs which we will refer to as vnn (run 2) and vclust       incorporating resources such as gazetteers, as well as addi-
(run 4). A fifth run, referred to as hyb, was prepared using        tional geotagged image data collected from the Web with
a hybrid strategy: if the test image had at least one clean        the goal of increasing the visual coverage of the training set.
tag associated with it, the tmax approach was selected, oth-
erwise the vnn was used. All runs were prepared for the full       5. ACKNOWLEDGEMENTS
test set of 262,000 images. The tag-based runs took approxi-         This work is supported by the SocialSensor FP7 project,
mately 23 hours to complete (316msec per image), while the         partially funded by the EC under contract number 287975.
visual ones took only 13 hours (179msec per image). These
times were recorded on a commodity Quad core@2.40GHz               6. REFERENCES
with 8GB RAM with a 1TB 7200rpm hard drive.                        [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
   Table 1 summarises the obtained results for the full test           dirichlet allocation. the Journal of machine Learning
set. The best performance in terms of median error was                 research, 3:993–1022, 2003.
attained by tmax, closely followed by hyb. In terms of accu-       [2] C. Hauff, B. Thomee, and M. Trevisiol. Working Notes
racy, the hyb run performs marginally better than tmax in              for the Placing Task at MediaEval 2013, 2013.
the low ranges (<1km up to <1000km), which is expected             [3] H. Jégou and O. Chum. Negative evidences and
since it has a fall-back scheme for the images with no tags.           co-occurences in image retrieval: The benefit of PCA
A noteworthy result is the very low performance of visual              and whitening. In ECCV, 2012.
runs. Potential reasons for this poor performance include
                                                                   [4] H. Jégou, M. Douze, and C. Schmid. Product
the very “cheap” feature extraction and indexing settings
                                                                       quantization for nearest neighbor search. IEEE Trans.
(for efficiency reasons) and the nature of the training and
                                                                       PAMI, 33(1):117–128, 2011.
test set, i.e. the training set did not contain images that
were sufficiently visually similar to those of the test set.         [5] E. Spyromitros-Xioufis, S. Papadopoulos,
   We also computed the selected performance measures for              I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. An
the test subsets proposed by the organizers. Surprisingly, we          empirical study on the combination of SURF features
could not identify a trend with respect to the test set size.          with VLAD vectors for image search. In WIAMIS, 2012.
For instance, on the small test set (5300), the best median        [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
error (by hyb) was 1423km, while for the test set of 53000             locations of Flickr resources using language models and
                                                                       similarity search. ICMR ’11, pages 48:1–48:8, New
3                                                                      York, NY, USA, 2011. ACM.
    https://github.com/socialsensor/multimedia-indexing