=Paper= {{Paper |id=None |storemode=property |title=CERTH at MediaEval Placing Task 2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_22.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-ZilosPXSK13 }} ==CERTH at MediaEval Placing Task 2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_22.pdf

CERTH at MediaEval Placing Task 2013

Giorgos Kordopatis-Zilos Symeon Papadopoulos, Andreas L. Symeonidis,
Electrical Engineering Dept. Eleftherios Yiannis Kompatsiaris
Aristotle University of Spyromitros-Xioufis, Information Technologies
Thessaloniki, Greece Information Technologies Institute
gkordopa@auth.gr Institute CERTH, Thessaloniki, Greece
CERTH, Thessaloniki, Greece [asymeon,ikom]@iti.gr
[papadop,espyromi]@iti.gr

ABSTRACT k-means on their latitude-longitude values. We opted for
We describe the participation of the CERTH team in the K = 5000 clusters-areas so that on average each area would
Placing task of MediaEval 2013. We submitted 5 runs on contain approximately 1450 images. For each such area, we
the full test set, two of which are based on tag information, then apply LDA to derive a local topic distribution, using
two on visual content, and one uses both tag and visual 100 topics and 20 terms per topic. We denote an area j as
information. Our best performance (median error 650km) Aj = {uj , {wj }, {τjk }}, where uj is the area id, wj is the
was achieved with the use of tag features. set of images belonging to the area, and τjk is the k-th topic
(set of terms) of the local distribution.
Categories and Subject Descriptors C. Creating bag-of-excluded-words (BoEW): In this
step, we attempt to create a set of non-geographic tags, i.e.
H.3 [Information Search and Retrieval]: Miscellaneous tags that should not be taken into account for geotagging.
To this end, we ﬁrst apply LDA on the whole dataset (global
1. INTRODUCTION LDA) using 500 topics and 50 terms per topic. For each
The goal of the task is to produce location estimates for of the resulting topics, we compute its frequency per area2 ,
a set of 262,000 images using a set of over 8.5 million geo- thus ending up with a topic-area distribution (histogram).
tagged images and their metadata for training. One may This essentially corresponds to the spatial distribution of the
ﬁnd more details regarding the challenge and the dataset in topic. Based on this, we compute its entropy and ﬂag the
[2]. For the tag-based runs, we built upon the scheme of topics that exceed a threshold of 180 (empirically selected).
[6], making use of a two-level LDA scheme [1]1 to ﬁlter out The terms of these topics form the bag-of-excluded-words.
non-geographic terms. For the visual-based runs, we relied Some example excluded words include the terms landscape,
on a simple Nearest Neighbour scheme using SURF+VLAD 35mm, kodak, boats, christmas, sunset, and tree, while some
features [5] and an eﬃcient indexing scheme for very fast terms that were mistakenly considered as non-geographic in-
retrieval [4]. Our hybrid run combined a tag and a visual clude europe, usa, atlanticocean and newmexico, most prob-
run using a simple fall-back scheme. All models were built ably due to their large geographic span.
solely on the training data provided by the organizers (i.e. Having created the geographical-tag model, we then pro-
no external gazetteers or Internet data were used). ceed with the online location estimation step for each test
image Ti . We ﬁrst ﬁlter the tags of the image that are either
2. APPROACHES machine-tags or belong to the BoEW, ending up with the set
of {ti } clean tags. We then compute the Jaccard similarity
2.1 Placing images using tags between this set of tags and the set of tags for each topic of
each local LDA:
The tag-based method relies on an oﬄine analysis, in
which a complex geographical-tag model is built from the |ti ∩ τjk |
sijk = (1)
tags and locations of the approximately 8.5 million images |ti ∪ τjk |
of the training set. The implemented approach comprises
In a ﬁrst variant of the approach (run 1), the image is as-
three steps.
signed to the area with the highest Jaccard similarity with
A. Filtering: In this step, we aim at removing noisy and
any local topic, areai = arg maxj,k sijk . In the second vari-
irrelevant tags from the training data. We, therefore, re-
ant of the approach (run 3), we ﬁrst compute the mean Jac-
move machine-tags from all images and then remove from
card similarity for each area sij over all topics with at least
the training set those images with no tags left. We end up
one common tag with the test image, and then select the
with 7,266,903 images.
area with the highest mean similarity, areai = arg maxj sij .
B. Spatial clustering and local LDA: Here, we ﬁrst
Having assigned the test image to an area, we then adopt
cluster the training set images based on their location, using
the location estimation technique of [6]: we ﬁrst determine
1 the k most similar training images (using Jaccard similarity
We used the JGibbLDA implementation, available on:
http://jgibblda.sourceforge.net/. on the corresponding sets of tags) and use their center-of-
2
This is computed by counting the images contained in the area
Copyright is held by the author/owner(s). that activate the topic. These are determined by applying a
1
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain threshold (= 0.9×500 ) on the image-topic distribution.
gravity (weighted by the similarity values) as the location measure tmax vnn tmean vclust hyb
estimate for the test image. For test images with no clean acc(1km) 10.26 0.60 7.82 0.76 10.37
acc(10km) 23.53 0.99 19.86 1.16 23.70
tags, we set their location equal to the centroid of the largest
acc(100km) 36.27 1.86 31.99 2.04 36.22
area (a kind of maximum likelihood estimation).
acc(500km) 47.20 6.49 43.31 6.64 47.36
2.2 Placing images using visual features acc(1000km) 53.12 13.43 49.74 13.50 53.27
median error 651 6715 1028 6691 681
In the oﬄine analysis step, we extract optimized SURF+
VLAD features from each image in the training set (over Table 1: Geotagging accuracy (%) for five ranges
8.5 million images) and index the features using the IV- and median error (in km). Runs tmax and tmean
FADC scheme proposed in [4]. The SURF+VLAD vectors were based on the approach of subsection 2.1, while
are based on multiple vocabulary aggregation (four visual vnn and vclust on the approach of subsection 2.2.
vocabularies with k = 128 centroids each) and joint di-
mensionality reduction (to only 48 dimensions for eﬃciency)
with PCA and whitening [3]. The vectors were then quan-
tized using a coarse quantizer with 1024 centroids and ap- images, the best median error (by tmax ) was just 521km.
plied Product Quantization on the residual vectors using an The accuracy measure was more stable, e.g. in the case of
8 × 10 scheme [4], which led to a signature of length 112bits tmax ranging from 9.62% to 10.95% (for <1km) and from
for each image. We made the implementation of the above 30.34% to 38.01% (for <10km). Similar ﬂuctuations were
feature extraction and indexing methods publicly available3 . noted for all other runs, which indicates that each test set
For the online location estimation step, we retrieve the top has an inherent degree of placeability.
k visually most similar images and use those for the estimate.
In the ﬁrst variant (run 2), k = 1 and we simply consider the 4. FUTURE WORK
location of the most similar image as the location of the test In the future, we plan to conduct a more thorough analy-
image. In the second variant (run 4), k = 20 and we apply a sis on the diﬀerent sources of error for the proposed scheme,
simple incremental spatial clustering scheme, in which if the and extend it to also include additional metadata of the in-
j-th image (out of the k most similar) is within 1km from put images, as well as external resources. Regarding the
the closest one of the previous j − 1 images, it is assigned to error analysis, we will look into the impact of (a) the num-
its cluster, otherwise it forms its own cluster. In the end, the ber of topics and terms per topic both for the local and the
largest cluster (or the ﬁrst in case of equal size) is selected global LDAs, (b) the selection of the entropy threshold, (c)
and its centroid is used as a location estimate. the number K of geographical areas, (d) the type and qual-
ity of visual features, (e) the use of better visual matching
methods (e.g. geometric veriﬁcation applied on the list of
3. RUNS AND RESULTS top k most similar images). Regarding the use of additional
As described above, we prepare two tag-based runs, which metadata, we plan to incorporate the author of an image as
we will refer to as tmax (run 1) and tmean (run 3), and two an indicator of the image location. Finally, we will consider
visual runs which we will refer to as vnn (run 2) and vclust incorporating resources such as gazetteers, as well as addi-
(run 4). A ﬁfth run, referred to as hyb, was prepared using tional geotagged image data collected from the Web with
a hybrid strategy: if the test image had at least one clean the goal of increasing the visual coverage of the training set.
tag associated with it, the tmax approach was selected, oth-
erwise the vnn was used. All runs were prepared for the full 5. ACKNOWLEDGEMENTS
test set of 262,000 images. The tag-based runs took approxi- This work is supported by the SocialSensor FP7 project,
mately 23 hours to complete (316msec per image), while the partially funded by the EC under contract number 287975.
visual ones took only 13 hours (179msec per image). These
times were recorded on a commodity Quad core@2.40GHz 6. REFERENCES
with 8GB RAM with a 1TB 7200rpm hard drive. [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
Table 1 summarises the obtained results for the full test dirichlet allocation. the Journal of machine Learning
set. The best performance in terms of median error was research, 3:993–1022, 2003.
attained by tmax, closely followed by hyb. In terms of accu- [2] C. Hauﬀ, B. Thomee, and M. Trevisiol. Working Notes
racy, the hyb run performs marginally better than tmax in for the Placing Task at MediaEval 2013, 2013.
the low ranges (<1km up to <1000km), which is expected [3] H. Jégou and O. Chum. Negative evidences and
since it has a fall-back scheme for the images with no tags. co-occurences in image retrieval: The beneﬁt of PCA
A noteworthy result is the very low performance of visual and whitening. In ECCV, 2012.
runs. Potential reasons for this poor performance include
[4] H. Jégou, M. Douze, and C. Schmid. Product
the very “cheap” feature extraction and indexing settings
quantization for nearest neighbor search. IEEE Trans.
(for eﬃciency reasons) and the nature of the training and
PAMI, 33(1):117–128, 2011.
test set, i.e. the training set did not contain images that
were suﬃciently visually similar to those of the test set. [5] E. Spyromitros-Xiouﬁs, S. Papadopoulos,
We also computed the selected performance measures for I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. An
the test subsets proposed by the organizers. Surprisingly, we empirical study on the combination of SURF features
could not identify a trend with respect to the test set size. with VLAD vectors for image search. In WIAMIS, 2012.
For instance, on the small test set (5300), the best median [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
error (by hyb) was 1423km, while for the test set of 53000 locations of Flickr resources using language models and
similarity search. ICMR ’11, pages 48:1–48:8, New
3 York, NY, USA, 2011. ACM.
https://github.com/socialsensor/multimedia-indexing