=Paper= {{Paper |id=Vol-1263/paper44 |storemode=property |title=SocialSensor at MediaEval Placing Task 2014 |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-ZilosOPK14 }} ==SocialSensor at MediaEval Placing Task 2014== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf

SocialSensor at MediaEval Placing Task 2014

Giorgos Kordopatis-Zilos, Giorgos Orfanidis, Symeon Papadopoulos,
Yiannis Kompatsiaris
Information Technologies Institute (CERTH-ITI), Thessaloniki, Greece
{georgekordopatis, g.orfanidis, papadop, ikom}@iti.gr

ABSTRACT equator). Consequently, a grid of cells is created, which
We describe the participation of the SocialSensor team in we use to build our language model using the approach de-
the Placing Task of MediaEval 2014. We submitted three scribed in [4]. More specifically, we estimate the most prob-
runs based on tag information for the full test set, using able cell for a query (test) image based on the respective tag
extensions over an existing language modelling approach, probabilities. A tag probability in a particular cell is calcu-
and two runs (one based on the full test set and the other on lated as the total number of different Flickr users that used
the 25,500 subset) based on visual content, using geospatial the tag inside the cell, divided with the the total count of
clustering and supervised-learning. Our best performance different users in all cells. Note that in that way a user can
(median error 230km, 23% at 1km) was achieved with the be counted in the total count of all cells more than once.
use of tag features, using only internal training data. In order to assign a query image in a cell, we calculate
the probability of each cell summing up the contributions of
individual tags and title words. The cell with the greatest
1. INTRODUCTION probability is selected as the image cell. If during this pro-
The goal of the task is to produce location estimates for cess there is no outcome (i.e. the probability for all cells is
a set of 510K images using a set of over 5M geotagged im- zero), we use the description of the query image. For the
ages and their metadata for training [1]. For the tag-based test images where there is no result (e.g. complete lack of
runs, we built upon the scheme of [4], extending it with text), we set their location equal to the center of the most
the use of the Similarity Search method, introduced in [6]. populated cell, of a coarse granularity grid (100km×100km),
We also devised an internal grid technique and a Gaussian a kind of maximum likelihood estimation.
distribution model based on the spatial entropy of tags to Extensions: We devised the following extensions:
adjust the corresponding probabilities. For the visual-based
location estimation, we attempted to build visual location Similarity Search: Having assigned a query image to a cell,
models, though with limited success. All models were built we then employ the location estimation technique of [6]: we
solely on the training data provided by the organizers (i.e. first determine the k most similar training images (using
no external gazetteers or Internet data were used). Jaccard similarity on the corresponding sets of tags) and use
their center-of-gravity (weighted by the similarity values) as
the location estimate for the test image.
2. APPROACHES Internal Grid: In order to ensure more reliable prediction in
finer granularities, we built the language model using a finer
2.1 Tag-based location estimation grid (cell side length of 0.001◦ for both latitude and longi-
Baseline approach: The baseline method relies on an of- tude, corresponding to a square of ≈100m×100m). Having
fline step, in which a complex geographical-tag model is built computed the result from both the coarse and fine granu-
from the tags and locations of the approximately 5M images larity, we use an internal grid technique. According to this,
of the training set. The metadata used to build the model for a query image, if the estimate based on the finer gran-
and the estimation of a query image are the tags, the title ularity falls within the borders of the estimated cell of the
and the description. A pre-processing step was first applied coarser granularity, then we consider the fine granularity
to remove all punctuation and symbols and to transform all trustworthy and apply similarity search inside the fine cell.
characters to lower case. After the pre-processing, all train- Otherwise, we perform similarity search inside the coarser
ing images left with empty tags and title are removed, re- granularity cell, since coarser granularity language models
sulting in a training set of approximately 4.1M images. Note are by default more trustworthy (due to the use of more
that the same pre-processing is applied on the test images data for building them).
before the actual location estimation process. Spatial Entropy: In order to adjust the original language
In contrast to last year’s clustering [3], we divide the earth model tag probabilities for each cell, we built a Gaussian
surface in rectangular cells with a side length of 0.01◦ for weight function based on the values of the spatial tag en-
both latitude and longitude (approximately 1km near the tropy. The spatial entropy for each tag tk is calculated based
on its probabilities over all m cells of the grid.
m
X
Copyright is held by the author/owner(s). e(tk ) = − p(tk |ci ) log p(tk |ci ) (1)
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain i=1
We chose a Gaussian model because the tags with either too measure Run 1 Run 2 Run 3 Run 4 Run 5
high or too low entropy values typically carry no geographic acc(10m) 0.5 0 0 0.03 0.31
cues, and we would therefore need to suppress their influence acc(100m) 5.85 0 0.01 0.65 4.36
on the location estimation process. Equation 2 presents the acc(1km) 23.02 0.03 0.16 21.87 22.24
acc(10km) 39.92 0.76 1.27 38.96 38.98
entropy-based cell estimation equation.
acc(100km) 46.87 2.18 3.00 46.13 46.13
T
X acc(1000km) 60.11 17.35 17.72 59.87 59.87
p(ci |j) = P (tk |ci ) ∗ N (e(tk ), µ, σ) (2) median error 230 6232 6086 258 259
k=1 Table 1: Geotagging accuracy (%) and median error (km)
for five ranges. Runs 1, 4 and 5 used text metadata, while
where p(ci |j) is the probability of cell ci for image j, T is
Runs 2 and 3 relied on visual features.
the number of tags for image j, P (tk |ci ) is the probability
of tag k for cell ci and ek is the value of the entropy of tag
k. N is the Gaussian function, and the parameters µ, σ are
estimated using the distribution over the training set.
3. RUNS AND RESULTS
2.2 Visual-based location estimation As described above, we prepared three tag-based runs and
two visual runs. The tag-based runs are the Run 1, using the
To build the visual location models, we relied on two fea-
language model, similarity search, internal grid and spatial
tures, SURF+VLAD and CS-LBP+VLAD, concatenating
entropy, Run 4, using the language model and the center of
them in a single vector. In particular, we first calculated
cells as estimated location, and Run 5, using the language
the interest points of each image, and then extracted both
model and similarity search. Run 2 was based on the Sub-
SURF and CS-LBP descriptors corresponding to them. The
cluster Selection step of subsection 2.2 using the center of
parameters used for CS-LBP [2] were P = 8, R = 2, and the
the subcluster as location estimate. Run 3 was based on the
number of bins N = 16. L2 normalization was applied for
combination of Subcluster Selection with Similarity Search
SURF and L1 for CS-LBP. For both features we used dis-
(according to subsection 2.2). For Run 3, we used a subset
tinct multiple vocabularies learned on independent collec-
of 25,500 images due to lack of time. For the rest of the runs
tions (four visual vocabularies with k = 128 centroids each)
we used the full test set of 510K images.
and applied dimensionality reduction using PCA separately
According to Table 1, the best performance in terms of
to each VLAD vector, keeping more principal components
both median error and accuracy in all ranges was attained
for the SURF+VLAD vector to a factor of 3-1 (due to the
by Run 1. Comparing Run 4 and 5, it can be seen that
correspondingly higher dimensionality of the non-reduced
similarity search had considerable impact on the low range
SURF+VLAD). The final VLAD vectors had a concate-
accuracy results. Also the combination of all features in Run
nated length of 1024 and were L2 normalized. For VLAD,
1 improves further the overall performance (reaching a 5.85%
we used the implementation of [5].
accuracy for the < 100m range), but the median error is still
The main part of the model building included the train-
quite high (230km), which means further improvements can
ing of linear SVM to separate the samples in a predefined
be achieved. The visual runs yielded very poor results.
number of spatial clusters and subclusters (we used 50 clus-
In the future, we plan to look into utilizing external data
ters and up to 50 subclusters corresponding to each clus-
for training, in particular the Flickr 100M Creative Com-
ter). The clusters/subclusters were created using k-means
mons dataset and gazetteers. Furthermore, we will look into
on the coordinates of the training images, while the number
alternative ways to utilize visual information for geotagging.
of subclusters was determined by the number of samples N
assigned to each cluster (min(round(N/3000), 50)). Acknowledgements: This work is supported by the So-
cialSensor FP7 project, partially funded by the EC under
Subcluster Selection: For each cluster a one-vs-rest approach
contract number 287975.
was applied resulting in 50-d prediction score vectors, while
for the subclusters a similar approach was used but only for
intra-cluster samples (resulting in better performance than
4. REFERENCES
[1] J. Choi and al. The placing task: A large-scale
using both intra- and inter-cluster samples). The decision geo-estimation challenge for social-media videos and images.
about the cluster membership for each sample was a com- In Proceedings of the 3rd ACM GeoMM Workshop, 2014.
bination of the estimation scores provided by the cluster [2] M. Heikkilä, M. Pietikäinen, and C. Schmid. Description of
prediction score vectors and also the scores corresponding interest regions with local binary patterns. Pattern
to the best subcluster score in each cluster. Finally, priors Recognition, 42(3):425–436, 2009.
(based on number of images per cluster/subcluster) were ap- [3] G. Kordopatis-Zilos, S. Papadopoulos,
plied to the respective scores, since they were found to lead E. Spyromitros-Xioufis, A. L. Symeonidis, and
Y. Kompatsiaris. CERTH at MediaEval Placing Task 2013.
to some improvement.
In Proceedings of MediaEval 2013.
Similarity Search: To achieve location estimations of finer [4] A. Popescu. CEA LIST’s participation at MediaEval 2013
granularity, we applied a similarity search step at the sub- Placing Task. In Proceedings of MediaEval 2013.
cluster level. In particular, the query image was compared [5] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris,
to 1000 samples from the selected subcluster (sampling was G. Tsoumakas, and I. Vlahavas. A comprehensive study over
necessary for efficiency reasons), and the location of the most VLAD and Product Quantization in large-scale image
retrieval. Trans. on Multimedia, 16(6):1713–1728, 2014.
similar of those was returned. Similarity was computed
[6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
based on a low-dimensional concept-based representation, locations of Flickr resources using language models and
using the 94 concepts of ImageCLEF 2012 (i.e. each image similarity search. ICMR ’11, pages 48:1–48:8, New York,
was represented by 94 prediction scores coming from a set NY, USA, 2011. ACM.
of corresponding pre-trained concept models).