=Paper= {{Paper |id=Vol-1263/paper44 |storemode=property |title=SocialSensor at MediaEval Placing Task 2014 |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Kordopatis-ZilosOPK14 }} ==SocialSensor at MediaEval Placing Task 2014== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf
               SocialSensor at MediaEval Placing Task 2014

                  Giorgos Kordopatis-Zilos, Giorgos Orfanidis, Symeon Papadopoulos,
                                         Yiannis Kompatsiaris
                          Information Technologies Institute (CERTH-ITI), Thessaloniki, Greece
                                  {georgekordopatis, g.orfanidis, papadop, ikom}@iti.gr




ABSTRACT                                                            equator). Consequently, a grid of cells is created, which
We describe the participation of the SocialSensor team in           we use to build our language model using the approach de-
the Placing Task of MediaEval 2014. We submitted three              scribed in [4]. More specifically, we estimate the most prob-
runs based on tag information for the full test set, using          able cell for a query (test) image based on the respective tag
extensions over an existing language modelling approach,            probabilities. A tag probability in a particular cell is calcu-
and two runs (one based on the full test set and the other on       lated as the total number of different Flickr users that used
the 25,500 subset) based on visual content, using geospatial        the tag inside the cell, divided with the the total count of
clustering and supervised-learning. Our best performance            different users in all cells. Note that in that way a user can
(median error 230km, 23% at 1km) was achieved with the              be counted in the total count of all cells more than once.
use of tag features, using only internal training data.                In order to assign a query image in a cell, we calculate
                                                                    the probability of each cell summing up the contributions of
                                                                    individual tags and title words. The cell with the greatest
1.    INTRODUCTION                                                  probability is selected as the image cell. If during this pro-
  The goal of the task is to produce location estimates for         cess there is no outcome (i.e. the probability for all cells is
a set of 510K images using a set of over 5M geotagged im-           zero), we use the description of the query image. For the
ages and their metadata for training [1]. For the tag-based         test images where there is no result (e.g. complete lack of
runs, we built upon the scheme of [4], extending it with            text), we set their location equal to the center of the most
the use of the Similarity Search method, introduced in [6].         populated cell, of a coarse granularity grid (100km×100km),
We also devised an internal grid technique and a Gaussian           a kind of maximum likelihood estimation.
distribution model based on the spatial entropy of tags to          Extensions: We devised the following extensions:
adjust the corresponding probabilities. For the visual-based
location estimation, we attempted to build visual location          Similarity Search: Having assigned a query image to a cell,
models, though with limited success. All models were built          we then employ the location estimation technique of [6]: we
solely on the training data provided by the organizers (i.e.        first determine the k most similar training images (using
no external gazetteers or Internet data were used).                 Jaccard similarity on the corresponding sets of tags) and use
                                                                    their center-of-gravity (weighted by the similarity values) as
                                                                    the location estimate for the test image.
2.    APPROACHES                                                    Internal Grid: In order to ensure more reliable prediction in
                                                                    finer granularities, we built the language model using a finer
2.1    Tag-based location estimation                                grid (cell side length of 0.001◦ for both latitude and longi-
Baseline approach: The baseline method relies on an of-             tude, corresponding to a square of ≈100m×100m). Having
fline step, in which a complex geographical-tag model is built      computed the result from both the coarse and fine granu-
from the tags and locations of the approximately 5M images          larity, we use an internal grid technique. According to this,
of the training set. The metadata used to build the model           for a query image, if the estimate based on the finer gran-
and the estimation of a query image are the tags, the title         ularity falls within the borders of the estimated cell of the
and the description. A pre-processing step was first applied        coarser granularity, then we consider the fine granularity
to remove all punctuation and symbols and to transform all          trustworthy and apply similarity search inside the fine cell.
characters to lower case. After the pre-processing, all train-      Otherwise, we perform similarity search inside the coarser
ing images left with empty tags and title are removed, re-          granularity cell, since coarser granularity language models
sulting in a training set of approximately 4.1M images. Note        are by default more trustworthy (due to the use of more
that the same pre-processing is applied on the test images          data for building them).
before the actual location estimation process.                      Spatial Entropy: In order to adjust the original language
   In contrast to last year’s clustering [3], we divide the earth   model tag probabilities for each cell, we built a Gaussian
surface in rectangular cells with a side length of 0.01◦ for        weight function based on the values of the spatial tag en-
both latitude and longitude (approximately 1km near the             tropy. The spatial entropy for each tag tk is calculated based
                                                                    on its probabilities over all m cells of the grid.
                                                                                               m
                                                                                               X
Copyright is held by the author/owner(s).                                         e(tk ) = −         p(tk |ci ) log p(tk |ci )   (1)
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain                                 i=1
We chose a Gaussian model because the tags with either too             measure        Run 1    Run 2     Run 3    Run 4     Run 5
high or too low entropy values typically carry no geographic           acc(10m)         0.5       0         0      0.03      0.31
cues, and we would therefore need to suppress their influence          acc(100m)       5.85       0       0.01     0.65      4.36
on the location estimation process. Equation 2 presents the            acc(1km)       23.02     0.03      0.16    21.87     22.24
                                                                       acc(10km)      39.92     0.76      1.27    38.96     38.98
entropy-based cell estimation equation.
                                                                       acc(100km)     46.87     2.18      3.00    46.13     46.13
                        T
                        X                                              acc(1000km)    60.11    17.35     17.72    59.87     59.87
           p(ci |j) =         P (tk |ci ) ∗ N (e(tk ), µ, σ)   (2)     median error    230      6232      6086     258       259
                        k=1                                          Table 1: Geotagging accuracy (%) and median error (km)
                                                                     for five ranges. Runs 1, 4 and 5 used text metadata, while
where p(ci |j) is the probability of cell ci for image j, T is
                                                                     Runs 2 and 3 relied on visual features.
the number of tags for image j, P (tk |ci ) is the probability
of tag k for cell ci and ek is the value of the entropy of tag
k. N is the Gaussian function, and the parameters µ, σ are
estimated using the distribution over the training set.
                                                                     3.   RUNS AND RESULTS
2.2    Visual-based location estimation                                 As described above, we prepared three tag-based runs and
                                                                     two visual runs. The tag-based runs are the Run 1, using the
   To build the visual location models, we relied on two fea-
                                                                     language model, similarity search, internal grid and spatial
tures, SURF+VLAD and CS-LBP+VLAD, concatenating
                                                                     entropy, Run 4, using the language model and the center of
them in a single vector. In particular, we first calculated
                                                                     cells as estimated location, and Run 5, using the language
the interest points of each image, and then extracted both
                                                                     model and similarity search. Run 2 was based on the Sub-
SURF and CS-LBP descriptors corresponding to them. The
                                                                     cluster Selection step of subsection 2.2 using the center of
parameters used for CS-LBP [2] were P = 8, R = 2, and the
                                                                     the subcluster as location estimate. Run 3 was based on the
number of bins N = 16. L2 normalization was applied for
                                                                     combination of Subcluster Selection with Similarity Search
SURF and L1 for CS-LBP. For both features we used dis-
                                                                     (according to subsection 2.2). For Run 3, we used a subset
tinct multiple vocabularies learned on independent collec-
                                                                     of 25,500 images due to lack of time. For the rest of the runs
tions (four visual vocabularies with k = 128 centroids each)
                                                                     we used the full test set of 510K images.
and applied dimensionality reduction using PCA separately
                                                                        According to Table 1, the best performance in terms of
to each VLAD vector, keeping more principal components
                                                                     both median error and accuracy in all ranges was attained
for the SURF+VLAD vector to a factor of 3-1 (due to the
                                                                     by Run 1. Comparing Run 4 and 5, it can be seen that
correspondingly higher dimensionality of the non-reduced
                                                                     similarity search had considerable impact on the low range
SURF+VLAD). The final VLAD vectors had a concate-
                                                                     accuracy results. Also the combination of all features in Run
nated length of 1024 and were L2 normalized. For VLAD,
                                                                     1 improves further the overall performance (reaching a 5.85%
we used the implementation of [5].
                                                                     accuracy for the < 100m range), but the median error is still
   The main part of the model building included the train-
                                                                     quite high (230km), which means further improvements can
ing of linear SVM to separate the samples in a predefined
                                                                     be achieved. The visual runs yielded very poor results.
number of spatial clusters and subclusters (we used 50 clus-
                                                                        In the future, we plan to look into utilizing external data
ters and up to 50 subclusters corresponding to each clus-
                                                                     for training, in particular the Flickr 100M Creative Com-
ter). The clusters/subclusters were created using k-means
                                                                     mons dataset and gazetteers. Furthermore, we will look into
on the coordinates of the training images, while the number
                                                                     alternative ways to utilize visual information for geotagging.
of subclusters was determined by the number of samples N
assigned to each cluster (min(round(N/3000), 50)).                   Acknowledgements: This work is supported by the So-
                                                                     cialSensor FP7 project, partially funded by the EC under
Subcluster Selection: For each cluster a one-vs-rest approach
                                                                     contract number 287975.
was applied resulting in 50-d prediction score vectors, while
for the subclusters a similar approach was used but only for
intra-cluster samples (resulting in better performance than
                                                                     4.   REFERENCES
                                                                     [1] J. Choi and al. The placing task: A large-scale
using both intra- and inter-cluster samples). The decision               geo-estimation challenge for social-media videos and images.
about the cluster membership for each sample was a com-                  In Proceedings of the 3rd ACM GeoMM Workshop, 2014.
bination of the estimation scores provided by the cluster            [2] M. Heikkilä, M. Pietikäinen, and C. Schmid. Description of
prediction score vectors and also the scores corresponding               interest regions with local binary patterns. Pattern
to the best subcluster score in each cluster. Finally, priors            Recognition, 42(3):425–436, 2009.
(based on number of images per cluster/subcluster) were ap-          [3] G. Kordopatis-Zilos, S. Papadopoulos,
plied to the respective scores, since they were found to lead            E. Spyromitros-Xioufis, A. L. Symeonidis, and
                                                                         Y. Kompatsiaris. CERTH at MediaEval Placing Task 2013.
to some improvement.
                                                                         In Proceedings of MediaEval 2013.
Similarity Search: To achieve location estimations of finer          [4] A. Popescu. CEA LIST’s participation at MediaEval 2013
granularity, we applied a similarity search step at the sub-             Placing Task. In Proceedings of MediaEval 2013.
cluster level. In particular, the query image was compared           [5] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris,
to 1000 samples from the selected subcluster (sampling was               G. Tsoumakas, and I. Vlahavas. A comprehensive study over
necessary for efficiency reasons), and the location of the most          VLAD and Product Quantization in large-scale image
                                                                         retrieval. Trans. on Multimedia, 16(6):1713–1728, 2014.
similar of those was returned. Similarity was computed
                                                                     [6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding
based on a low-dimensional concept-based representation,                 locations of Flickr resources using language models and
using the 94 concepts of ImageCLEF 2012 (i.e. each image                 similarity search. ICMR ’11, pages 48:1–48:8, New York,
was represented by 94 prediction scores coming from a set                NY, USA, 2011. ACM.
of corresponding pre-trained concept models).