=Paper= {{Paper |id=None |storemode=property |title=Placing media items using the Xtrieval Framework |pdfUrl=https://ceur-ws.org/Vol-807/Krippner_CUT_Placing_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/KrippnerMHK11 }} ==Placing media items using the Xtrieval Framework== https://ceur-ws.org/Vol-807/Krippner_CUT_Placing_me11wn.pdf
          Placing Media Items Using the Xtrieval Framework
                          Fabian Krippner, Gerald Meier, Jens Hartmann, Robert Knauf
                                                   Chemnitz University of Technology
                                                       Straße der Nationen 62
                                                      09111 Chemnitz, Germany
                                      {fabk, gmei, hajen, knauf}@hrz.tu-chemnitz.de


ABSTRACT                                                                requirements and conditions are explained in the Placing Task
In this paper we describe our approaches and results of evaluating      overview paper [1].
the metadata in tagged user-generated videos as well as their
visual features in order to extrapolate geographical relevance. The
                                                                        2.         RELATED WORK
                                                                        In the field of adding geographical information, “geotags”, or
evaluation was done in the context of the MediaEval 2011 Placing
                                                                        GPS coordinates there are many works already addressing the
Task in which we had to determine and to assign the best fitting
                                                                        annotation of images and videos as well as the extraction of
geographical coordinates to each video. Our main goal was to
                                                                        comparable image features. For videos Kelm et al. give an
realize this task with a retrieval framework developed by the
                                                                        explanation of three ways on how to place a video on the globe
Chemnitz University using the bag-of-words model to compare
                                                                        using the metadata and external resources [2]. Serdyukov et al.
parts of metadata. This framework is used for indexing and
                                                                        make use of the textual annotations associated with uploaded
comparing purposes. Particularly, it incorporates multiple lists of
                                                                        images in combination with GeoNames1 in order to retrieve
stop words, stemming lists and dictionaries. For enhancement
                                                                        geographical relevant information through a language based
purposes, we also used the GeoNames gazetteer despite noticing
                                                                        model [3]. Hays and Efros suggest a large image database divided
that the overall results seem to be slightly better using sole
                                                                        into locations for the purpose of scene matching by comparing
metadata comparisons.
                                                                        images instead of text using the nearest-neighbor method [4].

Categories and Subject Descriptors                                      But most inspiring was the way of comparing metadata and
H.3.3 [Information Search and Retrieval]                                scanning them for geographical relevant information that Perea-
                                                                        Ortega et al. used in their MediaEval 2010 work [5] as well as the
                                                                        metadata- and keyword-focused approach of Choi et al. [6]. This
Keywords                                                                led us to try and realize their approaches using Xtrieval
Geographical Coordinates, Geotagging, Flickr videos, Gazetteer,         framework [7] which has been developed at the Chair Media
Bag-of-Words model, Geographical Location                               Informatics of the Chemnitz University. It is based on Lucene2
                                                                        and improves it by using the bag-of-words model instead of
1.        INTRODUCTION                                                  straight keyword lookups.
The internet is filled with many hosting sites for different kinds of   Particularly, we benefitted from its incorporated whitespace
media such as videos, pictures, and music. Tagged with                  tokenizer and the modified stop word list. Lucene DataDocuments
heterogeneous kinds of information, it is sometimes difficult or        provided an adequate object type to store all development data
impossible at all to clearly assign the resource’s origin to one        sets.
specific location on the globe. Regardless of whether the given
tags of the videos are sufficient, it is a challenging task to assume
a fitting location using a data set of ground truth. For that           3.         EVALUATION OF RESULTS
purpose, the main goal of the MediaEval 2011 Placing Task was           Making use of the built-in methods of Xtrieval, we started with
to address the topic of automatically geotagging videos taken from      creating data collections from the development data and the data
the Flickr community.                                                   to be tested. We continued by indexing the development metadata
                                                                        collection and using the test data collection to search on it.
Starting from a set of training videos, geographical coordinates
were to be derived as accurately as possible by using three             For both data collections we extracted the following metadata
different variations of ground truth data.                              fields from the respective XML files of the videos: description,
                                                                        keywords, title, locality, region, country, and user ID.
The criterion of the gained results’ relevance were the distances
                                                                        Additionally, we extracted the data fields containing latitude and
between the actual coordinates of a video clip’s location provided
                                                                        longitude information from the development data collection.
by the particular Flickr user and the ones that have been estimated
                                                                        Furthermore, we created a collection based on the Flickr images’
by the Placing Task participants’ algorithms. The exact
                                                                        metadata.


                                                                        1
                                                                            http://www.geonames.org/
Copyright is held by the author/owner(s).                               2
MediaEval 2011 Workshop, September 1-2, Pisa, Italy.                        http://lucene.apache.org/java/docs/index.html
Processing both metadata collections (video and image                    87.86% of all results were equal in both the searches with and
descriptions), we merged the particular data fields into a single        without Gazetteer, including videos exceeding the 10000km
field. Here, we left out the fields “latitude”, “longitude”, “userid”,   threshold. An amount of 63.43% of all videos shared the same
and “docno” and handled them separately later on.                        coordinates while the rest was divided in two groups: One group
                                                                         (11.11% of the cases) delivered better results using the Gazetteer
The extracted fields which coincided were merged to one field
                                                                         while the other (13.31%) performed better based only on the
called “bag”. Thereby, except for the field “userid”, we applied
                                                                         development data. This leads us to the conclusion that the use of a
the bag-of-words model which is used for direct comparisons
                                                                         Gazetteer could improve the search results by nearly the same
using a search method based on the Lucene searcher. This method
                                                                         amount as the original data. Thereby, further refinement in the
works closely with the Lucene Index. It reevaluates the given
                                                                         selection process is needed for more accurate results.
query by searching for frequently used terms in relevant
documents and appending them to the initial search query with the
intention of enhancing the subsequent Lucene search.                     4.        OUTLOOK
We derived the score for each hit between development data and           The applied bag-of-words model did assign a correct location to
test data from the score formula of the Similarity3 class in Lucene.     ca. 10% of the tested videos. So, for further development we will
From each hit set the procedure returned we took the hit with the        include different sub-bag correlations in contrast to now, where
highest score as the best match between both data sets.                  we found our optimum in a complete “bag”. We will try different
                                                                         “bags” for different data, stacking searches after one another, and
In former tryouts where we used different contents in the “bag”          figuring out a better system of weighing between the distinct
we found our optimum of using all described fields in it. Leaving        ground truth resources. Thereby, we expect to minimize the
out even one field led to an overwhelming amount of videos being         deviation of results and to gain benefits of each particular ground
missed in the results. For example leaving out the field                 truth data set. Furthermore, the application of filter adjustments
“Keywords” from the XML data led to 1,407 (26.31%) missing               determining the origin by language and using speech recognition
hits when querying the test data.                                        will be a next step as well as using the image feature set,
We noticed 278 videos which did not receive a hit. By initiating         particularly, the Color and Directivity Descriptor, which we didn't
another search process in which we used the missing videos as the        get to be fully realized by the time of submission.
test data collection, we were able to reduce this number by 37
videos through repeating the former search process and adding an         5.        REFERENCES
additional query for matching user IDs. Subsequently, we added
the latter results to the results of the first search.                   [1] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm. Working
                                                                             Notes for the Placing Task at MediaEval. In Working Notes
For the second run which permitted the usage of a gazetteer, we              for the MediaEval 2011 Workshop, Pisa, Italy, 2011.
created a new index using the development data and the
GeoNames database. By creating a collection over the countries           [2] P. Kelm, S. Schmiedeke, and T. Sikora. Multi-modal, multi-
and features we gained a new basis for our search.                           resource methods for placing flickr videos on the map. In
                                                                             Proceedings of the 1st ACM International Conference on
Our first search resulted in 23 more hits than without the                   Multimedia Retrieval, ICMR '11, pages 52:1-52:8, New
gazetteer.                                                                   York, NY, USA, 2011. ACM.
       Table 1. Results determined by the distances between              [3] P. Serdyukov, V. Murdock, and R. van Zwol. Placing flickr
          predicted and actual geographical coordinates                      photos on a map. In Proceedings of the 32nd international
                                                                             ACM SIGIR conference on Research and development in
       Run          1km      10km     100km 1000km 10000km                   information retrieval, SIGIR '09, pages 484-491, New York,
       Pure                                                                  NY, USA, 2009. ACM.
                   9.37% 21.78% 30.67% 44.92%              86.37%
    Development                                                          [4] J. Hays and A. Efros. Im2gps: estimating geographic
     Gazetteer     9.86% 21.49% 29.79% 43.26%              84.16%            information from a single image. In Proceedings of the IEEE
                                                                             Conference on Computer Vision and Pattern Recognition,
Compared to the final results provided by the MediaEval team,                CVPR 2008, pages 1-8, Anchorage, Alaska, USA, 2008.
the number of determined coordinates in a very close proximity of        [5] J.M. Perea-Ortega, M.A. Garcia-Cumbreras, L.A. Urena-
1km and less was higher when using the gazetteer than without.               Lopez, and M. Garcia-Vega. SINAI at Placing Task of
For the moment, this answered our initial expectation.                       MediaEval 2010. In Working Notes Proceedings of the
                                                                             MediaEval 2010 Workshop, Pisa, Italy, 2010.
Surprisingly, regarding ranges of more than 1km difference to the
true coordinates, the results varied slightly to an extent up to         [6] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI Video
3.8%, but this time in favor to the search without the additional            Location Estimation System. In Working Notes Proceedings
geographic database.                                                         of the MediaEval 2010 Workshop, Pisa, Italy, 2010.
                                                                         [7] J. Kürsten, T. Wilhelm, and M. Eibl. Extensible Retrieval
                                                                             and Evaluation Framework: Xtrieval. In Proceedings of the
                                                                             Lernen - Wissen - Adaption Workshop, LWA 2008, pages
3
    http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/               107-110, Würzburg, Germany, 2008.
    search/Similarity.html