=Paper=
{{Paper
|id=None
|storemode=property
|title=Placing media items using the Xtrieval Framework
|pdfUrl=https://ceur-ws.org/Vol-807/Krippner_CUT_Placing_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/KrippnerMHK11
}}
==Placing media items using the Xtrieval Framework==
Placing Media Items Using the Xtrieval Framework Fabian Krippner, Gerald Meier, Jens Hartmann, Robert Knauf Chemnitz University of Technology Straße der Nationen 62 09111 Chemnitz, Germany {fabk, gmei, hajen, knauf}@hrz.tu-chemnitz.de ABSTRACT requirements and conditions are explained in the Placing Task In this paper we describe our approaches and results of evaluating overview paper [1]. the metadata in tagged user-generated videos as well as their visual features in order to extrapolate geographical relevance. The 2. RELATED WORK In the field of adding geographical information, “geotags”, or evaluation was done in the context of the MediaEval 2011 Placing GPS coordinates there are many works already addressing the Task in which we had to determine and to assign the best fitting annotation of images and videos as well as the extraction of geographical coordinates to each video. Our main goal was to comparable image features. For videos Kelm et al. give an realize this task with a retrieval framework developed by the explanation of three ways on how to place a video on the globe Chemnitz University using the bag-of-words model to compare using the metadata and external resources [2]. Serdyukov et al. parts of metadata. This framework is used for indexing and make use of the textual annotations associated with uploaded comparing purposes. Particularly, it incorporates multiple lists of images in combination with GeoNames1 in order to retrieve stop words, stemming lists and dictionaries. For enhancement geographical relevant information through a language based purposes, we also used the GeoNames gazetteer despite noticing model [3]. Hays and Efros suggest a large image database divided that the overall results seem to be slightly better using sole into locations for the purpose of scene matching by comparing metadata comparisons. images instead of text using the nearest-neighbor method [4]. Categories and Subject Descriptors But most inspiring was the way of comparing metadata and H.3.3 [Information Search and Retrieval] scanning them for geographical relevant information that Perea- Ortega et al. used in their MediaEval 2010 work [5] as well as the metadata- and keyword-focused approach of Choi et al. [6]. This Keywords led us to try and realize their approaches using Xtrieval Geographical Coordinates, Geotagging, Flickr videos, Gazetteer, framework [7] which has been developed at the Chair Media Bag-of-Words model, Geographical Location Informatics of the Chemnitz University. It is based on Lucene2 and improves it by using the bag-of-words model instead of 1. INTRODUCTION straight keyword lookups. The internet is filled with many hosting sites for different kinds of Particularly, we benefitted from its incorporated whitespace media such as videos, pictures, and music. Tagged with tokenizer and the modified stop word list. Lucene DataDocuments heterogeneous kinds of information, it is sometimes difficult or provided an adequate object type to store all development data impossible at all to clearly assign the resource’s origin to one sets. specific location on the globe. Regardless of whether the given tags of the videos are sufficient, it is a challenging task to assume a fitting location using a data set of ground truth. For that 3. EVALUATION OF RESULTS purpose, the main goal of the MediaEval 2011 Placing Task was Making use of the built-in methods of Xtrieval, we started with to address the topic of automatically geotagging videos taken from creating data collections from the development data and the data the Flickr community. to be tested. We continued by indexing the development metadata collection and using the test data collection to search on it. Starting from a set of training videos, geographical coordinates were to be derived as accurately as possible by using three For both data collections we extracted the following metadata different variations of ground truth data. fields from the respective XML files of the videos: description, keywords, title, locality, region, country, and user ID. The criterion of the gained results’ relevance were the distances Additionally, we extracted the data fields containing latitude and between the actual coordinates of a video clip’s location provided longitude information from the development data collection. by the particular Flickr user and the ones that have been estimated Furthermore, we created a collection based on the Flickr images’ by the Placing Task participants’ algorithms. The exact metadata. 1 http://www.geonames.org/ Copyright is held by the author/owner(s). 2 MediaEval 2011 Workshop, September 1-2, Pisa, Italy. http://lucene.apache.org/java/docs/index.html Processing both metadata collections (video and image 87.86% of all results were equal in both the searches with and descriptions), we merged the particular data fields into a single without Gazetteer, including videos exceeding the 10000km field. Here, we left out the fields “latitude”, “longitude”, “userid”, threshold. An amount of 63.43% of all videos shared the same and “docno” and handled them separately later on. coordinates while the rest was divided in two groups: One group (11.11% of the cases) delivered better results using the Gazetteer The extracted fields which coincided were merged to one field while the other (13.31%) performed better based only on the called “bag”. Thereby, except for the field “userid”, we applied development data. This leads us to the conclusion that the use of a the bag-of-words model which is used for direct comparisons Gazetteer could improve the search results by nearly the same using a search method based on the Lucene searcher. This method amount as the original data. Thereby, further refinement in the works closely with the Lucene Index. It reevaluates the given selection process is needed for more accurate results. query by searching for frequently used terms in relevant documents and appending them to the initial search query with the intention of enhancing the subsequent Lucene search. 4. OUTLOOK We derived the score for each hit between development data and The applied bag-of-words model did assign a correct location to test data from the score formula of the Similarity3 class in Lucene. ca. 10% of the tested videos. So, for further development we will From each hit set the procedure returned we took the hit with the include different sub-bag correlations in contrast to now, where highest score as the best match between both data sets. we found our optimum in a complete “bag”. We will try different “bags” for different data, stacking searches after one another, and In former tryouts where we used different contents in the “bag” figuring out a better system of weighing between the distinct we found our optimum of using all described fields in it. Leaving ground truth resources. Thereby, we expect to minimize the out even one field led to an overwhelming amount of videos being deviation of results and to gain benefits of each particular ground missed in the results. For example leaving out the field truth data set. Furthermore, the application of filter adjustments “Keywords” from the XML data led to 1,407 (26.31%) missing determining the origin by language and using speech recognition hits when querying the test data. will be a next step as well as using the image feature set, We noticed 278 videos which did not receive a hit. By initiating particularly, the Color and Directivity Descriptor, which we didn't another search process in which we used the missing videos as the get to be fully realized by the time of submission. test data collection, we were able to reduce this number by 37 videos through repeating the former search process and adding an 5. REFERENCES additional query for matching user IDs. Subsequently, we added the latter results to the results of the first search. [1] A. Rae, V. Murdock, P. Serdyukov, and P. Kelm. Working Notes for the Placing Task at MediaEval. In Working Notes For the second run which permitted the usage of a gazetteer, we for the MediaEval 2011 Workshop, Pisa, Italy, 2011. created a new index using the development data and the GeoNames database. By creating a collection over the countries [2] P. Kelm, S. Schmiedeke, and T. Sikora. Multi-modal, multi- and features we gained a new basis for our search. resource methods for placing flickr videos on the map. In Proceedings of the 1st ACM International Conference on Our first search resulted in 23 more hits than without the Multimedia Retrieval, ICMR '11, pages 52:1-52:8, New gazetteer. York, NY, USA, 2011. ACM. Table 1. Results determined by the distances between [3] P. Serdyukov, V. Murdock, and R. van Zwol. Placing flickr predicted and actual geographical coordinates photos on a map. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in Run 1km 10km 100km 1000km 10000km information retrieval, SIGIR '09, pages 484-491, New York, Pure NY, USA, 2009. ACM. 9.37% 21.78% 30.67% 44.92% 86.37% Development [4] J. Hays and A. Efros. Im2gps: estimating geographic Gazetteer 9.86% 21.49% 29.79% 43.26% 84.16% information from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Compared to the final results provided by the MediaEval team, CVPR 2008, pages 1-8, Anchorage, Alaska, USA, 2008. the number of determined coordinates in a very close proximity of [5] J.M. Perea-Ortega, M.A. Garcia-Cumbreras, L.A. Urena- 1km and less was higher when using the gazetteer than without. Lopez, and M. Garcia-Vega. SINAI at Placing Task of For the moment, this answered our initial expectation. MediaEval 2010. In Working Notes Proceedings of the MediaEval 2010 Workshop, Pisa, Italy, 2010. Surprisingly, regarding ranges of more than 1km difference to the true coordinates, the results varied slightly to an extent up to [6] J. Choi, A. Janin, and G. Friedland. The 2010 ICSI Video 3.8%, but this time in favor to the search without the additional Location Estimation System. In Working Notes Proceedings geographic database. of the MediaEval 2010 Workshop, Pisa, Italy, 2010. [7] J. Kürsten, T. Wilhelm, and M. Eibl. Extensible Retrieval and Evaluation Framework: Xtrieval. In Proceedings of the Lernen - Wissen - Adaption Workshop, LWA 2008, pages 3 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/ 107-110, Würzburg, Germany, 2008. search/Similarity.html