=Paper=
{{Paper
|id=None
|storemode=property
|title=NTNU@MediaEval 2011 Social Event Detection Task
|pdfUrl=https://ceur-ws.org/Vol-807/Ruocco_SED_NTNU_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RuoccoR11
}}
==NTNU@MediaEval 2011 Social Event Detection Task ==
NTNU@MediaEval 2011 Social Event Detection Task (SED) Massimiliano Ruocco and Heri Ramampiaro Data and Information Management Group, Department of Computer and Information Science Norwegian University of Science and Technology Sem Saelands vei 7-9 NO-7491 Trondheim, Norway {ruocco,heri}@idi.ntnu.no ABSTRACT In this paper we present the system used to solve the chal- lenges of the Social Event Detection (SED) task at MediaE- val 2011 challenge. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval 1. INTRODUCTION This work is part of the MediaEval 2011 challenge. The general purpose of the challenge was to propose an event re- trieval system. In particular, we proposed a system to solve two specific event extraction challenges. In the first chal- lenge, the main purpose was to retrieve all soccer events in Rome and Barcelona and in the second challenge, we were asked to retrieve all events from two specified venues in Am- Figure 1: System Overview sterdam (NL) and Barcelona (ES) within a certain temporal range. The results of the queries were presented as groups of images - i.e, one group per event. More specific details of 2, it was created with the names of the two venues speci- the challenge can be found in [2]. fied in the challenge. For both challenges, the geographical location (latitude and longitude) are also extracted. In or- 2. SYSTEM OVERVIEW der to retrieve these information, a set of SparQL queries In this section we give a detailed presentation of our pro- are submitted to DBpedia 1 database by using the Jena 2 posed algorithm. Figure 1 shows an overview of our system. interface for java. To be more specific, for Challenge 1, names in different languages and geographical information are extracted by selecting from the DBpedia category Foot- 2.1 Query Expansion ball_venues_in_Italy the occurrences based on the city In a social event retrieval context, a query can be split- of Rome and Barcelona. For Challenge 2, the geographical ted and mapped in three different parts according to the location and names related to the requested venues were ex- general parameters characterizing an event: (1) what, i.e., tracted using the LastFM API 3 and used as query into the which kind of event we are looking for, (2) where, i.e., the LastFM database. The output of this block is a set of queries venue, name of place, city or region where the event that we Q = {Q1 , ..., QN }, where each subset Qi = {qi1 , ..., qiM } are looking for takes place, (3) when, i.e., the time, interval refers to all queries related to a venue and each qij = {T, g} when the event happens. In both challenges the where part is composed of two different parts: a textual part with dif- of the query is expanded in the first block. For Challenge ferent names of the venues, and a spatial part with a pair of 1 the where part is created with all the stadium names in real numbers representing the latitude and longitude of the Rome and Barcelona, in all languages while for Challenge venues. 2.2 Search The queries are submitted to the search engine over the dataset. In our work, we use Solr 4 search engine to index the dataset and perform the search. The search is done as a 1 http://dbpedia.org/ 2 http://jena.sourceforge.net/ Copyright is held by the author/owner(s). 3 MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy http://www.last.fm/api 4 http://lucene.apache.org/solr/ mix of spatial (by using latitude and longitude values) and 3.1 Challenge 1 textual search. The data are indexed based on the textual Two different runs were performed in the first challenge. metadata, including Title, Description and Tags. The In the first run (Run 1 ) all the workflow of the system is search is then performed over all the three different meta- performed excluding the refinement step and semantic merge data. In particular for Challenge 2 the queries are boolean between clusters. For Run 1 the categorization is performed queries with all the terms in AND, while in the Challenge by using only Tag metadata, while for the second run (Run 1 these conditions are more relaxed and the terms of each 2 ), we also include the Title and Description metadata. query are composed with the boolean operator OR. The rea- From the results obtained (see Table 1) we can observe that son for this is that in this challenge, a categorizer is provided including other metadata than tags resulted in a decrease of as next step to filter out non-relevant retrieved occurrences. precision, probably due to the lack of descriptiveness of the other metadata. 2.3 Categorization The input of this block is a list of pictures with their meta- 3.2 Challenge 2 data. This module is used only for Challenge 1 to extract For this challenge, three different runs were performed. The pictures related to a soccer event. The categorization is per- Run 1 (the baseline run) executed the algorithm without formed over the three textual metadata for each picture, i.e., including the semantic merge and refinement steps. In both Title, Description and Tags. The different runs will ex- Run 2 and Run 3 the semantic merge and refinement steps ploit the descriptivity of each kind of metadata in the cate- were performed. Semantic merge was done by considering gorization process (see Section 3). To categorize the pictures each cluster represented by the named entity representing the SemanticHacker API 5 over the different textual meta- events or artists. Moreover in Run 2, refinement is per- data was used. The categories produced are based on the formed by querying the top-100 tags and the temporal range Open Directory Project 6 . The pictures are filtered by only in which each cluster is closed. In Run 3, we used the entity keeping those categorized with a category that has radix names representing artists or events extracted from the set Sports/Soccer. of tags of each cluster. Challenge 1 Challenge 2 2.4 Clustering and Merging Run 1 Run 2 Run 1 Run 2 Run 3 The previous block returns a set of filtered pictures (Chal- Precision 94.26 92.47 74.70 77.91 78.85 lenge 1 ) related to soccer events or pictures taken in the Recall 38.48 43.16 37.99 55.06 56.83 venues specified in the search step (Challenge 2 ) and grouped F-Measure 54.65 58.65 50.36 64.52 66.05 based on the venues. In this step the temporal information NMI 0.4613 0.4752 0.4101 0.5049 0.6448 will be used to group the temporal related pictures. In that way the resulting clusters are finally grouped according to Table 1: Evaluation measures of the runs their temporal and locational information. To perform the clustering process the Quality Thresold Clustering (QT) al- gorithm is used [1]. This algorithm does not require to spec- 4. CONCLUSIONS ify in advance the number of clusters and even it is compu- We have presented a system to extract events for the given tationally expensive, it is used only on retrieved documents. two challenges. As described in this paper, the best result in The resulting clusters may be semantically related and be- terms of precision was obtained in the first challenge by us- longing to the same event. To merge semantically similar ing only the tags for the categorization step, while the other clusters a graph is built, where the nodes of the graph are evaluation measure were better when using all the textual the clusters, and two nodes are connected if they share at metadata. In the second challenge the best result was ob- least a tag representing a named entity of an event or of an tained using the complete workflow of the algorithm, i.e. artist. To extract the named entities, we use the tags and using refinement step, in particular using the entity names submit them as queries to LastFM for the artist names and in the refinement query for each cluster. Our future experi- DBPedia for event names. Clusters are merged by finding ments, especially for the first challenge, will include the use the connected component as in [3]. of the refinement step and semantic merge over the totality of the results (instead of applying it over groups of results coming from the query). 2.5 Refinement The resulting clusters may be incomplete, i.e. the dataset may contain other pictures related to the event clusters ex- 5. REFERENCES [1] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring tracted but not retrieved in the search step. The refine- Expression Data: Identification and Analysis of ment module is used here to query the dataset by using the Coexpressed Genes. Genome Research, (1) top-k frequent tags and (2) top-k frequent entity names 9(11):1106–1115, Nov. 1999. (artists and events). The results of the refinement step can still be filtered to avoid retrieving non-relevant occurrences. [2] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and I. Kompatsiaris. Social Event Detection at MediaEval 2011: Challenges, Dataset and Evaluation. In 3. EXPERIMENTS AND RESULTS MediaEval 2011 Workshop, Pisa, Italy, September 1-2 In this section, we present the different runs and their eval- 2011. uation over different metrics. Table 1 provides a summary [3] M. Ruocco and H. Ramampiaro. Event clusters of our results. detection on flickr images using a suffix-tree structure. 5 http://textwise.com/api Multimedia, International Symposium on, 0:41–48, 6 http://www.dmoz.org/ 2010.