=Paper= {{Paper |id=None |storemode=property |title=NTNU@MediaEval 2011 Social Event Detection Task |pdfUrl=https://ceur-ws.org/Vol-807/Ruocco_SED_NTNU_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/RuoccoR11 }} ==NTNU@MediaEval 2011 Social Event Detection Task == https://ceur-ws.org/Vol-807/Ruocco_SED_NTNU_me11wn.pdf
NTNU@MediaEval 2011 Social Event Detection Task (SED)

                                   Massimiliano Ruocco and Heri Ramampiaro
            Data and Information Management Group, Department of Computer and Information Science
                                 Norwegian University of Science and Technology
                                Sem Saelands vei 7-9 NO-7491 Trondheim, Norway
                                                {ruocco,heri}@idi.ntnu.no


ABSTRACT
In this paper we present the system used to solve the chal-
lenges of the Social Event Detection (SED) task at MediaE-
val 2011 challenge.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: Information
Search and Retrieval

1.    INTRODUCTION
This work is part of the MediaEval 2011 challenge. The
general purpose of the challenge was to propose an event re-
trieval system. In particular, we proposed a system to solve
two specific event extraction challenges. In the first chal-
lenge, the main purpose was to retrieve all soccer events in
Rome and Barcelona and in the second challenge, we were
asked to retrieve all events from two specified venues in Am-
                                                                                Figure 1: System Overview
sterdam (NL) and Barcelona (ES) within a certain temporal
range. The results of the queries were presented as groups
of images - i.e, one group per event. More specific details of    2, it was created with the names of the two venues speci-
the challenge can be found in [2].                                fied in the challenge. For both challenges, the geographical
                                                                  location (latitude and longitude) are also extracted. In or-
2.    SYSTEM OVERVIEW                                             der to retrieve these information, a set of SparQL queries
In this section we give a detailed presentation of our pro-       are submitted to DBpedia 1 database by using the Jena 2
posed algorithm. Figure 1 shows an overview of our system.        interface for java. To be more specific, for Challenge 1,
                                                                  names in different languages and geographical information
                                                                  are extracted by selecting from the DBpedia category Foot-
2.1    Query Expansion                                            ball_venues_in_Italy the occurrences based on the city
In a social event retrieval context, a query can be split-        of Rome and Barcelona. For Challenge 2, the geographical
ted and mapped in three different parts according to the          location and names related to the requested venues were ex-
general parameters characterizing an event: (1) what, i.e.,       tracted using the LastFM API 3 and used as query into the
which kind of event we are looking for, (2) where, i.e., the      LastFM database. The output of this block is a set of queries
venue, name of place, city or region where the event that we      Q = {Q1 , ..., QN }, where each subset Qi = {qi1 , ..., qiM }
are looking for takes place, (3) when, i.e., the time, interval   refers to all queries related to a venue and each qij = {T, g}
when the event happens. In both challenges the where part         is composed of two different parts: a textual part with dif-
of the query is expanded in the first block. For Challenge        ferent names of the venues, and a spatial part with a pair of
1 the where part is created with all the stadium names in         real numbers representing the latitude and longitude of the
Rome and Barcelona, in all languages while for Challenge          venues.

                                                                  2.2   Search
                                                                  The queries are submitted to the search engine over the
                                                                  dataset. In our work, we use Solr 4 search engine to index
                                                                  the dataset and perform the search. The search is done as a
                                                                  1
                                                                    http://dbpedia.org/
                                                                  2
                                                                    http://jena.sourceforge.net/
Copyright is held by the author/owner(s).                         3
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy           http://www.last.fm/api
                                                                  4
                                                                    http://lucene.apache.org/solr/
mix of spatial (by using latitude and longitude values) and         3.1    Challenge 1
textual search. The data are indexed based on the textual           Two different runs were performed in the first challenge.
metadata, including Title, Description and Tags. The                In the first run (Run 1 ) all the workflow of the system is
search is then performed over all the three different meta-         performed excluding the refinement step and semantic merge
data. In particular for Challenge 2 the queries are boolean         between clusters. For Run 1 the categorization is performed
queries with all the terms in AND, while in the Challenge           by using only Tag metadata, while for the second run (Run
1 these conditions are more relaxed and the terms of each           2 ), we also include the Title and Description metadata.
query are composed with the boolean operator OR. The rea-           From the results obtained (see Table 1) we can observe that
son for this is that in this challenge, a categorizer is provided   including other metadata than tags resulted in a decrease of
as next step to filter out non-relevant retrieved occurrences.      precision, probably due to the lack of descriptiveness of the
                                                                    other metadata.
2.3      Categorization
The input of this block is a list of pictures with their meta-      3.2    Challenge 2
data. This module is used only for Challenge 1 to extract           For this challenge, three different runs were performed. The
pictures related to a soccer event. The categorization is per-      Run 1 (the baseline run) executed the algorithm without
formed over the three textual metadata for each picture, i.e.,      including the semantic merge and refinement steps. In both
Title, Description and Tags. The different runs will ex-            Run 2 and Run 3 the semantic merge and refinement steps
ploit the descriptivity of each kind of metadata in the cate-       were performed. Semantic merge was done by considering
gorization process (see Section 3). To categorize the pictures      each cluster represented by the named entity representing
the SemanticHacker API 5 over the different textual meta-           events or artists. Moreover in Run 2, refinement is per-
data was used. The categories produced are based on the             formed by querying the top-100 tags and the temporal range
Open Directory Project 6 . The pictures are filtered by only        in which each cluster is closed. In Run 3, we used the entity
keeping those categorized with a category that has radix            names representing artists or events extracted from the set
Sports/Soccer.                                                      of tags of each cluster.
                                                                                    Challenge 1                Challenge 2
2.4      Clustering and Merging                                                   Run 1 Run 2         Run 1      Run 2 Run 3
The previous block returns a set of filtered pictures (Chal-          Precision   94.26     92.47     74.70       77.91     78.85
lenge 1 ) related to soccer events or pictures taken in the            Recall     38.48     43.16     37.99       55.06     56.83
venues specified in the search step (Challenge 2 ) and grouped       F-Measure    54.65     58.65     50.36       64.52     66.05
based on the venues. In this step the temporal information              NMI       0.4613   0.4752     0.4101     0.5049    0.6448
will be used to group the temporal related pictures. In that
way the resulting clusters are finally grouped according to               Table 1: Evaluation measures of the runs
their temporal and locational information. To perform the
clustering process the Quality Thresold Clustering (QT) al-
gorithm is used [1]. This algorithm does not require to spec-
                                                                    4.    CONCLUSIONS
ify in advance the number of clusters and even it is compu-         We have presented a system to extract events for the given
tationally expensive, it is used only on retrieved documents.       two challenges. As described in this paper, the best result in
The resulting clusters may be semantically related and be-          terms of precision was obtained in the first challenge by us-
longing to the same event. To merge semantically similar            ing only the tags for the categorization step, while the other
clusters a graph is built, where the nodes of the graph are         evaluation measure were better when using all the textual
the clusters, and two nodes are connected if they share at          metadata. In the second challenge the best result was ob-
least a tag representing a named entity of an event or of an        tained using the complete workflow of the algorithm, i.e.
artist. To extract the named entities, we use the tags and          using refinement step, in particular using the entity names
submit them as queries to LastFM for the artist names and           in the refinement query for each cluster. Our future experi-
DBPedia for event names. Clusters are merged by finding             ments, especially for the first challenge, will include the use
the connected component as in [3].                                  of the refinement step and semantic merge over the totality
                                                                    of the results (instead of applying it over groups of results
                                                                    coming from the query).
2.5      Refinement
The resulting clusters may be incomplete, i.e. the dataset
may contain other pictures related to the event clusters ex-
                                                                    5.    REFERENCES
                                                                    [1] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring
tracted but not retrieved in the search step. The refine-
                                                                        Expression Data: Identification and Analysis of
ment module is used here to query the dataset by using the
                                                                        Coexpressed Genes. Genome Research,
(1) top-k frequent tags and (2) top-k frequent entity names
                                                                        9(11):1106–1115, Nov. 1999.
(artists and events). The results of the refinement step can
still be filtered to avoid retrieving non-relevant occurrences.     [2] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and
                                                                        I. Kompatsiaris. Social Event Detection at MediaEval
                                                                        2011: Challenges, Dataset and Evaluation. In
3.     EXPERIMENTS AND RESULTS                                          MediaEval 2011 Workshop, Pisa, Italy, September 1-2
In this section, we present the different runs and their eval-          2011.
uation over different metrics. Table 1 provides a summary           [3] M. Ruocco and H. Ramampiaro. Event clusters
of our results.                                                         detection on flickr images using a suffix-tree structure.
5
    http://textwise.com/api                                             Multimedia, International Symposium on, 0:41–48,
6
    http://www.dmoz.org/                                                2010.