=Paper= {{Paper |id=None |storemode=property |title=MediaEval 2013: Social Event Detection, Retrieval and Classification in Collaborative Photo Collections |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_64.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/BrennerI13 }} ==MediaEval 2013: Social Event Detection, Retrieval and Classification in Collaborative Photo Collections== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_64.pdf
       MediaEval 2013: Social Event Detection, Retrieval and
        Classification in Collaborative Photo Collections
                                              Markus Brenner, Ebroul Izquierdo
                                  School of Electronic Engineering and Computer Science
                                           Queen Mary University of London, UK
                                   {markus.brenner, ebroul.izquierdo}@eecs.qmul.ac.uk

ABSTRACT                                                                are needed on how to best exploit and process the information
We present a framework to detect social events, retrieve                collaborative photo collections hold in the context of social
associated photos and classify the photos according to event types      events.
in collaborative photo collections as part of the MediaEval 2013        3. OBJECTIVE AND APPROACH
benchmarks. We incorporate various contextual cues using both a         In this paper, we outline a framework that builds upon and
constraint-based clustering model and a classification model.           extends our previous works [7] and [8], and where we detect
Experiments based on the MediaEval Social Event Detection               social events and retrieve associated photos in collaborative photo
Dataset demonstrate the effectiveness of our approach.                  collections. Moreover, we classify photos according to event types
                                                                        such as music concerts or sport games. We test our approach
Categories and Subject Descriptors                                      against both challenges laid out by the MediaEval 2013 SED
H.3.1 [Information Storage and Retrieval]: Content Analysis
                                                                        Benchmark: the goal of Challenge I relates to detecting social
and Indexing; H.3.3 [Information Storage and Retrieval]:
                                                                        events and retrieving associated photos, and the goal of Challenge
Information Search and Retrieval
                                                                        II relates to classifying photos according to event types.
General Terms                                                           3.1 Preprocessing: Propagating Locations
Design, Experimentation, Performance
                                                                        The most useful information to us with respect to social events
Keywords                                                                are: involved people (based on the username of the person who
Benchmark, Photo Collections, Event Detection, Classification           uploaded the photos); date and time (the photos are captured); and
                                                                        the geographic location (venue) an event takes place. Our
1. INTRODUCTION                                                         reasoning for this is the assumed constraint that photos sharing the
The Internet enables people to host and share their photos online       same involved people, date and time as well as geographical
through websites like Flickr. Collaborative annotations and tags        location shall belong together to the same event. Likewise, photos
are commonplace on such services. The information people assign         that differ in at least one constraint shall not belong together.
varies greatly but often seems to include some sorts of references      Thus, we extract, propagate and incorporate as much information
to what happened where and who was involved. In other words,            from these three domains as possible. While date and time as well
such references describe observed experiences that are planned          as usernames (involved people) are usually available, the
and attended by people, which we simply refer to as events [1]. In      geographic location is often unavailable (for example, only newer
order to enable users to exploit events in their photo collections or   smartphones embed location coordinates). As detailed in our
on online services, effective approaches are needed to detect           previous paper [8], we take advantage of this constraint in a
events and retrieve corresponding photos, and additionally, to          preprocessing step to propagate geographic locations across a
understand event types. The MediaEval Social Event Detection            photo collection based on some photos that include geographic
(SED) Benchmark [2] provides a platform to compare different            coordinates or textual references such as Barcelona.
such approaches.
                                                                        3.2 Feature Extraction
2. BACKGROUND AND RELATED WORK                                          To aid event detection, retrieval and classification as explained in
There is much research in the area of event detection in web            the forthcoming two sections, we extract and compose textual
resources in general. The subdomain we focus on is photo                features of each photo’s title, description and keywords. First, we
websites, wherein users can share and collaboratively annotate          apply a Roman preprocessor that converts text into lower case,
photos. Recent research [3] put emphasis on detecting events from       strips punctuation as well as whitespaces and removes accents. In
Flickr photos by primarily exploiting user-supplied tags. Other         the next step, we split the words into tokens. To accommodate
works [4], [5] extend this to place semantics, the latter               other languages as well as misspelled or varied terms, we apply a
incorporating the visual similarity among photos as well. Our           language-agnostic character-based tokenizer rather than a word-
framework relates to event clustering approaches, particularly in       based tokenizer. We then use a vectorizer to convert the tokens
personal photo collections [6]. However, we also embody the             into a matrix of occurrences. To make up for photos with a large
context of social events to improve detection and retrieval             amount of textual annotations, we also consider the total number
performance. We believe that further understanding and research         of tokens. This approach is commonly referred to as Term
                                                                        Frequencies. Instead of decomposing the resulting feature matrix,
                                                                        we simply limit the amount of features to 9600, which results in
This work is partially supported by EU project CUbRIK.
                                                                        almost comparable performance at much lower required
                                                                        complexity.
Copyright is held by the author/owner(s).                                  In addition to textual features, we also extract and incorporate
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain          visual GIST features (a feature vector with 960 elements) for each
                                                                        photo. To fuse textual and visual features, we normalize both
                                                                        features and concatenate them into a combined feature vector. We
also incorporate a weighting ratio that allows us to emphasize one       select only one spatial cluster per matching temporal window for
or the other feature.                                                    each involved person (username), we can further improve results
                                                                         by a small margin.
3.3 Event Detection and Retrieval                                           For Challenge II, the results detail that we can better classify
We define an event as a distinct combination of a spatial window
                                                                         photos as non-events (F1-score of ~0.94) rather than as a specific
(5km clusters) and a temporal window (8h clusters). We start with
                                                                         event type. Of eight possible event types that we trained our
a list of all suitable spatio-temporal window combinations (the
                                                                         model on, we can best classify the types concert (~0.52), protest
results of Section 3.1). If we retrieve more than two photos, as we
                                                                         (~0.37) and theater-dance (~0.31). We see the worst performance
explain next, we consider the combination as a detected event and
                                                                         with fashion (~0.07) and other (~0.05). On average, we achieve
the retrieved photos as part of that event.
                                                                         an event classification F1-score of 0.50 in our best performing
   For actual retrieval, we first include all photos whose date, time
                                                                         configuration. Surprisingly, neither training set expansion nor
and available geographic coordinates fall into an event’s spatio-
                                                                         event-wide joint classification notably improves results.
temporal window (we denote these candidate photos as                ).
                                                                            Although we use the same feature extraction configuration for
Thereafter, we employ a Linear Support Vector Classifier (using the
                                                                         both challenges, the addition of visual features (compared to using
features whose extraction we explain in Section 3.2) for all
                                                                         only textual features) has a much larger positive impact for
remaining photos that only fall into an event’s temporal window, but
                                                                         Challenge II than for Challenge I.
whose spatial window we are not aware of. For each event, we train
a separate model and perform binary classification: photos which         Table 1: Results of Challenge I depending on configuration
are either related or not related to an event. We use         for the                                                         F1         NMI
related class, and a small, random subset of photos (that do not fall    Run 1: Run 5 - visual features                      0.78         0.94
within the same spatio-temporal window) for the not-related              Run 2: Basic                                        0.59         0.64
class.                                                                   Run 3: Run 2 + include temporal clusters            0.76         0.94
   In this last step of our event-driven retrieval framework, we         Run 4: Run 3 + expansion                            0.74         0.93
include photos that are likely relevant to a retrieval query but may     Run 5: Run 4 + include rest + max. user             0.78         0.94
have been mistakenly discarded by the classification step. In            Table 2: Results of Challenge II depending on configuration
particular, these might be photos that are linked to users who have                                                  F1 Non-Event      F1 Event
multiple photos relevant to a retrieval query. The assumption is         Run 1: Without visual features                   0.93            0.37
that if a user attends a social event and takes photos, then it is       Run 2-5: Default                                 0.95            0.50
likely that most of his photos taken over the time that he attends
the event are of the event.                                              5. CONCLUSION
                                                                         We present a framework to detect social events, retrieve
3.4 Classification of Event Type                                         associated photos and classify the photos according to event types
In this section, we extend our framework to classify the event type      in tagged photo collections such as Flickr. We combine various
that a photo or multiple photos belong to. We perform the same           contextual information using a constraint-based clustering and
initial constraint-based spatio-temporal clustering as in Section        classification model. The listed benchmark results validate our
3.3. This allows us to compile a larger training set by including all    approach. In the future, we wish to improve event detection by
photos of an event in case the training ground truth is only given       incorporating information from social networks.
for some photos of an event.
   Using this extended overall training set, we then train a multi-      REFERENCES
class Linear Support Vector Classifier (as in Section 3.3, based on      [1]   R. Troncy, B. Malocha, and A. T. . Fialho, “Linking events with
features that we extract in Section 3.2). In the simplest case, we can         media,” in I-SEMANTICS, 2010, pp. 1–4.
thereafter predict the event type of any given test photo. However,      [2]   T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, C. de Vries, S.
instead of treating any test photos separately, it is also possible to         Geva, and C. De Vries, “Social Event Detection at MediaEval
consider multiple photos (that belong to the same event) together. To          2013: Challenges, Datasets, and Evaluation,” in MediaEval 2013
do so, we simply assign the most often predicted event type within             Workshop, 2013, pp. 2–3.
an event to all its associated photos.                                   [3]   L. Chen and A. Roy, “Event detection from flickr data through
4. EXPERIMENTS AND RESULTS                                                     wavelet-based spatial analysis,” in CIKM, 2009, pp. 523–532.
We perform experiments on the MediaEval 2013 SED Dataset                 [4]   T. Rattenbury, N. Good, and M. Naaman, “Towards automatic
that consists of a total of 437370 Flickr photos (Challenge I) and             extraction of event and place semantics from Flickr tags,” in SIGIR,
57165 Instagram photos (Challenge II) with accompanying                        2007, pp. 103–110.
metadata. We use the provided training sets to estimate suitable
                                                                         [5]   S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and A. Vakali,
parameter values and train our event classification model required             “Cluster-based landmark and event detection on tagged photo
for Challenge II.                                                              collections,” MultiMedia, no. 99, pp. 1–1, 2010.
   In the following two tables, we present our results (as evaluated
by the organizers of the MediaEval Benchmark) with respect to            [6]   M. Cooper, J. Foote, A. Girgensohn, and L. Wilcox, “Temporal
the testing sets. The results of Challenge I show us that it is                event clustering for digital photo collections,” TOMCCAP, pp. 269–
                                                                               288, 2005.
important to consider temporal clusters (as newly detected events)
that are not clearly associated with any geographic location (or         [7]   M. Brenner and E. Izquierdo, “Social Event Detection and Retrieval
spatial cluster). In our case, this improves the F1-score from 0.59            in Collaborative Photo Collections,” in ICMR, 2012.
to 0.76. We also see that an additional classification-based             [8]   M. Brenner and E. Izquierdo, “Event-driven Retrieval in
expansion of an event’s candidate set does not necessarily always              Collaborative Photo Collections,” in WIAMIS, 2013.
improve detection and retrieval results, or does so only in
conjunction with other steps. For example, if we consider and