=Paper=
{{Paper
|id=None
|storemode=property
|title=MediaEval 2013: Social Event Detection, Retrieval and Classification in Collaborative Photo Collections
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_64.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BrennerI13
}}
==MediaEval 2013: Social Event Detection, Retrieval and Classification in Collaborative Photo Collections==
MediaEval 2013: Social Event Detection, Retrieval and
Classification in Collaborative Photo Collections
Markus Brenner, Ebroul Izquierdo
School of Electronic Engineering and Computer Science
Queen Mary University of London, UK
{markus.brenner, ebroul.izquierdo}@eecs.qmul.ac.uk
ABSTRACT are needed on how to best exploit and process the information
We present a framework to detect social events, retrieve collaborative photo collections hold in the context of social
associated photos and classify the photos according to event types events.
in collaborative photo collections as part of the MediaEval 2013 3. OBJECTIVE AND APPROACH
benchmarks. We incorporate various contextual cues using both a In this paper, we outline a framework that builds upon and
constraint-based clustering model and a classification model. extends our previous works [7] and [8], and where we detect
Experiments based on the MediaEval Social Event Detection social events and retrieve associated photos in collaborative photo
Dataset demonstrate the effectiveness of our approach. collections. Moreover, we classify photos according to event types
such as music concerts or sport games. We test our approach
Categories and Subject Descriptors against both challenges laid out by the MediaEval 2013 SED
H.3.1 [Information Storage and Retrieval]: Content Analysis
Benchmark: the goal of Challenge I relates to detecting social
and Indexing; H.3.3 [Information Storage and Retrieval]:
events and retrieving associated photos, and the goal of Challenge
Information Search and Retrieval
II relates to classifying photos according to event types.
General Terms 3.1 Preprocessing: Propagating Locations
Design, Experimentation, Performance
The most useful information to us with respect to social events
Keywords are: involved people (based on the username of the person who
Benchmark, Photo Collections, Event Detection, Classification uploaded the photos); date and time (the photos are captured); and
the geographic location (venue) an event takes place. Our
1. INTRODUCTION reasoning for this is the assumed constraint that photos sharing the
The Internet enables people to host and share their photos online same involved people, date and time as well as geographical
through websites like Flickr. Collaborative annotations and tags location shall belong together to the same event. Likewise, photos
are commonplace on such services. The information people assign that differ in at least one constraint shall not belong together.
varies greatly but often seems to include some sorts of references Thus, we extract, propagate and incorporate as much information
to what happened where and who was involved. In other words, from these three domains as possible. While date and time as well
such references describe observed experiences that are planned as usernames (involved people) are usually available, the
and attended by people, which we simply refer to as events [1]. In geographic location is often unavailable (for example, only newer
order to enable users to exploit events in their photo collections or smartphones embed location coordinates). As detailed in our
on online services, effective approaches are needed to detect previous paper [8], we take advantage of this constraint in a
events and retrieve corresponding photos, and additionally, to preprocessing step to propagate geographic locations across a
understand event types. The MediaEval Social Event Detection photo collection based on some photos that include geographic
(SED) Benchmark [2] provides a platform to compare different coordinates or textual references such as Barcelona.
such approaches.
3.2 Feature Extraction
2. BACKGROUND AND RELATED WORK To aid event detection, retrieval and classification as explained in
There is much research in the area of event detection in web the forthcoming two sections, we extract and compose textual
resources in general. The subdomain we focus on is photo features of each photo’s title, description and keywords. First, we
websites, wherein users can share and collaboratively annotate apply a Roman preprocessor that converts text into lower case,
photos. Recent research [3] put emphasis on detecting events from strips punctuation as well as whitespaces and removes accents. In
Flickr photos by primarily exploiting user-supplied tags. Other the next step, we split the words into tokens. To accommodate
works [4], [5] extend this to place semantics, the latter other languages as well as misspelled or varied terms, we apply a
incorporating the visual similarity among photos as well. Our language-agnostic character-based tokenizer rather than a word-
framework relates to event clustering approaches, particularly in based tokenizer. We then use a vectorizer to convert the tokens
personal photo collections [6]. However, we also embody the into a matrix of occurrences. To make up for photos with a large
context of social events to improve detection and retrieval amount of textual annotations, we also consider the total number
performance. We believe that further understanding and research of tokens. This approach is commonly referred to as Term
Frequencies. Instead of decomposing the resulting feature matrix,
we simply limit the amount of features to 9600, which results in
This work is partially supported by EU project CUbRIK.
almost comparable performance at much lower required
complexity.
Copyright is held by the author/owner(s). In addition to textual features, we also extract and incorporate
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain visual GIST features (a feature vector with 960 elements) for each
photo. To fuse textual and visual features, we normalize both
features and concatenate them into a combined feature vector. We
also incorporate a weighting ratio that allows us to emphasize one select only one spatial cluster per matching temporal window for
or the other feature. each involved person (username), we can further improve results
by a small margin.
3.3 Event Detection and Retrieval For Challenge II, the results detail that we can better classify
We define an event as a distinct combination of a spatial window
photos as non-events (F1-score of ~0.94) rather than as a specific
(5km clusters) and a temporal window (8h clusters). We start with
event type. Of eight possible event types that we trained our
a list of all suitable spatio-temporal window combinations (the
model on, we can best classify the types concert (~0.52), protest
results of Section 3.1). If we retrieve more than two photos, as we
(~0.37) and theater-dance (~0.31). We see the worst performance
explain next, we consider the combination as a detected event and
with fashion (~0.07) and other (~0.05). On average, we achieve
the retrieved photos as part of that event.
an event classification F1-score of 0.50 in our best performing
For actual retrieval, we first include all photos whose date, time
configuration. Surprisingly, neither training set expansion nor
and available geographic coordinates fall into an event’s spatio-
event-wide joint classification notably improves results.
temporal window (we denote these candidate photos as ).
Although we use the same feature extraction configuration for
Thereafter, we employ a Linear Support Vector Classifier (using the
both challenges, the addition of visual features (compared to using
features whose extraction we explain in Section 3.2) for all
only textual features) has a much larger positive impact for
remaining photos that only fall into an event’s temporal window, but
Challenge II than for Challenge I.
whose spatial window we are not aware of. For each event, we train
a separate model and perform binary classification: photos which Table 1: Results of Challenge I depending on configuration
are either related or not related to an event. We use for the F1 NMI
related class, and a small, random subset of photos (that do not fall Run 1: Run 5 - visual features 0.78 0.94
within the same spatio-temporal window) for the not-related Run 2: Basic 0.59 0.64
class. Run 3: Run 2 + include temporal clusters 0.76 0.94
In this last step of our event-driven retrieval framework, we Run 4: Run 3 + expansion 0.74 0.93
include photos that are likely relevant to a retrieval query but may Run 5: Run 4 + include rest + max. user 0.78 0.94
have been mistakenly discarded by the classification step. In Table 2: Results of Challenge II depending on configuration
particular, these might be photos that are linked to users who have F1 Non-Event F1 Event
multiple photos relevant to a retrieval query. The assumption is Run 1: Without visual features 0.93 0.37
that if a user attends a social event and takes photos, then it is Run 2-5: Default 0.95 0.50
likely that most of his photos taken over the time that he attends
the event are of the event. 5. CONCLUSION
We present a framework to detect social events, retrieve
3.4 Classification of Event Type associated photos and classify the photos according to event types
In this section, we extend our framework to classify the event type in tagged photo collections such as Flickr. We combine various
that a photo or multiple photos belong to. We perform the same contextual information using a constraint-based clustering and
initial constraint-based spatio-temporal clustering as in Section classification model. The listed benchmark results validate our
3.3. This allows us to compile a larger training set by including all approach. In the future, we wish to improve event detection by
photos of an event in case the training ground truth is only given incorporating information from social networks.
for some photos of an event.
Using this extended overall training set, we then train a multi- REFERENCES
class Linear Support Vector Classifier (as in Section 3.3, based on [1] R. Troncy, B. Malocha, and A. T. . Fialho, “Linking events with
features that we extract in Section 3.2). In the simplest case, we can media,” in I-SEMANTICS, 2010, pp. 1–4.
thereafter predict the event type of any given test photo. However, [2] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, C. de Vries, S.
instead of treating any test photos separately, it is also possible to Geva, and C. De Vries, “Social Event Detection at MediaEval
consider multiple photos (that belong to the same event) together. To 2013: Challenges, Datasets, and Evaluation,” in MediaEval 2013
do so, we simply assign the most often predicted event type within Workshop, 2013, pp. 2–3.
an event to all its associated photos. [3] L. Chen and A. Roy, “Event detection from flickr data through
4. EXPERIMENTS AND RESULTS wavelet-based spatial analysis,” in CIKM, 2009, pp. 523–532.
We perform experiments on the MediaEval 2013 SED Dataset [4] T. Rattenbury, N. Good, and M. Naaman, “Towards automatic
that consists of a total of 437370 Flickr photos (Challenge I) and extraction of event and place semantics from Flickr tags,” in SIGIR,
57165 Instagram photos (Challenge II) with accompanying 2007, pp. 103–110.
metadata. We use the provided training sets to estimate suitable
[5] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and A. Vakali,
parameter values and train our event classification model required “Cluster-based landmark and event detection on tagged photo
for Challenge II. collections,” MultiMedia, no. 99, pp. 1–1, 2010.
In the following two tables, we present our results (as evaluated
by the organizers of the MediaEval Benchmark) with respect to [6] M. Cooper, J. Foote, A. Girgensohn, and L. Wilcox, “Temporal
the testing sets. The results of Challenge I show us that it is event clustering for digital photo collections,” TOMCCAP, pp. 269–
288, 2005.
important to consider temporal clusters (as newly detected events)
that are not clearly associated with any geographic location (or [7] M. Brenner and E. Izquierdo, “Social Event Detection and Retrieval
spatial cluster). In our case, this improves the F1-score from 0.59 in Collaborative Photo Collections,” in ICMR, 2012.
to 0.76. We also see that an additional classification-based [8] M. Brenner and E. Izquierdo, “Event-driven Retrieval in
expansion of an event’s candidate set does not necessarily always Collaborative Photo Collections,” in WIAMIS, 2013.
improve detection and retrieval results, or does so only in
conjunction with other steps. For example, if we consider and