UPC at MediaEval 2013 Social Event Detection Task

                   Daniel Manchon-Vizuete                                     Xavier Giro-i-Nieto
                              Pixable                                 Universitat Politecnica de Catalunya
                           New York, USA                                     Barcelona, Catalonia
                    dmanchon@gmail.com                                       xavier.giro@upc.edu


ABSTRACT                                                         In our approach, no external data is used, so all submitted
These working notes present the contribution of the UPC          runs belong to the required type (as specified in the SED
team to the Social Event Detection (SED) task in MediaEval       overview paper [2]).
2013. The proposal extends the previous PhotoTOC work              These working notes is structured as follows. Section 2
in the domain of shared collections of photographs stored        describes the existing PhotoTOC system, which has been
in cloud services. An initial over-segmentation of the photo     adopted as an initial oversegmentation of the dataset. Later,
collection is later refined by merging pairs of similar clus-    Section 3 presents how the oversegmented clusters are merged
ters.                                                            considering different metadata sources. The performance of
                                                                 the solution is assessed in Section 4 with the results ob-
                                                                 tained on the MediaEval SED 2013 task. Finally, Section 5
1.   INTRODUCTION                                                provides the insights learned and points at future research
   These working notes describe the algorithms tested by the     directions.
UPC team in the MediaEval 2013 Semantic Event Detection
(SED) task. The reader is referred to the task description [2]
for further details about the study case, dataset and metrics.   2.   RELATED WORK
Our team participated only in Task 1, where all image were          The adopted solution is inspired by an original work from
to be clustered in events.                                       Microsoft Research[1] named PhotoTOC (Photo Table of
   The proposed approach is aimed at a light computational       Contents. In this previous design, photos are initially sorted
solution capable of dealing with large amounts of data. This     according to their creation time stamp and they are sequen-
requirement is specially sensible when dealing not only with     tially clustered by estimating the location of event bound-
large amounts of data, but also with large amount of users.      aries. A new event boundary is created whenever the time
The SED task describes a dataset with photos from different      gap (gi ) between two consecutive photos is much larger than
users, so that the events to be detected affect several users.   the average time differences of a temporal window around it.
This set up suggests a computational solution to be run on       In particular, a new event is created whenever the criterion
a centralised and shared service on the cloud, in contrast to    show in Equation 1 was satisfied,
other scenarios where each user data can be processed on the
client side. Any computation on the cloud typically implies                                           d
an economical cost on the server which, in many cases, is not                                  1    X
                                                                            log(gN ) ≥ K +            log(gN +1 )          (1)
directly charged on the user, but assumed by the intermedi-                                  2d + 1
                                                                                                    i=−d
ate photo management service. For this reason, it is of high
priority that any solution involves only light computations,     where PhotoTOC empirically set the configuration parame-
discarding this way any pixel-related operation which would      ters to d = 10 and K = log(17).
require the decoding and processing of the images.                  When the time creation is missing in the EXIF metadata,
   In addition, the SED task presents an inherent challenge      the PhotoTOC uses the file creation time. Whenever a clus-
due the incompleteness of the photo metadata. The pro-           ter is larger than 23, this event is considered too large and
vided dataset contains real photos with real missing or cor-     it is split based on color features. This content-based clus-
rupted information; such as non-geolocalised images, or iden-    tering algorithm generates 1/12 the amount of photographs
tical time stamps for the moment when the photo was taken        in the large cluster
but also uploaded. These situations are common specially            The main drawback of PhotoTOC approach was the need
when dealing with online services managing photos, which         of an image processing analysis to estimate the content-
present hetereogenous upload sources and, in many cases,         based similarity. The visual modality was discarded and
remove the EXIF metadata of the photos. These drawbacks          substituted by the geolocation and textual labels as addi-
have been partially managed in the proposed solution, which      tional information to the time creation. In addition, in the
combines the diversity of metadata sources (time stamps,         SED task images from different users were considered taken
geolocation and textual labels) in this challenging context.     from different cameras and point of view, all of this driving
                                                                 to a less reliable visual analysis. There is no guarantee ei-
                                                                 ther that the empirically set values proposed in PhotoTOC
Copyright is held by the author/owner(s).                        would be useful in another dataset, nor it is clear from the
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   paper how they were estimated.
3.    APPROACH                                                    deviation were used to compute the value of the phi function,
   Two solutions have been tested in our submission, both of      which is basically a mapping of the z-score between 0.0 and
them having a common starting point in the time-based clus-       1.0.
tering solution proposed by PhotoTOC.In both solutions,             After normalization, it is still necessary to estimate the
the initial time-based clusters are compared based on asso-       weight of each modality to be later applied to the linear
ciated geolocation, textual labels and user IDs. The first        fusion. These weights were estimated according to the in-
solution relies on weights for each criterion which have been     dividual gain of each type of features studied Method 1.
manually tuned, while the second introduces an estimation         Results shown in Table 1 indicate that the most important
of the relevance of each feature type.                            reason for the fusion of two clusters is that both of them
                                                                  belong to the same user ID, while geolocation and textual
3.1    User and time-based over-segmentation                      labelling have similar relevance. These experimental values
   The first step in the proposed solution considers the pho-     validate the empirical proposal adopted in Method 1.
tos of each user separately. The time-based clustering al-
gorithm proposed by PhotoTOC independently optimising                                      Time    Geo      Label   User
configuration parameters K and d with the training dataset                 Geolocated      0.06    0.28      0.22   0.44
provided by MediaEval. The obtained values were K =                       No geolocated    0.08      -       0.30   0.60
log(150) and d = 40, which clearly differ from the ones pro-
posed in [1]. During this first stage, those images whose         Table 1: Feature weights for photos with and with-
Date taken matches their Date uploaded are not processed,         out geolocation metadata.
as their time stamp is considered corrupted.
   As a result, an over-segmentation of mini-clusters is ob-        Finally, the training dataset was used again to estimate
tained. Each of them is characterised by its averaged time,       the merging threshold for this fused score. The experiments
averaged geolocation, aggregated set of textual labels and        indicated a maximum F1-score for values between 0.3 and
associated user ID. These are the features used in the poste-     0.6, for which a final threshold of 0.5 was adopted.
rior stages to assess the similarity between the mini-clusters.
                                                                  4.    EXPERIMENTS AND RESULTS
3.2    Cluster merges
                                                                    The UPC participated in Challenge 1 with the results
  The set of time-sorted clusters is sequentially analysed        shown in Table 2. The more optimised Method 2 corre-
in increasing time value. Each cluster is compared with the       sponds to Run 1, while Runs 2 and 3 correspond to Method 1
forthcoming 15 clusters, a time window set to avoid excessive     with an optimisation with respect to F1 or NMI, respecively.
computational time. Two clusters are merged whenever a            As expected, the values obtained for Method 2 outperform
similarity measure is above an estimated threshold. The           the two runs associated to Method 1.
submitted runs have considered two options for assessing
this similarity: a first one that adopts binary decision based                              F1      NMI       Divergence F1
on each criterion which are manually weighted, and a second            Method 1 (F1)      0.8798   0.9720        0.8268
one where each individual similarity measure is normalised             Method 1 (NMI)     0.8753   0.9710        0.8220
and later fused with a learned weight.                                    Method 2        0.8833   0.9731        0.8316
Method 1: Binary decisions and manual weights
                                                                          Table 2: UPC results in Challenge 1.
This method compares each pair of clusters separately and
takes a binary decision for each criterion. The geolocation
coordinates are compared with the Haversine distance, the
textual label set with the Jaccard Index and the user IDs         5.    CONCLUSIONS
with a simple binary decision. The three binary decisions are       The presented technique has allowed a fast resolution of
linearly fused with a weighting scheme of 0.2 per geolocation,    the photo clustering of images based only on numerical and
0.2 for text and 0.4 for user ID. Two clusters are merged if      textual metadata. The obtained results seems reasonable to
the fused combination exceeds 0.3.                                assist real users in the organisation of shared collections of
   The binary decision for each criterion is based on a spe-      photographs. However, the authors consider that presented
cific similarity threshold learned after optimisation on the      work may still benefit with an optimised set of similarity
training dataset. This process has assumed independence           thresholds adapted to the type of event.
between the different features, so each of them has been
treated separately.                                               6.    REFERENCES
Method 2: Weighted fusion of normalised distances                 [1] J. C. Platt, M. Czerwinski, and B. Field. Phototoc:
                                                                      automatic clustering for browsing personal
This second solution emerged as a need for a more refined             photographs. In Proc. 4th Pacific Rim Conference on
algorithm to combine the different metadata features. In              Multimedia. , vol. 1, pp. 6-10 Vol.1, 2003.
this case the individual and binary decisions are for a single
                                                                  [2] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
and fused similarity value.
                                                                      C. de Vries, and S. Geva. Social Event Detection at
  This fusion requires a normalization of the distance values
                                                                      MediaEval 2013: Challenges, datasets, and evaluation.
based on the provided training data. This normalization
                                                                      In MediaEval 2013 Workshop, Barcelona, Spain,
was based after the computation of the distances between
                                                                      October 18-19 2013.
3,000 random pairs of photos selected from the training set
and belonging to the same event. The estimated mean and