UPC at MediaEval 2013 Social Event Detection Task Daniel Manchon-Vizuete Xavier Giro-i-Nieto Pixable Universitat Politecnica de Catalunya New York, USA Barcelona, Catalonia dmanchon@gmail.com xavier.giro@upc.edu ABSTRACT In our approach, no external data is used, so all submitted These working notes present the contribution of the UPC runs belong to the required type (as specified in the SED team to the Social Event Detection (SED) task in MediaEval overview paper [2]). 2013. The proposal extends the previous PhotoTOC work These working notes is structured as follows. Section 2 in the domain of shared collections of photographs stored describes the existing PhotoTOC system, which has been in cloud services. An initial over-segmentation of the photo adopted as an initial oversegmentation of the dataset. Later, collection is later refined by merging pairs of similar clus- Section 3 presents how the oversegmented clusters are merged ters. considering different metadata sources. The performance of the solution is assessed in Section 4 with the results ob- tained on the MediaEval SED 2013 task. Finally, Section 5 1. INTRODUCTION provides the insights learned and points at future research These working notes describe the algorithms tested by the directions. UPC team in the MediaEval 2013 Semantic Event Detection (SED) task. The reader is referred to the task description [2] for further details about the study case, dataset and metrics. 2. RELATED WORK Our team participated only in Task 1, where all image were The adopted solution is inspired by an original work from to be clustered in events. Microsoft Research[1] named PhotoTOC (Photo Table of The proposed approach is aimed at a light computational Contents. In this previous design, photos are initially sorted solution capable of dealing with large amounts of data. This according to their creation time stamp and they are sequen- requirement is specially sensible when dealing not only with tially clustered by estimating the location of event bound- large amounts of data, but also with large amount of users. aries. A new event boundary is created whenever the time The SED task describes a dataset with photos from different gap (gi ) between two consecutive photos is much larger than users, so that the events to be detected affect several users. the average time differences of a temporal window around it. This set up suggests a computational solution to be run on In particular, a new event is created whenever the criterion a centralised and shared service on the cloud, in contrast to show in Equation 1 was satisfied, other scenarios where each user data can be processed on the client side. Any computation on the cloud typically implies d an economical cost on the server which, in many cases, is not 1 X log(gN ) ≥ K + log(gN +1 ) (1) directly charged on the user, but assumed by the intermedi- 2d + 1 i=−d ate photo management service. For this reason, it is of high priority that any solution involves only light computations, where PhotoTOC empirically set the configuration parame- discarding this way any pixel-related operation which would ters to d = 10 and K = log(17). require the decoding and processing of the images. When the time creation is missing in the EXIF metadata, In addition, the SED task presents an inherent challenge the PhotoTOC uses the file creation time. Whenever a clus- due the incompleteness of the photo metadata. The pro- ter is larger than 23, this event is considered too large and vided dataset contains real photos with real missing or cor- it is split based on color features. This content-based clus- rupted information; such as non-geolocalised images, or iden- tering algorithm generates 1/12 the amount of photographs tical time stamps for the moment when the photo was taken in the large cluster but also uploaded. These situations are common specially The main drawback of PhotoTOC approach was the need when dealing with online services managing photos, which of an image processing analysis to estimate the content- present hetereogenous upload sources and, in many cases, based similarity. The visual modality was discarded and remove the EXIF metadata of the photos. These drawbacks substituted by the geolocation and textual labels as addi- have been partially managed in the proposed solution, which tional information to the time creation. In addition, in the combines the diversity of metadata sources (time stamps, SED task images from different users were considered taken geolocation and textual labels) in this challenging context. from different cameras and point of view, all of this driving to a less reliable visual analysis. There is no guarantee ei- ther that the empirically set values proposed in PhotoTOC Copyright is held by the author/owner(s). would be useful in another dataset, nor it is clear from the MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain paper how they were estimated. 3. APPROACH deviation were used to compute the value of the phi function, Two solutions have been tested in our submission, both of which is basically a mapping of the z-score between 0.0 and them having a common starting point in the time-based clus- 1.0. tering solution proposed by PhotoTOC.In both solutions, After normalization, it is still necessary to estimate the the initial time-based clusters are compared based on asso- weight of each modality to be later applied to the linear ciated geolocation, textual labels and user IDs. The first fusion. These weights were estimated according to the in- solution relies on weights for each criterion which have been dividual gain of each type of features studied Method 1. manually tuned, while the second introduces an estimation Results shown in Table 1 indicate that the most important of the relevance of each feature type. reason for the fusion of two clusters is that both of them belong to the same user ID, while geolocation and textual 3.1 User and time-based over-segmentation labelling have similar relevance. These experimental values The first step in the proposed solution considers the pho- validate the empirical proposal adopted in Method 1. tos of each user separately. The time-based clustering al- gorithm proposed by PhotoTOC independently optimising Time Geo Label User configuration parameters K and d with the training dataset Geolocated 0.06 0.28 0.22 0.44 provided by MediaEval. The obtained values were K = No geolocated 0.08 - 0.30 0.60 log(150) and d = 40, which clearly differ from the ones pro- posed in [1]. During this first stage, those images whose Table 1: Feature weights for photos with and with- Date taken matches their Date uploaded are not processed, out geolocation metadata. as their time stamp is considered corrupted. As a result, an over-segmentation of mini-clusters is ob- Finally, the training dataset was used again to estimate tained. Each of them is characterised by its averaged time, the merging threshold for this fused score. The experiments averaged geolocation, aggregated set of textual labels and indicated a maximum F1-score for values between 0.3 and associated user ID. These are the features used in the poste- 0.6, for which a final threshold of 0.5 was adopted. rior stages to assess the similarity between the mini-clusters. 4. EXPERIMENTS AND RESULTS 3.2 Cluster merges The UPC participated in Challenge 1 with the results The set of time-sorted clusters is sequentially analysed shown in Table 2. The more optimised Method 2 corre- in increasing time value. Each cluster is compared with the sponds to Run 1, while Runs 2 and 3 correspond to Method 1 forthcoming 15 clusters, a time window set to avoid excessive with an optimisation with respect to F1 or NMI, respecively. computational time. Two clusters are merged whenever a As expected, the values obtained for Method 2 outperform similarity measure is above an estimated threshold. The the two runs associated to Method 1. submitted runs have considered two options for assessing this similarity: a first one that adopts binary decision based F1 NMI Divergence F1 on each criterion which are manually weighted, and a second Method 1 (F1) 0.8798 0.9720 0.8268 one where each individual similarity measure is normalised Method 1 (NMI) 0.8753 0.9710 0.8220 and later fused with a learned weight. Method 2 0.8833 0.9731 0.8316 Method 1: Binary decisions and manual weights Table 2: UPC results in Challenge 1. This method compares each pair of clusters separately and takes a binary decision for each criterion. The geolocation coordinates are compared with the Haversine distance, the textual label set with the Jaccard Index and the user IDs 5. CONCLUSIONS with a simple binary decision. The three binary decisions are The presented technique has allowed a fast resolution of linearly fused with a weighting scheme of 0.2 per geolocation, the photo clustering of images based only on numerical and 0.2 for text and 0.4 for user ID. Two clusters are merged if textual metadata. The obtained results seems reasonable to the fused combination exceeds 0.3. assist real users in the organisation of shared collections of The binary decision for each criterion is based on a spe- photographs. However, the authors consider that presented cific similarity threshold learned after optimisation on the work may still benefit with an optimised set of similarity training dataset. This process has assumed independence thresholds adapted to the type of event. between the different features, so each of them has been treated separately. 6. REFERENCES Method 2: Weighted fusion of normalised distances [1] J. C. Platt, M. Czerwinski, and B. Field. Phototoc: automatic clustering for browsing personal This second solution emerged as a need for a more refined photographs. In Proc. 4th Pacific Rim Conference on algorithm to combine the different metadata features. In Multimedia. , vol. 1, pp. 6-10 Vol.1, 2003. this case the individual and binary decisions are for a single [2] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano, and fused similarity value. C. de Vries, and S. Geva. Social Event Detection at This fusion requires a normalization of the distance values MediaEval 2013: Challenges, datasets, and evaluation. based on the provided training data. This normalization In MediaEval 2013 Workshop, Barcelona, Spain, was based after the computation of the distances between October 18-19 2013. 3,000 random pairs of photos selected from the training set and belonging to the same event. The estimated mean and