Synchronization of Multi-User Event Media at MediaEval 2015: Task Description, Datasets, and Evaluation Nicola Conci Francesco De Natale DISI - University of Trento DISI - University of Trento Trento, Italy Trento, Italy nicola.conci@unitn.it francesco.denatale@unitn.it Vasileios Mezaris Mike Matton CERTH - ITI VRT Thermi, Greece Belgium bmezaris@iti.gr mike.matton@innovatie.vrt.be ABSTRACT pecially after a battery discharge or replacement. In fact, The objective of this paper is to provide an overview of the in the case the temporal information is not represented cor- Synchronization of Multi-User Event Media (SEM) Task, rectly, there is a concrete risk of a misleading interpretation which is part of the MediaEval Benchmark for Multimedia of the media collection, with high probability of losing part Evaluation. The SEM task was initially presented at Me- of the semantics of the event, due to a bad alignment along diaEval in 2014, with the goal of proposing a challenge in the temporal axis. Under such conditions, videos could be aligning multiple users’ photo galleries related to the same of great help, since they contain both audio and visual infor- event but with unreliable timestamps. Besides aligning the mation that could be extremely relevant in providing addi- pictures on a common timeline, participants were also re- tional details about the ongoing event compared to the sole quired to detect the sub-events and cluster the pictures ac- presence of audio and still pictures. cordingly. For 2015 we have decided to extend the task also The SEM task presented in 2014 was dealing only with to other types of media, thus including audio and video in- still pictures and the results provided by the different teams formation for a more complete and diversified representation are definitely encouraging. Participating teams competed of the analyzed event. tackling the problem in different ways. The authors in [5] proposed an approach based on the extraction of visual fea- tures (SIFT) to find the image pairs across the galleries that 1. INTRODUCTION exhibit strong similarity. Then a non-homogeneous linear The ever increasing number of devices for the collection of equation system is constructed to constrain the time off- personal data (smartphones, portable cameras, audio recorders) sets between the galleries based on these matching pairs to has lead to the generation of huge amount of data, which determine an approximate solution. Sansone et al. [6] re- can be either stored for personal records or shared among lied their implementation on the use of a Markov Random friends, relatives, or social networks. In all cases, being able Field to find the best correspondences between the images to arrange such a vast amount of media is of critical impor- belonging to two different photo galleries. Zaharieva et al. tance both for indexing, categorization, and retrieval. This [8] proposed two multimodal approaches that employ both makes it possible for any user who attended, or is simply visual and time information for the synchronization of dif- interested in the event, to recreate the event according to ferent images galleries. The first approach relies on the pair- his personal experience, namely through summaries, stories, wise comparison of images in order to link different galleries, personalized albums [2][4]. while in the second approach Xmeans clustering is applied, However, it turns out that such a large amount of data and the time offsets are estimated by calculating the aver- is often unstructured and heterogeneous. The strong vari- age time differences within the clusters. Apostolidis et al. ability (and sometimes similarity) in terms of content and [1] also proposed a method relying on the combination of dif- archiving strategies makes it difficult to manually organize ferent visual features, and using the images exhibiting the all the event-related material in a simple yet effective man- strongest similarity to compute the galleries offsets. ner. In this respect, it would be desirable to find a consis- tent way of presenting the media galleries captured during an event [7]. This task is not trivial, since timing and lo- 2. TASK DESCRIPTION cation information attached to the captured media (mostly In our scenario we imagine a number of users attending timestamps and GPS) could be inaccurate or missing [3]. the same event and taking photos and videos with different This lack of information is even more accentuated in case non-synchronized devices (smartphones, compact cameras, people use devices that do not have a direct connection to DSLRs, tablets). Each user contributes to the task with the Internet, thus requiring manual setting of the clock, es- one gallery, which includes an arbitrary number of photos, audio files and videos. Assuming that users would like to merge their photo galleries in a single event-related collec- Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany tion, the best temporal alignment among the galleries should . be found, so as to correctly report and preserve the tempo- ral evolution of the event. Furthermore, considering the high are the delay between Gi and Gr calculated on the par- variability in terms of acquisition devices, we cannot expect ticipants’ submission and ground truth, respectively. The the clocks of each device to be synchronized, neither in terms threshold ∆Emax depends on the duration of the sub-events of precision, nor in terms of the time zone set by the users. in the dataset, and represents the maximum accepted time In addition, in some cases, also the location data could be lapse within which we consider a gallery as reasonably well- unavailable (not all devices have a GPS onboard), further synchronized. We use the above quantities in order to esti- reducing the chances of a correct event reconstruction. In mate the synchronization precision (Eq. (1)) and accuracy fact, these factors may considerably hinder the quality of (Eq. (2)): the alignment, thus different solutions should be envisaged, encompassing the joint analysis of temporal data, position M Card (∆Eir < ∆Emax ) information, and audio-visual similarity. P recision = = (1) The SEM task expects teams to provide the estimated N −1 N −1 time offset between different galleries of pictures collected by PN −1 different users and cameras. The goal can be summarised as i=1 ∆Eir follows: given a set of media collections (galleries) taken by Accuracy = 1 − (2) (N − 1)∆Emax different users/devices at the same event, find the best (rela- tive) time alignment among them at gallery level, and detect Precision measures the number of galleries (M ) over the the significant sub-events over the whole event collection. total number of galleries (N − 1, excluding the reference), that have been correctly synchronized. With the accuracy we instead evaluate the capabilities of the teams in minimiz- 3. DATASETS ing the average time lapse calculated over the M synchro- For this challenge we make available four different datasets, nized galleries, normalized with respect to the maximum exhibiting different challenges. The first dataset is related accepted time lapse. to the Tour de France 2014. It consists of images taken The synchronization task provides a basis for the cluster- during the event and collected from Flickr. The dataset is ing task. Once the galleries are synchronized, it is possible split into 33 galleries. The dataset covers the entire competi- to cluster the whole event collection to detect sub-events tion. Some images are also provided with GPS information occurring within the entire event. Sub-events are defined in together with the timestamp. A second dataset concerns a neutral and unbiased way (e.g., making reference to the the famous exhibition held every year in California, namely calendar/schedule of the event) and coded into the ground NAMM 2015. The data-set consists of 420 images and 32 truth. We measure the performance of the sub-event clus- videos, split into 19 galleries. Each user gallery contains tering over the whole synchronized collection of media. For a variable number of media (ranging from 12 to 49). All this, we use the Jaccard index JI and the clustering F1 score images are downloaded from Flickr, while videos are down- (Eq. (3)), where for computing the latter we use P and R, loaded from YouTube. The Spring Party Salesiani 2015 is which represent the Precision and Recall, respectively. a dataset collected by the organizers, and recorded during a students’ party held in Trento, Italy. It is composed of TP 2P R videos and pictures captured by the attendees during the JI = , F1 = (3) TP + FP + FN P +R event. Also in this case a gallery corresponds to the user’s device, and media are complemented with the corresponding In the formulation above we declare a true positive (TP) time-stamps. The last dataset, Salford Test Shoot includes when two images related to the same sub-event are put in 403 audio and 58 video files. Time-codes are available for the same cluster, and a true negative (TN) when two images most of the media. All datasets are provided with the cor- belonging to different sub-events are assigned to two differ- responding ground truth, extracted by considering the ac- ent clusters). False positives (FP) occur instead when two quisition time of the media and manually verified to check images are assigned to the same cluster although belonging the consistency with respect to the captured event. The to different sub-events. datasets related to the Tour de France 2014, NAMM 2015, and Spring Party Salesiani 2015 include material subject to Creative Commons license and are freely available for 5. CONCLUSIONS download 1 . The Salford dataset is instead accessible via In this paper we have presented the Synchronization of the ICoSOLE project website2 . Multi-User Event Media task held at MediaEval 2015. The competing teams will be evaluated considering four datasets collected by the organizers, and made available online to- 4. METRICS AND EVALUATION gether with the corresponding ground truth. For the eval- Each submission will be evaluated in terms of: i) time uation both the synchronization and the clustering perfor- synchronization error, and ii) sub-event detection error. mances will be evaluating, by measuring the galleries offset Concerning the first one, the goal of the participants is to and computing the F1 score, respectively. maximize the number of galleries for which the synchroniza- tion error is below a predefined threshold ∆Emax , and to Acknowledgments minimize the time shift of those galleries. The synchroniza- tion error for a gallery Gi with respect to the reference Gr This work was supported in part by the EC under contract ∗ ∗ FP7-600826 ForgetIT. We would like to thank Alessio Xom- is defined as ∆Eir = ∆Tir − ∆Tir , where ∆Tir and ∆Tir pero and Kostantinos Apostolidis for their precious help in 1 mmlab.disi.unitn.it/MediaEvalSEM2015 collecting and annotating the images for the datasets used 2 https://icosole.lab.vrt.be/viewer/home in the task. 6. REFERENCES [1] K. Apostolidis, C. Papagiannopoulou, and V. Mezaris. CERTH at MediaEval 2014 Synchronization of Multi-User Event Media Task. In Proc. MediaEval 2014 Workshop, CEUR vol. 1263, 2014. [2] M. Broilo, G. Boato, and F. De Natale. Content-based Synchronization for Multiple Photos Galleries. In Proc. IEEE Int. Conf. on Image Processing (ICIP), pages 1945–1948, 2012. [3] N. Conci, F. D. Natale, and V. Mezaris. Synchronization of multi-user event media (SEM) at MediaEval 2014: Task description, datasets, and evaluation. In Proc. MediaEval 2014 Workshop, CEUR vol. 1263, 2014. [4] G. Kim and E. P. Xing. Jointly Aligning and Segmenting Multiple Web Photo Streams for the Inference of Collective Photo Storylines. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 620–627, 2013. [5] P. Nowak, M. Thaler, H. Stiegler, and W. Bailer. JRS at Event Synchronization Task. In Proc. MediaEval 2014 Workshop, CEUR vol. 1263, 2014. [6] E. Sansone, G. Boato, and M.-S. Dao. Synchronizing Multi-User Photo Galleries with MRF. In Proc. MediaEval 2014 Workshop, CEUR vol. 1263, 2014. [7] J. Yang, J. Luo, J. Yu, and T. Huang. Photo Stream Alignment and Summarization for Collaborative Photo Collection and Sharing. Multimedia, IEEE Transactions on, 14(6):1642–1651, Dec 2012. [8] M. Zaharieva, M. Riegler, and M. Del Fabro. Multimodal Synchronization of Image Galleries. In Proc. MediaEval 2014 Workshop, CEUR vol. 1263, 2014.