     Synchronization of Multi-User Event Media at MediaEval
        2015: Task Description, Datasets, and Evaluation

                                    Nicola Conci                     Francesco De Natale
                               DISI - University of Trento           DISI - University of Trento
                                      Trento, Italy                         Trento, Italy
                               nicola.conci@unitn.it               francesco.denatale@unitn.it
                                 Vasileios Mezaris                       Mike Matton
                                     CERTH - ITI                                VRT
                                    Thermi, Greece                             Belgium
                                  bmezaris@iti.gr                  mike.matton@innovatie.vrt.be

ABSTRACT                                                            pecially after a battery discharge or replacement. In fact,
The objective of this paper is to provide an overview of the        in the case the temporal information is not represented cor-
Synchronization of Multi-User Event Media (SEM) Task,               rectly, there is a concrete risk of a misleading interpretation
which is part of the MediaEval Benchmark for Multimedia             of the media collection, with high probability of losing part
Evaluation. The SEM task was initially presented at Me-             of the semantics of the event, due to a bad alignment along
diaEval in 2014, with the goal of proposing a challenge in          the temporal axis. Under such conditions, videos could be
aligning multiple users’ photo galleries related to the same        of great help, since they contain both audio and visual infor-
event but with unreliable timestamps. Besides aligning the          mation that could be extremely relevant in providing addi-
pictures on a common timeline, participants were also re-           tional details about the ongoing event compared to the sole
quired to detect the sub-events and cluster the pictures ac-        presence of audio and still pictures.
cordingly. For 2015 we have decided to extend the task also            The SEM task presented in 2014 was dealing only with
to other types of media, thus including audio and video in-         still pictures and the results provided by the different teams
formation for a more complete and diversified representation        are definitely encouraging. Participating teams competed
of the analyzed event.                                              tackling the problem in different ways. The authors in [5]
                                                                    proposed an approach based on the extraction of visual fea-
                                                                    tures (SIFT) to find the image pairs across the galleries that
1.   INTRODUCTION                                                   exhibit strong similarity. Then a non-homogeneous linear
   The ever increasing number of devices for the collection of      equation system is constructed to constrain the time off-
personal data (smartphones, portable cameras, audio recorders)      sets between the galleries based on these matching pairs to
has lead to the generation of huge amount of data, which            determine an approximate solution. Sansone et al. [6] re-
can be either stored for personal records or shared among           lied their implementation on the use of a Markov Random
friends, relatives, or social networks. In all cases, being able    Field to find the best correspondences between the images
to arrange such a vast amount of media is of critical impor-        belonging to two different photo galleries. Zaharieva et al.
tance both for indexing, categorization, and retrieval. This        [8] proposed two multimodal approaches that employ both
makes it possible for any user who attended, or is simply           visual and time information for the synchronization of dif-
interested in the event, to recreate the event according to         ferent images galleries. The first approach relies on the pair-
his personal experience, namely through summaries, stories,         wise comparison of images in order to link different galleries,
personalized albums [2][4].                                         while in the second approach Xmeans clustering is applied,
   However, it turns out that such a large amount of data           and the time offsets are estimated by calculating the aver-
is often unstructured and heterogeneous. The strong vari-           age time differences within the clusters. Apostolidis et al.
ability (and sometimes similarity) in terms of content and          [1] also proposed a method relying on the combination of dif-
archiving strategies makes it difficult to manually organize        ferent visual features, and using the images exhibiting the
all the event-related material in a simple yet effective man-       strongest similarity to compute the galleries offsets.
ner. In this respect, it would be desirable to find a consis-
tent way of presenting the media galleries captured during
an event [7]. This task is not trivial, since timing and lo-        2.   TASK DESCRIPTION
cation information attached to the captured media (mostly              In our scenario we imagine a number of users attending
timestamps and GPS) could be inaccurate or missing [3].             the same event and taking photos and videos with different
   This lack of information is even more accentuated in case        non-synchronized devices (smartphones, compact cameras,
people use devices that do not have a direct connection to          DSLRs, tablets). Each user contributes to the task with
the Internet, thus requiring manual setting of the clock, es-       one gallery, which includes an arbitrary number of photos,
                                                                    audio files and videos. Assuming that users would like to
                                                                    merge their photo galleries in a single event-related collec-
ral evolution of the event. Furthermore, considering the high      are the delay between Gi and Gr calculated on the par-
variability in terms of acquisition devices, we cannot expect      ticipants’ submission and ground truth, respectively. The
the clocks of each device to be synchronized, neither in terms     threshold ∆Emax depends on the duration of the sub-events
of precision, nor in terms of the time zone set by the users.      in the dataset, and represents the maximum accepted time
In addition, in some cases, also the location data could be        lapse within which we consider a gallery as reasonably well-
unavailable (not all devices have a GPS onboard), further          synchronized. We use the above quantities in order to esti-
reducing the chances of a correct event reconstruction. In         mate the synchronization precision (Eq. (1)) and accuracy
fact, these factors may considerably hinder the quality of         (Eq. (2)):
the alignment, thus different solutions should be envisaged,
encompassing the joint analysis of temporal data, position
                                                                                         M     Card (∆Eir < ∆Emax )
information, and audio-visual similarity.                                P recision =        =                              (1)
   The SEM task expects teams to provide the estimated                                  N −1          N −1
time offset between different galleries of pictures collected by
                                                                                                   PN −1
different users and cameras. The goal can be summarised as
                                                                                                     i=1 ∆Eir
follows: given a set of media collections (galleries) taken by                  Accuracy = 1 −                              (2)
                                                                                                  (N − 1)∆Emax
different users/devices at the same event, find the best (rela-
tive) time alignment among them at gallery level, and detect         Precision measures the number of galleries (M ) over the
the significant sub-events over the whole event collection.        total number of galleries (N − 1, excluding the reference),
                                                                   that have been correctly synchronized. With the accuracy
                                                                   we instead evaluate the capabilities of the teams in minimiz-
3.     DATASETS                                                    ing the average time lapse calculated over the M synchro-
   For this challenge we make available four different datasets,   nized galleries, normalized with respect to the maximum
exhibiting different challenges. The first dataset is related      accepted time lapse.
to the Tour de France 2014. It consists of images taken              The synchronization task provides a basis for the cluster-
during the event and collected from Flickr. The dataset is         ing task. Once the galleries are synchronized, it is possible
split into 33 galleries. The dataset covers the entire competi-    to cluster the whole event collection to detect sub-events
tion. Some images are also provided with GPS information           occurring within the entire event. Sub-events are defined in
together with the timestamp. A second dataset concerns             a neutral and unbiased way (e.g., making reference to the
the famous exhibition held every year in California, namely        calendar/schedule of the event) and coded into the ground
NAMM 2015. The data-set consists of 420 images and 32              truth. We measure the performance of the sub-event clus-
videos, split into 19 galleries. Each user gallery contains        tering over the whole synchronized collection of media. For
a variable number of media (ranging from 12 to 49). All            this, we use the Jaccard index JI and the clustering F1 score
images are downloaded from Flickr, while videos are down-          (Eq. (3)), where for computing the latter we use P and R,
loaded from YouTube. The Spring Party Salesiani 2015 is            which represent the Precision and Recall, respectively.
a dataset collected by the organizers, and recorded during
a students’ party held in Trento, Italy. It is composed of                               TP                     2P R
videos and pictures captured by the attendees during the                     JI =                ,      F1 =                (3)
                                                                                    TP + FP + FN               P +R
event. Also in this case a gallery corresponds to the user’s
device, and media are complemented with the corresponding            In the formulation above we declare a true positive (TP)
time-stamps. The last dataset, Salford Test Shoot includes         when two images related to the same sub-event are put in
403 audio and 58 video files. Time-codes are available for         the same cluster, and a true negative (TN) when two images
most of the media. All datasets are provided with the cor-         belonging to different sub-events are assigned to two differ-
responding ground truth, extracted by considering the ac-          ent clusters). False positives (FP) occur instead when two
quisition time of the media and manually verified to check         images are assigned to the same cluster although belonging
the consistency with respect to the captured event. The            to different sub-events.
datasets related to the Tour de France 2014, NAMM 2015,
and Spring Party Salesiani 2015 include material subject
to Creative Commons license and are freely available for
                                                                   5.   CONCLUSIONS
download 1 . The Salford dataset is instead accessible via           In this paper we have presented the Synchronization of
the ICoSOLE project website2 .                                     Multi-User Event Media task held at MediaEval 2015. The
                                                                   competing teams will be evaluated considering four datasets
                                                                   collected by the organizers, and made available online to-
4.     METRICS AND EVALUATION                                      gether with the corresponding ground truth. For the eval-
   Each submission will be evaluated in terms of: i) time          uation both the synchronization and the clustering perfor-
synchronization error, and ii) sub-event detection error.          mances will be evaluating, by measuring the galleries offset
   Concerning the first one, the goal of the participants is to    and computing the F1 score, respectively.
maximize the number of galleries for which the synchroniza-
tion error is below a predefined threshold ∆Emax , and to          Acknowledgments
minimize the time shift of those galleries. The synchroniza-
tion error for a gallery Gi with respect to the reference Gr       This work was supported in part by the EC under contract
                                   ∗                          ∗    FP7-600826 ForgetIT. We would like to thank Alessio Xom-
is defined as ∆Eir = ∆Tir − ∆Tir     , where ∆Tir and ∆Tir
                                                                   pero and Kostantinos Apostolidis for their precious help in
    mmlab.disi.unitn.it/MediaEvalSEM2015                           collecting and annotating the images for the datasets used
    https://icosole.lab.vrt.be/viewer/home                         in the task.
