1. INTRODUCTION

CERTH at MediaEval 2015 Synchronization of Multi-User Event Media Task

Konstantinos Apostolidis

kapost@iti.gr 0

Vasileios Mezaris

bmezaris@iti.gr 0 0 CERTH-ITI , Thermi 57001 , Greece

2015

14 15

This paper describes the results of our participation to the Synchronization of Multi-User Event Media Task at the MediaEval 2015 challenge. Using multiple similarity measures, we identify pairs of similar media from di erent galleries. We use a graph-based approach to temporally synchronize user galleries; subsequently we use time information, geolocation information and visual concept detection results to cluster all photos into di erent sub-events. Our method achieves good accuracy on considerably diverse datasets.

1. INTRODUCTION

People attending large events collect dozens of photos and video clips with their smartphones, tablets, cameras. These are later exchanged and shared in a number of di erent ways. The alignment and presentation of the media galleries of different users in a consistent way, so as to preserve the temporal evolution of the event, is not straightforward, considering that the time information attached to some of the captured media may be wrong and geolocation information may be missing. The 2015 MediaEval Synchronization of Multi-user Event Media (SEM) task tackles this exact problem [ 2 ].

METHOD OVERVIEW

The proposed method temporally aligns user galleries that are created by di erent digital capture devices, and clusters the time-aligned photos into event-related clusters. In the rst stage, we assess media similarity by combining multiple similarity measures and by taking into account the geolocation metadata of photos. Similar media of the di erent galleries are identi ed and are used for constructing a graph, whose nodes represent galleries and edges represent the discovered similarities between media items of di erent galleries. Synchronization of the galleries is achieved by traversing the minimum spanning tree (MST) of the graph. Finally, we apply clustering techniques to split the media to di erent sub-events. Figure 1 illustrates the proposed method.

MEDIA SIMILARITY ASSESSMENT

To identify similar photos of di erent galleries, we combine the information of four similarity measures [ 1 ]:

We calculate the aforementioned similarity measures on the photos of all galleries to be synchronized. We combine the information of all similarity measures, using the following procedure: initially, the similarity O(i; j) of photos i and j is set equal to GC(i; j). Then, if S(i; j) > ts and S(i; j) > GC(i; j), O(i; j) is updated as O(i; j) = S(i; j). The same update process is subsequently repeated using CA similarity and DCS similarity (and the respective tc, td thresholds).

Subsequently, we weigh each similarity value so that the similarity of photos with distance of capture locations lower than a m threshold is emphasized, while the similarity of photos with distance of capture locations signi cantly above this threshold is zeroed. Similar photos that belong to different user galleries are treated as potential links between these galleries.

To identify similar audio les of di erent galleries, we perform cross correlation of audio data, degraded to 11KHz sampling rate. For video les, we select a frame for each second of video and resize it to 1 pixel width. To identify similar video les of di erent galleries, we perform cross correlation of the horizontally concatenated resized frames.

The ts, tc and td thresholds are empirically calculated over the training dataset. The m threshold is calculated by estimating a Gaussian mixture model of two Gaussian distributions on the histogram of all photo's pairwise capture location distances. The Gaussian distribution with the lowest mean (m) presumably signi es photos captured in the same sub-event.

TEMPORAL SYNCHRONIZATION

Having identi ed potential links for at least some gallery pairs, we construct a weighted graph, whose nodes represent the galleries, and its edges represent the links between galleries. The weight assigned to each edge is calculated as the sum of similarities of the photos linking the two galleries. Using this graph, the temporal o sets of each gallery will be computed against the reference gallery.

We compute the temporal o set of each gallery by traversing the minimum spanning tree (MST) of the galleries graph. This procedure (M ST t) can be summarized as follows: Starting from the node corresponding to the reference gallery, we select the edge with the highest weight. We compute the temporal o set of the node on the other end of this edge as the median of the capture time di erences of the pairs of similar photos that this edge represents. We add this node to the set of visited nodes. The selection of the edge with the highest weight is repeated, considering any member of the set of visited nodes as possible starting point, and the corresponding temporal o set is again computed, until all nodes are visited. This process is explained in more detail in [ 1 ].

The MSTt method calculates the o sets using only the shortest path from a visited node to any given node. We also explore a variation of the MStt process as an alternative way of computing temporal o sets (M ST x): before traversing the MST of the graph, we detect fully-connected triplets of nodes and we average the o set of the shortest path with the alternative path in each triplet, only if the di erence of the two paths is lower than maxDif f threshold. Utilizing in this M ST x process some additional information that the M ST t method ignores, we expect to achieve better accuracy in time synchronization.

5. SUB-EVENT CLUSTERING

After time synchronization, we cluster all photos to subevents. Two di erent approaches were adopted. In the rst approach (M P C), we apply the following procedure: At the rst stage, we split the photo's timeline where consecutive photos have temporal distance above the mean of all temporal distances. At the second stage, geolocation information is used to further split clusters of photos.During the third stage, clusters are merged using time and geolocation information. For the clusters that do not have geolocation information, the merging is continued by considering visual similarity. In the second approach (AP C), we augment the DCNN feature vectors with the normalized time information and cluster the media using A nity Propagation. 6.

RESULTS

We submitted 4 runs in total, combining the 2 methods for temporal synchronization and the 2 methods for sub-event clustering. The results of our approach for all datasets and all four runs are listed in Table 1. From the reported results, it is clear that our method achieved good accuracy but only managed to synchronize a small number of galleries, particularly in the TDF14 dataset. In sub-event clustering, the MPC method scored a slightly better F-score (column F1) for two of the datasets. The M ST t and M ST x methods performed the same because maxDif f was set too low (maxDif f = 10), which allowed only very small adjustments, thus degenerating the M ST x method to M ST t.

Dataset

In this paper our framework and results at the MediaEval 2015 Synchronization of Multi-User Event Media Task are presented. Better ne-tuning of the algorithm parameters is required to achieve consistently good performance on diverse datasets. As a future work, we are considering extending the algorithm to automatic parameter selection (which could lead to select more links between galleries, thus improving precision), experiment with di erent values of maxDif f , and apply a more sophisticated method to combine di erent similarity measures.

ACKNOWLEDGMENTS

This work was supported by the European Commission under contract FP7-600826 ForgetIT.

[1]

Apostolidis and

Mezaris . Using photo similarity and weighted graphs for the temporal synchronization of event-centered multi-user photo collections . In Proc. 2nd Workshop on Human Centered Event Understanding from Multimedia (HuEvent'15) at ACM Multimedia (MM'15) , Brisbane, Australia, Oct. 2015 .

[2]

Conci ,

De Natale ,

Mezaris , and

Matton . Synchronization of Multi-User Event Media (SEM) at MediaEval 2015: Task Description, Datasets, and Evaluation . In Proc. MediaEval Workshop , Wurzen, Germany, Sept. 2015 .

[3]

Jia ,

Shelhamer ,

Donahue ,

Karayev ,

Long ,

Girshick ,

Guadarrama , and T. Darrell. Ca e: Convolutional architecture for fast feature embedding . ACM Int. Conf. on Multimedia, Nov . 2014 .

[4]

Oliva and

Torralba . Modeling the shape of the scene: A holistic representation of the spatial envelope . International journal of computer vision , 42 ( 3 ): 145 { 175 , 2001 .

[5]

Szegedy , W. Liu,

Jia ,

Sermanet ,

Reed ,

Anguelov ,

Erhan ,

Vanhoucke , and

Rabinovich . Going deeper with convolutions . CoRR, abs/1409.4842 , 2014 .

[6]

Zhou ,

Li ,

Lu , and

Tian . SIFT match veri cation by geometric coding for large-scale partial-duplicate web image search . ACM Trans. Multimedia Comput. Commun. Appl. , 9 ( 1 ):4: 1 {4: 18 , Feb . 2013 .