1. INTRODUCTION

CERTH at MediaEval 2014 Synchronization of Multi-User Event Media Task

0 Konstantinos Apostolidis, Christina Papagiannopoulou, Vasileios Mezaris Information Technologies Institute, CERTH , Thessaloniki , Greece

2014

16 17

This paper describes the results of the CERTH participation in the Synchronization of Multi-User Event Media Task of MediaEval 2014. We used a near duplicate image detector to identify very similar photos, which allowed us to temporally align photo galleries; and then we used time, geolocation and visual information, including the results of visual concept detection, to cluster all photos into di erent events.

1. INTRODUCTION

People attending large-scale social events collect dozens of photos and video clips with their smartphones, tablets, cameras. These are later exchanged and shared in a number of di erent ways. The alignment and presentation of the photo galleries of di erent users in a consistent way, so as to preserve the temporal evolution of the event, is not straightforward, considering that the time information attached to some of the captured media may be wrong (due to di erent photo capturing devices not being synchronized) and geolocation information may be missing. The 2014 MediaEval Synchronization of Multi-user Event Media (SEM) task tackles this exact problem [ 1 ].

SYSTEM OVERVIEW

The main goal of our system is the time alignment of photo galleries that are created by di erent digital photo capture devices, and the clustering of these into event-related clusters. In the rst stage, similar photos of the di erent galleries are identi ed and are used for constructing a graph, whose nodes represent galleries and edges represent discovered links between them. Time alignment of the galleries is achieved by traversing the graph. After that, we apply clustering techniques in order to split our collection into different events. Figure 1 shows the pipeline of our system.

TIME SYNCHRONIZATION

Time synchronization makes use of a Near Duplicate Detector (NDD) that extracts SIFT descriptors from the photos, forms a visual vocabulary and encodes the descriptorbased representation of each photo using VLAD encoding. The nearest neighbours that are returned for a query image are re ned by checking the geometrical consistency of SIFT keypoints using geometric coding (GC) [ 4 ]. We further modi ed this NDD process to also use color information (HSV histograms), so that near duplicate candidates that are very similar in color are not discarded even if the GC score is relatively low.

We apply the modi ed NDD on the union of all galleries. Consequently, we lter out identi ed pairs of near duplicates according to the following rules:

Reject pairs when geolocation information is available and the location distance of the two photos is greater than a distance threshold.

Reject pairs when the time di erence between the photos is above an extreme time threshold (which indicates that this time di erence is unlikely to be due to a time synchronization error alone).

The remaining near duplicate photos belonging to di erent galleries are considered as links between those galleries.

It is now straightforward to construct a graph whose nodes represent the galleries, and the edges represent these links between galleries. Each edge has a weight which is equal to the number of links between the two galleries. Having constructed the graph, we compute the time o set of each gallery by traversing it, as follows. Starting from the node corresponding to the reference gallery, we select the edge with the highest weight. We compute the time o set of the node on the other end of this edge as the median of the time di erences of the pairs of near duplicate photos that this edge represents, and add this node to the set of visited nodes. The selection of the edge with the highest weight is repeated, considering as possible starting point any member of the set of visited nodes, and the corresponding time o set is computed, until all nodes are visited. Alternatively, we can traverse the graph and compute the nodes' time o sets by simultaneously considering the weights of multiple edges. 4.

MEDIA CLUSTERING OF EVENTS

Following time synchronization, we cluster all photos to events. Two di erent approaches are adopted: the rst one considers all photo galleries as a single photo collection, exploiting the synchronization results, while the second one rst makes a pre-clustering within each gallery separately.

In the rst approach, we use the method of [ 2 ], resulting in clusters that are time distinct, comprising di erent events. Subsequently, each of these clusters is split based on the geolocation information. The photos that do not have geolocation information are assigned to the geo-cluster which is more similar according to the color information (e.g. HSV histogram).

In the second approach, we detect time gaps between events of each gallery. Speci cally, we nd the minimum time di erence of dissimilar photos which is greater than the maximum time di erence of the near-duplicate photos (based on the similarity matrix of GC). The clusters that are formed are merged according to time and geolocation similarity. For the clusters that do not have geolocation information, the merging is continued by considering the time and low-level feature similarity or the time and the concept detector (CD) con dence similarity scores [ 3 ].

EXPERIMENTS AND RESULTS

We submitted 5 runs in total, combining 3 methods for time synchronization and 3 methods for event clustering: Run1:aNDD-perGallery-mergeCD : Compute gallery time o sets using our modi ed NDD. CD scores are used to merge clusters using the second approach of section 4.

Run2:aNDD-perGallery-mergeHSV : Compute gallery time o sets using our modi ed NDD. HSV histogram similarity is used to merge clusters using the second approach of section 4.

Run3:aNDD-concat : Compute gallery time o sets using our modi ed NDD. Clustering is performed using the rst approach of section 4.

Run4:aNDD-multiT-perGallery-mergeCD : Compute gallery time o sets using our modi ed NDD and traversal of the graph by simultaneously considering the weights of multiple edges. CD scores are used to merge clusters using the second approach of section 4.

Run5:NDD-perGallery-mergeCD : Compute gallery time o sets using NDD without HSV color information. CD scores are used to merge certain events using the second approach of the section 4.

The results of our approach for all 5 runs, for the Vancouver testset and the London testset are listed in Tables 1 and 2 respectively. 6.

CONCLUSIONS

This paper presented our framework and results at the MediaEval 2014 Synchronization of Multi-User Event Media Task. Our modi ed NDD approach gives the best results in time alignment for the Vancouver testset, while the standard NDD yields a slightly better time synchronization for the London testset. In sub-event clustering, the exploitation of consistent timestamps in a gallery and the use of CD condence scores gives a good performance for the Vancouver testset, whereas HSV histogram similarity seems to give the best clustering results for the London testset.

ACKNOWLEDGMENTS

This work was supported by the EC under contracts FP7287911 LinkedTV and FP7-600826 ForgetIT.

[1]

Conci ,

De Natale , and

Mezaris . Synchronization of Multi-User Event Media (SEM) at MediaEval 2014: Task Description, Datasets, and Evaluation . In Proc. MediaEval Workshop , 2014 .

[2]

Cooper ,

Foote ,

Girgensohn , and

Wilcox . Temporal event clustering for digital photo collections . ACM Transactions on Multimedia Computing , Communications, and Applications (TOMCCAP), 1 ( 3 ): 269 { 288 , 2005 .

[3]

Papagiannopoulou and

Mezaris . Concept-based Image Clustering and Summarization of Event-related Image Collections . In Proc. Int. Workshop on Human Centered Event Understanding from Multimedia (HuEvent14) of ACM Multimedia (MM14) , 2014 .

[4]

Zhou ,

Li ,

Lu , and

Tian . SIFT match veri cation by geometric coding for large-scale partial-duplicate web image search . ACM Trans. Multimedia Comput. Commun. Appl. , 9 ( 1 ):4: 1 {4: 18 , Feb . 2013 .