JRS at Synchronization of Multi-user Event Media Task Hannes Fassold, Harald Stiegler, Felix Lee, Werner Bailer JOANNEUM RESEARCH – DIGITAL Steyrergasse 17, 8010 Graz, Austria {firstname.lastname}@joanneum.at ABSTRACT From the remaining ones, the k highest values ht are se- The event synchronisation task addresses the problem of lected. The selected homographies are clipped to a range aligning media (i.e., photo and video) streams (“galleries”) [hmin t , hmax t ] and the arithmetic average havg and sum hsum from different users temporally and identifying coherent events of the clipped values is calculated. The visual similarity si,j in the streams. Our approach uses the visual similarity of is retrieved as the geometric average of havg and hsum . image/key frame pairs based on full matching of SIFT de- Our general approach is a probabilistic method, where a scriptors with geometric verification. Based on the visual significant number of potential solutions (hypotheses) are similarity and the given time information, a probabilistic calculated, and from these hypotheses the ’most-inner’ (in a algorithm is employed, where in each run a hypothesis is sense which will be explained later) is taken as the final so- calculated for the set of time offsets with respect to the ref- lution. Such a probabilistic approach is more robust against erence gallery. From the gathered hypotheses, the final set outliers in the data. As a preprocessing step, we calculate of time offsets is calculated as the medoid of all hypotheses. a connection magnitude ck,l for each gallery pair k and l in order to steer the random picking of gallery pairs (k, l) to- wards the more ’stable’ gallery pairs (e.g., the gallery pairs 1. INTRODUCTION with a high number of matches and a low deviation of the The event synchronisation task addresses the problem of time difference values between the matches). The connec- aligning media streams (referred to as galleries) from differ- tion magnitude is calculated as the geometric average of the ent users temporally and identifying coherent events in the number of identified matches (based on visual similarity) streams. This paper describes the work done by the JRS between the galleries, the average visual similarity scores of team for the two subtasks of determining the time offsets of the matches and of the reciprocal of the average deviation galleries and clustering the images and videos into events. of the time differences between the matches. Details on the task and the data set can be found in [1]. One potential solution is a vector of time differences D0 = (δ1 , ..., δM ) between the M galleries and the reference gallery 2. APPROACH G0 . For generating one potential solution D0 , we proceed as follows. First, a random gallery pair (k, l) is identified. 2.1 Determining Gallery Offsets The probability of picking a pair (k, l) is proportional to Our approach utilizes the visual information (the captured its connection magnitude ck,l , therefore we steer the ran- images and extracted key frames from the video) and the dom picking towards more stable gallery pairs. In order to given time stamps in a probabilistic way. The absolute time probabilistically calculate the time difference δk,l between stamps are not considered reliable in this task, however, the two galleries, we first apply k-means clustering on the their relative distances within the gallery of one user can time difference values of all matches, where k is typically be exploited. in the range 3 to 5. Then, we randomly pick one of the We denote galleries as G0..M (assuming G0 as the reference cluster centers and set it as δk,l . Having calculated δk,l , we gallery), each Gk containing a set of images or key frames can propagate this value recursively and calculate unknown I1..Nk . For every image, several thousands of SIFT descrip- values δk0 ,l by taking usage of the relation tors [3] are extracted. A GPU accelerated implementation δk,l = δk,k0 + δk0 ,l , (1) is used to speed up descriptor extraction and matching [2]. For a pair of galleries (k, l), for each image Ii ∈ Gk its best which is very easy to show. By iterating this process of matching image Ij ∈ Gl is identified, via exhaustive match- randomly selecting a gallery pair, followed by calculating ing of their respective SIFT descriptors. For each match δk,l , M − 1 times we retrieve one potential solution D0 . (Ii , Ij ), a geometric verification step is applied, yielding a In order to calculate the final solution D, we generate a set variable number of homographies along with the number of of several thousands of potential solutions D0 (each being a points ht supporting the respective homography. The vi- vector of time differences) in the way described above. From sual similarity si,j for the image pair is calculated as fol- the potential solutions, we determine the final solution D by lows. First, all homographies with ht < τ are discarded. calculating the medoid of all potential solutions. In a certain sense, this is the ’most-inner’ solution, when interpreting the Copyright is held by the author/owner(s). potential solutions as vectors. MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- many 100% 100% run 1 run 1 run 2 80% run 2 80% 60% 60% 40% 40% 20% 20% 0% 0% Precision Recall F-Score Precision Recall F-Score Figure 1: Results for subevent clustering TDF14 (left) and NAMM15 (right). features of bikes and bikers match quite well across many images (which can be seen from the high visual similarity 100% Precision values si,j for these images), thus visual matching provides Accuracy a weaker constraint than on visually more diverse data. 80% The results for subevent clustering are shown in Figure 1. One interesting observation is that while the F1 score is on a 60% comparable level for both data sets, precision and recall are quite balanced for NAMM, but biased towards higher preci- 40% sion for TDF. Interestingly, the variation of the parameter between the two runs does not change this behaviour. For 20% both parameterisations the method tends to oversegment the TDF data. It seems that the impact of synchronisation 0% errors on the clustering result is limited, as no direct relation TDF14 NAMM15 is apparent from the results. 4. CONCLUSION The proposed method performs quite well in minimising Figure 2: Results for sychronisation. the overall synchronisation error, but at the expense of more galleries that exceed the error threshold. For the subevent clustering, a better automatic adaptation of the number of 2.2 Clustering Events clusters to the data set is needed, in order to avoid overseg- For the event clustering, we rely solely on the time infor- mentation such as on the TDF data. mation. We correct the time stamp of a specific gallery with the calculated offset, with respect to the reference gallery, Acknowledgments for the specific gallery. Based on the time information, a one dimensional k-means clustering algorithm is applied, where The research leading to these results has received funding k is ranging between 30 and 100. The value is determined from the European Union’s Seventh Framework Programme based on the size of a data set - the total number of images (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE in all galleries - and a user parameter which specifies the – Immersive Coverage of Spatially Outspread Live Events” desired granularity of the subevents. (http://www.icosole.eu/). 3. EXPERIMENTS AND RESULTS 5. REFERENCES [1] Nicola Conci, Francesco De Natale, Vasileios Mezaris, We submitted two runs, which use the same parameters and Mike Matton. Synchronization of Multi-User Event for determining the time offsets. The clustering is different, Media at MediaEval 2015: Task Description, Datasets, with k for run 2 having the double value of run 1. So run and Evaluation. In MediaEval 2015 Workshop, Wurzen, 2 corresponds to a finer granularity of the subevents com- Germany, September 14-15 2015. pared to run 1. Unfortunately, the official submissions only [2] Hannes Fassold and Jakub Rosner. A real-time GPU contained the results for the still image data sets (Tour de implementation of the SIFT algorithm for large-scale France, NAMM), but not for the videos. video analysis tasks. In Real-Time Image and Video Figure 2 shows the results for synchronisation. For both Processing, San Francisco, CA, USA, 2015. data sets, accuracy is clearly higher than precision. This means that our approach tends to optimise for a globally [3] D. Lowe. Distinctive image features from scale-invariant lower synchronisation error at the cost of higher individ- keypoints. International Journal of Computer Vision, ual errors for some galleries. While precision is significantly 60(2):91–110, 2004. lower for NAMM than for TDF, accuracy has actually in- creased. One reason for this may be the fact, that local