=Paper=
{{Paper
|id=Vol-1263/paper53
|storemode=property
|title=JRS at Event Synchronization Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_53.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NowakTSB14
}}
==JRS at Event Synchronization Task==
JRS at Event Synchronization Task Paweł Nowak, Marcus Thaler, Harald Stiegler, Werner Bailer JOANNEUM RESEARCH – DIGITAL Steyrergasse 17, 8010 Graz, Austria werner.bailer@joanneum.at ABSTRACT For the extraction of the image similarities based on the The event synchronisation task addresses the problem of compact feature representation VLAD the same extracted aligning photo streams from different users temporally and SIFT key descriptors were used. In order to compute the identifying coherent events in the streams. In our approach, VLAD signature of each gallery image we used the VLFeat 1 we first determine the visual similarity of image pairs. We open source library. We reduced a global visual vocabulary determine visual similarity based on full matching of SIFT with about 300,000 descriptor cluster using k-means clus- descriptors and based on VLAD, and compare the use of tering to 256 visual words. The descriptors for building the the two sets of similarity scores. We then build a non- vocabulary have been extracted from a news data set of the homogeneous linear equation system constraining the time TOSCA-MP project2 . Based on sum of squared errors the offsets between the galleries based on these matching pairs similarities between the VLAD signatures and thus the im- and determine an approximate solution. Event clusters are age similarities within the test sets were calculated. initialised from subsequent and visually similar images, and For a pair of images (Ii , Ij ), VLAD yields distances dVij , clusters are merged if their temporal proximity and the max- which are transformed into similarities imum similarity of their members is high enough. θV − dVij , if dVij < θV sVij = (1) 0, otherwise, 1. INTRODUCTION The event synchronisation task addresses the problem of where θV is a threshold for the maximum distance. The aligning photo streams from different users temporally and SIFT similarity sS ij is determined as identifying coherent events in the streams. This paper de- scribes the work done by the JRS team for the two subtasks ( |P | of determining the time offsets of galleries and clustering the max(0, min(|Pij − θS ), if |Pij | ≥ p images into events. Details on the task and the data set can sS ij = i |,|Pj |) (2) 0, otherwise, be found in [1]. where Pi are the key points in each of the images, Pij is the 2. APPROACH set of matching key points, p is a threshold for the number of matching key points and θS is a similarity threshold. We 2.1 Determining Gallery Offsets use all similarities above zero to formulate constraints on the In our approach, we first determine the visual similarity time offsets of the galleries. Optionally, the GPS information of image pairs. We determine visual similarity based on full of the images (if available) can be used, setting the similarity matching of SIFT [3] descriptors and based on VLAD [2], to zero, if the deviation in longitude or latitude is above a and compare the use of the two sets of similarity scores. threshold θG (in degrees). The computation of the image similarities between the For N galleries G1 , . . . , GN , we can assume without loss images of each gallery is based on SIFT descriptors. All of generality that G1 is the reference gallery. We aim at images of each gallery were first downscaled from HD to obtaining a list of time differences D = (δ2 , . . . , δN ), where SD. Subsequently, up to 500 SIFT key points and descriptors δi is the time offset between galleries Gi and G1 . As the were extracted from each image. underlying assumption in this task is that the offset between For similarity calculation based on nearest neighbor match- two galleries is constant over time, each pair of matching ing of SIFT descriptors, each raw SIFT descriptor of the images adds one constraint of the form δp − δq = τij , where source image is assigned to its nearest neighbour (based on p, q are the galleries containing images Ii , Ij respectively, Euclidean distance) descriptor in the target image. These and τij is the time offset determined from time stamps of assignments are validated by a homography extracted with the matching images. Note that δ1 is by definition 0. We the maximum number of descriptors supporting a consistent can then reorganise our constraints into an overdetermined homography. equation system 1 Copyright is held by the author/owner(s). http://www.vlfeat.org 2 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain http://www.tosca-mp.eu run set vis.sim. θS θV θG αt 1 0,9 1 1,2 VLAD n/a 1.70 2.5 1.0 0,8 2 1 SIFT+VLAD 0.07 1.82 2.5 0.0 0,7 2 2 SIFT+VLAD 0.08 1.80 2.5 1.0 0,6 3 1 SIFT+VLAD 0.07 1.82 2.5 0.0 0,5 3 2 SIFT+VLAD 0.08 1.85 2.5 0.0 0,4 0,3 4 1 SIFT+VLAD 0.06 1.85 2.5 0.0 0,2 4 2 SIFT+VLAD 0.08 1.80 2.5 0.0 0,1 0 Table 1: Parameters of runs, tmin = 120s and p = 10. run 1 run 2 run 3 run 4 precision Vancouver precision London accuracy Vancouver accuracy London g2 (i) − g2 (j) · · · gN (i) − gN (j) δ2 τij .. .. . . Figure 1: Results for sychronisation. .. = .. . . g2 (k) − g2 (l) · · · gN (k) − gN (l) δN τkl 1 (3) 0,9 0,8 where gn (i) is a binary function, yielding 1 if I ∈ Gn , 0 0,7 otherwise. In order to deal with outliers, we iteratively solve 0,6 the equation system, and remove up to 10% of the largest 0,5 0,4 outliers. In each iteration, we use the Jacobi method to solve 0,3 the equation system. 0,2 0,1 2.2 Clustering Events 0 run 1 run 2 run 3 run 4 We initialise the event time line by grouping subsequent Rand Vancouver Rand London images, which have visual similarity (sV or sS ) above zero. Jaccard Vancouver Jaccard London This will oversegment the event time line. In a next step, F-Measure Vancouver F-Measure London we start regrouping these events based on visual similarity and (optionally) temporal proximity. The distance between Figure 2: Results for clustering. two events i, j is determined as not so much related with the similarity to the development |t̄i − t̄j | dE ij = αt max(1, )θ − maxk∈Ei ,l∈Ej sk l, (4) set, but rather with the high visual similarity in Winter tmin Olympics (e.g., all ice based competitions have high simi- where t̄i is the mean time of images in event Ei , αt is a larity). For clustering, the differences are not so clear, for weight for using time information, θ is the similarity thresh- runs 2 and 3 the Vancouver results are even better than the old used (S or V) and sk l is the visual similarity between a London ones according to Jaccard index and F-measure. In pair of images of which one belongs to Ei and the other to general, the Rand index shows a quite different picture than Ej . Two events are merged if dE ij < θmerge , where θmerge the other two measures. has been set to θV + 0.15. Acknowledgments 3. EXPERIMENTS AND RESULTS The research leading to these results has received funding We submitted four runs, with the parameters listed in Ta- from the European Union’s Seventh Framework Programme ble 1. One observation of the experiments of the test set is (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE that full matching of SIFT descriptors is better for deter- – Immersive Coverage of Spatially Outspread Live Events” mining gallery offsets, which needs to find the single most (http://www.icosole.eu/). similar image from the other gallery. In contrast, the event clustering needs a more global notion of similarity, which is well covered by VLAD. Thus we used VLAD similarities 5. REFERENCES for event clustering in all the runs. The results for synchro- [1] Nicola Conci, Francesco De Natale, and Vasileios nisation are shown in Figure 1, and those for clustering in Mezaris. Synchronization of Multi-User Event Media Figure 2. (SEM) at MediaEval 2014: Task Description, Datasets, and Evaluation. In MediaEval 2014 Workshop, 4. DISCUSSION Barcelona, Spain, October 16-17 2014. [2] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, As already expected from the experiments on the devel- P. Perez, and C. Schmid. Aggregating local image opment set, VLAD is not discriminative enough for deter- descriptors into compact codes. IEEE Transactions on mining the image pairs for synchronisation, thus the results Pattern Analysis and Machine Intelligence, of run 1 are much worse than the others. Our method only 34(9):1704–1716, 2012. manages to sychronise a fraction of the galleries correctly, [3] D. Lowe. Distinctive image features from scale-invariant however, if a gallery is sychronised, the accuracy is rather keypoints. International Journal of Computer Vision, high. The results for the London data set are clearly bet- 60(2):91–110, 2004. ter than those for the Vancouver set. We think that this is