CERTH at MediaEval 2015 Synchronization of Multi-User Event Media Task Konstantinos Apostolidis Vasileios Mezaris CERTH-ITI CERTH-ITI Thermi 57001, Greece Thermi 57001, Greece kapost@iti.gr bmezaris@iti.gr ABSTRACT 1. Geometric Consistency of Local Features Similarity This paper describes the results of our participation to the (GC): We check the geometric consistency of SIFT Synchronization of Multi-User Event Media Task at the Me- keypoints for each pair of photos, using geometric cod- diaEval 2015 challenge. Using multiple similarity measures, ing [6]. The GC similarity can discover near-duplicate we identify pairs of similar media from different galleries. We photos. use a graph-based approach to temporally synchronize user 2. Scene Similarity (S): We calculate the pairwise cosine galleries; subsequently we use time information, geolocation distances between the extracted GIST descriptor [4] of information and visual concept detection results to cluster each photo. High S similarity indicates photos cap- all photos into different sub-events. Our method achieves tured at similar scenery (indoor, urban, nature). good accuracy on considerably diverse datasets. 3. Color Allocation Similarity (CA): We divide each im- age to three equal, non-overlapping horizontal strips, and extract the HSV histogram of each. We calculate 1. INTRODUCTION the pairwise cosine distances between the concatena- People attending large events collect dozens of photos and tion of the HSV histograms. High CA similarity indi- video clips with their smartphones, tablets, cameras. These cates photos with similar colors. are later exchanged and shared in a number of different ways. 4. DCNN Concept Scores Similarity (DCS): We use the The alignment and presentation of the media galleries of dif- Cafe DCNN [3] and the googleNet pre-trained model ferent users in a consistent way, so as to preserve the tempo- [5] to extract concept scores for photos. We use the ral evolution of the event, is not straightforward, considering Euclidean distance to calculate pairwise distances be- that the time information attached to some of the captured tween concept scores vectors of photos. High DCS media may be wrong and geolocation information may be similarity indicates semantically similar photos. missing. The 2015 MediaEval Synchronization of Multi-user Event Media (SEM) task tackles this exact problem [2]. We calculate the aforementioned similarity measures on the photos of all galleries to be synchronized. We combine the information of all similarity measures, using the follow- 2. METHOD OVERVIEW ing procedure: initially, the similarity O(i, j) of photos i The proposed method temporally aligns user galleries that and j is set equal to GC(i, j). Then, if S(i, j) > ts and are created by different digital capture devices, and clusters S(i, j) > GC(i, j), O(i, j) is updated as O(i, j) = S(i, j). the time-aligned photos into event-related clusters. In the The same update process is subsequently repeated using first stage, we assess media similarity by combining multi- CA similarity and DCS similarity (and the respective tc , ple similarity measures and by taking into account the ge- td thresholds). olocation metadata of photos. Similar media of the differ- Subsequently, we weigh each similarity value so that the ent galleries are identified and are used for constructing a similarity of photos with distance of capture locations lower graph, whose nodes represent galleries and edges represent than a m threshold is emphasized, while the similarity of the discovered similarities between media items of differ- photos with distance of capture locations significantly above ent galleries. Synchronization of the galleries is achieved by this threshold is zeroed. Similar photos that belong to dif- traversing the minimum spanning tree (MST) of the graph. ferent user galleries are treated as potential links between Finally, we apply clustering techniques to split the media these galleries. to different sub-events. Figure 1 illustrates the proposed To identify similar audio files of different galleries, we per- method. form cross correlation of audio data, degraded to 11KHz sampling rate. For video files, we select a frame for each second of video and resize it to 1 pixel width. To identify 3. MEDIA SIMILARITY ASSESSMENT similar video files of different galleries, we perform cross cor- To identify similar photos of different galleries, we com- relation of the horizontally concatenated resized frames. bine the information of four similarity measures [1]: The ts , tc and td thresholds are empirically calculated over the training dataset. The m threshold is calculated by estimating a Gaussian mixture model of two Gaussian dis- Copyright is held by the author/owner(s) MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- tributions on the histogram of all photo’s pairwise capture many location distances. The Gaussian distribution with the low- Figure 1: System overview est mean (m) presumably signifies photos captured in the similarity. In the second approach (AP C), we augment the same sub-event. DCNN feature vectors with the normalized time information and cluster the media using Affinity Propagation. 4. TEMPORAL SYNCHRONIZATION Having identified potential links for at least some gallery 6. RESULTS pairs, we construct a weighted graph, whose nodes represent We submitted 4 runs in total, combining the 2 methods for the galleries, and its edges represent the links between gal- temporal synchronization and the 2 methods for sub-event leries. The weight assigned to each edge is calculated as the clustering. The results of our approach for all datasets and sum of similarities of the photos linking the two galleries. all four runs are listed in Table 1. From the reported results, Using this graph, the temporal offsets of each gallery will be it is clear that our method achieved good accuracy but only computed against the reference gallery. managed to synchronize a small number of galleries, par- We compute the temporal offset of each gallery by travers- ticularly in the TDF14 dataset. In sub-event clustering, ing the minimum spanning tree (MST) of the galleries graph. the MPC method scored a slightly better F-score (column This procedure (M ST t) can be summarized as follows: Start- F1) for two of the datasets. The M ST t and M ST x meth- ing from the node corresponding to the reference gallery, we ods performed the same because maxDif f was set too low select the edge with the highest weight. We compute the (maxDif f = 10), which allowed only very small adjust- temporal offset of the node on the other end of this edge as ments, thus degenerating the M ST x method to M ST t. the median of the capture time differences of the pairs of similar photos that this edge represents. We add this node Table 1: Proposed method results. to the set of visited nodes. The selection of the edge with Dataset Run Precision Accuracy F1 the highest weight is repeated, considering any member of M ST t+AP C 0.833 0.908 0.226 the set of visited nodes as possible starting point, and the M ST t+M P C 0.833 0.908 0.348 NAMM15 corresponding temporal offset is again computed, until all M ST x+AP C 0.833 0.908 0.226 nodes are visited. This process is explained in more detail M ST x+M P C 0.833 0.908 0.348 in [1]. M ST t+AP C 0.125 0.845 0.113 M ST t+M P C 0.125 0.845 0.001 The MSTt method calculates the offsets using only the TDF14 M ST x+AP C 0.125 0.845 0.113 shortest path from a visited node to any given node. We also M ST x+M P C 0.125 0.845 0.001 explore a variation of the MStt process as an alternative way M ST t+AP C 0.424 1.000 0.123 of computing temporal offsets (M ST x): before traversing M ST t+M P C 0.424 1.000 0.164 the MST of the graph, we detect fully-connected triplets of STS M ST x+AP C 0.424 1.000 0.123 nodes and we average the offset of the shortest path with M ST x+M P C 0.424 1.000 0.164 the alternative path in each triplet, only if the difference of the two paths is lower than maxDif f threshold. Utilizing in this M ST x process some additional information that the 7. CONCLUSIONS M ST t method ignores, we expect to achieve better accuracy In this paper our framework and results at the MediaEval in time synchronization. 2015 Synchronization of Multi-User Event Media Task are presented. Better fine-tuning of the algorithm parameters is 5. SUB-EVENT CLUSTERING required to achieve consistently good performance on diverse datasets. As a future work, we are considering extending After time synchronization, we cluster all photos to sub- the algorithm to automatic parameter selection (which could events. Two different approaches were adopted. In the first lead to select more links between galleries, thus improving approach (M P C), we apply the following procedure: At the precision), experiment with different values of maxDif f , first stage, we split the photo’s timeline where consecutive and apply a more sophisticated method to combine different photos have temporal distance above the mean of all tempo- similarity measures. ral distances. At the second stage, geolocation information is used to further split clusters of photos.During the third stage, clusters are merged using time and geolocation in- 8. ACKNOWLEDGMENTS formation. For the clusters that do not have geolocation This work was supported by the European Commission information, the merging is continued by considering visual under contract FP7-600826 ForgetIT. 9. REFERENCES [1] K. Apostolidis and V. Mezaris. Using photo similarity and weighted graphs for the temporal synchronization of event-centered multi-user photo collections. In Proc. 2nd Workshop on Human Centered Event Understanding from Multimedia (HuEvent’15) at ACM Multimedia (MM’15), Brisbane, Australia, Oct. 2015. [2] N. Conci, F. De Natale, V. Mezaris, and M. Matton. Synchronization of Multi-User Event Media (SEM) at MediaEval 2015: Task Description, Datasets, and Evaluation. In Proc. MediaEval Workshop, Wurzen, Germany, Sept. 2015. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. ACM Int. Conf. on Multimedia, Nov. 2014. [4] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001. [5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. [6] W. Zhou, H. Li, Y. Lu, and Q. Tian. SIFT match verification by geometric coding for large-scale partial-duplicate web image search. ACM Trans. Multimedia Comput. Commun. Appl., 9(1):4:1–4:18, Feb. 2013.