=Paper=
{{Paper
|id=Vol-1263/paper75
|storemode=property
|title=Synchronizing Multi-User Photo Galleries with MRF
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_75.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SansoneBD14
}}
==Synchronizing Multi-User Photo Galleries with MRF==
Synchronizing Multi-User Photo Galleries with MRF Emanuele Sansone, Giulia Boato Minh-Son Dao DISI, University of Trento UIT-HCM Trento, Italy HCMC, Viet-Nam e.sansone@unitn.it, boato@disi.unitn.it sondm@uit.edu.vn ABSTRACT We present a novel solution to the MediaEval 2014 Event Synchronization Task: Synchronization of Multi-User Event Media (SEM). The framework is based on a probabilistic graphical model. Thanks to the simple topology of the graph, the estimation of the true temporal displacement Figure 1: Example of correspondences between two among multiple photo collections can be performed efficiently photo collections. The two stripes represent the two through exact inference. The underlying fitness function is photo collections, where the lower is the reference defined in a flexible way, for which it is possible to integrate and the upper is the gallery to be synchronized; ver- easily new information (e.g., text tags or social network tical lines correspond to pictures and arrows depict data). The flexibility makes the framework suitable and correspondences. adaptable to cope with many real situations. The method is evaluated on two datasets obtaining an overall accuracy of more than 85% in both cases. in the reference gallery (notice that N will be often smaller than M but this is not always true). Given the set of all possible temporal displacements, namely 1. INTRODUCTION {∆Ti : i = 1, . . . , Q}, and the sequence of all correspon- The problem of photo stream synchronization has been dences between pictures given the offset ∆Ti , namely x∆Ti = investigated in little work in the current literature. Never- (x∆T 1 i , . . . , x∆T j i , . . . , x∆Ti ∆Ti N ), where xj identifies the picture theless, it represents an open and attractive research topic, in the reference P associated with image p1j given ∆Ti , the 2 especially if one considers its potential applicability to the synchronization task can be cast into an optimization prob- context of online photo sharing communities. lem for finding the best offset ∆T ∗ . In other words, Indeed, a novel task on this issue has been introduced in MediaEval 2014 [4], where the scenario considered is repre- ∆T ∗ = arg max f (x∆Ti ) (1) ∆Ti sented by a number of users attending the same event and taking photos and videos with different non-synchronized where f : X → R is the function that associates a similarity devices. The goal of the task is twofold. score to each sequence of associations and X = {x∆Ti : i = The synchronization consists of finding the correct tempo- 1, . . . , Q}. ral offset of each photo collection, denoted as P 1 = (p11 , p12 , Now it is possible to define an undirected graphical model . . . , p1N ), with respect to a reference gallery, namely P 2 = through a sequence of observed nodes y = (y1 , . . . , yj , . . . , yN ), (p21 , p22 , . . . , p2M ), where N and M correspond to the lengths where yj refers to the image p1j in P 1 , and a sequence of la- of the two streams. tent variables x = (x1 , . . . , xj , . . . , xN ), whose admissible Once the sequential chronological order of all pictures is values are defined over the set X. The edges of the model restored, the clustering phase is evaluated based on the num- are of two kinds: links between nodes in x and y, that com- ber of sub-events detected as well as on the quality of the pare the similarity across photos of the two galleries, and obtained groupings. links between nodes in the same sequence x, which take into account the temporal structure of the collections. Fig. 2 summarizes the graphical model described so far. 2. PROPOSED APPROACH The joint distribution associated with the graph can there- The proposed framework for synchronization is based on fore be factorized in the following form: a probabilistic graphical model. Each temporal displace- ment can be uniquely identified by a set of nearest-neighbor 1 Y Y picture pairs across the two photo sets. Fig. 1 shows an p(x) = ψ(xj , xj+1 ) φ(xk , yk ) (2) Z j k example of this concept, where each image in the collection to be synchronized can be compared with the nearest image where ψ(xj , xj+1 ) is the potential associated with the link that connects xj and xj+1 and φ(xk , yk ) corresponds to the potential of the edge between xk and yk [3]. The distribution Copyright is held by the author/owner(s). p(x) can be interpreted as a function measuring the quality MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain of the alignment given an offset, and can be exploited as Table 1: Synchronization performance on the two datasets in terms of precision and accuracy. Dataset Precision Accuracy Vancouver 0.35 0.86 Figure 2: Markov network with observed and hidden London 0.25 0.89 nodes. Table 2: Clustering performance on the two datasets the objective f in Eq. (1). For the sake of computational in terms of Rand Index, Jaccard Index and F- simplicity, we define the potential functions belonging to the measure. expontential family, namely: Dataset Run RI JI F1 n h D (x , y ) 1 0.9749 0.1673 0.1433 H k k Vancouver 2 0.9737 0.1382 0.1214 φ(xk , yk ) = exp − α max + DH (k) 3 0.9730 0.1315 0.1162 DS (xk , yk ) DG (xk , yk ) io 1 0.9852 0.1287 0.1140 + β max +γ (3) DS (k) max DG (k) London 2 0.9836 0.0742 0.0691 n DT (xj , xj+1 ) o 3 0.9841 0.0885 0.0813 ψ(xj , xj+1 ) = exp − δ (4) DTmax (j) where DH , DS and DG represent distance metrics between scriptors (CSD) [1] with Local Binary Patterns [5]. The images computed on HSV color histograms, SURF descrip- second configuration consists only of 6 CSD values obtained tors [2] and GPS coordinates, respectively. In particular, by performing PCA reduction on the original CSD descrip- DH is obtained by first dividing each image into 9 blocks, by tor, while the last set up is obtained by adding the temporal computing the Hellinger distance between color histograms information to the second configuration. Table 2 shows the extracted from their respective blocks and by combining the results associated with the three different runs. In general, distances linearly assigning a higher weight to the central the inclusion of the temporal feature doesn’t significantly in- component. DS corresponds to the average of Euclidean crease the performance. But it’s evident that if one is able to distances evaluated over all pairs of matched salient points. carry out a precise synchronization, then temporal informa- Finally, DG is computed by approximating the Earth sur- tion becomes a very reliable feature to perform clustering. face as a sphere. DT is evaluated on timestamp information This is confirmed by the fact that the results are slightly according to the following relation: more than 10% in terms of F-measure and low-level visual features are therefore not sufficient. DT (xj , xj+1 ) = |tyj+1 − txj+1 | + |tyj − txj | (5) Future work will be devoted to increasing the synchroniza- Every distance measure is therefore normalized by its re- tion precision in order to allow the exploitation of the tempo- spective maximum value, and is finally combined linearly as ral component. One possible approach consists of modifying shown in Eqs. (3) and (4). the structure of the graphical model, such that new binary As far as the clustering challenge is concerned, the k- latent variables take into account the possibility of having means algorithm is used to find the natural grouping of im- no associations between photos. ages, see Section 3. 4. REFERENCES 3. RESULTS AND DISCUSSION [1] M. Baştan, H. Çam, U. Güdükbay, and Özgür Ulusoy. The learning of the parameters described in the previous BilVideo-7: An MPEG-7-Compatible Video Indexing section is carried out by performing an optimization of the and Retrieval System. IEEE MultiMedia, 17(3):62–73, joint distribution over the parameter solution space and us- 2009. ing the training dataset made available by [4]. The estimated [2] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded values for the parameters are α = 2.4249, β = 0.9509, γ = Up Robust Features. In Computer Vision - ECCV 0.9594 and δ = 3.8597. Once the training phase has com- 2006. pleted, one can start to synchronize any pair of galleries. [3] C. M. Bishop et al. Pattern Recognition and Machine Results obtained in the SEM task for synchronization are Learning. Springer New York, 2006. summarized in Table 1. In both datasets, the accuracy of [4] N. Conci, F. D. Natale, and V. Mezaris. synchronization is greater than 85%, which proves the effec- Synchronization of Multi-User Event Media (SEM) at tiveness of the proposed algorithm. Nevertheless, the preci- MediaEval 2014: Task Description, Datasets, and sion obtained is quite poor since on average only one fourth Evaluation. In Proceedings of MediaEval 2014, of the photo collections are correctly synchronized. The low Barcelona, Spain, October 2014. performance is mainly due to forcing associations between [5] T. Ojala, M. Pietikäinen, and T. Mäenpää. Gray Scale images of the two galleries. In many cases, there are pictures and Rotation Invariant Texture Classification with that have no correspondence in the reference set. Local Binary Patterns. In Computer Vision - ECCV Once the galleries are synchronized, the corrected tem- 2000, volume 1842 of Lecture Notes in Computer poral information can be exploited to perform clustering. Science, pages 404–420. Springer Berlin Heidelberg, At this purpose, the k-means algorithm is used over three 2000. different combinations of features. The first configuration consists of the concatenation of Global Color Structure De-