=Paper=
{{Paper
|id=Vol-1263/paper75
|storemode=property
|title=Synchronizing Multi-User Photo Galleries with MRF
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_75.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SansoneBD14
}}
==Synchronizing Multi-User Photo Galleries with MRF==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_75.pdf</pdf>
<pre>
         Synchronizing Multi-User Photo Galleries with MRF

              Emanuele Sansone, Giulia Boato                                            Minh-Son Dao
                       DISI, University of Trento                                         UIT-HCM
                              Trento, Italy                                             HCMC, Viet-Nam
          e.sansone@unitn.it, boato@disi.unitn.it                                     sondm@uit.edu.vn


ABSTRACT
We present a novel solution to the MediaEval 2014 Event
Synchronization Task: Synchronization of Multi-User Event
Media (SEM). The framework is based on a probabilistic
graphical model. Thanks to the simple topology of the
graph, the estimation of the true temporal displacement               Figure 1: Example of correspondences between two
among multiple photo collections can be performed efficiently         photo collections. The two stripes represent the two
through exact inference. The underlying fitness function is           photo collections, where the lower is the reference
defined in a flexible way, for which it is possible to integrate      and the upper is the gallery to be synchronized; ver-
easily new information (e.g., text tags or social network             tical lines correspond to pictures and arrows depict
data). The flexibility makes the framework suitable and               correspondences.
adaptable to cope with many real situations. The method
is evaluated on two datasets obtaining an overall accuracy
of more than 85% in both cases.                                       in the reference gallery (notice that N will be often smaller
                                                                      than M but this is not always true).
                                                                        Given the set of all possible temporal displacements, namely
1.   INTRODUCTION                                                     {∆Ti : i = 1, . . . , Q}, and the sequence of all correspon-
    The problem of photo stream synchronization has been              dences between pictures given the offset ∆Ti , namely x∆Ti =
investigated in little work in the current literature. Never-         (x∆T
                                                                        1
                                                                           i
                                                                             , . . . , x∆T
                                                                                        j
                                                                                          i
                                                                                            , . . . , x∆Ti         ∆Ti
                                                                                                       N ), where xj   identifies the picture
theless, it represents an open and attractive research topic,         in the reference P associated with image p1j given ∆Ti , the
                                                                                                2

especially if one considers its potential applicability to the        synchronization task can be cast into an optimization prob-
context of online photo sharing communities.                          lem for finding the best offset ∆T ∗ . In other words,
    Indeed, a novel task on this issue has been introduced in
MediaEval 2014 [4], where the scenario considered is repre-                                ∆T ∗ = arg max f (x∆Ti )                      (1)
                                                                                                         ∆Ti
sented by a number of users attending the same event and
taking photos and videos with different non-synchronized              where f : X → R is the function that associates a similarity
devices. The goal of the task is twofold.                             score to each sequence of associations and X = {x∆Ti : i =
    The synchronization consists of finding the correct tempo-        1, . . . , Q}.
ral offset of each photo collection, denoted as P 1 = (p11 , p12 ,       Now it is possible to define an undirected graphical model
. . . , p1N ), with respect to a reference gallery, namely P 2 =      through a sequence of observed nodes y = (y1 , . . . , yj , . . . , yN ),
(p21 , p22 , . . . , p2M ), where N and M correspond to the lengths   where yj refers to the image p1j in P 1 , and a sequence of la-
of the two streams.                                                   tent variables x = (x1 , . . . , xj , . . . , xN ), whose admissible
    Once the sequential chronological order of all pictures is        values are defined over the set X. The edges of the model
restored, the clustering phase is evaluated based on the num-         are of two kinds: links between nodes in x and y, that com-
ber of sub-events detected as well as on the quality of the           pare the similarity across photos of the two galleries, and
obtained groupings.                                                   links between nodes in the same sequence x, which take into
                                                                      account the temporal structure of the collections. Fig. 2
                                                                      summarizes the graphical model described so far.
2.   PROPOSED APPROACH                                                   The joint distribution associated with the graph can there-
  The proposed framework for synchronization is based on              fore be factorized in the following form:
a probabilistic graphical model. Each temporal displace-
ment can be uniquely identified by a set of nearest-neighbor                                1 Y               Y
picture pairs across the two photo sets. Fig. 1 shows an                           p(x) =       ψ(xj , xj+1 )   φ(xk , yk )              (2)
                                                                                            Z j
                                                                                                                  k
example of this concept, where each image in the collection
to be synchronized can be compared with the nearest image              where ψ(xj , xj+1 ) is the potential associated with the link
                                                                      that connects xj and xj+1 and φ(xk , yk ) corresponds to the
                                                                      potential of the edge between xk and yk [3]. The distribution
Copyright is held by the author/owner(s).                             p(x) can be interpreted as a function measuring the quality
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        of the alignment given an offset, and can be exploited as
                                                                   Table 1: Synchronization performance on the two
                                                                   datasets in terms of precision and accuracy.
                                                                                Dataset  Precision Accuracy
                                                                              Vancouver     0.35      0.86
Figure 2: Markov network with observed and hidden                               London      0.25      0.89
nodes.
                                                                   Table 2: Clustering performance on the two datasets
the objective f in Eq. (1). For the sake of computational          in terms of Rand Index, Jaccard Index and F-
simplicity, we define the potential functions belonging to the     measure.
expontential family, namely:                                               Dataset   Run     RI     JI      F1
                     n h D (x , y )                                                   1    0.9749 0.1673 0.1433
                             H    k   k                                  Vancouver    2    0.9737 0.1382 0.1214
   φ(xk , yk ) = exp − α      max
                                        +
                            DH      (k)                                               3    0.9730 0.1315 0.1162
                          DS (xk , yk )     DG (xk , yk ) io                          1    0.9852 0.1287 0.1140
                      + β max           +γ                   (3)
                           DS (k)             max
                                             DG    (k)                     London     2    0.9836 0.0742 0.0691
                      n    DT (xj , xj+1 ) o                                          3    0.9841 0.0885 0.0813
   ψ(xj , xj+1 ) = exp − δ                                   (4)
                              DTmax (j)
where DH , DS and DG represent distance metrics between            scriptors (CSD) [1] with Local Binary Patterns [5]. The
images computed on HSV color histograms, SURF descrip-             second configuration consists only of 6 CSD values obtained
tors [2] and GPS coordinates, respectively. In particular,         by performing PCA reduction on the original CSD descrip-
DH is obtained by first dividing each image into 9 blocks, by      tor, while the last set up is obtained by adding the temporal
computing the Hellinger distance between color histograms          information to the second configuration. Table 2 shows the
extracted from their respective blocks and by combining the        results associated with the three different runs. In general,
distances linearly assigning a higher weight to the central        the inclusion of the temporal feature doesn’t significantly in-
component. DS corresponds to the average of Euclidean              crease the performance. But it’s evident that if one is able to
distances evaluated over all pairs of matched salient points.      carry out a precise synchronization, then temporal informa-
Finally, DG is computed by approximating the Earth sur-            tion becomes a very reliable feature to perform clustering.
face as a sphere. DT is evaluated on timestamp information         This is confirmed by the fact that the results are slightly
according to the following relation:                               more than 10% in terms of F-measure and low-level visual
                                                                   features are therefore not sufficient.
         DT (xj , xj+1 ) = |tyj+1 − txj+1 | + |tyj − txj |   (5)      Future work will be devoted to increasing the synchroniza-
Every distance measure is therefore normalized by its re-          tion precision in order to allow the exploitation of the tempo-
spective maximum value, and is finally combined linearly as        ral component. One possible approach consists of modifying
shown in Eqs. (3) and (4).                                         the structure of the graphical model, such that new binary
  As far as the clustering challenge is concerned, the k-          latent variables take into account the possibility of having
means algorithm is used to find the natural grouping of im-        no associations between photos.
ages, see Section 3.
                                                                   4.   REFERENCES
3.   RESULTS AND DISCUSSION                                        [1] M. Baştan, H. Çam, U. Güdükbay, and Özgür Ulusoy.
   The learning of the parameters described in the previous            BilVideo-7: An MPEG-7-Compatible Video Indexing
section is carried out by performing an optimization of the            and Retrieval System. IEEE MultiMedia, 17(3):62–73,
joint distribution over the parameter solution space and us-           2009.
ing the training dataset made available by [4]. The estimated      [2] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded
values for the parameters are α = 2.4249, β = 0.9509, γ =              Up Robust Features. In Computer Vision - ECCV
0.9594 and δ = 3.8597. Once the training phase has com-                2006.
pleted, one can start to synchronize any pair of galleries.        [3] C. M. Bishop et al. Pattern Recognition and Machine
   Results obtained in the SEM task for synchronization are            Learning. Springer New York, 2006.
summarized in Table 1. In both datasets, the accuracy of           [4] N. Conci, F. D. Natale, and V. Mezaris.
synchronization is greater than 85%, which proves the effec-           Synchronization of Multi-User Event Media (SEM) at
tiveness of the proposed algorithm. Nevertheless, the preci-           MediaEval 2014: Task Description, Datasets, and
sion obtained is quite poor since on average only one fourth           Evaluation. In Proceedings of MediaEval 2014,
of the photo collections are correctly synchronized. The low           Barcelona, Spain, October 2014.
performance is mainly due to forcing associations between          [5] T. Ojala, M. Pietikäinen, and T. Mäenpää. Gray Scale
images of the two galleries. In many cases, there are pictures         and Rotation Invariant Texture Classification with
that have no correspondence in the reference set.                      Local Binary Patterns. In Computer Vision - ECCV
   Once the galleries are synchronized, the corrected tem-             2000, volume 1842 of Lecture Notes in Computer
poral information can be exploited to perform clustering.              Science, pages 404–420. Springer Berlin Heidelberg,
At this purpose, the k-means algorithm is used over three              2000.
different combinations of features. The first configuration
consists of the concatenation of Global Color Structure De-

</pre>