=Paper= {{Paper |id=Vol-1436/Paper55 |storemode=property |title=JRS at Synchronization of Multi-user Event Media Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper55.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/FassoldSLB15 }} ==JRS at Synchronization of Multi-user Event Media Task== https://ceur-ws.org/Vol-1436/Paper55.pdf
      JRS at Synchronization of Multi-user Event Media Task

                           Hannes Fassold, Harald Stiegler, Felix Lee, Werner Bailer
                                              JOANNEUM RESEARCH – DIGITAL
                                              Steyrergasse 17, 8010 Graz, Austria
                                           {firstname.lastname}@joanneum.at



ABSTRACT                                                             From the remaining ones, the k highest values ht are se-
The event synchronisation task addresses the problem of              lected. The selected homographies are clipped to a range
aligning media (i.e., photo and video) streams (“galleries”)         [hmin
                                                                        t    , hmax
                                                                                 t   ] and the arithmetic average havg and sum hsum
from different users temporally and identifying coherent events      of the clipped values is calculated. The visual similarity si,j
in the streams. Our approach uses the visual similarity of           is retrieved as the geometric average of havg and hsum .
image/key frame pairs based on full matching of SIFT de-                Our general approach is a probabilistic method, where a
scriptors with geometric verification. Based on the visual           significant number of potential solutions (hypotheses) are
similarity and the given time information, a probabilistic           calculated, and from these hypotheses the ’most-inner’ (in a
algorithm is employed, where in each run a hypothesis is             sense which will be explained later) is taken as the final so-
calculated for the set of time offsets with respect to the ref-      lution. Such a probabilistic approach is more robust against
erence gallery. From the gathered hypotheses, the final set          outliers in the data. As a preprocessing step, we calculate
of time offsets is calculated as the medoid of all hypotheses.       a connection magnitude ck,l for each gallery pair k and l in
                                                                     order to steer the random picking of gallery pairs (k, l) to-
                                                                     wards the more ’stable’ gallery pairs (e.g., the gallery pairs
1.    INTRODUCTION                                                   with a high number of matches and a low deviation of the
   The event synchronisation task addresses the problem of           time difference values between the matches). The connec-
aligning media streams (referred to as galleries) from differ-       tion magnitude is calculated as the geometric average of the
ent users temporally and identifying coherent events in the          number of identified matches (based on visual similarity)
streams. This paper describes the work done by the JRS               between the galleries, the average visual similarity scores of
team for the two subtasks of determining the time offsets of         the matches and of the reciprocal of the average deviation
galleries and clustering the images and videos into events.          of the time differences between the matches.
Details on the task and the data set can be found in [1].               One potential solution is a vector of time differences D0 =
                                                                     (δ1 , ..., δM ) between the M galleries and the reference gallery
2.    APPROACH                                                       G0 . For generating one potential solution D0 , we proceed
                                                                     as follows. First, a random gallery pair (k, l) is identified.
2.1    Determining Gallery Offsets                                   The probability of picking a pair (k, l) is proportional to
   Our approach utilizes the visual information (the captured        its connection magnitude ck,l , therefore we steer the ran-
images and extracted key frames from the video) and the              dom picking towards more stable gallery pairs. In order to
given time stamps in a probabilistic way. The absolute time          probabilistically calculate the time difference δk,l between
stamps are not considered reliable in this task, however,            the two galleries, we first apply k-means clustering on the
their relative distances within the gallery of one user can          time difference values of all matches, where k is typically
be exploited.                                                        in the range 3 to 5. Then, we randomly pick one of the
   We denote galleries as G0..M (assuming G0 as the reference        cluster centers and set it as δk,l . Having calculated δk,l , we
gallery), each Gk containing a set of images or key frames           can propagate this value recursively and calculate unknown
I1..Nk . For every image, several thousands of SIFT descrip-         values δk0 ,l by taking usage of the relation
tors [3] are extracted. A GPU accelerated implementation
                                                                                           δk,l = δk,k0 + δk0 ,l ,                (1)
is used to speed up descriptor extraction and matching [2].
   For a pair of galleries (k, l), for each image Ii ∈ Gk its best   which is very easy to show. By iterating this process of
matching image Ij ∈ Gl is identified, via exhaustive match-          randomly selecting a gallery pair, followed by calculating
ing of their respective SIFT descriptors. For each match             δk,l , M − 1 times we retrieve one potential solution D0 .
(Ii , Ij ), a geometric verification step is applied, yielding a        In order to calculate the final solution D, we generate a set
variable number of homographies along with the number of             of several thousands of potential solutions D0 (each being a
points ht supporting the respective homography. The vi-              vector of time differences) in the way described above. From
sual similarity si,j for the image pair is calculated as fol-        the potential solutions, we determine the final solution D by
lows. First, all homographies with ht < τ are discarded.             calculating the medoid of all potential solutions. In a certain
                                                                     sense, this is the ’most-inner’ solution, when interpreting the
Copyright is held by the author/owner(s).                            potential solutions as vectors.
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-
many
         100%                                                          100%
                                                     run 1                                                        run 1
                                                     run 2             80%                                        run 2
             80%

             60%                                                       60%


             40%                                                       40%


             20%                                                       20%


             0%                                                         0%
                      Precision   Recall       F-Score                         Precision       Recall        F-Score


                      Figure 1: Results for subevent clustering TDF14 (left) and NAMM15 (right).


                                                                  features of bikes and bikers match quite well across many
                                                                  images (which can be seen from the high visual similarity
      100%         Precision                                      values si,j for these images), thus visual matching provides
                   Accuracy                                       a weaker constraint than on visually more diverse data.
      80%                                                            The results for subevent clustering are shown in Figure 1.
                                                                  One interesting observation is that while the F1 score is on a
      60%                                                         comparable level for both data sets, precision and recall are
                                                                  quite balanced for NAMM, but biased towards higher preci-
      40%                                                         sion for TDF. Interestingly, the variation of the parameter
                                                                  between the two runs does not change this behaviour. For
      20%                                                         both parameterisations the method tends to oversegment
                                                                  the TDF data. It seems that the impact of synchronisation
       0%                                                         errors on the clustering result is limited, as no direct relation
                      TDF14            NAMM15                     is apparent from the results.

                                                                  4.    CONCLUSION
                                                                    The proposed method performs quite well in minimising
         Figure 2: Results for sychronisation.                    the overall synchronisation error, but at the expense of more
                                                                  galleries that exceed the error threshold. For the subevent
                                                                  clustering, a better automatic adaptation of the number of
2.2     Clustering Events                                         clusters to the data set is needed, in order to avoid overseg-
  For the event clustering, we rely solely on the time infor-     mentation such as on the TDF data.
mation. We correct the time stamp of a specific gallery with
the calculated offset, with respect to the reference gallery,     Acknowledgments
for the specific gallery. Based on the time information, a one
dimensional k-means clustering algorithm is applied, where        The research leading to these results has received funding
k is ranging between 30 and 100. The value is determined          from the European Union’s Seventh Framework Programme
based on the size of a data set - the total number of images      (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE
in all galleries - and a user parameter which specifies the       – Immersive Coverage of Spatially Outspread Live Events”
desired granularity of the subevents.                             (http://www.icosole.eu/).


3.     EXPERIMENTS AND RESULTS                                    5.    REFERENCES
                                                                  [1] Nicola Conci, Francesco De Natale, Vasileios Mezaris,
  We submitted two runs, which use the same parameters
                                                                      and Mike Matton. Synchronization of Multi-User Event
for determining the time offsets. The clustering is different,
                                                                      Media at MediaEval 2015: Task Description, Datasets,
with k for run 2 having the double value of run 1. So run
                                                                      and Evaluation. In MediaEval 2015 Workshop, Wurzen,
2 corresponds to a finer granularity of the subevents com-
                                                                      Germany, September 14-15 2015.
pared to run 1. Unfortunately, the official submissions only
                                                                  [2] Hannes Fassold and Jakub Rosner. A real-time GPU
contained the results for the still image data sets (Tour de
                                                                      implementation of the SIFT algorithm for large-scale
France, NAMM), but not for the videos.
                                                                      video analysis tasks. In Real-Time Image and Video
  Figure 2 shows the results for synchronisation. For both
                                                                      Processing, San Francisco, CA, USA, 2015.
data sets, accuracy is clearly higher than precision. This
means that our approach tends to optimise for a globally          [3] D. Lowe. Distinctive image features from scale-invariant
lower synchronisation error at the cost of higher individ-            keypoints. International Journal of Computer Vision,
ual errors for some galleries. While precision is significantly       60(2):91–110, 2004.
lower for NAMM than for TDF, accuracy has actually in-
creased. One reason for this may be the fact, that local