=Paper= {{Paper |id=Vol-1436/Paper55 |storemode=property |title=JRS at Synchronization of Multi-user Event Media Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper55.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/FassoldSLB15 }} ==JRS at Synchronization of Multi-user Event Media Task== https://ceur-ws.org/Vol-1436/Paper55.pdf

JRS at Synchronization of Multi-user Event Media Task

Hannes Fassold, Harald Stiegler, Felix Lee, Werner Bailer
JOANNEUM RESEARCH – DIGITAL
Steyrergasse 17, 8010 Graz, Austria
{firstname.lastname}@joanneum.at

ABSTRACT From the remaining ones, the k highest values ht are se-
The event synchronisation task addresses the problem of lected. The selected homographies are clipped to a range
aligning media (i.e., photo and video) streams (“galleries”) [hmin
t , hmax
t ] and the arithmetic average havg and sum hsum
from different users temporally and identifying coherent events of the clipped values is calculated. The visual similarity si,j
in the streams. Our approach uses the visual similarity of is retrieved as the geometric average of havg and hsum .
image/key frame pairs based on full matching of SIFT de- Our general approach is a probabilistic method, where a
scriptors with geometric verification. Based on the visual significant number of potential solutions (hypotheses) are
similarity and the given time information, a probabilistic calculated, and from these hypotheses the ’most-inner’ (in a
algorithm is employed, where in each run a hypothesis is sense which will be explained later) is taken as the final so-
calculated for the set of time offsets with respect to the ref- lution. Such a probabilistic approach is more robust against
erence gallery. From the gathered hypotheses, the final set outliers in the data. As a preprocessing step, we calculate
of time offsets is calculated as the medoid of all hypotheses. a connection magnitude ck,l for each gallery pair k and l in
order to steer the random picking of gallery pairs (k, l) to-
wards the more ’stable’ gallery pairs (e.g., the gallery pairs
1. INTRODUCTION with a high number of matches and a low deviation of the
The event synchronisation task addresses the problem of time difference values between the matches). The connec-
aligning media streams (referred to as galleries) from differ- tion magnitude is calculated as the geometric average of the
ent users temporally and identifying coherent events in the number of identified matches (based on visual similarity)
streams. This paper describes the work done by the JRS between the galleries, the average visual similarity scores of
team for the two subtasks of determining the time offsets of the matches and of the reciprocal of the average deviation
galleries and clustering the images and videos into events. of the time differences between the matches.
Details on the task and the data set can be found in [1]. One potential solution is a vector of time differences D0 =
(δ1 , ..., δM ) between the M galleries and the reference gallery
2. APPROACH G0 . For generating one potential solution D0 , we proceed
as follows. First, a random gallery pair (k, l) is identified.
2.1 Determining Gallery Offsets The probability of picking a pair (k, l) is proportional to
Our approach utilizes the visual information (the captured its connection magnitude ck,l , therefore we steer the ran-
images and extracted key frames from the video) and the dom picking towards more stable gallery pairs. In order to
given time stamps in a probabilistic way. The absolute time probabilistically calculate the time difference δk,l between
stamps are not considered reliable in this task, however, the two galleries, we first apply k-means clustering on the
their relative distances within the gallery of one user can time difference values of all matches, where k is typically
be exploited. in the range 3 to 5. Then, we randomly pick one of the
We denote galleries as G0..M (assuming G0 as the reference cluster centers and set it as δk,l . Having calculated δk,l , we
gallery), each Gk containing a set of images or key frames can propagate this value recursively and calculate unknown
I1..Nk . For every image, several thousands of SIFT descrip- values δk0 ,l by taking usage of the relation
tors [3] are extracted. A GPU accelerated implementation
δk,l = δk,k0 + δk0 ,l , (1)
is used to speed up descriptor extraction and matching [2].
For a pair of galleries (k, l), for each image Ii ∈ Gk its best which is very easy to show. By iterating this process of
matching image Ij ∈ Gl is identified, via exhaustive match- randomly selecting a gallery pair, followed by calculating
ing of their respective SIFT descriptors. For each match δk,l , M − 1 times we retrieve one potential solution D0 .
(Ii , Ij ), a geometric verification step is applied, yielding a In order to calculate the final solution D, we generate a set
variable number of homographies along with the number of of several thousands of potential solutions D0 (each being a
points ht supporting the respective homography. The vi- vector of time differences) in the way described above. From
sual similarity si,j for the image pair is calculated as fol- the potential solutions, we determine the final solution D by
lows. First, all homographies with ht < τ are discarded. calculating the medoid of all potential solutions. In a certain
sense, this is the ’most-inner’ solution, when interpreting the
Copyright is held by the author/owner(s). potential solutions as vectors.
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-
many
100% 100%
run 1 run 1
run 2 80% run 2
80%

60% 60%

40% 40%

20% 20%

0% 0%
Precision Recall F-Score Precision Recall F-Score

Figure 1: Results for subevent clustering TDF14 (left) and NAMM15 (right).

features of bikes and bikers match quite well across many
images (which can be seen from the high visual similarity
100% Precision values si,j for these images), thus visual matching provides
Accuracy a weaker constraint than on visually more diverse data.
80% The results for subevent clustering are shown in Figure 1.
One interesting observation is that while the F1 score is on a
60% comparable level for both data sets, precision and recall are
quite balanced for NAMM, but biased towards higher preci-
40% sion for TDF. Interestingly, the variation of the parameter
between the two runs does not change this behaviour. For
20% both parameterisations the method tends to oversegment
the TDF data. It seems that the impact of synchronisation
0% errors on the clustering result is limited, as no direct relation
TDF14 NAMM15 is apparent from the results.

4. CONCLUSION
The proposed method performs quite well in minimising
Figure 2: Results for sychronisation. the overall synchronisation error, but at the expense of more
galleries that exceed the error threshold. For the subevent
clustering, a better automatic adaptation of the number of
2.2 Clustering Events clusters to the data set is needed, in order to avoid overseg-
For the event clustering, we rely solely on the time infor- mentation such as on the TDF data.
mation. We correct the time stamp of a specific gallery with
the calculated offset, with respect to the reference gallery, Acknowledgments
for the specific gallery. Based on the time information, a one
dimensional k-means clustering algorithm is applied, where The research leading to these results has received funding
k is ranging between 30 and 100. The value is determined from the European Union’s Seventh Framework Programme
based on the size of a data set - the total number of images (FP7/2007-2013) under grant agreement n◦ 610370, “ICoSOLE
in all galleries - and a user parameter which specifies the – Immersive Coverage of Spatially Outspread Live Events”
desired granularity of the subevents. (http://www.icosole.eu/).

3. EXPERIMENTS AND RESULTS 5. REFERENCES
[1] Nicola Conci, Francesco De Natale, Vasileios Mezaris,
We submitted two runs, which use the same parameters
and Mike Matton. Synchronization of Multi-User Event
for determining the time offsets. The clustering is different,
Media at MediaEval 2015: Task Description, Datasets,
with k for run 2 having the double value of run 1. So run
and Evaluation. In MediaEval 2015 Workshop, Wurzen,
2 corresponds to a finer granularity of the subevents com-
Germany, September 14-15 2015.
pared to run 1. Unfortunately, the official submissions only
[2] Hannes Fassold and Jakub Rosner. A real-time GPU
contained the results for the still image data sets (Tour de
implementation of the SIFT algorithm for large-scale
France, NAMM), but not for the videos.
video analysis tasks. In Real-Time Image and Video
Figure 2 shows the results for synchronisation. For both
Processing, San Francisco, CA, USA, 2015.
data sets, accuracy is clearly higher than precision. This
means that our approach tends to optimise for a globally [3] D. Lowe. Distinctive image features from scale-invariant
lower synchronisation error at the cost of higher individ- keypoints. International Journal of Computer Vision,
ual errors for some galleries. While precision is significantly 60(2):91–110, 2004.
lower for NAMM than for TDF, accuracy has actually in-
creased. One reason for this may be the fact, that local