=Paper= {{Paper |id=Vol-1436/Paper71 |storemode=property |title=EUMSSI team at the MediaEval Person Discovery Challenge |pdfUrl=https://ceur-ws.org/Vol-1436/Paper71.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/LeWMO15 }} ==EUMSSI team at the MediaEval Person Discovery Challenge== https://ceur-ws.org/Vol-1436/Paper71.pdf
           EUMSSI team at the MediaEval Person Discovery
                             Challenge

                        Nam Le1,2 , Di Wu1 , Sylvain Meignier3 , Jean-Marc Odobez1,2
                                      1
                                        Idiap Research Institute, Martigny, Switzerland
                                2
                                    École Polytechnique Fédéral de Lausanne, Switzerland
                                       3
                                          LIUM, University of Maine, Le Mans, France
                      {nle, dwu, odobez}@idiap.ch, sylvain.meignier@univ-lemans.fr

ABSTRACT
We present the results of the EUMSSI team’s participation
in the Multimodal Person Discovery task at the MediaEval
challenge 2015. The goal is to identify all people who simul-
taneously appear and speak in a video corpus, which implic-
itly involves both audio stream and visual stream. We em-
phasize on improving each modality separately and bench-                Figure 1: Architecture of proposed system
marking them to analyze their pros and cons.
                                                                 and test their performance to understand their advantages.
1.   INTRODUCTION                                                   The used system, as illustrated in Fig. 1, consists of 2 main
   Nowadays, viewers, journalists, or archivists have access     stages. The first stage detects and clusters speakers, faces
to a vast amount multimedia data. The need for browsing          and overlaid person names, including extracting Named En-
and retrieval tools of these archives has led researchers to     tities (NE). The second one associates speakers to faces using
devote effort to developing technologies that create search-     co- occurrence statistics and the overlaid person names are
able indices [14]. In this view, as humans are very interested   propagated to the speakers, or faces, in order to give the
in other people while consuming multimedia contents, algo-       identities of the persons in the show.
rithms indexing identities of people and retrieving their re-    2.1      Speaker diarization
spective quotations are indispensable for searching archives.
                                                                    The speaker diarization system (“who speak when?”) is
This practical need leads to research problems on how to
                                                                 based on the LIUM Speaker Diarization system[16], freely
identify people presence in videos and answer ’who appears
                                                                 distributed1 . This system has achieved the best or second
when?’ or ’who speaks when?’.
                                                                 best results in the speaker diarization task on REPERE
   In particular, in the MediaEval Person Discovery task,
                                                                 French broadcast evaluation campaigns 2012 and 2013 [6].
the goal is the following. Given the raw TV broadcasts,
                                                                    The diarization system is first composed of an acoustic
each shot must be automatically tagged with the name(s) of
                                                                 Bayesian Information Criterion (BIC)-based segmentation
people who can be both seen as well as heard in the shot.
                                                                 followed by a BIC-based hierarchical clustering. Each clus-
The list of people is not known a priori and their names
                                                                 ter represents a speaker and is modeled with a full covari-
must be discovered in an unsupervised way from video text
                                                                 ance Gaussian. A Viterbi decoding re-segments the signal
overlay or speech transcripts. This situation corresponds to
                                                                 using GMMs with 8 diagonal components learned by EM-
cases where at the moment a content is created or broadcast,
                                                                 ML, for each cluster. Segmentation, clustering and decoding
some of the appearing people are relatively unknown but
                                                                 are performed with 12 MFCC+E, computed with a 10ms
may later on become a trending topic on social networks or
                                                                 frame rate. Music and jingle regions are removed using a
search engines. In addition, to ensure high quality indexes,
                                                                 Viterbi decoding with 8 GMMs (trained on french broad-
algorithms should also help human annotators double-check
                                                                 cast news data) for music, jingle, silence, and speech (with
these indexes by providing an evidence of the claimed iden-
                                                                 wide/narrow band variants for the last two, and clean or
tity (especially for people who are not yet famous).
                                                                 noised or musical background variants for wideband speech).
2.   PROPOSED SYSTEM                                                In the above steps, features were used unnormalized in
                                                                 order to preserve information on the background environ-
  The participation of the EUMSSI team was to enable the         ment, which may help differentiating between speakers. At
assessment of the different modules developed by the authors     this point however, each cluster contains the voice of only
in the past [11, 7, 8, 17, 4]. In this view, starting from the   one speaker, but several clusters can be related to a same
baseline provided by the organizer, the goal was to replace      speaker. The background environment contribution must
baseline components by the team’s components, whenever           be removed from each GMM cluster, through feature gaus-
they have been made compatible and their processing speed        sianization. Finally, the system is completed with clustering
was enough to address the data provided in the challenge,        method based on the i-vectors paradigm and Integer Linear
                                                                 Programming (ILP). This new clustering method is fully
                                                                 described in [17] and [4]. The ILP clustering along with i-
Copyright is held by the author/owner(s).                        1
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany          www-lium.univ-lemans.fr/en/content/liumspkdiarization
vectors speaker models gives better results than the usual               Method     EwMAP       MAP       C      #(2485)
hierarchical agglomerative clustering based on GMMs and                  Baseline    49.98      50.32   58.75      617
cross-likelihood distances [1].                                         SpkDia       65.31      66.70   72.50     2817
                                                                        FaceDia      66.38      67.98   71.67     1691
2.2    Face diarization
   Given the video shots, face diarization process consists            Table 1: Results on REPERE test 2 (dev set)
of (i) face detection, detecting faces appearing within each            Method    EwMAP       MAP         C     #(21963)
shot, (ii) face tracking, extending detections into continuous          Baseline    78.35     78.64     92.71    12066
tracks within each shot, and (iii) face clustering, grouping
                                                                        FaceDia     83.04     83.33     90.77     7237
all tracks with the same identity into clusters.
                                                                       SpkDia∗     89.75      90.14     97.05    30583
Face detection. Detecting faces in broadcasting media can               SpkFace     89.53     89.90     96.52    20601
be challenging due to the wide range of media content. Faces           ∗
                                                                         Primary submission
can appear in widely different situations with varied illumi-
nation and noise such as in studio, during live coverage, or                Table 2: Results on INA (test set)
during political debate. To overcome these challenges, we
employ deformable part-based model (DPM) [5, 12], which           entities to maximize the co-occurrences between them.
can detect faces at multiple poses and variation. Because,
the main disadvantage of DPM is its long running time, face       3.    EXPERIMENTS
detector is only applied 2 times per second.                         We evaluated 3 methods: SpkDia, FaceDia, and SpkFace.
Face tracking. The goal of this step is to create continuous      In SpkDia (primary submission), we apply naming based
face tracks in one video shot, which raises the need for asso-    on audio information only (this is equivalent to assumption
ciation individual detections. Because of long gaps between       that all speakers which are associated with a name are vis-
detected faces, we exploit long term connectivity using CRF-      ible and speaking). This is our primary submission for the
based multi-target tracking [10]. This framework relies on        challenge. Second, in FaceDia, we apply naming based on
the unsupervised learning of time sensitive association costs     visual information only, and assume that all visible faces
for different features. First, similarities between detections    (which are associated with a name) are talking. Third, in
are computed based on low level features (color histogram,        SpkFace, we apply naming based on audio information only,
position, motion, SURF keypoint descriptors) which can be         but validate if there exists visible faces during the speech
computed fast. Then, for each feature type, the correspond-       segments (if not, the segment is discarded). Because our
ing pairwise factor of the CRF is defined as the probability of   approaches are monomodal and fully unsupervised, we did
similarity measurements between pairs of detections under         not use the information provided by leaderboard to improve
two distinct hypotheses that they correspond to the same          performance.
label or not. By optimizing a graph labeling posterior, we           The results using the challenge performance measures are
assign the same label to detections belonging to the same         reported in Tab. 1 for the REPERE test 2 data [9] as the
face, and different labels to different faces.                    initial development data and in Tab. 2 for the challenge test-
                                                                  ing part of the INA dataset. SpkDia is the most robust and
Face clustering. Given the face tracks across all video
                                                                  performs the best even without any face information, which
shots, we hierarchically merge face tracks tracks using match-
                                                                  might be explained by two points. First, there is usually
ing and biometric similarity measures [11]. Matching cluster
                                                                  only one speaker at a time, and not much noise in the chal-
similarity is calculated based on average of distances be-
                                                                  lenge data. Meanwhile, face diarization can be difficult due
tween sparse keypoints of two clusters. Meanwhile, biomet-
                                                                  to multiple faces, facial variation, missed detections, etc.
ric model-based similarity measures how densely extracted
                                                                  Hence, speech clusters tend to be more reliable than face
features from one cluster are likely to belong to the model of
                                                                  clusters. Second, when a speaker is not visible, it is often
the other cluster, as compared to the likelihood given by the
                                                                  the anchor of the show, who is counted as one query equally
statistical model, and vice-versa. Face tracks are first clus-
                                                                  to those appearing for short duration. Therefore, SpkDia
tered using only feature-based matching, yielding clusters
                                                                  is not penalized much by the visibility of speakers. We can
with sufficient data to adapt the biometric models. Then,
                                                                  observe this effect more in the last column of Tab. 2 which
model-based similarity is combined with matching similarity
                                                                  shows the number of person presence with names predicted
to merge clusters until stopping criteria are met. Similarly to
                                                                  by each scheme. Using faces to filter 1/3 of speech segments
speaker diarization, face diarization produces face segments
                                                                  does not help to increase precision because these segments
during which distinct identities appear.
                                                                  correspond to a small number of repetitive speakers. Also,
2.3    Person Naming                                              though face diarization gives only 1/3 of possible names,
Identity candidate retrieval. OPNs can be more reli-              these names are precise person-wise. This interesting fact
ably extracted using Optical Character Recognition (OCR)          may provide outlook on combining 2 modalities.
techniques [2, 13] than from automatic speech transcripts.
Therefore, we only exploit name entities detected from OCR        4.    FUTURE WORKS
by [3] as potential identity candidates.                             We have presented our system in MediaEval challenge.
Direct one-to-one tagging. As mentioned earlier, our              The testing result serves as our basis for improving each
goal is to benchmark improvements of each modality in the         component. We are working on speeding up the tracking
system. Hence, there is one assumption that the temporal          process as well as investigating alternative face representa-
clusters of the diarization processes are trustable. In this      tions such as total variability modeling. On another hand,
work, we use a simple one-to-one naming method provided           current system has not taken full advantage of both audio
by [15] which finds the mapping between clusters and named        and visual streams, which we plan to proceed in the future.
5.   REFERENCES                                                     identification using overlaid texts in tv broadcast. In
 [1] C. Barras, X. Zhu, S. Meignier, and J. Gauvain.                Interspeech, page 4p, 2012.
     Multi-stage speaker diarization of broadcast news.        [16] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin,
     14(5):1505–1512, Feb. 2006.                                    and S. Meignier. An open-source state-of-the-art
 [2] D. Chen and J.-M. Odobez. Video text recognition               toolbox for broadcast news diarization. In Interspeech,
     using sequential monte carlo and error voting methods.         Lyon (France), 25-29 Aug. 2013.
     Pattern Recognition Letters, 26(9):1386–1403, 2005.       [17] M. Rouvier and S. Meignier. A global optimization
 [3] M. Dinarelli and S. Rosset. Models cascade for                 framework for speaker diarization. In Odyssey
     tree-structured named entity detection. In IJCNLP,             Workshop, Singapore, 2012.
     pages 1269–1278, 2011.
 [4] G. Dupuy, S. Meignier, P. Deléglise, and Y. Estève.
     Recent improvements towards ILP-based clustering for
     broadcast news speaker diarization. 2014.
 [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and
     D. Ramanan. Object detection with discriminatively
     trained part-based models. IEEE Transactions on
     Pattern Analysis and Machine Intelligence,
     32(9):1627–1645, 2010.
 [6] O. Galibert and J. Kahn. The first official REPERE
     evaluation. In Interspeech satellite workshop on
     Speech, Language and Audio in Multimedia (SLAM),
     Marseille, France, 2013.
 [7] P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and
     P. Deleglise. A Conditional Random Field approach
     for Audio-Visual people diarization. In IEEE
     International Conference on Acoustics, Speech and
     Signal Processing (ICASSP 2014), 2014.
 [8] P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and
     P. Deleglise. Face identification from overlaid texts
     using Local Face Recurrent Patterns and CRF models.
     In IEEE International Conference on Image
     Processing (ICIP), 2014.
 [9] A. Giraudel, M. Carré, V. Mapelli, J. Kahn,
     O. Galibert, and L. Quintard. The repere corpus : a
     multimodal corpus for person recognition. In N. C. C.
     Chair), K. Choukri, T. Declerck, M. U. Dogan,
     B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and
     S. Piperidis, editors, Proceedings of the Eight
     International Conference on Language Resources and
     Evaluation (LREC’12), Istanbul, Turkey, may 2012.
     European Language Resources Association (ELRA).
[10] A. Heili, A. Lopez-Mendez, and J.-M. Odobez.
     Exploiting long-term connectivity and visual motion
     in crf-based multi-person tracking. IEEE Transactions
     on Image Processing, 23(7):3040–3056, 2014.
[11] E. Khoury, P. Gay, and J.-M. Odobez. Fusing
     matching and biometric similarity measures for face
     diarization in video. In Proceedings of the 3rd ACM
     conference on International conference on multimedia
     retrieval, pages 97–104. ACM, 2013.
[12] M. Mathias, R. Benenson, M. Pedersoli, and
     L. Van Gool. Face detection without bells and
     whistles. In ECCV, pages 720–735. Springer, 2014.
[13] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
     From text detection in videos to person identification.
     In 2012 IEEE International Conference on Multimedia
     and Expo (ICME), pages 854–859. IEEE, 2012.
[14] J. Poignant, H. Bredin, and C. Barras. Multimodal
     person discovery in broadcast tv at mediaeval 2015.
     2015.
[15] J. Poignant, H. Bredin, V.-B. Le, L. Besacier,
     C. Barras, and G. Quénot. Unsupervised speaker