=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_46 |storemode=property |title=EUMSSI Team at the MediaEval Person Discovery Challenge 2016 |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_46.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/LeMO16 }} ==EUMSSI Team at the MediaEval Person Discovery Challenge 2016== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_46.pdf
           EUMSSI team at the MediaEval Person Discovery
                          Challenge 2016

                               Nam Le1,2 , Sylvain Meignier3 , Jean-Marc Odobez1,2
                                       1
                                         Idiap Research Institute, Martigny, Switzerland
                                 2
                                     École Polytechnique Fédéral de Lausanne, Switzerland
                                        3
                                           LIUM, University of Maine, Le Mans, France
                          {nle, odobez}@idiap.ch, sylvain.meignier@univ-lemans.fr

                                                                   for text recognition in videos, and on [3, 15] for text recog-
                                                                   nition and indexing. In brief, given an input video, two
                                                                   main steps are applied: first the video is preprocessed with
                                                                   a motion filtering to reduce noise, and individual frames are
                                                                   processed to localize and binarize the text regions for text
                                                                   recognition. As compared to printed documents, OCR in
         Figure 1: Architecture of our system                      TV news videos encounters several challenges: low resolu-
                                                                   tion of text regions, sequence of different texts continuously
ABSTRACT                                                           displayed, or small amount of text to be recognized etc. To
                                                                   tackle these, multiple image segmentations of the same text
We present the results of the EUMSSI team’s participation
                                                                   region are decoded, and then all results are compared and
in the Multimodal Person Discovery task. The goal is to
                                                                   aggregated over time to produce several hypotheses. The
identify all people who simultaneously appear and speak in
                                                                   best hypothesis is used to extract people names for identi-
a video corpus. In the proposed system, besides improv-
                                                                   fication. To recognize names from texts, we use the MITIE
ing each modality, we emphasize on the ranking of multiple
                                                                   open library 1 , which provides state-of-the-art NER tool. To
results from both audio stream and visual stream.
                                                                   improve the raw MITIE results, a heuristics preprocessing
                                                                   step identifies names of editorial staff based on their roles
1.    INTRODUCTION                                                 (cameraman, editor, or writer) because they do not appear
   As the retrieval of information on people in videos is of       within the video, thus are not useful for identification.
high interest for users, algorithms indexing identities of peo-
ple and retrieving their respective quotations are indispens-      2.2     Face diarization
able for searching archives. This practical need leads to re-         Given the video shots, face diarization process consists of
search problems on how to identify people presence in videos.      (i) face detection, (ii) face tracking, and (iii) face clustering.
Given the raw TV broadcasts, each shot must be automat-            Detection & tracking. Detecting and associating faces
ically tagged with the name(s) of people who can be both           can be challenging due to the wide range of media content,
seen as well as heard in the shot along with the confident         where faces can appear with varied illumination and noise.
score. The list of people is not known apriori and their           To overcome these challenges, we use a fast version of de-
names must be discovered from video text overlay or speech         formable part-based model (DPM) [5, 11, 4] to detect faces
transcripts [6]. To this end, a video must be segmented in         at multiple poses and variation. Tracking is performed using
an unsupervised way into homogeneous segments according            the CRF-based multi-target tracking framework [7], which
to person identity, like speaker diarization and face diariza-     relies on the unsupervised learning of time sensitive associ-
tion, to be combined with the extracted names. Our goal            ation costs for different features. Because the bottle-neck of
is to benchmark our recent improvements in all components          the system is detection, the detector is only applied 4 times
and address the fusion of multimodal results.                      per second. We also trained an explicit false alarm classifier
                                                                   at the track level to efficiently filter out false tracks. Further
2.    PROPOSED SYSTEM                                              details can be found in [9].
   The system we proposed is illustrated in Fig. 1. It con-        Face clustering. We hierarchically merge face tracks across
sists of 4 main parts: video optical character recognition         all shots using matching and biometric similarity measures
(OCR) and named entity recognition (NER), face diaria-             similarly to [8] with two improvements: shot-constrained
tion, speaker diarization, and fusion naming.                      face clustering (SCFC) and the use of total variability mod-
                                                                   eling (TVM). SCFC is a divide-and-conquer strategy. Face
2.1    Video OCR and NER                                           clustering is first applied limiting within each group of sim-
  To detect OCR segments in videos and exploit them for            ilar shots. Then all resulting face clusters, which are now
retrieval, we first relied on the approaches described in [2, 1]   much fewer in quantity, are hierarchically merged. TVM is a
                                                                   state-of-the-art biometrics method that can represent faces
Copyright is held by the author/owner(s).                          which can appear in widely different contexts and sessions
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-      1
lands                                                                  https://github.com/mit-nlp/MITIE
[17, 16]. To compute similarity between face clusters, we           Algorithm 1 Ranking names within shots
simply use the average distance between all pairs of faces           1: for sk ∈ S do
using the cosine distance between i-vectors.                         2:    Qsk = ∅
2.3    Speaker diarization                                           3:    Face naming(sk ) ⇒ (NiF , t(NiF ))
                                                                     4:    Speaker naming(sk ) ⇒ (NjA , 1.0)
   The speaker diarization system is based on the LIUM
                                                                     5:    for each NiF do
Speaker Diarization system[14], which is publicly distributed2 .
                                                                     6:        if ∃NjA /NjA = NiF then
It is provided to all participants as the baseline method.
                                                                     7:            Qsk = Qsk ∪ {(NiF , t(NiF ) + 2.0)}
2.4    Identification and result ranking                             8:        else
                                                                     9:            Qsk = Qsk ∪ {(NiF , t(NiF ) + 1.0)}
   After obtaining homogeneous clusters during which dis-
tinct identities speak or appear, one needs to assign each          10:     for each NjA do
name output from NER module to the correct clusters. How-           11:        if not ∃NiF /NiF = NjA then
ever, associating auditory voices with visual person clusters       12:            Qsk = Qsk ∪ {(NjA , 1.0)}
or names has two major difficulties. The visible person may
not be the current speaker and the speaking person can be
                                                                                      MAP@1       MAP@10      MAP@100
dubbed by a narrator in a different language. Although we
                                                                           Sub. (1)    30.3        22.0         21.0
have introduced a temporal learning method to solve the
                                                                           Sub. (2)    58.6        42.9         42.0
dubbing problem [10], incorporating it into an AV diariza-
tion system is still an open question. Because of these prob-              Sub. (3)    64.2        53.1         52.1
lems of AV association, we use a direct naming method [13]                 Sub. (4)    68.3        56.2         54.7
which finds the mapping between clusters and names to                      Sub. (5)    79.2        65.2         63.4
maximize the co-occurrences between them.                           Table 1: Benchmarking results of our submissions.
Identification. Names are propagated based on the out-              Details of each submission in the text.
puts of face diarization and speaker diarization indepen-
dently. The direct naming method is applied to speaker              Each of our 5 submissions (Sub.) is as following:
clusters to produce a mapping between names and clusters.               • Sub. (1) and Sub. (2) used our face naming without
All shots which overlap with the clusters are tagged with                 talking score with baseline OCR-NER (1) or with our
the corresponding names with equal confident scores. The                  OCR-NER (2).
same direct method is applied to face clusters to produce               • Sub. (3) used our face naming with talking score.
a set of named clusters. Unlike speaker naming, for one                 • Sub. (4) used the combination of talking face naming
shot, a name coming from face naming is ranked based on                   in sub. (3) with speaker naming.
the talking score of the cluster’s segment within that shot.            • And sub. (5) used the combination of sub. (4) with
The talking score is predicted using lip motion and tempo-                other systems using baseline OCR-NER or baseline
ral modeling with LSTM [10]. Based on the two results, we                 face diarization. This is also our primary submission.
propose a strategy to appropriately combine them.                      When comparing sub. (1) and sub. (2), one can observe
Ranking. Let S = {sk } be the list of testing shots. Within         that our OCR-NER outperforms the baseline OCR-NER by
each shot, {NiF , t(NiF )} is the set of names returned by face     a large margin. This may be contributed by the high re-
naming and the corresponding talking scores and {NiA , 1.0}         call of our system. Because the metric is averaged over
is the set of names returned by speaker naming, each is             all queries, any missing name can significantly decrease the
ranked equally with score 1.0. The names which the two              overall MAP. On the other hand, false names are less prob-
methods agree on are ranked highest. Then, names from               lematic because of two reasons: they may not be associated
face naming are ranked higher than speaker naming because           with any clusters and they are not queried at all. In sub.
we found that face naming is more reliable in empirical ex-         (3), using talking face detection with LSTM, we can further
periments. Alternative strategies that rank speaker naming          improve by 5.6%. By combining face naming and speaker
equal or higher than face naming gave inferior results. Our         naming, we manage to increase the precision. This shows the
ranking strategy is described in Algo. 1.                           potential for further research of better audio-visual naming.
                                                                    In our primary submission (5), the result are greatly boosted
Further fusion. Finally, replacing individual component in
                                                                    when other methods are added. From this we can note that
our system with baseline NER [12] and face diarization 3 can
                                                                    these methods are complementary to each other and how to
produce complementary results. Therefore, these results are
                                                                    exploit their advantages is an open question in the future.
added to our final submission with lower confident scores.

3.    EVALUATION                                                    4.   CONCLUSION
                                                                       We have presented our system in MediaEval challenge
  Participants are scored based on a set of queries. Each
                                                                    2016. This system consists of our recent advances in video
query is a person name in the corpus, each participant has
                                                                    processing and temporal modeling. Although each modal-
to return all shots when that person appears and talks. The
                                                                    ity shows positive performance, the current system has not
metric is Mean Average Precision (MAP) over all queries. In
                                                                    taken full advantage of both audio and visual streams. There-
Tab. 1, we report our result on the test set as of 24/09/2016 4 .
                                                                    fore, the testing results serve as the basis for us to work
2                                                                   further in this direction.
  www-lium.univ-lemans.fr/en/content/liumspkdiarization
3
  http://pyannote.github.io/                                        Acknowledgement This research was supported by the
4
  The groundtruth is still updated by a collaborative anno-         European Union project EUMSSI (FP7-611057).
tation process.
5.   REFERENCES                                                2011 International Joint Conference on, pages 1–8.
 [1] D. Chen and J.-M. Odobez. Video text recognition          IEEE, 2011.
     using sequential monte carlo and error voting methods.
     Pattern Recognition Letters, 26(9):1386–1403, 2005.
 [2] D. Chen, J.-M. Odobez, and H. Bourlard. Text
     detection and recognition in images and video frames.
     Pattern Recognition, 37(3):595–608, 2004.
 [3] N. Daddaoua, J.-M. Odobez, and A. Vinciarelli. Ocr
     based slide retrieval. In Eighth International
     Conference on Document Analysis and Recognition
     (ICDAR’05), pages 945–949. IEEE, 2005.
 [4] C. Dubout and F. Fleuret. Deformable part models
     with individual part scaling. In BMVC, 2013.
 [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and
     D. Ramanan. Object detection with discriminatively
     trained part-based models. IEEE Transactions on
     Pattern Analysis and Machine Intelligence,
     32(9):1627–1645, 2010.
 [6] C. B. H. Bredin, C. Guinaudeau. Multimodal person
     discovery in broadcast tv at mediaeval 2016. In Proc.
     of the MediaEval 2016 Workshop, Hilversum,
     Netherlands, Oct. 2016.
 [7] A. Heili, A. Lopez-Mendez, and J.-M. Odobez.
     Exploiting long-term connectivity and visual motion
     in crf-based multi-person tracking. IEEE Transactions
     on Image Processing, 23(7):3040–3056, 2014.
 [8] E. Khoury, P. Gay, and J.-M. Odobez. Fusing
     Matching and Biometric Similarity Measures for Face
     Diarization in Video. In ACM ICMR, 2013.
 [9] N. Le, A. Heili, D. Wu, and J.-M. Odobez. Temporally
     subsampled detection for accurate and efficient face
     tracking and diarization. In International Conference
     on Pattern Recognition. IEEE, Dec. 2016.
[10] N. Le and J.-M. Odobez. Learning multimodal
     temporal representation for dubbing detection in
     broadcast media. In ACM Multimedia. ACM, Oct.
     2016.
[11] M. Mathias, R. Benenson, M. Pedersoli, and
     L. Van Gool. Face detection without bells and
     whistles. In ECCV, pages 720–735. Springer, 2014.
[12] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
     From text detection in videos to person identification.
     In 2012 IEEE International Conference on Multimedia
     and Expo (ICME), pages 854–859. IEEE, 2012.
[13] J. Poignant, H. Bredin, V.-B. Le, L. Besacier,
     C. Barras, and G. Quénot. Unsupervised speaker
     identification using overlaid texts in tv broadcast. In
     Interspeech, page 4p, 2012.
[14] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin,
     and S. Meignier. An open-source state-of-the-art
     toolbox for broadcast news diarization. In Interspeech,
     Lyon (France), 25-29 Aug. 2013.
[15] A. Vinciarelli and J.-M. Odobez. Application of
     information retrieval technologies to presentation
     slides. IEEE Transactions on Multimedia,
     8(5):981–995, 2006.
[16] R. Wallace and M. McLaren. Total variability
     modelling for face verification. Biometrics, IET,
     1(4):188–199, 2012.
[17] R. Wallace, M. McLaren, C. McCool, and S. Marcel.
     Inter-session variability modelling and joint factor
     analysis for face authentication. In Biometrics (IJCB),