LIMSI at MediaEval 2015:
                       Person Discovery in Broadcast TV Task

                                   Johann Poignant, Hervé Bredin, Claude Barras
                                   LIMSI - CNRS - Rue John Von Neumann, Orsay, France.
                                                 firstname.lastname@limsi.fr


ABSTRACT                                                                                                    Constrained
                                                                       Components                Baseline
                                                                                                             clustering
This paper describes the algorithm tested by the LIMSI
team in the MediaEval 2015 Person Discovery in Broadcast                                Speech turns
TV Task. For this task we used an audio/video diariza-                 Segmentation                x              x
tion process constrained by names written on screen. These             Similarity                                 x
names are used to both identify clusters and prevent the fu-           Diarization                   x
sion of two clusters with different co-occurring names. This                                 Face
method obtained 83.1% of EwMAP tuned on the out-domain                 Detection & Tracking          x            x
development corpus.                                                    Similarity                                 x
                                                                       Diarization
                                                                                         Speaking face
1.    INTRODUCTION                                                     Mapping                       x            x
   We present the approach of the LIMSI team to the Person                             Source of names
Discovery in Broadcast TV Task at MediaEval 2015. To                   Written names [3]             x            x
address this task we had to return the names of people who             Pronounced names [2, 1]
can be both seen as well as heard in a selection of shots in a
collection of videos. The list of people is not known a priori   Table 1: Sub-component provided used by fusions
and their names must be discovered in an unsupervised way
from media content using text overlay or speech transcripts.
For further details about the task, dataset and metrics the      we propagate the speaker identities on the co-occurring face
reader can refer to the task description [4].                    tracks based on the speech turns/face tracks mapping.
   We first detail the fusion system baseline provided to all
participants (we are both organizer and participant). Then,
we describe the constrained multi-modal clustering. Finally,
we compare the results obtained.

2.    MULTI-MODAL FUSION
  We propose two different approaches to address the task.
They only rely on metadata provided to all participants (see
Table 1). Only written names are used as source of identity.
In addition to speech turn segmentation and face detection             Figure 1: Baseline fusion system overview
and tracking, the baseline relies on the provided speaker
diarization and speaking face mapping. The constrained
clustering relies on the similarity matrices (for speaker and    2.2    Constrained multi-modal clustering
face) to process its own clustering.                                Figure 2 shows a global overview of our method. We
                                                                 first combined the mono-modal similarity matrix and the
2.1    Baseline                                                  speaking face mapping into a large multi-modal matrix us-
  From the written names and the speaker diarization, we         ing weights α and β to give more or less importance to a
used the “Direct Speech Turn Tagging” method described           given modality. In parallel, written names are used to iden-
in [5] to identify speaker: we first tagged speech turns with    tify co-occurring face tracks and speech turns.
co-occurring written name. Then, on the remaining un-               Then, we perform an agglomerative clustering on the multi-
named speech turns, we find the one-to-one mapping that          modal matrix to merge all face tracks and speech turns of
maximizes the co-occurrence duration between speaker clus-       a same person into a unique cluster. This process is con-
ters and written names (see [5] for more details). Finally,      strained by avoiding the fusion of clusters named differently.
                                                                 The two parameters α and β advance or delay the merge of
                                                                 components of a modality relatively to others during the ag-
Copyright is held by the author/owner(s).                        glomerative clustering process, while the stopping criterion
MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany       is chosen to maximize the target metrics (here the EwMAP).
                                                                     Run                       EwMAP(%)       MAP(%)      C(%)
                 Face tracks                         Speech turns    Baseline                    78.35         78.64      92.71
                                                                     Const. clus. 01-jul-15      83.13         83.46      93.19
                                                                     Const. clus. 08-jul-15      84.56         84.89      94.11
                                                                     Oracle propagation
                                                                                                  96.84          96.84    97.25
                                  Speech turns                       mono-show
              Face tracks         Face tracks        Speech turns    Oracle propagation
                                  mapping            Similarity                                   97.83          97.83    97.83
              Similarity                                             cross-show

                     ×α                  ×β              ×(1- α)                         Table 2: Results

                                                                    on the out-domain development set. For the second deadline
                                                                    (July 8th), we tuned these parameters with the evaluation
                                                                    proposed via the leaderboard (computed every six hours on a
                                                                    subset of the test set). We can see only a little improvement
                                                                    between them, showing that our method generalizes well.
                                                                       To determine the scope for further progress we used an or-
               Written
               names
                                                                    acle capable of recognizing a speaking face as soon as his/her
                                                                    written name is correctly extracted by the OCR module. In
                            Multimodal                              the mono-show case, the name must be written in the same
                            Similarity
                            Matrix                                  video. In the cross-show case, the name can be written in
                                                                    any video of the corpus. Since our own approach only uses
      Name co-occurring
                                    Multi-modal                     mono-show propagation, these oracle experiments show it is
        with face tracks
                                Constrained clustering              possible to earn up to 1% of MAP using cross-show propa-
       and speech turns
                                                                    gation approaches.
                                  Named clusters                       In Table 3 we report the mean precision and recall over all
                                                                    queries. Compared to the baseline, the constraints on the
       Figure 2: Constrained clustering overview                    clustering process allows to have a lower stopping criterion
                                                                    (therefore to have bigger clusters and hence to improve the
                                                                    recall), while keeping very good clusters purity (see the pre-
A complete description of this method can be found in [6].          cision in Table 3). The high precision of our constraint clus-
                                                                    tering made the choice of the confidence score (used to rank
2.3     Speaking face selection and confidence                      shots in the computation of the MAP) not really important.
   The last part is common for the two fusions. For each per-       The tuning of the three parameters on an in-domain corpus
son who speaks and appears in a shot (following the shot seg-       improves recall by 1.3% and decreases precision by 0.8%. In
mentation provided to all participants), we compute a con-          practice, α was reduced for the July 8th (in-domain tuning),
fidence score. This score is based on the temporal distance         therefore speech turns clustering was delayed (with respect
between the speaking face and its closest written name. This        to face tracks clustering) between July 1st (out-domain) and
confidence equals to:                                               July 8th (in-domain tuning).
                 
                  1 + d if the speaking face co-occurs                  Run                      Precision(%)    Recall(%)
   confidence =            with the written name
                  1/δ                                                   Baseline                     79.1          74.8
                           otherwise
                                                                         Const. clus. 01-jul-15       98.5          82.9
where d is the co-occurrence duration and δ is the duration              Const. clus. 08-jul-15       97.7          84.2
of the gap between the face track (or speech turn) and the
written name.                                                                 Table 3: Mean precision and recall

3.     RESULTS
   In Table 2, we report the EwMAP, the MAP and the
                                                                    4.   CONCLUSION AND FUTURE WORKS
Correctness (denoted by C ) obtained by the baseline and              This paper presented our approach and results at the Me-
the constrained clustering tuned on an out-domain corpus            diaEval Person Discovery in Broadcast TV task. The pro-
(for the first deadline: 01-jul-15) and on an in-domain corpus      cess used an audio/video diarization constrained by written
(second deadline: 08-jul-15).                                       names on screen. This source of identities is used to both
   The baseline does not take into account the similarity           identify clusters and avoid wrong merges during the agglom-
between face and does not benefit from the knowledge of             erative clustering process.
written names during the diarization process. In addition             For future works we will improve the distance between
to these 2 additional information, our second method opti-          speech turns, try other clustering methods and cross-show
mizes the stopping criterion of the clustering based on the         propagation.
target metric (EwMAP) while the diarization of the baseline
is tuned to maximize the classical DER.                             Acknowledgment. This work was supported by the French
   For the first deadline (July 1st) we tuned the parameters        National Agency for Research under grant ANR-12-CHRI-
α and β and the stopping criterion of the clustering process        0006-01 (CAMOMILE project).
5.   REFERENCES
[1] M. Dinarelli and S. Rosset. Models Cascade for
    Tree-Structured Named Entity Detection. In IJCNLP,
    2011.
[2] L. Lamel, S. Courcinous, J. Despres, J. Gauvain,
    Y. Josse, K. Kilgour, F. Kraft, V.-B. Le, H. Ney,
    M. Nussbaum-Thom, I. Oparin, T. Schlippe,
    R. Schlëter, T. Schultz, T. F. da Silva, S. Stüker,
    M. Sundermeyer, B. Vieru, N. Vu, A. Waibel, and
    C. Woehrling. Speech Recognition for Machine
    Translation in Quaero. In IWSLT, 2011.
[3] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
    From text detection in videos to person identification.
    In ICME, 2012.
[4] J. Poignant, H. Bredin, and C. Barras. Multimodal
    Person Discovery in Broadcast TV at MediaEval 2015.
    In MEDIAEVAL, 2015.
[5] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras,
    and G. Quénot. Unsupervised speaker identification
    using overlaid texts in TV broadcast. In
    INTERSPEECH, 2012.
[6] J. Poignant, G. Fortier, L. Besacier, and G. Quénot.
    Naming multi-modal clusters to identify persons in TV
    broadcast. MTAP, 2015.