              SSIG and IRISA at Multimodal Person Discovery

                Cassio E. dos Santos Jr1 , Guillaume Gravier2 , William Robson Schwartz1
             Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
                                    IRISA & Inria Rennes , CNRS, Rennes, France
                            cass@dcc.ufmg.br, guig@irisa.fr, william@dcc.ufmg.br

ABSTRACT                                                          graph where nodes are speaking faces, with edges denoting
This paper describes our approach and results in the multi-       the voice and/or face similarity. This approach is motivated
modal person discovery in broadcast TV task at MediaEval          by the wish to avoid explicit face and speaker clustering and
2015. We investigate two distinct aspects of multimodal           open new strategies for person discovery. Note that the two
person discovery. One refers to face clusters, which are con-     approaches could be combined but, for practical reasons,
sidered to propagate names associated to faces in one shot        this combination was not considered in the framework of
to other faces that probably belong to the same person. The       the evaluation.
face clustering approach consists in calculating face similar-
ities using partial least squares (PLS) and a simple hierar-      2.   PLS-BASED FACE CLUSTERING
chical approach. The other aspect refers to tag propagation
                                                                     The PLS-based face clustering approach consists in calcu-
in a graph-based approach where nodes are speaking faces
                                                                  lating a similarity measure between face tracks for further
and edges link similar faces/speakers. The advantage of the
                                                                  clustering. Face clusters are then used in a variant of the
graph-based tag propagation is to not rely on face/speaker
                                                                  baseline, as a replacement of the face clusters provided.
clustering, which we believe can be errorprone.
                                                                     PLS is a statistical method consisting of two steps: re-
                                                                  gression and projection [9]. The projection step consists
1.   INTRODUCTION                                                 in calculating a subspace that maximize the covariance be-
   Multimodal person discovery in video archives consists in      tween predictors and responses. The regression step relies
naming all speaking faces in the collection without prior         on ordinary least squares to estimate responses based on
information, leveraging face recognition, speech recogni-         the projected predictors. We employ the one-shot similarity
tion, speaker recognition and optical character recognition.      metric based on PLS for face verification described in [4],
A description of the task and resources provided within           which presents robust results for face images in the wild
MediaEval is given in [2]. In particular, two key components      compared to conventional distance-based methods. In a nut-
of most systems for multimodal person discovery are (i) face      shell, the similarity sim(A, B) between face tracks A and B
tracking and clustering and (ii) speaker diarization. See [6]     relies on PLS regression trained to return +1 for samples
for a recent overview of existing systems. Given these com-       in A and response −1 for samples in a background set of
ponents, a popular strategy to name speaking faces relies on      images (300 random face images from the LFW dataset [5]).
a mapping of face clusters and speakers from the diarization,     Then, sim(A, B) is calculated as the average of responses
combining this mapping with appearance of named entities          from samples in B evaluated in the learned PLS regression.
in speech transcripts or on screen (e.g., [3, 8]). The baseline   A symmetric version is used in practice, averaging sim(A, B)
system provided by the organizers [7] is a clear instanciation    and sim(B, A).
of this. Person names appearing on screen are first prop-            Based on PLS similarity calculated between all face track
agated onto speaker clusters, finding an optimal mapping          pairs, clustering aims at grouping face tracks from the same
based on co-occurrence. In the next step, one has to find         subject. We employ a hierarchical clustering approach that
for each named speaker if there is a co-occurring face track      consists in merging a pair of face tracks with maximum sim-
that has a probability to correspond to the current speaker       ilarity and with at least one face track that was not merged
higher than a threshold. Each such face track receives the        yet. The merging consists in propagating an identification
name assigned to the speaker cluster.                             label from one face track to the other or generating a new
   We explore two distinct aspects of multimodal person dis-      identification label for the pair if no label was previous as-
covery in this evaluation. On the one hand, we seek to im-        sociated to the face tracks. The algorithm stops when the
prove face clustering using recent advances in face recogni-      maximum similarity is less than a threshold, empirically set
tion based on partial least square (PLS) regression [4]. We       to 0.5 using the development set.
consider a variant of the baseline system provided, modified         To assess the interest of PLS-based face clustering, we
to better merge the PLS face cluster and speaker diarization      consider a slightly different version of the baseline approach
results. On the other hand, we study tag propagation in a         to merge face clustering and speaker diarization informa-
                                                                  tion. Each name associated to one face track is propagated
                                                                  to all face tracks within the same face cluster. We then
     method       BSLN     SPKR    FACE      UNI      INT                                       EwMAP        MAP     C
     dev          38.89    63.67   49.12     67.84    44.83                      no prop        44.5         44.7    76.7
     test         78.35    89.46   67.18     89.74    66.86                      1 step prop    53.6         54.0    75.4
     test (PLS)   78.35    89.46   61.90     89.64    61.64               test   no prop        78.3         79.5    89.7

Table 1: EwMAP (in %) using the baseline face clus-               Table 2: Results with graph-based naming on the
ters on the development set (top row), on the test                development data (test2) and on the test data.
set (middle row) and using the PLS-based face clus-
ters on the test set (bottom row).
                                                                  diarization. In PLS-based face clustering, we consider the
                                                                  CLBP [1] feature descriptor with radius parameter 5 calcu-
modified baseline approach using only the speaker diariza-        lated in squared blocks of size 16 pixels and stride of 8 pixels.
tion, only the face cluster, and considering the intersection     All faces were cropped from the videos using the face posi-
of the names instead of union.                                    tion provided in the baseline approach and scaled to 128 by
                                                                  128 pixels. Note that we do not provide face clusters based
3.     GRAPH-BASED TAG PROPAGATION                                on PLS for the development set and, therefore, all results in
                                                                  Tab. 1 for the development set consider only the face clusters
   To skirt issues with errors in clustering, which we be-        available in the baseline approach. We also provide the re-
lieve can strongly affect the naming process, we investigate      sults on the test set considering the face clusters provided in
a strategy based on tag propagation within a graph where a        the baseline method, i.e., without PLS-based face clustering.
node corresponds to an occurrence of a speaking face within          The SPK approach yields the best EwMAP in Tab. 1 while
a shot.                                                           the FACE yields the worst results. However, the results from
   The first step is the graph construction process, which        INT and UNI indicate that the two approaches present com-
consists in identifying speaking faces from the face tracks       plementary results, i.e., the intersection of the propagated
detected within each shot1 . This is achieved by selecting        names among the face clusters and speaker diarization shots
face tracks whose probability to correspond to the current        indicates that a small subset of correct names from the face
speech turn is greater than a threshold empirically set to        clusters that are not in the speaker names, These aspects
0.6, where the probabilities that a face track corresponds        are observed in the development and test set, using the face
to a speech turn are those provided. For each selected face       clusters in the baseline or the PLS method face clusters. We
track, we keep a record of the matching speech turn. The se-      also noticed no significant difference in the results between
lected speaking face tracks are the nodes of a graph and are      face clusters provided in the baseline approach and using the
connected with edges bearing two scores, depicting the simi-      PLS-based method, considering the UNI approach. We be-
larity of resp. voice and face (as given in the speech turn and   lieve that this small difference is an effect of the poor quality
face track similarity files). To avoid a fully connected graph    of the face clusters, which might result from combined errors
and keep only relevant relationships, we connect two nodes if     in the face detection and in the face tracking methods.
the similarity between the corresponding face tracks and the         Results for the graph-based tag propagation method are
similarity between the corresponding speech turns are both        given in Tab. 2. On the development data (test2 subset),
above a threshold, empirically set to 0.1 for both modali-        results are provided without tag propagation (no prop) and
ties. Note that having no relations between face tracks and       with a singl step of tag propagation. We believe that the
speech turns across shows, a graph is built independently         poor results obtained are attributable to the fact that the
for each show.                                                    graph links only submission shots, which account only for a
   The naming process starts by associating a name to a           small fraction of the total number of shots in the develop-
node whenever possible based on the output of overlaid text       ment data. Contrarily, most of the shots in the test data
detection: if an overlay significantly overlaps the face track,   are subission shots. With no surprise, tag propagation im-
the node is tagged with the corresponding name and a score        proves the MAP to the expense of correctness. Submission
of 1. In case of multiple overlaping overlays, the name cor-      on the test set was made without tag propagation (because
responding to the longest co-occurrence is considered. Af-        of unconvincing propagation results at the time) and not up-
ter tagging all nodes, tags are optionally propagated over        dated after the initial submission (July 1st). Interestingly,
a number of iterations. At each iteration, each tag of each       direct naming of speaking face tracks from overlays (i.e., no
node is propagated via the corresponding edges with a prop-       propagation) already provides accurate tagging.
agation score equal to the tag score multiplied by the edge
weight, where edge weights are taken as the average of the
face and voice similarity. After propagation, each node re-       5.   ACKNOWLEDGEMENTS
ceives the tag with the highest score.                               The authors would like to thank the Brazilian National
                                                                  Research Council – CNPq (Grant #477457/2013-4), Brazil-
                                                                  ian National Council for the Improvement of Higher Educa-
4.     RESULTS                                                    tion – CAPES (Grant STIC-AMSUD 001/2013) and the Mi-
   The results from the second submission (July 8th) of the       nas Gerais Research Foundation – FAPEMIG (Grants APQ-
four PLS-based methods and the baseline are presented in          01806-13 and CEX-APQ-03195-13). This work was partially
Tab. 1, where the following abbreviations are employed:           supported by the STIC AmSud program, under the project
PLS-based face clustering considering only speaker diariza-       ’Unsupervised Mining of Multimedia Content’, and by the
tion (SPKR), only face clusters (FACE), union (UNI) and           Inria Associate Team program.
intersection (INT) of names among face clusters and speaker
    Only submission shots were considered in this work.
