PUC Minas and IRISA at Multimodal Person Discovery

       Gabriel Sargent1 , Gabriel Barbosa de Fonseca2 , Izabela Lyon Freire2 , Ronan Sicre1 ,
        Zenilton K. G. Patrocínio Jr2 , Silvio Jamil F. Guimarães2 and Guillaume Gravier1 ,
                           1
                             IRISA & Inria Rennes , CNRS and Univ. Rennes 1, Rennes, France
                    2
                        Computer Science Department - PUC de Minas Gerais, Belo Horizonte, Brazil


ABSTRACT                                                                                    Similarity
                                                                         Submission                        Tag propagation
                                                                                          audio video
This paper describes the systems developed by PUC Minas
and IRISA for the person discovery task at MediaEval 2016.               primary (p)      binary CNN          hierarchical
  We adopt a graph-based representation and investigate                 contrast 1 (c1)   GMM CNN            random walk
two tag-propagation approaches to associate overlays co-                contrast 2 (c2)      –     CNN        hierarchical
occurring with some speaking faces to other visually or                 contrast 3 (c3)   GMM CNN             hierarchical
audio-visually similar speaking faces.                                  contrast 4 (c4)      –       –             –
  Given a video, we first build a graph from the detected         Table 1: Components of the systems at the origin of
speaking faces (nodes) and their audio-visual similarities        our 5 submissions.
(edges). Each node is associated to its co-occurring over-
lays (tags) when they exist. Then, we consider two tag-
propagation approaches, respectively based on a random            with the speech segment for which the overlap is maximum
walk strategy and on Kruskal’s algorithm.                         and at least 60 %. The facetracks and speech segments are
                                                                  the ones provided by MediaEval, the latter being extracted
                                                                  from the speaker diarization result disregarding the arbi-
1.   INTRODUCTION                                                 trary speaker number.
   The task of multimodal person discovery in TV broadcast
consists in identifying persons of a video corpus which both      2.1     Audiovisual similarities
speak and are visible at the same time, in an unsupervised           We consider three weighting schemes for the edges in the
way [2]. Most approaches to the task use clustering, either       graphs, resulting from the combination of different strategies
of face tracks or of voice segments (or both) before find-        to combine visual similarity and voice similarity.
ing a good match between text in overlays and clusters [6,                                    V
                                                                     The visual similarity Sij  between two facetracks i and j is
4]. While this type of approaches worked well in 2015, we         calculated as follows. A key face is selected from the central
believe that the clustering steps involved are error prone.       frame of each facetrack, from which a generic image descrip-
Indeed, errors in the clustering step cannot be undone af-        tor is computed by applying a very-deep convolutional neu-
terwards in the naming stages. In 2015, IRISA and UFMG            ral network pre-trained on the ImageNet dataset [8]. Specif-
proposed a graph-based approach in which each node cor-           ically, we extract the last convolutional layer [9] and perform
responds to a speaking face and edges to the similarity be-       average pooling and “power normalization”, i.e., square-root
tween its vertices [3]. The similarity can be computed at                                                                     V
                                                                  compression followed by L2-normalization. Finally, Sij        is
the visual level, the voice level or both. Names can be asso-     calculated as the cosine similarity between the descriptors
ciated to nodes based on co-occurrences of a speaking face        of the two key face images.
and names overlays. However, only a small fraction of the            Voice similarity can be computed two ways. A simple
nodes can be tagged by this method. Hence, in 2016, we            binary audio similarity is derived from the speaker diariza-
studied tag propagation algorithms that take advantage of         tion provided by MediaEval, where the similarity is 1 if the
the graph structure to assign tags to nodes with no over-         two segments are labeled with the same speaker in the di-
lapping overlays, thus potentially improving recall. Tab. 1                                                        A
                                                                  arization. Alternately, the audio similarity Sij    between two
recaps the different configurations submitted.                    segments can be calculated as follows. Each speech segment
                                                                  is modeled with a 16-Gaussian mixture model (GMM) over
                                                                                                            A
2.   GRAPH GENERATION                                             Mel cepstral features. The distance Dij     is computed using
                                                                  the Euclidean-based approximation of the KL2 divergence
   Each video is modeled by a graph where each node repre-        between the two GMMs [1], and turned into a similarity ac-
sents a speaking face, and each edge quantifies the visual or     cording to SijA
                                                                                   = exp(log (α) Dij A
                                                                                                       ), where α = 0.25 in the
audiovisual similarity between two speaking faces. A speak-       experiments here.
ing face is defined as the association of a facetrack (sequence      Fusion of the visual and voice similarities is done by a
of faces related to the same person in adjacent video frames)     weighted average, SijAV        V
                                                                                            = βSij + (1 − β)SijA
                                                                                                                 . We experimen-
                                                                  tally set β = 0.85 in the case of binary voice comparison and
Copyright is held by the author/owner(s).                         β = 0.5 for the GMM-based comparison.
MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Nether-
lands.
2.2    Tag initialization                                                                MAP@1        MAP@10       MAP@100
                                                                                       T15 T16       T15 T16       T15 T16
   Initially, each node in the graph is tagged using the overlay
for which the overlap with the facetrack is maximum. We               primary (p)      87.9 64.4     82.1 49.3     81.9 47.8
used the overlay detection and name recognition provided             contrast 1 (c1)   87.9 64.4     79.8 48.4     79.6 46.7
(output from the OCR system 2), which we filtered using              contrast 2 (c2)   87.9 62.9     81.7 46.2     81.5 44.8
the named entity detector NERO [7], keeping only words               contrast 3 (c3)   87.9 63.6     80.2 49.3     80.0 47.5
tagged as “pers” by the named entity recognition. Note that          contrast 4 (c4)   87.9 56.8     79.7 36.1     79.5 35.1
this approach is rather aggressive as NERO was initially de-           (p − c4)/c4      0.0 13.4      3.0 36.6      3.0 36.2
signed for the speech transcription in the French language.        Table 2: Mean average precision at different ranks
In practice, many nodes are not tagged as they do not over-        (in %) for the 5 submissions. Last row gives the
lap with a valid overlay (Sets T15 and T16, introduced in          relative improvement of the primary run over the
Section 4, show respectively 25.5% and 6.6% of nodes ini-          no-propagation baseline (c4).
tially tagged). This is why tag propagation is required.
                                                                                              T15                   T16
3.    TAG PROPAGATION APPROACHES                                         MAP@1      p = c1 = c2 = c3 = c4    p = c1, c3, c2, c4
   Two different approaches are considered for the propaga-             MAP@10          p, c2, c3, c1, c4     p, c2, c3, c1, c4
tion of the initial tags: a random walk approach and a hi-              MAP@100         p, c2, c3, c1, c4     p, c3, c1, c2, c4
erarchical one based on Kruskal’s algorithm. In both cases,
every node will be associated a particular tag with a confi-        Table 3: Ranking (best first) of the submissions.
dence score at the end of the propagation phase.

3.1    Random walk tag propagation                                 in H, skip it; otherwise, it links trees T1 and T2 (thus form-
                                                                   ing T3 ), and e is added to the minimum spanning forest H
   In a graph where transition probabilities between nodes
                                                                   being created; three cases are possible: I. None of T1 , T2 is
are known, the probability of ever reaching node j starting
                                                                   tagged: T3 will not be tagged II. Only T1 is tagged, with
from node i can be calculated using a random walk strategy
                                                                   confidence score CT1 : T1 ’s tag is assigned to the entire T3
with absorbing states [10]. Let n be the number of nodes
                                                                   (i.e., to all its unlabelled nodes), with a confidence score
of the graph, we define a symmetrical weight matrix W =
                                                                   CT3 = CT1 × (1 − we ), where we is the weight of e in G. III.
{Wij }1≤i,j≤n , where Wij is the similarity between nodes
                                                                   Both T1 and T2 are tagged: one of the tags (of T1 or of T2 )
i and j, and aPdiagonal degree matrix D = {Dij }1≤i,j≤n ,
                                                                   is picked (at random), and assigned to T3 with confidence
where Dii =       j Wij . The transition probability matrix        scores as in case II.
P0 = {P0ij }1≤i,j≤n , where P0ij is the probability of reaching
node j from node i in one step, is given by P0 = D−1 W.
Tagged nodes are set as absorbing states in P, according to
                                                                   4.    RESULTS
                                                                    Tab. 2 reports the results obtained on the 2015 and 2016
                             I       0                             test data (T15=development data for 2016, and T16, re-
                      P=                  ,
                            Pul Puu                                spectively). For T16, the reference annotation dump of
                                                                   2016/09/14 is used. The rank of the submissions is shown
where l is the set of tagged nodes, u is the set of untagged
                                                                   in Tab. 3. All tag propagation approaches improve over the
nodes, I is an identity matrix of size |l| × |l|, Pul contains
                                                                   no-propagation baseline (c4), the interest of tag propagation
probabilities of untagged nodes ending their walk on tagged
                                                                   being much clearer on T16. The baseline highlights notice-
nodes, and Puu contains probabilities of untagged nodes get-
                                                                   able differences between T15 and T16. In T15, propagation
ting to other untagged nodes. We denote Pt the transition
                                                                   was almost useless as most nodes could be tagged in the ini-
probability after t iterations. The random walk iteration
                                                                   tial stage. This is not the case in T16 where tag propagation
is performed according to Pt+1 = (1 − γ) P0 Pt + γ P0 ,
                                                                   yields significant gain. The hierarchical tag propagation on
where γ is a parameter enforcing the consistency of the ini-
                                                                   graphs combining CNN visual similarity and binary voice
tial stateP(here, γ = 0.4). Once the random walk has con-
                                  −9                               similarity (primary) consistently outperforms other combi-
verged ( i,j |Pt+1       t
                 i,j − Pi,j | < 10   ), each untagged node is
                                                                   nations, showing the interest of combining audio and visual
associated to the tagged one on which it has the highest
                                                                   similarities. Comparing approaches, c3 usually (except for
probability to end its walk, i.e., each row index of Pul is
                                                                   T16, MAP@1) performs better than c1, indicating that the
matched with the column index with maximal probability.
                                                                   hierarchical tag propagation performs better than the ran-
This maximal probability is kept as the confidence score.
                                                                   dom walk, at least with GMM-CNN audiovisual similari-
3.2    Hierarchical tag propagation                                ties. The comparison of c3 and c1 shows the weakness of
                                                                   the GMM-based voice comparison over the state-of-the-art
   This method is based on the computation of a minimum
                                                                   approach used for diarization. Finally, the comparison of
spanning tree (MST) from an undirected weighted graph, us-
                                                                   c3 and c2 gives mixed feelings. The use of the GMM-based
ing Kruskal’s algorithm. The MST establishes a hierarchical
                                                                   voice comparison decreases performance in most cases ex-
partition of a set [5]. A connected graph G is given (see Sec-
                                                                   cept on T16 at K = 1, 100.
tion 2), where edge weights represent distances (functions of
their respective similarities S AV ). To propagate the initial
tags, we start from a null graph H on G’s nodes, and the           5.    ACKNOWLEDGEMENTS
following process is repeated, until all edges of G are exam-       Work supported by FAPEMIG/INRIA/MOTIF (CEX-
ined: from G, the unexamined edge e corresponding to the           APQ 03195-13), FAPEMIG/PPM (CEX-PPM-6-16) and
smallest distance is chosen. If it does not link different trees   CAPES (064965/2014-01).
6.   REFERENCES
 [1] M. Ben, M. Betser, F. Bimbot, and G. Gravier.
     Speaker diarization using bottom-up clustering based
     on a parameter-derived distance between adapted
     GMMs. In Proceedings of the 8th International
     Conference on Spoken Language Processing, pages
     333–444, October 2004.
 [2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal
     person discovery in broadcast TV at MediaEval 2016.
     In Working notes of the MediaEval 2016 Workshop,
     October 2016.
 [3] C. E. dos Santos Jr, G. Gravier, and W. R. Schwartz.
     SSIG and IRISA at Multimodal Person Discovery. In
     Working notes of the MediaEval 2015 Workshop,
     September 2015.
 [4] N. Le, D. Wu, S. Meignier, and J.-M. Odobez.
     EUMSSI team at the MediaEval Person Discovery
     Challenge. In Working notes of the MediaEval 2015
     Workshop, September 2015.
 [5] B. Perret, J. Cousty, J. C. R. Ura, and S. J. F.
     Guimarães. Evaluation of morphological hierarchies
     for supervised segmentation. In Proceedings of the
     12th International Symposium on Mathematical
     Morphology and Its Applications to Signal and Image
     Processing, pages 39–50. Springer, 2015.
 [6] J. Poignant, L. Besacier, and G. Quénot. Unsupervised
     speaker identification in TV broadcast based on
     written names. IEEE/ACM Transactions on Audio,
     Speech and Language Processing, 23(1):57–68, 2015.
 [7] C. Raymond. Robust tree-structured named entities
     recognition from speech. In International Conference
     on Acoustics, Speech and Signal Processing, 2013.
 [8] K. Simonyan and A. Zisserman. Very deep
     convolutional networks for large-scale image
     recognition. arXiv preprint arXiv:1409.1556, 2014.
 [9] G. Tolias, R. Sicre, and H. Jégou. Particular object
     retrieval with integral max-pooling of CNN
     activations. In Proceedings of the 2016 International
     Conference on Learning Representations, 2016.
[10] X. Zhu and Z. Ghahramani. Learning from labeled
     and unlabeled data with label propagation. Technical
     report, Citeseer, 2002.