1. INTRODUCTION

PUC Minas and IRISA at Multimodal Person Discovery

Gabriel Sargent

Gabriel Barbosa de Fonseca

Izabela Lyon Freire

Ronan Sicre

Zenilton K. G. Patrocínio Jr

Silvio Jamil F. Guimarães

Guillaume Gravier

1 0 Computer Science Department - PUC de Minas Gerais , Belo Horizonte , Brazil 1 IRISA & Inria Rennes , CNRS and Univ. Rennes 1, Rennes , France

2016

20 21

This paper describes the systems developed by PUC Minas and IRISA for the person discovery task at MediaEval 2016. We adopt a graph-based representation and investigate two tag-propagation approaches to associate overlays cooccurring with some speaking faces to other visually or audio-visually similar speaking faces. Given a video, we rst build a graph from the detected speaking faces (nodes) and their audio-visual similarities (edges). Each node is associated to its co-occurring overlays (tags) when they exist. Then, we consider two tagpropagation approaches, respectively based on a random walk strategy and on Kruskal's algorithm.

1. INTRODUCTION

The task of multimodal person discovery in TV broadcast consists in identifying persons of a video corpus which both speak and are visible at the same time, in an unsupervised way [ 2 ]. Most approaches to the task use clustering, either of face tracks or of voice segments (or both) before nding a good match between text in overlays and clusters [ 6, 4 ]. While this type of approaches worked well in 2015, we believe that the clustering steps involved are error prone. Indeed, errors in the clustering step cannot be undone afterwards in the naming stages. In 2015, IRISA and UFMG proposed a graph-based approach in which each node corresponds to a speaking face and edges to the similarity between its vertices [ 3 ]. The similarity can be computed at the visual level, the voice level or both. Names can be associated to nodes based on co-occurrences of a speaking face and names overlays. However, only a small fraction of the nodes can be tagged by this method. Hence, in 2016, we studied tag propagation algorithms that take advantage of the graph structure to assign tags to nodes with no overlapping overlays, thus potentially improving recall. Tab. 1 recaps the di erent con gurations submitted.

GRAPH GENERATION

Each video is modeled by a graph where each node represents a speaking face, and each edge quanti es the visual or audiovisual similarity between two speaking faces. A speaking face is de ned as the association of a facetrack (sequence of faces related to the same person in adjacent video frames)

Submission

primary (p) contrast 1 (c1) contrast 2 (c2) contrast 3 (c3) contrast 4 (c4)

Similarity

audio video binary GMM

{ GMM {

CNN CNN CNN CNN {

Tag propagation

hierarchical random walk hierarchical hierarchical { with the speech segment for which the overlap is maximum and at least 60 %. The facetracks and speech segments are the ones provided by MediaEval, the latter being extracted from the speaker diarization result disregarding the arbitrary speaker number. 2.1

Audiovisual similarities

We consider three weighting schemes for the edges in the graphs, resulting from the combination of di erent strategies to combine visual similarity and voice similarity.

The visual similarity SiVj between two facetracks i and j is calculated as follows. A key face is selected from the central frame of each facetrack, from which a generic image descriptor is computed by applying a very-deep convolutional neural network pre-trained on the ImageNet dataset [ 8 ]. Specifically, we extract the last convolutional layer [ 9 ] and perform average pooling and \power normalization", i.e., square-root compression followed by L2-normalization. Finally, SiVj is calculated as the cosine similarity between the descriptors of the two key face images.

Voice similarity can be computed two ways. A simple binary audio similarity is derived from the speaker diarization provided by MediaEval, where the similarity is 1 if the two segments are labeled with the same speaker in the diarization. Alternately, the audio similarity SiAj between two segments can be calculated as follows. Each speech segment is modeled with a 16-Gaussian mixture model (GMM) over Mel cepstral features. The distance DiAj is computed using the Euclidean-based approximation of the KL2 divergence between the two GMMs [ 1 ], and turned into a similarity according to SiAj = exp(log ( ) DiAj), where = 0:25 in the experiments here.

Fusion of the visual and voice similarities is done by a weighted average, SiAjV = SiVj + (1 )SiAj. We experimentally set = 0:85 in the case of binary voice comparison and = 0:5 for the GMM-based comparison. 2.2

Tag initialization

Initially, each node in the graph is tagged using the overlay for which the overlap with the facetrack is maximum. We used the overlay detection and name recognition provided (output from the OCR system 2), which we ltered using the named entity detector NERO [ 7 ], keeping only words tagged as \pers" by the named entity recognition. Note that this approach is rather aggressive as NERO was initially designed for the speech transcription in the French language. In practice, many nodes are not tagged as they do not overlap with a valid overlay (Sets T15 and T16, introduced in Section 4, show respectively 25.5% and 6.6% of nodes initially tagged). This is why tag propagation is required.

TAG PROPAGATION APPROACHES

Two di erent approaches are considered for the propagation of the initial tags: a random walk approach and a hierarchical one based on Kruskal's algorithm. In both cases, every node will be associated a particular tag with a con dence score at the end of the propagation phase. 3.1

Random walk tag propagation

In a graph where transition probabilities between nodes are known, the probability of ever reaching node j starting from node i can be calculated using a random walk strategy with absorbing states [ 10 ]. Let n be the number of nodes of the graph, we de ne a symmetrical weight matrix W = fWij g1 i;j n, where Wij is the similarity between nodes i and j, and a diagonal degree matrix D = fDijg1 i;j n, where Dii = Pj Wij . The transition probability matrix P0 = fPi0jg1 i;j n, where Pi0j is the probability of reaching node j from node i in one step, is given by P0 = D 1W. Tagged nodes are set as absorbing states in P, according to P =

0 Pul Puu ; where l is the set of tagged nodes, u is the set of untagged nodes, I is an identity matrix of size jlj jlj, Pul contains probabilities of untagged nodes ending their walk on tagged nodes, and Puu contains probabilities of untagged nodes getting to other untagged nodes. We denote Pt the transition probability after t iterations. The random walk iteration is performed according to Pt+1 = (1 ) P0 Pt + P0; where is a parameter enforcing the consistency of the initial state (here, = 0:4). Once the random walk has converged (Pi;j jPit;+j1 Pit;j j < 10 9), each untagged node is associated to the tagged one on which it has the highest probability to end its walk, i.e., each row index of Pul is matched with the column index with maximal probability. This maximal probability is kept as the con dence score. 3.2

Hierarchical tag propagation

This method is based on the computation of a minimum spanning tree (MST) from an undirected weighted graph, using Kruskal's algorithm. The MST establishes a hierarchical partition of a set [ 5 ]. A connected graph G is given (see Section 2), where edge weights represent distances (functions of their respective similarities SAV ): To propagate the initial tags, we start from a null graph H on G's nodes, and the following process is repeated, until all edges of G are examined: from G; the unexamined edge e corresponding to the smallest distance is chosen. If it does not link di erent trees primary (p) contrast 1 (c1) contrast 2 (c2) contrast 3 (c3) contrast 4 (c4) (p c4)=c4 in H, skip it; otherwise, it links trees T1 and T2 (thus forming T3), and e is added to the minimum spanning forest H being created; three cases are possible: I. None of T1; T2 is tagged: T3 will not be tagged II. Only T1 is tagged, with con dence score CT1 : T1's tag is assigned to the entire T3 (i.e., to all its unlabelled nodes), with a con dence score CT3 = CT1 (1 we); where we is the weight of e in G. III. Both T1 and T2 are tagged: one of the tags (of T1 or of T2) is picked (at random), and assigned to T3 with con dence scores as in case II. 4.

RESULTS

Tab. 2 reports the results obtained on the 2015 and 2016 test data (T15=development data for 2016, and T16, respectively). For T16, the reference annotation dump of 2016/09/14 is used. The rank of the submissions is shown in Tab. 3. All tag propagation approaches improve over the no-propagation baseline (c4), the interest of tag propagation being much clearer on T16. The baseline highlights noticeable di erences between T15 and T16. In T15, propagation was almost useless as most nodes could be tagged in the initial stage. This is not the case in T16 where tag propagation yields signi cant gain. The hierarchical tag propagation on graphs combining CNN visual similarity and binary voice similarity (primary) consistently outperforms other combinations, showing the interest of combining audio and visual similarities. Comparing approaches, c3 usually (except for T16, MAP@1) performs better than c1, indicating that the hierarchical tag propagation performs better than the random walk, at least with GMM-CNN audiovisual similarities. The comparison of c3 and c1 shows the weakness of the GMM-based voice comparison over the state-of-the-art approach used for diarization. Finally, the comparison of c3 and c2 gives mixed feelings. The use of the GMM-based voice comparison decreases performance in most cases except on T16 at K = 1; 100.

ACKNOWLEDGEMENTS

Work supported by FAPEMIG/INRIA/MOTIF (CEXAPQ 03195-13), FAPEMIG/PPM (CEX-PPM-6-16) and CAPES (064965/2014-01).

[1]

Ben ,

Betser ,

Bimbot , and

Gravier . Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs . In Proceedings of the 8th International Conference on Spoken Language Processing , pages 333 { 444 , October 2004 .

[2]

Bredin ,

Barras , and

Guinaudeau . Multimodal person discovery in broadcast TV at MediaEval 2016 . In Working notes of the MediaEval 2016 Workshop, October 2016 .

[3]

C. E. dos Santos

Jr , G. Gravier, and

W. R.

Schwartz . SSIG and IRISA at Multimodal Person Discovery . In Working notes of the MediaEval 2015 Workshop , September 2015 .

[4]

Le ,

Wu ,

Meignier , and J. -M. Odobez . EUMSSI team at the MediaEval Person Discovery Challenge . In Working notes of the MediaEval 2015 Workshop , September 2015 .

[5]

Perret ,

Cousty ,

J. C. R.

Ura , and

S. J. F.

Guimara ~es. Evaluation of morphological hierarchies for supervised segmentation . In Proceedings of the 12th International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing , pages 39 { 50 . Springer, 2015 .

[6]

Poignant ,

Besacier , and

Quenot . Unsupervised speaker identi cation in TV broadcast based on written names . IEEE/ACM Transactions on Audio, Speech and Language Processing , 23 ( 1 ): 57 { 68 , 2015 .

[7]

Raymond . Robust tree-structured named entities recognition from speech . In International Conference on Acoustics, Speech and Signal Processing , 2013 .

[8]

Simonyan and

Zisserman . Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 , 2014 .

[9]

Tolias ,

Sicre , and

Jegou . Particular object retrieval with integral max-pooling of CNN activations . In Proceedings of the 2016 International Conference on Learning Representations , 2016 .

[10]

Zhu and

Ghahramani . Learning from labeled and unlabeled data with label propagation . Technical report, Citeseer , 2002 .