PUC Minas and IRISA at Multimodal Person Discovery Gabriel Sargent1 , Gabriel Barbosa de Fonseca2 , Izabela Lyon Freire2 , Ronan Sicre1 , Zenilton K. G. Patrocínio Jr2 , Silvio Jamil F. Guimarães2 and Guillaume Gravier1 , 1 IRISA & Inria Rennes , CNRS and Univ. Rennes 1, Rennes, France 2 Computer Science Department - PUC de Minas Gerais, Belo Horizonte, Brazil ABSTRACT Similarity Submission Tag propagation audio video This paper describes the systems developed by PUC Minas and IRISA for the person discovery task at MediaEval 2016. primary (p) binary CNN hierarchical We adopt a graph-based representation and investigate contrast 1 (c1) GMM CNN random walk two tag-propagation approaches to associate overlays co- contrast 2 (c2) – CNN hierarchical occurring with some speaking faces to other visually or contrast 3 (c3) GMM CNN hierarchical audio-visually similar speaking faces. contrast 4 (c4) – – – Given a video, we first build a graph from the detected Table 1: Components of the systems at the origin of speaking faces (nodes) and their audio-visual similarities our 5 submissions. (edges). Each node is associated to its co-occurring over- lays (tags) when they exist. Then, we consider two tag- propagation approaches, respectively based on a random with the speech segment for which the overlap is maximum walk strategy and on Kruskal’s algorithm. and at least 60 %. The facetracks and speech segments are the ones provided by MediaEval, the latter being extracted from the speaker diarization result disregarding the arbi- 1. INTRODUCTION trary speaker number. The task of multimodal person discovery in TV broadcast consists in identifying persons of a video corpus which both 2.1 Audiovisual similarities speak and are visible at the same time, in an unsupervised We consider three weighting schemes for the edges in the way [2]. Most approaches to the task use clustering, either graphs, resulting from the combination of different strategies of face tracks or of voice segments (or both) before find- to combine visual similarity and voice similarity. ing a good match between text in overlays and clusters [6, V The visual similarity Sij between two facetracks i and j is 4]. While this type of approaches worked well in 2015, we calculated as follows. A key face is selected from the central believe that the clustering steps involved are error prone. frame of each facetrack, from which a generic image descrip- Indeed, errors in the clustering step cannot be undone af- tor is computed by applying a very-deep convolutional neu- terwards in the naming stages. In 2015, IRISA and UFMG ral network pre-trained on the ImageNet dataset [8]. Specif- proposed a graph-based approach in which each node cor- ically, we extract the last convolutional layer [9] and perform responds to a speaking face and edges to the similarity be- average pooling and “power normalization”, i.e., square-root tween its vertices [3]. The similarity can be computed at V compression followed by L2-normalization. Finally, Sij is the visual level, the voice level or both. Names can be asso- calculated as the cosine similarity between the descriptors ciated to nodes based on co-occurrences of a speaking face of the two key face images. and names overlays. However, only a small fraction of the Voice similarity can be computed two ways. A simple nodes can be tagged by this method. Hence, in 2016, we binary audio similarity is derived from the speaker diariza- studied tag propagation algorithms that take advantage of tion provided by MediaEval, where the similarity is 1 if the the graph structure to assign tags to nodes with no over- two segments are labeled with the same speaker in the di- lapping overlays, thus potentially improving recall. Tab. 1 A arization. Alternately, the audio similarity Sij between two recaps the different configurations submitted. segments can be calculated as follows. Each speech segment is modeled with a 16-Gaussian mixture model (GMM) over A 2. GRAPH GENERATION Mel cepstral features. The distance Dij is computed using the Euclidean-based approximation of the KL2 divergence Each video is modeled by a graph where each node repre- between the two GMMs [1], and turned into a similarity ac- sents a speaking face, and each edge quantifies the visual or cording to SijA = exp(log (α) Dij A ), where α = 0.25 in the audiovisual similarity between two speaking faces. A speak- experiments here. ing face is defined as the association of a facetrack (sequence Fusion of the visual and voice similarities is done by a of faces related to the same person in adjacent video frames) weighted average, SijAV V = βSij + (1 − β)SijA . We experimen- tally set β = 0.85 in the case of binary voice comparison and Copyright is held by the author/owner(s). β = 0.5 for the GMM-based comparison. MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Nether- lands. 2.2 Tag initialization MAP@1 MAP@10 MAP@100 T15 T16 T15 T16 T15 T16 Initially, each node in the graph is tagged using the overlay for which the overlap with the facetrack is maximum. We primary (p) 87.9 64.4 82.1 49.3 81.9 47.8 used the overlay detection and name recognition provided contrast 1 (c1) 87.9 64.4 79.8 48.4 79.6 46.7 (output from the OCR system 2), which we filtered using contrast 2 (c2) 87.9 62.9 81.7 46.2 81.5 44.8 the named entity detector NERO [7], keeping only words contrast 3 (c3) 87.9 63.6 80.2 49.3 80.0 47.5 tagged as “pers” by the named entity recognition. Note that contrast 4 (c4) 87.9 56.8 79.7 36.1 79.5 35.1 this approach is rather aggressive as NERO was initially de- (p − c4)/c4 0.0 13.4 3.0 36.6 3.0 36.2 signed for the speech transcription in the French language. Table 2: Mean average precision at different ranks In practice, many nodes are not tagged as they do not over- (in %) for the 5 submissions. Last row gives the lap with a valid overlay (Sets T15 and T16, introduced in relative improvement of the primary run over the Section 4, show respectively 25.5% and 6.6% of nodes ini- no-propagation baseline (c4). tially tagged). This is why tag propagation is required. T15 T16 3. TAG PROPAGATION APPROACHES MAP@1 p = c1 = c2 = c3 = c4 p = c1, c3, c2, c4 Two different approaches are considered for the propaga- MAP@10 p, c2, c3, c1, c4 p, c2, c3, c1, c4 tion of the initial tags: a random walk approach and a hi- MAP@100 p, c2, c3, c1, c4 p, c3, c1, c2, c4 erarchical one based on Kruskal’s algorithm. In both cases, every node will be associated a particular tag with a confi- Table 3: Ranking (best first) of the submissions. dence score at the end of the propagation phase. 3.1 Random walk tag propagation in H, skip it; otherwise, it links trees T1 and T2 (thus form- ing T3 ), and e is added to the minimum spanning forest H In a graph where transition probabilities between nodes being created; three cases are possible: I. None of T1 , T2 is are known, the probability of ever reaching node j starting tagged: T3 will not be tagged II. Only T1 is tagged, with from node i can be calculated using a random walk strategy confidence score CT1 : T1 ’s tag is assigned to the entire T3 with absorbing states [10]. Let n be the number of nodes (i.e., to all its unlabelled nodes), with a confidence score of the graph, we define a symmetrical weight matrix W = CT3 = CT1 × (1 − we ), where we is the weight of e in G. III. {Wij }1≤i,j≤n , where Wij is the similarity between nodes Both T1 and T2 are tagged: one of the tags (of T1 or of T2 ) i and j, and aPdiagonal degree matrix D = {Dij }1≤i,j≤n , is picked (at random), and assigned to T3 with confidence where Dii = j Wij . The transition probability matrix scores as in case II. P0 = {P0ij }1≤i,j≤n , where P0ij is the probability of reaching node j from node i in one step, is given by P0 = D−1 W. Tagged nodes are set as absorbing states in P, according to 4. RESULTS   Tab. 2 reports the results obtained on the 2015 and 2016 I 0 test data (T15=development data for 2016, and T16, re- P= , Pul Puu spectively). For T16, the reference annotation dump of 2016/09/14 is used. The rank of the submissions is shown where l is the set of tagged nodes, u is the set of untagged in Tab. 3. All tag propagation approaches improve over the nodes, I is an identity matrix of size |l| × |l|, Pul contains no-propagation baseline (c4), the interest of tag propagation probabilities of untagged nodes ending their walk on tagged being much clearer on T16. The baseline highlights notice- nodes, and Puu contains probabilities of untagged nodes get- able differences between T15 and T16. In T15, propagation ting to other untagged nodes. We denote Pt the transition was almost useless as most nodes could be tagged in the ini- probability after t iterations. The random walk iteration tial stage. This is not the case in T16 where tag propagation is performed according to Pt+1 = (1 − γ) P0 Pt + γ P0 , yields significant gain. The hierarchical tag propagation on where γ is a parameter enforcing the consistency of the ini- graphs combining CNN visual similarity and binary voice tial stateP(here, γ = 0.4). Once the random walk has con- −9 similarity (primary) consistently outperforms other combi- verged ( i,j |Pt+1 t i,j − Pi,j | < 10 ), each untagged node is nations, showing the interest of combining audio and visual associated to the tagged one on which it has the highest similarities. Comparing approaches, c3 usually (except for probability to end its walk, i.e., each row index of Pul is T16, MAP@1) performs better than c1, indicating that the matched with the column index with maximal probability. hierarchical tag propagation performs better than the ran- This maximal probability is kept as the confidence score. dom walk, at least with GMM-CNN audiovisual similari- 3.2 Hierarchical tag propagation ties. The comparison of c3 and c1 shows the weakness of the GMM-based voice comparison over the state-of-the-art This method is based on the computation of a minimum approach used for diarization. Finally, the comparison of spanning tree (MST) from an undirected weighted graph, us- c3 and c2 gives mixed feelings. The use of the GMM-based ing Kruskal’s algorithm. The MST establishes a hierarchical voice comparison decreases performance in most cases ex- partition of a set [5]. A connected graph G is given (see Sec- cept on T16 at K = 1, 100. tion 2), where edge weights represent distances (functions of their respective similarities S AV ). To propagate the initial tags, we start from a null graph H on G’s nodes, and the 5. ACKNOWLEDGEMENTS following process is repeated, until all edges of G are exam- Work supported by FAPEMIG/INRIA/MOTIF (CEX- ined: from G, the unexamined edge e corresponding to the APQ 03195-13), FAPEMIG/PPM (CEX-PPM-6-16) and smallest distance is chosen. If it does not link different trees CAPES (064965/2014-01). 6. REFERENCES [1] M. Ben, M. Betser, F. Bimbot, and G. Gravier. Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs. In Proceedings of the 8th International Conference on Spoken Language Processing, pages 333–444, October 2004. [2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal person discovery in broadcast TV at MediaEval 2016. In Working notes of the MediaEval 2016 Workshop, October 2016. [3] C. E. dos Santos Jr, G. Gravier, and W. R. Schwartz. SSIG and IRISA at Multimodal Person Discovery. In Working notes of the MediaEval 2015 Workshop, September 2015. [4] N. Le, D. Wu, S. Meignier, and J.-M. Odobez. EUMSSI team at the MediaEval Person Discovery Challenge. In Working notes of the MediaEval 2015 Workshop, September 2015. [5] B. Perret, J. Cousty, J. C. R. Ura, and S. J. F. Guimarães. Evaluation of morphological hierarchies for supervised segmentation. In Proceedings of the 12th International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing, pages 39–50. Springer, 2015. [6] J. Poignant, L. Besacier, and G. Quénot. Unsupervised speaker identification in TV broadcast based on written names. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(1):57–68, 2015. [7] C. Raymond. Robust tree-structured named entities recognition from speech. In International Conference on Acoustics, Speech and Signal Processing, 2013. [8] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [9] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of CNN activations. In Proceedings of the 2016 International Conference on Learning Representations, 2016. [10] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002.