=Paper=
{{Paper
|id=Vol-1436/Paper68
|storemode=property
|title=SSIG and IRISA at Multimodal Person Discovery
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper68.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SantosGS15
}}
==SSIG and IRISA at Multimodal Person Discovery==
SSIG and IRISA at Multimodal Person Discovery Cassio E. dos Santos Jr1 , Guillaume Gravier2 , William Robson Schwartz1 1 Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil 2 IRISA & Inria Rennes , CNRS, Rennes, France cass@dcc.ufmg.br, guig@irisa.fr, william@dcc.ufmg.br ABSTRACT graph where nodes are speaking faces, with edges denoting This paper describes our approach and results in the multi- the voice and/or face similarity. This approach is motivated modal person discovery in broadcast TV task at MediaEval by the wish to avoid explicit face and speaker clustering and 2015. We investigate two distinct aspects of multimodal open new strategies for person discovery. Note that the two person discovery. One refers to face clusters, which are con- approaches could be combined but, for practical reasons, sidered to propagate names associated to faces in one shot this combination was not considered in the framework of to other faces that probably belong to the same person. The the evaluation. face clustering approach consists in calculating face similar- ities using partial least squares (PLS) and a simple hierar- 2. PLS-BASED FACE CLUSTERING chical approach. The other aspect refers to tag propagation The PLS-based face clustering approach consists in calcu- in a graph-based approach where nodes are speaking faces lating a similarity measure between face tracks for further and edges link similar faces/speakers. The advantage of the clustering. Face clusters are then used in a variant of the graph-based tag propagation is to not rely on face/speaker baseline, as a replacement of the face clusters provided. clustering, which we believe can be errorprone. PLS is a statistical method consisting of two steps: re- gression and projection [9]. The projection step consists 1. INTRODUCTION in calculating a subspace that maximize the covariance be- Multimodal person discovery in video archives consists in tween predictors and responses. The regression step relies naming all speaking faces in the collection without prior on ordinary least squares to estimate responses based on information, leveraging face recognition, speech recogni- the projected predictors. We employ the one-shot similarity tion, speaker recognition and optical character recognition. metric based on PLS for face verification described in [4], A description of the task and resources provided within which presents robust results for face images in the wild MediaEval is given in [2]. In particular, two key components compared to conventional distance-based methods. In a nut- of most systems for multimodal person discovery are (i) face shell, the similarity sim(A, B) between face tracks A and B tracking and clustering and (ii) speaker diarization. See [6] relies on PLS regression trained to return +1 for samples for a recent overview of existing systems. Given these com- in A and response −1 for samples in a background set of ponents, a popular strategy to name speaking faces relies on images (300 random face images from the LFW dataset [5]). a mapping of face clusters and speakers from the diarization, Then, sim(A, B) is calculated as the average of responses combining this mapping with appearance of named entities from samples in B evaluated in the learned PLS regression. in speech transcripts or on screen (e.g., [3, 8]). The baseline A symmetric version is used in practice, averaging sim(A, B) system provided by the organizers [7] is a clear instanciation and sim(B, A). of this. Person names appearing on screen are first prop- Based on PLS similarity calculated between all face track agated onto speaker clusters, finding an optimal mapping pairs, clustering aims at grouping face tracks from the same based on co-occurrence. In the next step, one has to find subject. We employ a hierarchical clustering approach that for each named speaker if there is a co-occurring face track consists in merging a pair of face tracks with maximum sim- that has a probability to correspond to the current speaker ilarity and with at least one face track that was not merged higher than a threshold. Each such face track receives the yet. The merging consists in propagating an identification name assigned to the speaker cluster. label from one face track to the other or generating a new We explore two distinct aspects of multimodal person dis- identification label for the pair if no label was previous as- covery in this evaluation. On the one hand, we seek to im- sociated to the face tracks. The algorithm stops when the prove face clustering using recent advances in face recogni- maximum similarity is less than a threshold, empirically set tion based on partial least square (PLS) regression [4]. We to 0.5 using the development set. consider a variant of the baseline system provided, modified To assess the interest of PLS-based face clustering, we to better merge the PLS face cluster and speaker diarization consider a slightly different version of the baseline approach results. On the other hand, we study tag propagation in a to merge face clustering and speaker diarization informa- tion. Each name associated to one face track is propagated to all face tracks within the same face cluster. We then Copyright is held by the author/owner(s). consider the union of the names from the face tracks and MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany speaker diarization within each shot. We also evaluate the method BSLN SPKR FACE UNI INT EwMAP MAP C dev 38.89 63.67 49.12 67.84 44.83 no prop 44.5 44.7 76.7 dev test 78.35 89.46 67.18 89.74 66.86 1 step prop 53.6 54.0 75.4 test (PLS) 78.35 89.46 61.90 89.64 61.64 test no prop 78.3 79.5 89.7 Table 1: EwMAP (in %) using the baseline face clus- Table 2: Results with graph-based naming on the ters on the development set (top row), on the test development data (test2) and on the test data. set (middle row) and using the PLS-based face clus- ters on the test set (bottom row). diarization. In PLS-based face clustering, we consider the CLBP [1] feature descriptor with radius parameter 5 calcu- modified baseline approach using only the speaker diariza- lated in squared blocks of size 16 pixels and stride of 8 pixels. tion, only the face cluster, and considering the intersection All faces were cropped from the videos using the face posi- of the names instead of union. tion provided in the baseline approach and scaled to 128 by 128 pixels. Note that we do not provide face clusters based 3. GRAPH-BASED TAG PROPAGATION on PLS for the development set and, therefore, all results in Tab. 1 for the development set consider only the face clusters To skirt issues with errors in clustering, which we be- available in the baseline approach. We also provide the re- lieve can strongly affect the naming process, we investigate sults on the test set considering the face clusters provided in a strategy based on tag propagation within a graph where a the baseline method, i.e., without PLS-based face clustering. node corresponds to an occurrence of a speaking face within The SPK approach yields the best EwMAP in Tab. 1 while a shot. the FACE yields the worst results. However, the results from The first step is the graph construction process, which INT and UNI indicate that the two approaches present com- consists in identifying speaking faces from the face tracks plementary results, i.e., the intersection of the propagated detected within each shot1 . This is achieved by selecting names among the face clusters and speaker diarization shots face tracks whose probability to correspond to the current indicates that a small subset of correct names from the face speech turn is greater than a threshold empirically set to clusters that are not in the speaker names, These aspects 0.6, where the probabilities that a face track corresponds are observed in the development and test set, using the face to a speech turn are those provided. For each selected face clusters in the baseline or the PLS method face clusters. We track, we keep a record of the matching speech turn. The se- also noticed no significant difference in the results between lected speaking face tracks are the nodes of a graph and are face clusters provided in the baseline approach and using the connected with edges bearing two scores, depicting the simi- PLS-based method, considering the UNI approach. We be- larity of resp. voice and face (as given in the speech turn and lieve that this small difference is an effect of the poor quality face track similarity files). To avoid a fully connected graph of the face clusters, which might result from combined errors and keep only relevant relationships, we connect two nodes if in the face detection and in the face tracking methods. the similarity between the corresponding face tracks and the Results for the graph-based tag propagation method are similarity between the corresponding speech turns are both given in Tab. 2. On the development data (test2 subset), above a threshold, empirically set to 0.1 for both modali- results are provided without tag propagation (no prop) and ties. Note that having no relations between face tracks and with a singl step of tag propagation. We believe that the speech turns across shows, a graph is built independently poor results obtained are attributable to the fact that the for each show. graph links only submission shots, which account only for a The naming process starts by associating a name to a small fraction of the total number of shots in the develop- node whenever possible based on the output of overlaid text ment data. Contrarily, most of the shots in the test data detection: if an overlay significantly overlaps the face track, are subission shots. With no surprise, tag propagation im- the node is tagged with the corresponding name and a score proves the MAP to the expense of correctness. Submission of 1. In case of multiple overlaping overlays, the name cor- on the test set was made without tag propagation (because responding to the longest co-occurrence is considered. Af- of unconvincing propagation results at the time) and not up- ter tagging all nodes, tags are optionally propagated over dated after the initial submission (July 1st). Interestingly, a number of iterations. At each iteration, each tag of each direct naming of speaking face tracks from overlays (i.e., no node is propagated via the corresponding edges with a prop- propagation) already provides accurate tagging. agation score equal to the tag score multiplied by the edge weight, where edge weights are taken as the average of the face and voice similarity. After propagation, each node re- 5. ACKNOWLEDGEMENTS ceives the tag with the highest score. The authors would like to thank the Brazilian National Research Council – CNPq (Grant #477457/2013-4), Brazil- ian National Council for the Improvement of Higher Educa- 4. RESULTS tion – CAPES (Grant STIC-AMSUD 001/2013) and the Mi- The results from the second submission (July 8th) of the nas Gerais Research Foundation – FAPEMIG (Grants APQ- four PLS-based methods and the baseline are presented in 01806-13 and CEX-APQ-03195-13). This work was partially Tab. 1, where the following abbreviations are employed: supported by the STIC AmSud program, under the project PLS-based face clustering considering only speaker diariza- ’Unsupervised Mining of Multimedia Content’, and by the tion (SPKR), only face clusters (FACE), union (UNI) and Inria Associate Team program. intersection (INT) of names among face clusters and speaker 1 Only submission shots were considered in this work. 6. REFERENCES [1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(12):2037–2041, 2006. [2] H. Bredin, J. Poignant, and C. Barras. Overview of the multimodal person discovery task at MediaEval 2015. In Working Notes Proc. of MediaEval 2015 Workshop, 2015. [3] H. Bredin, A. Roy, V.-B. Le, and C. Barras. Person Instance Graphs for Mono-, Cross- and Multi-Modal Person Recognition in Multimedia Data. Application to Speaker Identification in TV Broadcast. International Journal of Multimedia Information Retrieval, 2014. [4] H. Guo, W. R. Schwartz, and L. S. Davis. Face verification using large feature sets and one shot similarity. In Intl. Conf. on Biometrics, pages 1–8, 2011. [5] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007. [6] J. Poignant. Identification non-supervisée de personnes dans les flux teéleévisés. PhD thesis, Université de Grenoble, 2013. [7] J. Poignant, H. Bredin, V.-B. Le, L. Besacier, C. Barras, and G. Quénot. Unsupervised speaker identification using overlaid texts in TV broadcast. In Annual Conf. of the International Speech Communication Association, 2012. [8] J. Poignant, G. Fortier, L. Besacier, and G. Quénot. Naming multi-modal clusters to identify persons in TV broadcast. Multimedia Tools and Applications, 2015. [9] R. Rosipal and N. Krämer. Overview and recent advances in partial least squares. In Subspace, latent structure and feature selection, pages 34–51. Springer, 2006.