LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task Mateusz Budnik, Bahjat Safadi, Laurent Ali Khodabakhsh, Cenk Demiroglu Besacier, Georges Quénot Electrical and Computer Engineering Univ. Grenoble Alpes, LIG, F-38000 Grenoble, Department France Ozyegin University, Istanbul, Turkey CNRS, LIG, F-38000 Grenoble, France ali.khodabakhsh@ozu.edu.tr firstname.lastname@imag.fr cenk.demiroglu@ozyegin.edu.tr ABSTRACT 2. APPROACH In this working notes paper the contribution of the LIG team Our initial approach focused on creating new features for (partnership between Univ. Grenoble Alpes and Ozyegin both face and speech. The second approach is based more University) to the Multimodal Person Discovery in Broad- on the baseline system, i.e. no new descriptors were gener- cast TV task in MediaEval 2015 is presented. The task ated and the key element was the distance between speech focused on unsupervised learning techniques. Two different segments and face tracks. approaches were submitted by the team. In the first one, new features for face and speech modalities were tested. In 2.1 What did not work: new features the second one, an alternative way to calculate the distance The first approach explored the use of alternative fea- between face tracks and speech segments is presented. It tures for different modalities. For speech, a Total Variability also had a competitive MAP score and was able to beat the Space (TVS) system [1] was designed using the following set- baseline. tings with the segmentation provided by the baseline system. Models were learned on the test data without any manual annotation available. 1. INTRODUCTION These working notes present the submissions proposed by • 19 MFCC and energy + ∆s (no static energy) + fea- LIG team (partnership between Univ. Grenoble Alpes and ture warping Ozyegin University) to the MediaEval 2015 Multimodal Per- • 20ms length window with a 10ms shift son Discovery in Broadcast TV task. Along with the algo- rithms and initial results, a more general discussion about • Energy based silence filtering the task is provided as well. A detailed description of the task, the dataset, the evaluation metric and the baseline sys- • 1024 GMMs + 400 dimensional TVS tem can be found in the paper provided by the organizers • Cosine similarities between segments within each video [4]. All the approaches presented here are unsupervised (fol- are calculated lowing the organizers guidelines) and were submitted to the main task. The main goal of the task is to identify people appear- For faces, features extracted from a deep convolutional ing in various TV shows, mostly news or political debates. neural network [2] were used. This was done in the following The task is limited to persons that speak and are visible way using the test set only: at the same time (potential people of interest). Addition- ally, the task is confined to the multimodal data (including • Face extraction with the approach provided by the or- face, speech, overlaid text) found in the test set videos and ganizers. All scaled to resolution of 100×100 pixels. is strictly unsupervised (no manual annotation available). • Labels generated by the OCR. They were then as- The main source of names is given by the optical character signed to co-occurring faces. This was based on a tem- recognition system used in the baseline [3]. poral overlap between the face and the label. This Thanks to the provided baseline system [5], it was possible generated list served as a training set. The number of to concentrate on some aspects of the task, like a particu- classes equaled the number of unique names. lar modality or the clustering method. Initially, our focus was on creating better face and speech descriptors. In the • The general structure of the net is based on the small- second approach however, only the distances between face est architecture presented in [6], but with just 5 convo- tracks and speech segments were modified. The output of lutional layers and the number of filters at each layer the baseline OCR system was used as is, while the output reduced by half. The fully connected layers had 1024 from the speech transcription system was not used at all. outputs. It was trained for around 15 epochs. • After the training, the last layer containing the classes Copyright is held by the author/owner(s). was discarded and the last fully connected hidden layer MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany (1024 outputs) was then used for feature extraction. Two individual sets of clusters were generated for each apart from the news anchor, most people appear only once modality. Afterwards, both were mapped to the shots. If and this may not be enough to create an accurate biomet- there was an overlap with the same label, the person was ric model. This stands in contrast to the development set, named. Additional submissions involving this approach were which contains debates and parliament sessions where some made, which included adding descriptors provided by the persons re-appeared much more frequently. baseline (e.g. HOG for face and BIC for speech). However, A more general issue is also the class imbalance. While they did not manage to give better performance than the some people, especially the anchors, appear frequently across baseline. different videos, most of the others are shown once or twice and are confined to a single video. This makes the use of 2.2 What did work : modified distance be- unsupervised techniques, like clustering, challenging, due to tween modalities widely varying cluster sizes - small clusters can get attached In the baseline provided, the written names are first prop- to bigger ones, which is heavily penalized under the MAP agated to speaker cluster and then the named speakers are metric. This can, at least partially, explain the poor per- assigned to co-occurring faces. Due to the nature of the test formance of the first approach. Even though the features set, an alternative was used where the written names are first used in this method are state-of-the-art, they would require propagated to face clusters. These face-name pairs are sub- more high quality data (including annotation) and parame- sequently assigned to co-occurring speech segments. This ter adjustment to create good enough distinctions between approach yielded a more precise but smaller set of named thousands of individual persons appearing in the videos. people compared to the baseline. In order to expand it, a fusion with the output of the baseline system was made, 4. CONCLUSIONS where every conflict (e.g. different names for the same shot) During this evaluation different algorithms were tested in would be resolved in favor of our proposed approach. order to (unsupervisingly) identify people, which speak and Additionally, another way to calculate the distance be- are visible in TV broadcasts. One approach concentrated tween speech and face track was developed. In the baseline on trying to provide state-of-the-art features for different the distance between a face track and a speech segment is modalities, while the other provided an alternative estima- calculated using lip movement detection, size and the posi- tion of the distance between already provided modalities of tion of the face and so on. Our complementary approach is face and speech. based on temporal correlation of tracks from different modal- The first approach, even with its limited performance on ities. this particular shared task, seems to have greater potential First, overlapping face tracks and speech segments are ex- and our future work may try to address some of its short- tracted for each video. Similarity vectors for both modalities comings. This includes a focus on a more robust deep learn- are extracted with respect to all the other segments within ing approach that could deal with noisy or automatically the same video. Correlation of the similarity vectors are generated training sets. calculated in order to determine which face and voice go to- gether. In other words, a face-speech pair which appears frequently throughout the video is more likely to belong to 5. ACKNOWLEDGMENTS the same person. Finally, the output of this approach is This work was conducted as a part of the CHIST-ERA fused with the output of the system described in the first CAMOMILE project, which was funded by the ANR (Agence paragraph of this subsection (face-name pairs assigned to Nationale de la Recherche, France). co-occuring speech segments) to produce a single name for each shot. 6. REFERENCES [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and 3. INITIAL RESULTS AND DISCUSSION P. Ouellet. Front-end factor analysis for speaker The first system (submitted for the first deadline) per- verification. Audio, Speech, and Language Processing, formed rather poorly with 30.48 % EwMAP (MAP = 30.63 IEEE Transactions on, 19(4):788–798, 2011. %). While our second approach, submitted as the main sys- [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. tem by the second deadline, with EwMAP = 85.67 % (and Imagenet classification with deep convolutional neural MAP = 86.03 %) was far more successful and was able to networks. In Advances in neural information processing beat the baseline system (EwMAP = 78.35 % and MAP = systems, pages 1097–1105, 2012. 78.64 %). The scores presented here were provided by the [3] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. organizers and can change slightly before the workshop, due From text detection in videos to person identification. to more annotation being available. ICME, 2012. During the preparation for this evaluation there were a [4] J. Poignant, H. Bredin, and C. Barras. Multimodal number of issues and observations connected to both our person discovery in broadcast tv at mediaeval 2015. approach and to the data. First of all, trying to build bio- MediaEval 2015 Workshop, September 2015. metric models for individual people does not work well for [5] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras, this particular task (at least based on what was tested in and G. Quénot. Unsupervised speaker identification the context of this evaluation, e.g. SVMs). In order to using overlaid texts in tv broadcast. INTERSPEECH, comply with the task requirements, the labels can only be 2012. generated from the OCR and then be assigned to one of the modalities. However, both steps are unsupervised, generat- [6] K. Simonyan and A. Zisserman. Very deep ing noisy annotation in the process. Additionally, the video convolutional networks for large-scale image test set consists of one type of program (TV news) where, recognition. arXiv preprint arXiv:1409.1556, 2014.