=Paper= {{Paper |id=Vol-1436/Paper41 |storemode=property |title=LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper41.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/BudnikSBQKD15 }} ==LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task== https://ceur-ws.org/Vol-1436/Paper41.pdf
     LIG at MediaEval 2015 Multimodal Person Discovery in
                      Broadcast TV Task

        Mateusz Budnik, Bahjat Safadi, Laurent                        Ali Khodabakhsh, Cenk Demiroglu
             Besacier, Georges Quénot                                  Electrical and Computer Engineering
         Univ. Grenoble Alpes, LIG, F-38000 Grenoble,                               Department
                            France                                     Ozyegin University, Istanbul, Turkey
            CNRS, LIG, F-38000 Grenoble, France                         ali.khodabakhsh@ozu.edu.tr
                firstname.lastname@imag.fr                             cenk.demiroglu@ozyegin.edu.tr

ABSTRACT                                                         2.    APPROACH
In this working notes paper the contribution of the LIG team       Our initial approach focused on creating new features for
(partnership between Univ. Grenoble Alpes and Ozyegin            both face and speech. The second approach is based more
University) to the Multimodal Person Discovery in Broad-         on the baseline system, i.e. no new descriptors were gener-
cast TV task in MediaEval 2015 is presented. The task            ated and the key element was the distance between speech
focused on unsupervised learning techniques. Two different       segments and face tracks.
approaches were submitted by the team. In the first one,
new features for face and speech modalities were tested. In      2.1     What did not work: new features
the second one, an alternative way to calculate the distance        The first approach explored the use of alternative fea-
between face tracks and speech segments is presented. It         tures for different modalities. For speech, a Total Variability
also had a competitive MAP score and was able to beat the        Space (TVS) system [1] was designed using the following set-
baseline.                                                        tings with the segmentation provided by the baseline system.
                                                                 Models were learned on the test data without any manual
                                                                 annotation available.
1.   INTRODUCTION
   These working notes present the submissions proposed by            • 19 MFCC and energy + ∆s (no static energy) + fea-
LIG team (partnership between Univ. Grenoble Alpes and                  ture warping
Ozyegin University) to the MediaEval 2015 Multimodal Per-
                                                                      • 20ms length window with a 10ms shift
son Discovery in Broadcast TV task. Along with the algo-
rithms and initial results, a more general discussion about           • Energy based silence filtering
the task is provided as well. A detailed description of the
task, the dataset, the evaluation metric and the baseline sys-        • 1024 GMMs + 400 dimensional TVS
tem can be found in the paper provided by the organizers
                                                                      • Cosine similarities between segments within each video
[4]. All the approaches presented here are unsupervised (fol-
                                                                        are calculated
lowing the organizers guidelines) and were submitted to the
main task.
   The main goal of the task is to identify people appear-         For faces, features extracted from a deep convolutional
ing in various TV shows, mostly news or political debates.       neural network [2] were used. This was done in the following
The task is limited to persons that speak and are visible        way using the test set only:
at the same time (potential people of interest). Addition-
ally, the task is confined to the multimodal data (including          • Face extraction with the approach provided by the or-
face, speech, overlaid text) found in the test set videos and           ganizers. All scaled to resolution of 100×100 pixels.
is strictly unsupervised (no manual annotation available).
                                                                      • Labels generated by the OCR. They were then as-
The main source of names is given by the optical character
                                                                        signed to co-occurring faces. This was based on a tem-
recognition system used in the baseline [3].
                                                                        poral overlap between the face and the label. This
   Thanks to the provided baseline system [5], it was possible
                                                                        generated list served as a training set. The number of
to concentrate on some aspects of the task, like a particu-
                                                                        classes equaled the number of unique names.
lar modality or the clustering method. Initially, our focus
was on creating better face and speech descriptors. In the            • The general structure of the net is based on the small-
second approach however, only the distances between face                est architecture presented in [6], but with just 5 convo-
tracks and speech segments were modified. The output of                 lutional layers and the number of filters at each layer
the baseline OCR system was used as is, while the output                reduced by half. The fully connected layers had 1024
from the speech transcription system was not used at all.               outputs. It was trained for around 15 epochs.

                                                                      • After the training, the last layer containing the classes
Copyright is held by the author/owner(s).                               was discarded and the last fully connected hidden layer
MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany              (1024 outputs) was then used for feature extraction.
  Two individual sets of clusters were generated for each        apart from the news anchor, most people appear only once
modality. Afterwards, both were mapped to the shots. If          and this may not be enough to create an accurate biomet-
there was an overlap with the same label, the person was         ric model. This stands in contrast to the development set,
named. Additional submissions involving this approach were       which contains debates and parliament sessions where some
made, which included adding descriptors provided by the          persons re-appeared much more frequently.
baseline (e.g. HOG for face and BIC for speech). However,           A more general issue is also the class imbalance. While
they did not manage to give better performance than the          some people, especially the anchors, appear frequently across
baseline.                                                        different videos, most of the others are shown once or twice
                                                                 and are confined to a single video. This makes the use of
2.2   What did work : modified distance be-                      unsupervised techniques, like clustering, challenging, due to
      tween modalities                                           widely varying cluster sizes - small clusters can get attached
   In the baseline provided, the written names are first prop-   to bigger ones, which is heavily penalized under the MAP
agated to speaker cluster and then the named speakers are        metric. This can, at least partially, explain the poor per-
assigned to co-occurring faces. Due to the nature of the test    formance of the first approach. Even though the features
set, an alternative was used where the written names are first   used in this method are state-of-the-art, they would require
propagated to face clusters. These face-name pairs are sub-      more high quality data (including annotation) and parame-
sequently assigned to co-occurring speech segments. This         ter adjustment to create good enough distinctions between
approach yielded a more precise but smaller set of named         thousands of individual persons appearing in the videos.
people compared to the baseline. In order to expand it, a
fusion with the output of the baseline system was made,          4.   CONCLUSIONS
where every conflict (e.g. different names for the same shot)
                                                                    During this evaluation different algorithms were tested in
would be resolved in favor of our proposed approach.
                                                                 order to (unsupervisingly) identify people, which speak and
   Additionally, another way to calculate the distance be-
                                                                 are visible in TV broadcasts. One approach concentrated
tween speech and face track was developed. In the baseline
                                                                 on trying to provide state-of-the-art features for different
the distance between a face track and a speech segment is
                                                                 modalities, while the other provided an alternative estima-
calculated using lip movement detection, size and the posi-
                                                                 tion of the distance between already provided modalities of
tion of the face and so on. Our complementary approach is
                                                                 face and speech.
based on temporal correlation of tracks from different modal-
                                                                    The first approach, even with its limited performance on
ities.
                                                                 this particular shared task, seems to have greater potential
   First, overlapping face tracks and speech segments are ex-
                                                                 and our future work may try to address some of its short-
tracted for each video. Similarity vectors for both modalities
                                                                 comings. This includes a focus on a more robust deep learn-
are extracted with respect to all the other segments within
                                                                 ing approach that could deal with noisy or automatically
the same video. Correlation of the similarity vectors are
                                                                 generated training sets.
calculated in order to determine which face and voice go to-
gether. In other words, a face-speech pair which appears
frequently throughout the video is more likely to belong to      5.   ACKNOWLEDGMENTS
the same person. Finally, the output of this approach is          This work was conducted as a part of the CHIST-ERA
fused with the output of the system described in the first       CAMOMILE project, which was funded by the ANR (Agence
paragraph of this subsection (face-name pairs assigned to        Nationale de la Recherche, France).
co-occuring speech segments) to produce a single name for
each shot.                                                       6.   REFERENCES
                                                                 [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and
3.    INITIAL RESULTS AND DISCUSSION                                 P. Ouellet. Front-end factor analysis for speaker
  The first system (submitted for the first deadline) per-           verification. Audio, Speech, and Language Processing,
formed rather poorly with 30.48 % EwMAP (MAP = 30.63                 IEEE Transactions on, 19(4):788–798, 2011.
%). While our second approach, submitted as the main sys-        [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
tem by the second deadline, with EwMAP = 85.67 % (and                Imagenet classification with deep convolutional neural
MAP = 86.03 %) was far more successful and was able to               networks. In Advances in neural information processing
beat the baseline system (EwMAP = 78.35 % and MAP =                  systems, pages 1097–1105, 2012.
78.64 %). The scores presented here were provided by the         [3] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
organizers and can change slightly before the workshop, due          From text detection in videos to person identification.
to more annotation being available.                                  ICME, 2012.
  During the preparation for this evaluation there were a        [4] J. Poignant, H. Bredin, and C. Barras. Multimodal
number of issues and observations connected to both our              person discovery in broadcast tv at mediaeval 2015.
approach and to the data. First of all, trying to build bio-         MediaEval 2015 Workshop, September 2015.
metric models for individual people does not work well for
                                                                 [5] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras,
this particular task (at least based on what was tested in
                                                                     and G. Quénot. Unsupervised speaker identification
the context of this evaluation, e.g. SVMs). In order to
                                                                     using overlaid texts in tv broadcast. INTERSPEECH,
comply with the task requirements, the labels can only be
                                                                     2012.
generated from the OCR and then be assigned to one of the
modalities. However, both steps are unsupervised, generat-       [6] K. Simonyan and A. Zisserman. Very deep
ing noisy annotation in the process. Additionally, the video         convolutional networks for large-scale image
test set consists of one type of program (TV news) where,            recognition. arXiv preprint arXiv:1409.1556, 2014.