     UPC System for the 2015 MediaEval Multimodal Person
               Discovery in Broadcast TV task

                          M. India, D. Varas, V. Vilaplana, J.R. Morros, J. Hernando
                                 Universitat Politecnica de Catalunya, Spain

ABSTRACT                                                          the diarization only in those parts where we assume that
This paper describes a system to identify people in broad-        someone in the video must be speaking.
cast TV shows in a purely unsupervised manner. The system
outputs the identity of people that appear, talk and can be       3.   VIDEO SYSTEM
identified by using information appearing in the show (in our        For face tracking, the baseline code was used (tracking
case, text with person names). Three types of monomodal           by detection using the Kanade-Lucas-Tomasi algorithm [18,
technologies are used: speech diarization, video diarization      10, 16]). For feature extraction we used the technique in
and text detection / named entity recognition. These tech-        the baseline (HOG [5] features on facial locations[19], con-
nologies are combined using a linear programming approach         catenated and projected using LDML [8]). While in the
where some restrictions are imposed.                              baseline a single descriptor was selected for each track, we
                                                                  used several vectors, by uniform temporal sampling of the
                                                                  track faces. We expect this approach to better capture the
1.   INTRODUCTION                                                 variations in pose/expression.
  The 2015 Multimodal Person Discovery in Broadcast TV               We used agglomerative hierarchical clustering. A binary
[13] goal is to identify people appearing and speaking in TV      hierarchical tree is created by fusing tracks according to the
shows in a purely unsupervised manner. This paper de-             minimum distance between track vectors. The number of
scribes the UPC contribution, which is based on combining         clusters may vary between videos and has to be determined.
speech diarization, video-based face diarization and text de-     It is estimated by evaluating the CalinskiHarabasz [3] and
tection plus Named Entity Recognition (NER). We did not           Silhouette [14] criteria in the range [50, 80] clusters and av-
make use of the names present in speech transcriptions.           eraging the maximum results. The number of resulting clus-
                                                                  ters is the average of the maximum result for both methods.
2.   AUDIO SYSTEM                                                    To improve the diarization, spatio-temporal restrictions
                                                                  were introduced. We assume that a person can not appear
   Speaker information was extracted using an Agglomera-
                                                                  twice in a frame so tracks with temporal overlapping should
tive Hierarchical Clustering diarization system based in Hid-
                                                                  represent different persons and are prevented to merge into
den Markov Models [21, 20, 2, 11]. It uses energy-based
                                                                  the same cluster. Also, as we use a multi-vector representa-
speech activity detection , Mel Frequency Cepstral Coeffi-
                                                                  tion for each track, vectors in the same track must be part
cients voice features and initial uniform segmentation.
                                                                  of the same cluster. Restrictions are modeled using a ma-
   Speaker clusters are modeled with Gaussian Mixture Mod-
                                                                  trix expressing the relationship between the feature vectors.
els (GMM). The complexity selection of the models is based
                                                                  Entries for vectors in different tracks were assigned a value
on the amount of data per cluster and the cluster complexity
                                                                  of 1, entries for vectors in the same track were assigned a
ratio which fixes the amount of speech per Gaussian. Hid-
                                                                  value 0 < v  1, and entries for vectors on temporally co-
den Markov Model (HMM) training and cluster realignment
                                                                  occurring tracks received a very large value v  1. This
by Viterbi decoding is based on maximum likelihood. In the
                                                                  matrix is used to point-wise multiply the vector-to-vector
decoding stage, a minimum speaker segment duration of 3
                                                                  distance matrix used for clustering.
seconds is imposed to deal with too short segments. For
the cluster merging, the most likely pair of clusters are se-
lected in each iteration. This likelihood is calculated using a   4.   TEXT SYSTEM
modified Bayesian information criterion (BIC) [4, 1] metric          We used the person names provided in the baseline [6,
among clusters.                                                   12] and our own technology for obtaining person names (in
   This system has been used with two different kind of in-       different runs). From the input image a segmentation is
puts for each show. In one hand, diarization is run with          created with a Binary Partition Tree [15] using color and
each audio file without any constraint. In the other hand,        stroke width [7]. A partition is built were each charac-
using a face-tracking system, segments without tracked faces      ter is a connected component while background regions are
are discarded. The purpose of this second method is to run        merged. Next, regions are filtered by a sequence of binary
                                                                  classifiers that reject non-character components. Compo-
                                                                  nents accepted by the classifiers as character candidates are
Copyright is held by the author/owner(s).                         combined into pairs and pairs are combined into chains.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany       A post-processing stage is applied to find missing compo-
                                                                         Exp.   System   Audio Input      NER           MAP
                                                                          1        2      facetrack      Baseline       22.6
                                                                          2        1      facetrack      Baseline       27.1
                                                                          3        2          -          Baseline       33.5
                                                                          4        1          -          Baseline       41.6
                                                                          5        1          -         UPC system      32.6
                                                                                     Table 1: MAP Evaluation

                                                                    6.    RESULTS
                                                                       Five different experiments were performed, which are shown
               Figure 1: System block diagram                       in Table 1. These experiments were evaluated with the train-
                                                                    ing database and evaluated using the mean average precision
nents wrongly rejected as false positives in the filtering stage.   metric (MAP). In the experiments we tested several varia-
Tesseract OCR Engine [17] provides one transcription for            tions: the order of the fusions, the input of the audio di-
each text chain and Stanford Name Entity Recognizer [9] is          arization and the text system used. In Table 1, System 1
used to automatically detect person names in the text.              refers to the architecture shown in Figure 1 where the first
                                                                    fusion combines text and video, and System 2 refers to first
                                                                    combining video and audio and later fusing text. f acetrack
5.    FUSION                                                        indicates that the audio diarization is performed using only
  Our system combines the previous information sources to           audio tracks where there are faces detected. The null case
obtain the final person recognition labelling. Speaker di-          means performing the diarization using the whole audio in-
arization and video diarization are performed first in an           put. While the first four experiments use the baseline names,
independent manner. In order to fuse this information to            in the fifth one the system described in section 4 was used.
create a final labelling, the development database was ana-            The best performance was achieved in experiment 4 by the
lyzed. Some assumptions were made:                                  System 1, without filtering the audio input for the diariza-
                                                                    tion and using the Baseline person names. There is a clear
     • Speaker is not always related with who is shown in the
                                                                    evidence that the system works better when the diarization
       screen. So it is important to weigh accurately the tem-
                                                                    is run with the whole audio input. Referring to the fusion
       poral overlaps between each speaker with its different
                                                                    order in the algorithm, results indicate that mixing video
       possible face identity assignments.
     • Some speakers do not come into view any time in the          and text tracks first, provides a better performance.
       show and there are other people who are shown in the            The five experiments were run on the test data. Exper-
       screen but do not speak. Both should be discarded.           iments 1-4 were submitted on July 1st and experiment 5
     • Text identities are more related with who is shown           on July 8th. The best set-up in the training data (Exp.4
       rather than with who is speaking. So text is better          in Table 1) was uploaded as our primary submission. Af-
       combined with video than with speech.                        ter evaluating this primary submission with the final set of
                                                                    annotations, the following results were obtained: EwMAP
According to these assumptions, an algorithm was designed           = 54.1%, MAP = 54.36% and C = 69.71%. Experiment 5
based in weighting temporal overlaps between tracks (Fig-           was submitted on July 8th. It is similar to experiment 4
ure 1). This algorithm considers two different fusion modal-        but using our own technology to obtain person names. We
ities (Video/Text and Video/Audio) and combines both to             had low performance with the OCR and NER and thus the
obtain a final track file. Firstly, text and video are fused.       results were worse than expected.
Their overlapped tracks are selected, and the temporary
overlaps of their identities are weigthed to set the constraints
of an ILP system (IBM CPLEX).                                       7.    CONCLUSIONS
                            XX                                        Speaker diarization, face recognition, and text detection
                      max(          αij βij )                (1)    with named entity recogniton have been combined using the
                                i     j                             integer linear programming approach. Our idea was to first
                                                                    perform monomodal speech and video diarizations, using as
                                    αij ≤ 1                  (2)
                            j                                       much restrictions as possible to improve the results and then
                                                                    use ILP to combine these diarizations along with the per-
(αij : assignment between i text identity with a j video iden-      sons name information. Several architectures for this combi-
tity; βij : weight of assignment). Equation 2 establishes that      nation and several constrains of the integer linear program-
each text identity must only have one face identity assigned.       ming algorithm were considered. The architecture which
The next step is to combine the speech diarization tracks           combines video and audio modalities after the fusion with
with the face tracks that have a text identity assigned. The        the text stream has provided the best results.
same method based on ILP is used. Finally, using the rela-
tion between text, face and speaker identities and the over-
lapped tracks in the second fusion, the final labeling output       8.    ACKNOWLEDGMENTS
was obtained. A second algorithm was implemented chang-               This work has been developed in the framework of the
ing the order of the fusions. In this case, audio were fused        projects TEC2013-43935-R, TEC2012-38939-C03-02 and PCIN-
with the video and the result was combined with the text            2013-067. It has been financed by the Spanish Ministerio de
identities. Thus, only the face identities with a speaker as-       Economı́a y Competitividad and the European Regional De-
signed were considered.                                             velopment Fund (ERDF).
