=Paper= {{Paper |id=Vol-1436/Paper54 |storemode=property |title=UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper54.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/IndiaVVMH15 }} ==UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task== https://ceur-ws.org/Vol-1436/Paper54.pdf
     UPC System for the 2015 MediaEval Multimodal Person
               Discovery in Broadcast TV task

                          M. India, D. Varas, V. Vilaplana, J.R. Morros, J. Hernando
                                 Universitat Politecnica de Catalunya, Spain



ABSTRACT                                                          the diarization only in those parts where we assume that
This paper describes a system to identify people in broad-        someone in the video must be speaking.
cast TV shows in a purely unsupervised manner. The system
outputs the identity of people that appear, talk and can be       3.   VIDEO SYSTEM
identified by using information appearing in the show (in our        For face tracking, the baseline code was used (tracking
case, text with person names). Three types of monomodal           by detection using the Kanade-Lucas-Tomasi algorithm [18,
technologies are used: speech diarization, video diarization      10, 16]). For feature extraction we used the technique in
and text detection / named entity recognition. These tech-        the baseline (HOG [5] features on facial locations[19], con-
nologies are combined using a linear programming approach         catenated and projected using LDML [8]). While in the
where some restrictions are imposed.                              baseline a single descriptor was selected for each track, we
                                                                  used several vectors, by uniform temporal sampling of the
                                                                  track faces. We expect this approach to better capture the
1.   INTRODUCTION                                                 variations in pose/expression.
  The 2015 Multimodal Person Discovery in Broadcast TV               We used agglomerative hierarchical clustering. A binary
[13] goal is to identify people appearing and speaking in TV      hierarchical tree is created by fusing tracks according to the
shows in a purely unsupervised manner. This paper de-             minimum distance between track vectors. The number of
scribes the UPC contribution, which is based on combining         clusters may vary between videos and has to be determined.
speech diarization, video-based face diarization and text de-     It is estimated by evaluating the CalinskiHarabasz [3] and
tection plus Named Entity Recognition (NER). We did not           Silhouette [14] criteria in the range [50, 80] clusters and av-
make use of the names present in speech transcriptions.           eraging the maximum results. The number of resulting clus-
                                                                  ters is the average of the maximum result for both methods.
2.   AUDIO SYSTEM                                                    To improve the diarization, spatio-temporal restrictions
                                                                  were introduced. We assume that a person can not appear
   Speaker information was extracted using an Agglomera-
                                                                  twice in a frame so tracks with temporal overlapping should
tive Hierarchical Clustering diarization system based in Hid-
                                                                  represent different persons and are prevented to merge into
den Markov Models [21, 20, 2, 11]. It uses energy-based
                                                                  the same cluster. Also, as we use a multi-vector representa-
speech activity detection , Mel Frequency Cepstral Coeffi-
                                                                  tion for each track, vectors in the same track must be part
cients voice features and initial uniform segmentation.
                                                                  of the same cluster. Restrictions are modeled using a ma-
   Speaker clusters are modeled with Gaussian Mixture Mod-
                                                                  trix expressing the relationship between the feature vectors.
els (GMM). The complexity selection of the models is based
                                                                  Entries for vectors in different tracks were assigned a value
on the amount of data per cluster and the cluster complexity
                                                                  of 1, entries for vectors in the same track were assigned a
ratio which fixes the amount of speech per Gaussian. Hid-
                                                                  value 0 < v  1, and entries for vectors on temporally co-
den Markov Model (HMM) training and cluster realignment
                                                                  occurring tracks received a very large value v  1. This
by Viterbi decoding is based on maximum likelihood. In the
                                                                  matrix is used to point-wise multiply the vector-to-vector
decoding stage, a minimum speaker segment duration of 3
                                                                  distance matrix used for clustering.
seconds is imposed to deal with too short segments. For
the cluster merging, the most likely pair of clusters are se-
lected in each iteration. This likelihood is calculated using a   4.   TEXT SYSTEM
modified Bayesian information criterion (BIC) [4, 1] metric          We used the person names provided in the baseline [6,
among clusters.                                                   12] and our own technology for obtaining person names (in
   This system has been used with two different kind of in-       different runs). From the input image a segmentation is
puts for each show. In one hand, diarization is run with          created with a Binary Partition Tree [15] using color and
each audio file without any constraint. In the other hand,        stroke width [7]. A partition is built were each charac-
using a face-tracking system, segments without tracked faces      ter is a connected component while background regions are
are discarded. The purpose of this second method is to run        merged. Next, regions are filtered by a sequence of binary
                                                                  classifiers that reject non-character components. Compo-
                                                                  nents accepted by the classifiers as character candidates are
Copyright is held by the author/owner(s).                         combined into pairs and pairs are combined into chains.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany       A post-processing stage is applied to find missing compo-
                                                                         Exp.   System   Audio Input      NER           MAP
                                                                          1        2      facetrack      Baseline       22.6
                                                                          2        1      facetrack      Baseline       27.1
                                                                          3        2          -          Baseline       33.5
                                                                          4        1          -          Baseline       41.6
                                                                          5        1          -         UPC system      32.6
                                                                                     Table 1: MAP Evaluation

                                                                    6.    RESULTS
                                                                       Five different experiments were performed, which are shown
               Figure 1: System block diagram                       in Table 1. These experiments were evaluated with the train-
                                                                    ing database and evaluated using the mean average precision
nents wrongly rejected as false positives in the filtering stage.   metric (MAP). In the experiments we tested several varia-
Tesseract OCR Engine [17] provides one transcription for            tions: the order of the fusions, the input of the audio di-
each text chain and Stanford Name Entity Recognizer [9] is          arization and the text system used. In Table 1, System 1
used to automatically detect person names in the text.              refers to the architecture shown in Figure 1 where the first
                                                                    fusion combines text and video, and System 2 refers to first
                                                                    combining video and audio and later fusing text. f acetrack
5.    FUSION                                                        indicates that the audio diarization is performed using only
  Our system combines the previous information sources to           audio tracks where there are faces detected. The null case
obtain the final person recognition labelling. Speaker di-          means performing the diarization using the whole audio in-
arization and video diarization are performed first in an           put. While the first four experiments use the baseline names,
independent manner. In order to fuse this information to            in the fifth one the system described in section 4 was used.
create a final labelling, the development database was ana-            The best performance was achieved in experiment 4 by the
lyzed. Some assumptions were made:                                  System 1, without filtering the audio input for the diariza-
                                                                    tion and using the Baseline person names. There is a clear
     • Speaker is not always related with who is shown in the
                                                                    evidence that the system works better when the diarization
       screen. So it is important to weigh accurately the tem-
                                                                    is run with the whole audio input. Referring to the fusion
       poral overlaps between each speaker with its different
                                                                    order in the algorithm, results indicate that mixing video
       possible face identity assignments.
     • Some speakers do not come into view any time in the          and text tracks first, provides a better performance.
       show and there are other people who are shown in the            The five experiments were run on the test data. Exper-
       screen but do not speak. Both should be discarded.           iments 1-4 were submitted on July 1st and experiment 5
     • Text identities are more related with who is shown           on July 8th. The best set-up in the training data (Exp.4
       rather than with who is speaking. So text is better          in Table 1) was uploaded as our primary submission. Af-
       combined with video than with speech.                        ter evaluating this primary submission with the final set of
                                                                    annotations, the following results were obtained: EwMAP
According to these assumptions, an algorithm was designed           = 54.1%, MAP = 54.36% and C = 69.71%. Experiment 5
based in weighting temporal overlaps between tracks (Fig-           was submitted on July 8th. It is similar to experiment 4
ure 1). This algorithm considers two different fusion modal-        but using our own technology to obtain person names. We
ities (Video/Text and Video/Audio) and combines both to             had low performance with the OCR and NER and thus the
obtain a final track file. Firstly, text and video are fused.       results were worse than expected.
Their overlapped tracks are selected, and the temporary
overlaps of their identities are weigthed to set the constraints
of an ILP system (IBM CPLEX).                                       7.    CONCLUSIONS
                            XX                                        Speaker diarization, face recognition, and text detection
                      max(          αij βij )                (1)    with named entity recogniton have been combined using the
                      αij
                                i     j                             integer linear programming approach. Our idea was to first
                                                                    perform monomodal speech and video diarizations, using as
                            X
                                    αij ≤ 1                  (2)
                            j                                       much restrictions as possible to improve the results and then
                                                                    use ILP to combine these diarizations along with the per-
(αij : assignment between i text identity with a j video iden-      sons name information. Several architectures for this combi-
tity; βij : weight of assignment). Equation 2 establishes that      nation and several constrains of the integer linear program-
each text identity must only have one face identity assigned.       ming algorithm were considered. The architecture which
The next step is to combine the speech diarization tracks           combines video and audio modalities after the fusion with
with the face tracks that have a text identity assigned. The        the text stream has provided the best results.
same method based on ILP is used. Finally, using the rela-
tion between text, face and speaker identities and the over-
lapped tracks in the second fusion, the final labeling output       8.    ACKNOWLEDGMENTS
was obtained. A second algorithm was implemented chang-               This work has been developed in the framework of the
ing the order of the fusions. In this case, audio were fused        projects TEC2013-43935-R, TEC2012-38939-C03-02 and PCIN-
with the video and the result was combined with the text            2013-067. It has been financed by the Spanish Ministerio de
identities. Thus, only the face identities with a speaker as-       Economı́a y Competitividad and the European Regional De-
signed were considered.                                             velopment Fund (ERDF).
9.   REFERENCES                                                [18] C. Tomasi and T. Kanade. Detection and tracking of
 [1] J. Ajmera and C. Wooters. A robust speaker                     point features. Technical report, International Journal
     clustering algorithm. Proc. ASRU, 2003.                        of Computer Vision, 1991.
 [2] X. Anguera, C. Wooters, and J. Hernando. Acoustic         [19] M. Uricar, V. Franc, and V. Hlavac. Facial landmarks
     beamforming for speaker diarization of meetings.               detector learned by the structured output svm. In
     IEEE Transactions on Audio, Speech, and Language               G. Csurka, M. Kraus, R. Laramee, P. Richard, and
     Processing, 15(7):2011–2022, 2007.                             J. Braz, editors, Computer Vision, Imaging and
 [3] T. Caliński and J. Harabasz. A dendrite method for            Computer Graphics. Theory and Application, volume
     cluster analysis. Communications in                            359 of Communications in Computer and Information
     Statistics-Simulation and Computation, 3(1):1–27,              Science, pages 383–398. Springer Berlin Heidelberg,
     1974.                                                          2013.
 [4] S. S. Chen and P. Gopalakrishnan. Clustering via the      [20] M. Zelenak and J. Hernando. The detection of
     bayesian information criterion with applications in            overlapping speech with prosodic features for speaker
     speech recognition. Proc. ICASSP, 20:645–648, 1998.            diarization. Proc. Interspeech, 2011.
 [5] N. Dalal and B. Triggs. Histograms of oriented            [21] M. Zelenak, C. Segura, J. Luque, and J. Hernando.
     gradients for human detection. In CVPR 2015, 2005.             Simultaneous speech detection with spatial features
 [6] M. Dinarelli and S. Rosset. Models cascade for                 for speaker diarization. IEEE Transactions on Audio,
     tree-structured named entity detection. In Proceedings         Speech, and Language Processing, 20(2):436–446, 2012.
     of 5th International Joint Conference on Natural
     Language Processing, pages 1269–1278, Chiang Mai,
     Thailand, November 2011. Asian Federation of
     Natural Language Processing.
 [7] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text
     in natural scenes with stroke width transform. In
     Proc. of Computer Vision and Pattern Recognition
     CVPR2010, pages 2963–2970, 2010.
 [8] M. Guillaumin, T. Mensink, J. Verbeek, and
     C. Schmid. Face recognition from caption-based
     supervision. IJCV, 96(1), 2012.
 [9] J.Finkel, T. Grenager, and C. Manning. Incorporating
     non-local information into information extraction
     systems by gibbs sampling. In Proc. of the 43nd
     Annual Meeting of the Association for Computational
     Linguistics, pages 363–370, 2005.
[10] B. D. Lucas and T. Kanade. An iterative image
     registration technique with an application to stereo
     vision. pages 674–679, 1981.
[11] J. Luque, X. Anguera, A. Temko, and J. Hernando.
     Speaker diarization for conference room. The UPC
     RT07s evaluation system. Multimodal Technologies for
     Perception of Humans., pages 543–553, 2008.
[12] J. Poignant, L. Besacier, G. Quenot, and F. Thollard.
     From text detection in videos to person identification.
     In Multimedia and Expo (ICME), 2012 IEEE
     International Conference on, pages 854–859, July
     2012.
[13] J. Poignant, H. Bredin, and C. Barras. Multimodal
     person discovery in broadcast tv at mediaeval 2015. In
     Proceedings of MediaEval 2015, September 2015.
[14] P. Rousseeuw. Silhouettes: A graphical aid to the
     interpretation and validation of cluster analysis. J.
     Comput. Appl. Math., 20(1):53–65, Nov. 1987.
[15] P. Salembier and L. Garrido. Binary partition tree as
     an efficient representation for image processing,
     segmentation and information retrieval. IEEE TIP,
     9(4):561–575, April 2000.
[16] J. Shi and C. Tomasi. Good features to track. pages
     593–600, 1994.
[17] R. Smith and G. Inc. An overview of the tesseract ocr
     engine. In Proc. 9th IEEE Intl. Conf. on Document
     Analysis and Recognition (ICDAR, pages 629–633,
     2007.