=Paper= {{Paper |id=Vol-1436/Paper54 |storemode=property |title=UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper54.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/IndiaVVMH15 }} ==UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task== https://ceur-ws.org/Vol-1436/Paper54.pdf

UPC System for the 2015 MediaEval Multimodal Person
Discovery in Broadcast TV task

M. India, D. Varas, V. Vilaplana, J.R. Morros, J. Hernando
Universitat Politecnica de Catalunya, Spain

ABSTRACT the diarization only in those parts where we assume that
This paper describes a system to identify people in broad- someone in the video must be speaking.
cast TV shows in a purely unsupervised manner. The system
outputs the identity of people that appear, talk and can be 3. VIDEO SYSTEM
identified by using information appearing in the show (in our For face tracking, the baseline code was used (tracking
case, text with person names). Three types of monomodal by detection using the Kanade-Lucas-Tomasi algorithm [18,
technologies are used: speech diarization, video diarization 10, 16]). For feature extraction we used the technique in
and text detection / named entity recognition. These tech- the baseline (HOG [5] features on facial locations[19], con-
nologies are combined using a linear programming approach catenated and projected using LDML [8]). While in the
where some restrictions are imposed. baseline a single descriptor was selected for each track, we
used several vectors, by uniform temporal sampling of the
track faces. We expect this approach to better capture the
1. INTRODUCTION variations in pose/expression.
The 2015 Multimodal Person Discovery in Broadcast TV We used agglomerative hierarchical clustering. A binary
[13] goal is to identify people appearing and speaking in TV hierarchical tree is created by fusing tracks according to the
shows in a purely unsupervised manner. This paper de- minimum distance between track vectors. The number of
scribes the UPC contribution, which is based on combining clusters may vary between videos and has to be determined.
speech diarization, video-based face diarization and text de- It is estimated by evaluating the CalinskiHarabasz [3] and
tection plus Named Entity Recognition (NER). We did not Silhouette [14] criteria in the range [50, 80] clusters and av-
make use of the names present in speech transcriptions. eraging the maximum results. The number of resulting clus-
ters is the average of the maximum result for both methods.
2. AUDIO SYSTEM To improve the diarization, spatio-temporal restrictions
were introduced. We assume that a person can not appear
Speaker information was extracted using an Agglomera-
twice in a frame so tracks with temporal overlapping should
tive Hierarchical Clustering diarization system based in Hid-
represent different persons and are prevented to merge into
den Markov Models [21, 20, 2, 11]. It uses energy-based
the same cluster. Also, as we use a multi-vector representa-
speech activity detection , Mel Frequency Cepstral Coeffi-
tion for each track, vectors in the same track must be part
cients voice features and initial uniform segmentation.
of the same cluster. Restrictions are modeled using a ma-
Speaker clusters are modeled with Gaussian Mixture Mod-
trix expressing the relationship between the feature vectors.
els (GMM). The complexity selection of the models is based
Entries for vectors in different tracks were assigned a value
on the amount of data per cluster and the cluster complexity
of 1, entries for vectors in the same track were assigned a
ratio which fixes the amount of speech per Gaussian. Hid-
value 0 < v 1, and entries for vectors on temporally co-
den Markov Model (HMM) training and cluster realignment
occurring tracks received a very large value v 1. This
by Viterbi decoding is based on maximum likelihood. In the
matrix is used to point-wise multiply the vector-to-vector
decoding stage, a minimum speaker segment duration of 3
distance matrix used for clustering.
seconds is imposed to deal with too short segments. For
the cluster merging, the most likely pair of clusters are se-
lected in each iteration. This likelihood is calculated using a 4. TEXT SYSTEM
modified Bayesian information criterion (BIC) [4, 1] metric We used the person names provided in the baseline [6,
among clusters. 12] and our own technology for obtaining person names (in
This system has been used with two different kind of in- different runs). From the input image a segmentation is
puts for each show. In one hand, diarization is run with created with a Binary Partition Tree [15] using color and
each audio file without any constraint. In the other hand, stroke width [7]. A partition is built were each charac-
using a face-tracking system, segments without tracked faces ter is a connected component while background regions are
are discarded. The purpose of this second method is to run merged. Next, regions are filtered by a sequence of binary
classifiers that reject non-character components. Compo-
nents accepted by the classifiers as character candidates are
Copyright is held by the author/owner(s). combined into pairs and pairs are combined into chains.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany A post-processing stage is applied to find missing compo-
Exp. System Audio Input NER MAP
1 2 facetrack Baseline 22.6
2 1 facetrack Baseline 27.1
3 2 - Baseline 33.5
4 1 - Baseline 41.6
5 1 - UPC system 32.6
Table 1: MAP Evaluation

6. RESULTS
Five different experiments were performed, which are shown
Figure 1: System block diagram in Table 1. These experiments were evaluated with the train-
ing database and evaluated using the mean average precision
nents wrongly rejected as false positives in the filtering stage. metric (MAP). In the experiments we tested several varia-
Tesseract OCR Engine [17] provides one transcription for tions: the order of the fusions, the input of the audio di-
each text chain and Stanford Name Entity Recognizer [9] is arization and the text system used. In Table 1, System 1
used to automatically detect person names in the text. refers to the architecture shown in Figure 1 where the first
fusion combines text and video, and System 2 refers to first
combining video and audio and later fusing text. f acetrack
5. FUSION indicates that the audio diarization is performed using only
Our system combines the previous information sources to audio tracks where there are faces detected. The null case
obtain the final person recognition labelling. Speaker di- means performing the diarization using the whole audio in-
arization and video diarization are performed first in an put. While the first four experiments use the baseline names,
independent manner. In order to fuse this information to in the fifth one the system described in section 4 was used.
create a final labelling, the development database was ana- The best performance was achieved in experiment 4 by the
lyzed. Some assumptions were made: System 1, without filtering the audio input for the diariza-
tion and using the Baseline person names. There is a clear
• Speaker is not always related with who is shown in the
evidence that the system works better when the diarization
screen. So it is important to weigh accurately the tem-
is run with the whole audio input. Referring to the fusion
poral overlaps between each speaker with its different
order in the algorithm, results indicate that mixing video
possible face identity assignments.
• Some speakers do not come into view any time in the and text tracks first, provides a better performance.
show and there are other people who are shown in the The five experiments were run on the test data. Exper-
screen but do not speak. Both should be discarded. iments 1-4 were submitted on July 1st and experiment 5
• Text identities are more related with who is shown on July 8th. The best set-up in the training data (Exp.4
rather than with who is speaking. So text is better in Table 1) was uploaded as our primary submission. Af-
combined with video than with speech. ter evaluating this primary submission with the final set of
annotations, the following results were obtained: EwMAP
According to these assumptions, an algorithm was designed = 54.1%, MAP = 54.36% and C = 69.71%. Experiment 5
based in weighting temporal overlaps between tracks (Fig- was submitted on July 8th. It is similar to experiment 4
ure 1). This algorithm considers two different fusion modal- but using our own technology to obtain person names. We
ities (Video/Text and Video/Audio) and combines both to had low performance with the OCR and NER and thus the
obtain a final track file. Firstly, text and video are fused. results were worse than expected.
Their overlapped tracks are selected, and the temporary
overlaps of their identities are weigthed to set the constraints
of an ILP system (IBM CPLEX). 7. CONCLUSIONS
XX Speaker diarization, face recognition, and text detection
max( αij βij ) (1) with named entity recogniton have been combined using the
αij
i j integer linear programming approach. Our idea was to first
perform monomodal speech and video diarizations, using as
X
αij ≤ 1 (2)
j much restrictions as possible to improve the results and then
use ILP to combine these diarizations along with the per-
(αij : assignment between i text identity with a j video iden- sons name information. Several architectures for this combi-
tity; βij : weight of assignment). Equation 2 establishes that nation and several constrains of the integer linear program-
each text identity must only have one face identity assigned. ming algorithm were considered. The architecture which
The next step is to combine the speech diarization tracks combines video and audio modalities after the fusion with
with the face tracks that have a text identity assigned. The the text stream has provided the best results.
same method based on ILP is used. Finally, using the rela-
tion between text, face and speaker identities and the over-
lapped tracks in the second fusion, the final labeling output 8. ACKNOWLEDGMENTS
was obtained. A second algorithm was implemented chang- This work has been developed in the framework of the
ing the order of the fusions. In this case, audio were fused projects TEC2013-43935-R, TEC2012-38939-C03-02 and PCIN-
with the video and the result was combined with the text 2013-067. It has been financed by the Spanish Ministerio de
identities. Thus, only the face identities with a speaker as- Economı́a y Competitividad and the European Regional De-
signed were considered. velopment Fund (ERDF).
9. REFERENCES [18] C. Tomasi and T. Kanade. Detection and tracking of
[1] J. Ajmera and C. Wooters. A robust speaker point features. Technical report, International Journal
clustering algorithm. Proc. ASRU, 2003. of Computer Vision, 1991.
[2] X. Anguera, C. Wooters, and J. Hernando. Acoustic [19] M. Uricar, V. Franc, and V. Hlavac. Facial landmarks
beamforming for speaker diarization of meetings. detector learned by the structured output svm. In
IEEE Transactions on Audio, Speech, and Language G. Csurka, M. Kraus, R. Laramee, P. Richard, and
Processing, 15(7):2011–2022, 2007. J. Braz, editors, Computer Vision, Imaging and
[3] T. Caliński and J. Harabasz. A dendrite method for Computer Graphics. Theory and Application, volume
cluster analysis. Communications in 359 of Communications in Computer and Information
Statistics-Simulation and Computation, 3(1):1–27, Science, pages 383–398. Springer Berlin Heidelberg,
1974. 2013.
[4] S. S. Chen and P. Gopalakrishnan. Clustering via the [20] M. Zelenak and J. Hernando. The detection of
bayesian information criterion with applications in overlapping speech with prosodic features for speaker
speech recognition. Proc. ICASSP, 20:645–648, 1998. diarization. Proc. Interspeech, 2011.
[5] N. Dalal and B. Triggs. Histograms of oriented [21] M. Zelenak, C. Segura, J. Luque, and J. Hernando.
gradients for human detection. In CVPR 2015, 2005. Simultaneous speech detection with spatial features
[6] M. Dinarelli and S. Rosset. Models cascade for for speaker diarization. IEEE Transactions on Audio,
tree-structured named entity detection. In Proceedings Speech, and Language Processing, 20(2):436–446, 2012.
of 5th International Joint Conference on Natural
Language Processing, pages 1269–1278, Chiang Mai,
Thailand, November 2011. Asian Federation of
Natural Language Processing.
[7] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text
in natural scenes with stroke width transform. In
Proc. of Computer Vision and Pattern Recognition
CVPR2010, pages 2963–2970, 2010.
[8] M. Guillaumin, T. Mensink, J. Verbeek, and
C. Schmid. Face recognition from caption-based
supervision. IJCV, 96(1), 2012.
[9] J.Finkel, T. Grenager, and C. Manning. Incorporating
non-local information into information extraction
systems by gibbs sampling. In Proc. of the 43nd
Annual Meeting of the Association for Computational
Linguistics, pages 363–370, 2005.
[10] B. D. Lucas and T. Kanade. An iterative image
registration technique with an application to stereo
vision. pages 674–679, 1981.
[11] J. Luque, X. Anguera, A. Temko, and J. Hernando.
Speaker diarization for conference room. The UPC
RT07s evaluation system. Multimodal Technologies for
Perception of Humans., pages 543–553, 2008.
[12] J. Poignant, L. Besacier, G. Quenot, and F. Thollard.
From text detection in videos to person identification.
In Multimedia and Expo (ICME), 2012 IEEE
International Conference on, pages 854–859, July
2012.
[13] J. Poignant, H. Bredin, and C. Barras. Multimodal
person discovery in broadcast tv at mediaeval 2015. In
Proceedings of MediaEval 2015, September 2015.
[14] P. Rousseeuw. Silhouettes: A graphical aid to the
interpretation and validation of cluster analysis. J.
Comput. Appl. Math., 20(1):53–65, Nov. 1987.
[15] P. Salembier and L. Garrido. Binary partition tree as
an efficient representation for image processing,
segmentation and information retrieval. IEEE TIP,
9(4):561–575, April 2000.
[16] J. Shi and C. Tomasi. Good features to track. pages
593–600, 1994.
[17] R. Smith and G. Inc. An overview of the tesseract ocr
engine. In Proc. 9th IEEE Intl. Conf. on Document
Analysis and Recognition (ICDAR, pages 629–633,
2007.