=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_34 |storemode=property |title=PUCMinas and IRISA at Multimodal Person Discovery |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_34.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SargentFFSPGG16 }} ==PUCMinas and IRISA at Multimodal Person Discovery== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_34.pdf

PUC Minas and IRISA at Multimodal Person Discovery

Gabriel Sargent1 , Gabriel Barbosa de Fonseca2 , Izabela Lyon Freire2 , Ronan Sicre1 ,
Zenilton K. G. Patrocínio Jr2 , Silvio Jamil F. Guimarães2 and Guillaume Gravier1 ,
1
IRISA & Inria Rennes , CNRS and Univ. Rennes 1, Rennes, France
2
Computer Science Department - PUC de Minas Gerais, Belo Horizonte, Brazil

ABSTRACT Similarity
Submission Tag propagation
audio video
This paper describes the systems developed by PUC Minas
and IRISA for the person discovery task at MediaEval 2016. primary (p) binary CNN hierarchical
We adopt a graph-based representation and investigate contrast 1 (c1) GMM CNN random walk
two tag-propagation approaches to associate overlays co- contrast 2 (c2) – CNN hierarchical
occurring with some speaking faces to other visually or contrast 3 (c3) GMM CNN hierarchical
audio-visually similar speaking faces. contrast 4 (c4) – – –
Given a video, we first build a graph from the detected Table 1: Components of the systems at the origin of
speaking faces (nodes) and their audio-visual similarities our 5 submissions.
(edges). Each node is associated to its co-occurring over-
lays (tags) when they exist. Then, we consider two tag-
propagation approaches, respectively based on a random with the speech segment for which the overlap is maximum
walk strategy and on Kruskal’s algorithm. and at least 60 %. The facetracks and speech segments are
the ones provided by MediaEval, the latter being extracted
from the speaker diarization result disregarding the arbi-
1. INTRODUCTION trary speaker number.
The task of multimodal person discovery in TV broadcast
consists in identifying persons of a video corpus which both 2.1 Audiovisual similarities
speak and are visible at the same time, in an unsupervised We consider three weighting schemes for the edges in the
way [2]. Most approaches to the task use clustering, either graphs, resulting from the combination of different strategies
of face tracks or of voice segments (or both) before find- to combine visual similarity and voice similarity.
ing a good match between text in overlays and clusters [6, V
The visual similarity Sij between two facetracks i and j is
4]. While this type of approaches worked well in 2015, we calculated as follows. A key face is selected from the central
believe that the clustering steps involved are error prone. frame of each facetrack, from which a generic image descrip-
Indeed, errors in the clustering step cannot be undone af- tor is computed by applying a very-deep convolutional neu-
terwards in the naming stages. In 2015, IRISA and UFMG ral network pre-trained on the ImageNet dataset [8]. Specif-
proposed a graph-based approach in which each node cor- ically, we extract the last convolutional layer [9] and perform
responds to a speaking face and edges to the similarity be- average pooling and “power normalization”, i.e., square-root
tween its vertices [3]. The similarity can be computed at V
compression followed by L2-normalization. Finally, Sij is
the visual level, the voice level or both. Names can be asso- calculated as the cosine similarity between the descriptors
ciated to nodes based on co-occurrences of a speaking face of the two key face images.
and names overlays. However, only a small fraction of the Voice similarity can be computed two ways. A simple
nodes can be tagged by this method. Hence, in 2016, we binary audio similarity is derived from the speaker diariza-
studied tag propagation algorithms that take advantage of tion provided by MediaEval, where the similarity is 1 if the
the graph structure to assign tags to nodes with no over- two segments are labeled with the same speaker in the di-
lapping overlays, thus potentially improving recall. Tab. 1 A
arization. Alternately, the audio similarity Sij between two
recaps the different configurations submitted. segments can be calculated as follows. Each speech segment
is modeled with a 16-Gaussian mixture model (GMM) over
A
2. GRAPH GENERATION Mel cepstral features. The distance Dij is computed using
the Euclidean-based approximation of the KL2 divergence
Each video is modeled by a graph where each node repre- between the two GMMs [1], and turned into a similarity ac-
sents a speaking face, and each edge quantifies the visual or cording to SijA
= exp(log (α) Dij A
), where α = 0.25 in the
audiovisual similarity between two speaking faces. A speak- experiments here.
ing face is defined as the association of a facetrack (sequence Fusion of the visual and voice similarities is done by a
of faces related to the same person in adjacent video frames) weighted average, SijAV V
= βSij + (1 − β)SijA
. We experimen-
tally set β = 0.85 in the case of binary voice comparison and
Copyright is held by the author/owner(s). β = 0.5 for the GMM-based comparison.
MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Nether-
lands.
2.2 Tag initialization MAP@1 MAP@10 MAP@100
T15 T16 T15 T16 T15 T16
Initially, each node in the graph is tagged using the overlay
for which the overlap with the facetrack is maximum. We primary (p) 87.9 64.4 82.1 49.3 81.9 47.8
used the overlay detection and name recognition provided contrast 1 (c1) 87.9 64.4 79.8 48.4 79.6 46.7
(output from the OCR system 2), which we filtered using contrast 2 (c2) 87.9 62.9 81.7 46.2 81.5 44.8
the named entity detector NERO [7], keeping only words contrast 3 (c3) 87.9 63.6 80.2 49.3 80.0 47.5
tagged as “pers” by the named entity recognition. Note that contrast 4 (c4) 87.9 56.8 79.7 36.1 79.5 35.1
this approach is rather aggressive as NERO was initially de- (p − c4)/c4 0.0 13.4 3.0 36.6 3.0 36.2
signed for the speech transcription in the French language. Table 2: Mean average precision at different ranks
In practice, many nodes are not tagged as they do not over- (in %) for the 5 submissions. Last row gives the
lap with a valid overlay (Sets T15 and T16, introduced in relative improvement of the primary run over the
Section 4, show respectively 25.5% and 6.6% of nodes ini- no-propagation baseline (c4).
tially tagged). This is why tag propagation is required.
T15 T16
3. TAG PROPAGATION APPROACHES MAP@1 p = c1 = c2 = c3 = c4 p = c1, c3, c2, c4
Two different approaches are considered for the propaga- MAP@10 p, c2, c3, c1, c4 p, c2, c3, c1, c4
tion of the initial tags: a random walk approach and a hi- MAP@100 p, c2, c3, c1, c4 p, c3, c1, c2, c4
erarchical one based on Kruskal’s algorithm. In both cases,
every node will be associated a particular tag with a confi- Table 3: Ranking (best first) of the submissions.
dence score at the end of the propagation phase.

3.1 Random walk tag propagation in H, skip it; otherwise, it links trees T1 and T2 (thus form-
ing T3 ), and e is added to the minimum spanning forest H
In a graph where transition probabilities between nodes
being created; three cases are possible: I. None of T1 , T2 is
are known, the probability of ever reaching node j starting
tagged: T3 will not be tagged II. Only T1 is tagged, with
from node i can be calculated using a random walk strategy
confidence score CT1 : T1 ’s tag is assigned to the entire T3
with absorbing states [10]. Let n be the number of nodes
(i.e., to all its unlabelled nodes), with a confidence score
of the graph, we define a symmetrical weight matrix W =
CT3 = CT1 × (1 − we ), where we is the weight of e in G. III.
{Wij }1≤i,j≤n , where Wij is the similarity between nodes
Both T1 and T2 are tagged: one of the tags (of T1 or of T2 )
i and j, and aPdiagonal degree matrix D = {Dij }1≤i,j≤n ,
is picked (at random), and assigned to T3 with confidence
where Dii = j Wij . The transition probability matrix scores as in case II.
P0 = {P0ij }1≤i,j≤n , where P0ij is the probability of reaching
node j from node i in one step, is given by P0 = D−1 W.
Tagged nodes are set as absorbing states in P, according to
4. RESULTS
Tab. 2 reports the results obtained on the 2015 and 2016
I 0 test data (T15=development data for 2016, and T16, re-
P= ,
Pul Puu spectively). For T16, the reference annotation dump of
2016/09/14 is used. The rank of the submissions is shown
where l is the set of tagged nodes, u is the set of untagged
in Tab. 3. All tag propagation approaches improve over the
nodes, I is an identity matrix of size |l| × |l|, Pul contains
no-propagation baseline (c4), the interest of tag propagation
probabilities of untagged nodes ending their walk on tagged
being much clearer on T16. The baseline highlights notice-
nodes, and Puu contains probabilities of untagged nodes get-
able differences between T15 and T16. In T15, propagation
ting to other untagged nodes. We denote Pt the transition
was almost useless as most nodes could be tagged in the ini-
probability after t iterations. The random walk iteration
tial stage. This is not the case in T16 where tag propagation
is performed according to Pt+1 = (1 − γ) P0 Pt + γ P0 ,
yields significant gain. The hierarchical tag propagation on
where γ is a parameter enforcing the consistency of the ini-
graphs combining CNN visual similarity and binary voice
tial stateP(here, γ = 0.4). Once the random walk has con-
−9 similarity (primary) consistently outperforms other combi-
verged ( i,j |Pt+1 t
i,j − Pi,j | < 10 ), each untagged node is
nations, showing the interest of combining audio and visual
associated to the tagged one on which it has the highest
similarities. Comparing approaches, c3 usually (except for
probability to end its walk, i.e., each row index of Pul is
T16, MAP@1) performs better than c1, indicating that the
matched with the column index with maximal probability.
hierarchical tag propagation performs better than the ran-
This maximal probability is kept as the confidence score.
dom walk, at least with GMM-CNN audiovisual similari-
3.2 Hierarchical tag propagation ties. The comparison of c3 and c1 shows the weakness of
the GMM-based voice comparison over the state-of-the-art
This method is based on the computation of a minimum
approach used for diarization. Finally, the comparison of
spanning tree (MST) from an undirected weighted graph, us-
c3 and c2 gives mixed feelings. The use of the GMM-based
ing Kruskal’s algorithm. The MST establishes a hierarchical
voice comparison decreases performance in most cases ex-
partition of a set [5]. A connected graph G is given (see Sec-
cept on T16 at K = 1, 100.
tion 2), where edge weights represent distances (functions of
their respective similarities S AV ). To propagate the initial
tags, we start from a null graph H on G’s nodes, and the 5. ACKNOWLEDGEMENTS
following process is repeated, until all edges of G are exam- Work supported by FAPEMIG/INRIA/MOTIF (CEX-
ined: from G, the unexamined edge e corresponding to the APQ 03195-13), FAPEMIG/PPM (CEX-PPM-6-16) and
smallest distance is chosen. If it does not link different trees CAPES (064965/2014-01).
6. REFERENCES
[1] M. Ben, M. Betser, F. Bimbot, and G. Gravier.
Speaker diarization using bottom-up clustering based
on a parameter-derived distance between adapted
GMMs. In Proceedings of the 8th International
Conference on Spoken Language Processing, pages
333–444, October 2004.
[2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal
person discovery in broadcast TV at MediaEval 2016.
In Working notes of the MediaEval 2016 Workshop,
October 2016.
[3] C. E. dos Santos Jr, G. Gravier, and W. R. Schwartz.
SSIG and IRISA at Multimodal Person Discovery. In
Working notes of the MediaEval 2015 Workshop,
September 2015.
[4] N. Le, D. Wu, S. Meignier, and J.-M. Odobez.
EUMSSI team at the MediaEval Person Discovery
Challenge. In Working notes of the MediaEval 2015
Workshop, September 2015.
[5] B. Perret, J. Cousty, J. C. R. Ura, and S. J. F.
Guimarães. Evaluation of morphological hierarchies
for supervised segmentation. In Proceedings of the
12th International Symposium on Mathematical
Morphology and Its Applications to Signal and Image
Processing, pages 39–50. Springer, 2015.
[6] J. Poignant, L. Besacier, and G. Quénot. Unsupervised
speaker identification in TV broadcast based on
written names. IEEE/ACM Transactions on Audio,
Speech and Language Processing, 23(1):57–68, 2015.
[7] C. Raymond. Robust tree-structured named entities
recognition from speech. In International Conference
on Acoustics, Speech and Signal Processing, 2013.
[8] K. Simonyan and A. Zisserman. Very deep
convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[9] G. Tolias, R. Sicre, and H. Jégou. Particular object
retrieval with integral max-pooling of CNN
activations. In Proceedings of the 2016 International
Conference on Learning Representations, 2016.
[10] X. Zhu and Z. Ghahramani. Learning from labeled
and unlabeled data with label propagation. Technical
report, Citeseer, 2002.