<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PUC Minas and IRISA at Multimodal Person Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel Sargent</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel Barbosa de Fonseca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Izabela Lyon Freire</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ronan Sicre</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zenilton K. G. Patrocínio Jr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvio Jamil F. Guimarães</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Gravier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department - PUC de Minas Gerais</institution>
          ,
          <addr-line>Belo Horizonte</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IRISA &amp; Inria Rennes</institution>
          ,
          <addr-line>CNRS and Univ. Rennes 1, Rennes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper describes the systems developed by PUC Minas and IRISA for the person discovery task at MediaEval 2016. We adopt a graph-based representation and investigate two tag-propagation approaches to associate overlays cooccurring with some speaking faces to other visually or audio-visually similar speaking faces. Given a video, we rst build a graph from the detected speaking faces (nodes) and their audio-visual similarities (edges). Each node is associated to its co-occurring overlays (tags) when they exist. Then, we consider two tagpropagation approaches, respectively based on a random walk strategy and on Kruskal's algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The task of multimodal person discovery in TV broadcast
consists in identifying persons of a video corpus which both
speak and are visible at the same time, in an unsupervised
way [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Most approaches to the task use clustering, either
of face tracks or of voice segments (or both) before
nding a good match between text in overlays and clusters [
        <xref ref-type="bibr" rid="ref4 ref6">6,
4</xref>
        ]. While this type of approaches worked well in 2015, we
believe that the clustering steps involved are error prone.
Indeed, errors in the clustering step cannot be undone
afterwards in the naming stages. In 2015, IRISA and UFMG
proposed a graph-based approach in which each node
corresponds to a speaking face and edges to the similarity
between its vertices [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The similarity can be computed at
the visual level, the voice level or both. Names can be
associated to nodes based on co-occurrences of a speaking face
and names overlays. However, only a small fraction of the
nodes can be tagged by this method. Hence, in 2016, we
studied tag propagation algorithms that take advantage of
the graph structure to assign tags to nodes with no
overlapping overlays, thus potentially improving recall. Tab. 1
recaps the di erent con gurations submitted.
      </p>
    </sec>
    <sec id="sec-2">
      <title>GRAPH GENERATION</title>
      <p>Each video is modeled by a graph where each node
represents a speaking face, and each edge quanti es the visual or
audiovisual similarity between two speaking faces. A
speaking face is de ned as the association of a facetrack (sequence
of faces related to the same person in adjacent video frames)</p>
      <sec id="sec-2-1">
        <title>Submission</title>
        <p>primary (p)
contrast 1 (c1)
contrast 2 (c2)
contrast 3 (c3)
contrast 4 (c4)</p>
      </sec>
      <sec id="sec-2-2">
        <title>Similarity</title>
        <p>audio video
binary
GMM</p>
        <p>{
GMM
{</p>
        <p>CNN
CNN
CNN
CNN
{</p>
      </sec>
      <sec id="sec-2-3">
        <title>Tag propagation</title>
        <p>hierarchical
random walk
hierarchical
hierarchical
{
with the speech segment for which the overlap is maximum
and at least 60 %. The facetracks and speech segments are
the ones provided by MediaEval, the latter being extracted
from the speaker diarization result disregarding the
arbitrary speaker number.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Audiovisual similarities</title>
      <p>We consider three weighting schemes for the edges in the
graphs, resulting from the combination of di erent strategies
to combine visual similarity and voice similarity.</p>
      <p>
        The visual similarity SiVj between two facetracks i and j is
calculated as follows. A key face is selected from the central
frame of each facetrack, from which a generic image
descriptor is computed by applying a very-deep convolutional
neural network pre-trained on the ImageNet dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Specifically, we extract the last convolutional layer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and perform
average pooling and \power normalization", i.e., square-root
compression followed by L2-normalization. Finally, SiVj is
calculated as the cosine similarity between the descriptors
of the two key face images.
      </p>
      <p>
        Voice similarity can be computed two ways. A simple
binary audio similarity is derived from the speaker
diarization provided by MediaEval, where the similarity is 1 if the
two segments are labeled with the same speaker in the
diarization. Alternately, the audio similarity SiAj between two
segments can be calculated as follows. Each speech segment
is modeled with a 16-Gaussian mixture model (GMM) over
Mel cepstral features. The distance DiAj is computed using
the Euclidean-based approximation of the KL2 divergence
between the two GMMs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and turned into a similarity
according to SiAj = exp(log ( ) DiAj), where = 0:25 in the
experiments here.
      </p>
      <p>Fusion of the visual and voice similarities is done by a
weighted average, SiAjV = SiVj + (1 )SiAj. We
experimentally set = 0:85 in the case of binary voice comparison and
= 0:5 for the GMM-based comparison.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Tag initialization</title>
      <p>
        Initially, each node in the graph is tagged using the overlay
for which the overlap with the facetrack is maximum. We
used the overlay detection and name recognition provided
(output from the OCR system 2), which we ltered using
the named entity detector NERO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], keeping only words
tagged as \pers" by the named entity recognition. Note that
this approach is rather aggressive as NERO was initially
designed for the speech transcription in the French language.
In practice, many nodes are not tagged as they do not
overlap with a valid overlay (Sets T15 and T16, introduced in
Section 4, show respectively 25.5% and 6.6% of nodes
initially tagged). This is why tag propagation is required.
      </p>
    </sec>
    <sec id="sec-5">
      <title>TAG PROPAGATION APPROACHES</title>
      <p>Two di erent approaches are considered for the
propagation of the initial tags: a random walk approach and a
hierarchical one based on Kruskal's algorithm. In both cases,
every node will be associated a particular tag with a con
dence score at the end of the propagation phase.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Random walk tag propagation</title>
      <p>
        In a graph where transition probabilities between nodes
are known, the probability of ever reaching node j starting
from node i can be calculated using a random walk strategy
with absorbing states [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Let n be the number of nodes
of the graph, we de ne a symmetrical weight matrix W =
fWij g1 i;j n, where Wij is the similarity between nodes
i and j, and a diagonal degree matrix D = fDijg1 i;j n,
where Dii = Pj Wij . The transition probability matrix
P0 = fPi0jg1 i;j n, where Pi0j is the probability of reaching
node j from node i in one step, is given by P0 = D 1W.
Tagged nodes are set as absorbing states in P, according to
P =
      </p>
      <p>I</p>
      <p>0
Pul Puu
;
where l is the set of tagged nodes, u is the set of untagged
nodes, I is an identity matrix of size jlj jlj, Pul contains
probabilities of untagged nodes ending their walk on tagged
nodes, and Puu contains probabilities of untagged nodes
getting to other untagged nodes. We denote Pt the transition
probability after t iterations. The random walk iteration
is performed according to Pt+1 = (1 ) P0 Pt + P0;
where is a parameter enforcing the consistency of the
initial state (here, = 0:4). Once the random walk has
converged (Pi;j jPit;+j1 Pit;j j &lt; 10 9), each untagged node is
associated to the tagged one on which it has the highest
probability to end its walk, i.e., each row index of Pul is
matched with the column index with maximal probability.
This maximal probability is kept as the con dence score.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Hierarchical tag propagation</title>
      <p>
        This method is based on the computation of a minimum
spanning tree (MST) from an undirected weighted graph,
using Kruskal's algorithm. The MST establishes a hierarchical
partition of a set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A connected graph G is given (see
Section 2), where edge weights represent distances (functions of
their respective similarities SAV ): To propagate the initial
tags, we start from a null graph H on G's nodes, and the
following process is repeated, until all edges of G are
examined: from G; the unexamined edge e corresponding to the
smallest distance is chosen. If it does not link di erent trees
primary (p)
contrast 1 (c1)
contrast 2 (c2)
contrast 3 (c3)
contrast 4 (c4)
(p c4)=c4
in H, skip it; otherwise, it links trees T1 and T2 (thus
forming T3), and e is added to the minimum spanning forest H
being created; three cases are possible: I. None of T1; T2 is
tagged: T3 will not be tagged II. Only T1 is tagged, with
con dence score CT1 : T1's tag is assigned to the entire T3
(i.e., to all its unlabelled nodes), with a con dence score
CT3 = CT1 (1 we); where we is the weight of e in G. III.
Both T1 and T2 are tagged: one of the tags (of T1 or of T2)
is picked (at random), and assigned to T3 with con dence
scores as in case II.
4.
      </p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>Tab. 2 reports the results obtained on the 2015 and 2016
test data (T15=development data for 2016, and T16,
respectively). For T16, the reference annotation dump of
2016/09/14 is used. The rank of the submissions is shown
in Tab. 3. All tag propagation approaches improve over the
no-propagation baseline (c4), the interest of tag propagation
being much clearer on T16. The baseline highlights
noticeable di erences between T15 and T16. In T15, propagation
was almost useless as most nodes could be tagged in the
initial stage. This is not the case in T16 where tag propagation
yields signi cant gain. The hierarchical tag propagation on
graphs combining CNN visual similarity and binary voice
similarity (primary) consistently outperforms other
combinations, showing the interest of combining audio and visual
similarities. Comparing approaches, c3 usually (except for
T16, MAP@1) performs better than c1, indicating that the
hierarchical tag propagation performs better than the
random walk, at least with GMM-CNN audiovisual
similarities. The comparison of c3 and c1 shows the weakness of
the GMM-based voice comparison over the state-of-the-art
approach used for diarization. Finally, the comparison of
c3 and c2 gives mixed feelings. The use of the GMM-based
voice comparison decreases performance in most cases
except on T16 at K = 1; 100.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGEMENTS</title>
      <p>Work supported by FAPEMIG/INRIA/MOTIF
(CEXAPQ 03195-13), FAPEMIG/PPM (CEX-PPM-6-16) and
CAPES (064965/2014-01).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Betser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bimbot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          .
          <article-title>Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Spoken Language Processing</source>
          , pages
          <volume>333</volume>
          {
          <fpage>444</fpage>
          ,
          <year>October 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Guinaudeau</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast TV at MediaEval 2016</article-title>
          . In Working notes of the MediaEval 2016 Workshop,
          <year>October 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. E. dos Santos</given-names>
            <surname>Jr</surname>
          </string-name>
          , G. Gravier, and
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          .
          <article-title>SSIG and IRISA at Multimodal Person Discovery</article-title>
          .
          <source>In Working notes of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-M. Odobez</surname>
          </string-name>
          .
          <article-title>EUMSSI team at the MediaEval Person Discovery Challenge</article-title>
          .
          <source>In Working notes of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Perret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cousty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. R.</given-names>
            <surname>Ura</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. J. F.</given-names>
            <surname>Guimara</surname>
          </string-name>
          <article-title>~es. Evaluation of morphological hierarchies for supervised segmentation</article-title>
          .
          <source>In Proceedings of the 12th International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing</source>
          , pages
          <volume>39</volume>
          {
          <fpage>50</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation in TV broadcast based on written names</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech and Language Processing</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ):
          <volume>57</volume>
          {
          <fpage>68</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Raymond</surname>
          </string-name>
          .
          <article-title>Robust tree-structured named entities recognition from speech</article-title>
          .
          <source>In International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tolias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sicre</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          .
          <article-title>Particular object retrieval with integral max-pooling of CNN activations</article-title>
          .
          <source>In Proceedings of the 2016 International Conference on Learning Representations</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <article-title>Learning from labeled and unlabeled data with label propagation</article-title>
          .
          <source>Technical report, Citeseer</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>