<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SSIG and IRISA at Multimodal Person Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cassio E. dos Santos Jr</string-name>
          <email>cass@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Gravier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Robson Schwartz</string-name>
          <email>william@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Universidade Federal de Minas Gerais</institution>
          ,
          <addr-line>Belo Horizonte</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IRISA &amp; Inria Rennes</institution>
          ,
          <addr-line>CNRS, Rennes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes our approach and results in the multimodal person discovery in broadcast TV task at MediaEval 2015. We investigate two distinct aspects of multimodal person discovery. One refers to face clusters, which are considered to propagate names associated to faces in one shot to other faces that probably belong to the same person. The face clustering approach consists in calculating face similarities using partial least squares (PLS) and a simple hierarchical approach. The other aspect refers to tag propagation in a graph-based approach where nodes are speaking faces and edges link similar faces/speakers. The advantage of the graph-based tag propagation is to not rely on face/speaker clustering, which we believe can be errorprone.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Multimodal person discovery in video archives consists in
naming all speaking faces in the collection without prior
information, leveraging face recognition, speech
recognition, speaker recognition and optical character recognition.
A description of the task and resources provided within
MediaEval is given in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, two key components
of most systems for multimodal person discovery are (i ) face
tracking and clustering and (ii ) speaker diarization. See [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
for a recent overview of existing systems. Given these
components, a popular strategy to name speaking faces relies on
a mapping of face clusters and speakers from the diarization,
combining this mapping with appearance of named entities
in speech transcripts or on screen (e.g., [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]). The baseline
system provided by the organizers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a clear instanciation
of this. Person names appearing on screen are rst
propagated onto speaker clusters, nding an optimal mapping
based on co-occurrence. In the next step, one has to nd
for each named speaker if there is a co-occurring face track
that has a probability to correspond to the current speaker
higher than a threshold. Each such face track receives the
name assigned to the speaker cluster.
      </p>
      <p>
        We explore two distinct aspects of multimodal person
discovery in this evaluation. On the one hand, we seek to
improve face clustering using recent advances in face
recognition based on partial least square (PLS) regression [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We
consider a variant of the baseline system provided, modi ed
to better merge the PLS face cluster and speaker diarization
results. On the other hand, we study tag propagation in a
graph where nodes are speaking faces, with edges denoting
the voice and/or face similarity. This approach is motivated
by the wish to avoid explicit face and speaker clustering and
open new strategies for person discovery. Note that the two
approaches could be combined but, for practical reasons,
this combination was not considered in the framework of
the evaluation.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>PLS-BASED FACE CLUSTERING</title>
      <p>The PLS-based face clustering approach consists in
calculating a similarity measure between face tracks for further
clustering. Face clusters are then used in a variant of the
baseline, as a replacement of the face clusters provided.</p>
      <p>
        PLS is a statistical method consisting of two steps:
regression and projection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The projection step consists
in calculating a subspace that maximize the covariance
between predictors and responses. The regression step relies
on ordinary least squares to estimate responses based on
the projected predictors. We employ the one-shot similarity
metric based on PLS for face veri cation described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which presents robust results for face images in the wild
compared to conventional distance-based methods. In a
nutshell, the similarity sim(A; B) between face tracks A and B
relies on PLS regression trained to return +1 for samples
in A and response 1 for samples in a background set of
images (300 random face images from the LFW dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
Then, sim(A; B) is calculated as the average of responses
from samples in B evaluated in the learned PLS regression.
A symmetric version is used in practice, averaging sim(A; B)
and sim(B; A).
      </p>
      <p>Based on PLS similarity calculated between all face track
pairs, clustering aims at grouping face tracks from the same
subject. We employ a hierarchical clustering approach that
consists in merging a pair of face tracks with maximum
similarity and with at least one face track that was not merged
yet. The merging consists in propagating an identi cation
label from one face track to the other or generating a new
identi cation label for the pair if no label was previous
associated to the face tracks. The algorithm stops when the
maximum similarity is less than a threshold, empirically set
to 0:5 using the development set.</p>
      <p>To assess the interest of PLS-based face clustering, we
consider a slightly di erent version of the baseline approach
to merge face clustering and speaker diarization
information. Each name associated to one face track is propagated
to all face tracks within the same face cluster. We then
consider the union of the names from the face tracks and
speaker diarization within each shot. We also evaluate the
method BSLN
dev 38.89
test 78.35
test (PLS) 78.35
modi ed baseline approach using only the speaker
diarization, only the face cluster, and considering the intersection
of the names instead of union.</p>
    </sec>
    <sec id="sec-3">
      <title>GRAPH-BASED TAG PROPAGATION</title>
      <p>To skirt issues with errors in clustering, which we
believe can strongly a ect the naming process, we investigate
a strategy based on tag propagation within a graph where a
node corresponds to an occurrence of a speaking face within
a shot.</p>
      <p>The rst step is the graph construction process, which
consists in identifying speaking faces from the face tracks
detected within each shot1. This is achieved by selecting
face tracks whose probability to correspond to the current
speech turn is greater than a threshold empirically set to
0.6, where the probabilities that a face track corresponds
to a speech turn are those provided. For each selected face
track, we keep a record of the matching speech turn. The
selected speaking face tracks are the nodes of a graph and are
connected with edges bearing two scores, depicting the
similarity of resp. voice and face (as given in the speech turn and
face track similarity les). To avoid a fully connected graph
and keep only relevant relationships, we connect two nodes if
the similarity between the corresponding face tracks and the
similarity between the corresponding speech turns are both
above a threshold, empirically set to 0.1 for both
modalities. Note that having no relations between face tracks and
speech turns across shows, a graph is built independently
for each show.</p>
      <p>The naming process starts by associating a name to a
node whenever possible based on the output of overlaid text
detection: if an overlay signi cantly overlaps the face track,
the node is tagged with the corresponding name and a score
of 1. In case of multiple overlaping overlays, the name
corresponding to the longest co-occurrence is considered.
After tagging all nodes, tags are optionally propagated over
a number of iterations. At each iteration, each tag of each
node is propagated via the corresponding edges with a
propagation score equal to the tag score multiplied by the edge
weight, where edge weights are taken as the average of the
face and voice similarity. After propagation, each node
receives the tag with the highest score.</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>
        The results from the second submission (July 8th) of the
four PLS-based methods and the baseline are presented in
Tab. 1, where the following abbreviations are employed:
PLS-based face clustering considering only speaker
diarization (SPKR), only face clusters (FACE), union (UNI) and
intersection (INT) of names among face clusters and speaker
1Only submission shots were considered in this work.
dev
test
no prop
1 step prop
no prop
diarization. In PLS-based face clustering, we consider the
CLBP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] feature descriptor with radius parameter 5
calculated in squared blocks of size 16 pixels and stride of 8 pixels.
All faces were cropped from the videos using the face
position provided in the baseline approach and scaled to 128 by
128 pixels. Note that we do not provide face clusters based
on PLS for the development set and, therefore, all results in
Tab. 1 for the development set consider only the face clusters
available in the baseline approach. We also provide the
results on the test set considering the face clusters provided in
the baseline method, i.e., without PLS-based face clustering.
      </p>
      <p>The SPK approach yields the best EwMAP in Tab. 1 while
the FACE yields the worst results. However, the results from
INT and UNI indicate that the two approaches present
complementary results, i.e., the intersection of the propagated
names among the face clusters and speaker diarization shots
indicates that a small subset of correct names from the face
clusters that are not in the speaker names, These aspects
are observed in the development and test set, using the face
clusters in the baseline or the PLS method face clusters. We
also noticed no signi cant di erence in the results between
face clusters provided in the baseline approach and using the
PLS-based method, considering the UNI approach. We
believe that this small di erence is an e ect of the poor quality
of the face clusters, which might result from combined errors
in the face detection and in the face tracking methods.</p>
      <p>Results for the graph-based tag propagation method are
given in Tab. 2. On the development data (test2 subset),
results are provided without tag propagation (no prop) and
with a singl step of tag propagation. We believe that the
poor results obtained are attributable to the fact that the
graph links only submission shots, which account only for a
small fraction of the total number of shots in the
development data. Contrarily, most of the shots in the test data
are subission shots. With no surprise, tag propagation
improves the MAP to the expense of correctness. Submission
on the test set was made without tag propagation (because
of unconvincing propagation results at the time) and not
updated after the initial submission (July 1st). Interestingly,
direct naming of speaking face tracks from overlays (i.e., no
propagation) already provides accurate tagging.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The authors would like to thank the Brazilian National
Research Council { CNPq (Grant #477457/2013-4),
Brazilian National Council for the Improvement of Higher
Education { CAPES (Grant STIC-AMSUD 001/2013) and the
Minas Gerais Research Foundation { FAPEMIG (Grants
APQ01806-13 and CEX-APQ-03195-13). This work was partially
supported by the STIC AmSud program, under the project
'Unsupervised Mining of Multimedia Content', and by the
Inria Associate Team program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadid</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikainen</surname>
          </string-name>
          .
          <article-title>Face description with local binary patterns: Application to face recognition</article-title>
          .
          <source>IEEE Trans. on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>28</volume>
          (
          <issue>12</issue>
          ):
          <year>2037</year>
          {
          <year>2041</year>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Overview of the multimodal person discovery task at MediaEval 2015</article-title>
          .
          <source>In Working Notes Proc. of MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Person Instance Graphs for Mono-</article-title>
          ,
          <string-name>
            <surname>Cross</surname>
          </string-name>
          - and
          <string-name>
            <surname>Multi-Modal Person</surname>
          </string-name>
          <article-title>Recognition in Multimedia Data</article-title>
          .
          <article-title>Application to Speaker Identi cation in TV Broadcast</article-title>
          .
          <source>International Journal of Multimedia Information Retrieval</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Face veri cation using large feature sets and one shot similarity</article-title>
          .
          <source>In Intl. Conf. on Biometrics</source>
          , pages
          <fpage>1</fpage>
          <issue>{8</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Learned-Miller</surname>
          </string-name>
          .
          <article-title>Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments</article-title>
          .
          <source>Technical Report 07-49</source>
          , University of Massachusetts, Amherst,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          .
          <article-title>Identi cation non-supervisee de personnes dans les ux teeleevises</article-title>
          .
          <source>PhD thesis</source>
          , Universite de Grenoble,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Besacier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in TV broadcast</article-title>
          .
          <source>In Annual Conf. of the International Speech Communication Association</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot.</surname>
          </string-name>
          <article-title>Naming multi-modal clusters to identify persons in TV broadcast</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosipal</surname>
          </string-name>
          and
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Kramer. Overview and recent advances in partial least squares</article-title>
          .
          <source>In Subspace, latent structure and feature selection</source>
          , pages
          <volume>34</volume>
          {
          <fpage>51</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>