<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johann Poignant</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hervé Bredin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claude Barras LIMSI - CNRS - Rue John Von Neumann</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orsay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France. firstname.lastname@limsi.fr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the algorithm tested by the LIMSI team in the MediaEval 2015 Person Discovery in Broadcast TV Task. For this task we used an audio/video diarization process constrained by names written on screen. These names are used to both identify clusters and prevent the fusion of two clusters with di erent co-occurring names. This method obtained 83.1% of EwMAP tuned on the out-domain development corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        We present the approach of the LIMSI team to the Person
Discovery in Broadcast TV Task at MediaEval 2015. To
address this task we had to return the names of people who
can be both seen as well as heard in a selection of shots in a
collection of videos. The list of people is not known a priori
and their names must be discovered in an unsupervised way
from media content using text overlay or speech transcripts.
For further details about the task, dataset and metrics the
reader can refer to the task description [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>We rst detail the fusion system baseline provided to all
participants (we are both organizer and participant). Then,
we describe the constrained multi-modal clustering. Finally,
we compare the results obtained.</p>
    </sec>
    <sec id="sec-2">
      <title>MULTI-MODAL FUSION</title>
      <p>We propose two di erent approaches to address the task.
They only rely on metadata provided to all participants (see
Table 1). Only written names are used as source of identity.
In addition to speech turn segmentation and face detection
and tracking, the baseline relies on the provided speaker
diarization and speaking face mapping. The constrained
clustering relies on the similarity matrices (for speaker and
face) to process its own clustering.</p>
    </sec>
    <sec id="sec-3">
      <title>Baseline</title>
      <p>
        From the written names and the speaker diarization, we
used the \Direct Speech Turn Tagging" method described
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to identify speaker: we rst tagged speech turns with
co-occurring written name. Then, on the remaining
unnamed speech turns, we nd the one-to-one mapping that
maximizes the co-occurrence duration between speaker
clusters and written names (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for more details). Finally,
      </p>
      <sec id="sec-3-1">
        <title>Components</title>
      </sec>
      <sec id="sec-3-2">
        <title>Segmentation Similarity Diarization</title>
      </sec>
      <sec id="sec-3-3">
        <title>Detection &amp; Tracking Similarity Diarization</title>
      </sec>
      <sec id="sec-3-4">
        <title>Speaking face</title>
        <p>
          Mapping x
Source of names
Written names [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] x
Pronounced names [
          <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
          ]
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Speech turns x</title>
      </sec>
      <sec id="sec-3-6">
        <title>Face x x</title>
      </sec>
      <sec id="sec-3-7">
        <title>Baseline Constrained clustering x</title>
        <p>x
x
x
x
x
we propagate the speaker identities on the co-occurring face
tracks based on the speech turns/face tracks mapping.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Constrained multi-modal clustering</title>
      <p>Figure 2 shows a global overview of our method. We
rst combined the mono-modal similarity matrix and the
speaking face mapping into a large multi-modal matrix
using weights and to give more or less importance to a
given modality. In parallel, written names are used to
identify co-occurring face tracks and speech turns.</p>
      <p>Then, we perform an agglomerative clustering on the
multimodal matrix to merge all face tracks and speech turns of
a same person into a unique cluster. This process is
constrained by avoiding the fusion of clusters named di erently.
The two parameters and advance or delay the merge of
components of a modality relatively to others during the
agglomerative clustering process, while the stopping criterion
is chosen to maximize the target metrics (here the EwMAP).
Face tracks
Speech turns
Face tracks
Similarity</p>
      <p>×α
Written
names
Name co-occurring
with face tracks
and speech turns</p>
      <p>Speech turns
Face tracks
mapping
×β</p>
      <p>Speech turns
Similarity</p>
      <p>×(1- α)
Multimodal
Similarity
Matrix</p>
      <p>Multi-modal
Constrained clustering</p>
      <p>
        Named clusters
A complete description of this method can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Speaking face selection and confidence</title>
      <p>The last part is common for the two fusions. For each
person who speaks and appears in a shot (following the shot
segmentation provided to all participants), we compute a
condence score. This score is based on the temporal distance
between the speaking face and its closest written name. This
con dence equals to:
con dence =
8 1 + d if the speaking face co-occurs
&lt; with the written name
: 1= otherwise
where d is the co-occurrence duration and is the duration
of the gap between the face track (or speech turn) and the
written name.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>In Table 2, we report the EwMAP, the MAP and the
Correctness (denoted by C ) obtained by the baseline and
the constrained clustering tuned on an out-domain corpus
(for the rst deadline: 01-jul-15) and on an in-domain corpus
(second deadline: 08-jul-15).</p>
      <p>The baseline does not take into account the similarity
between face and does not bene t from the knowledge of
written names during the diarization process. In addition
to these 2 additional information, our second method
optimizes the stopping criterion of the clustering based on the
target metric (EwMAP) while the diarization of the baseline
is tuned to maximize the classical DER.</p>
      <p>For the rst deadline (July 1st) we tuned the parameters
and and the stopping criterion of the clustering process
Run
Baseline
Const. clus. 01-jul-15
Const. clus. 08-jul-15
Oracle propagation
mono-show
Oracle propagation
cross-show
on the out-domain development set. For the second deadline
(July 8th), we tuned these parameters with the evaluation
proposed via the leaderboard (computed every six hours on a
subset of the test set). We can see only a little improvement
between them, showing that our method generalizes well.</p>
      <p>To determine the scope for further progress we used an
oracle capable of recognizing a speaking face as soon as his/her
written name is correctly extracted by the OCR module. In
the mono-show case, the name must be written in the same
video. In the cross-show case, the name can be written in
any video of the corpus. Since our own approach only uses
mono-show propagation, these oracle experiments show it is
possible to earn up to 1% of MAP using cross-show
propagation approaches.</p>
      <p>In Table 3 we report the mean precision and recall over all
queries. Compared to the baseline, the constraints on the
clustering process allows to have a lower stopping criterion
(therefore to have bigger clusters and hence to improve the
recall), while keeping very good clusters purity (see the
precision in Table 3). The high precision of our constraint
clustering made the choice of the con dence score (used to rank
shots in the computation of the MAP) not really important.
The tuning of the three parameters on an in-domain corpus
improves recall by 1.3% and decreases precision by 0.8%. In
practice, was reduced for the July 8th (in-domain tuning),
therefore speech turns clustering was delayed (with respect
to face tracks clustering) between July 1st (out-domain) and
July 8th (in-domain tuning).</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>This paper presented our approach and results at the
MediaEval Person Discovery in Broadcast TV task. The
process used an audio/video diarization constrained by written
names on screen. This source of identities is used to both
identify clusters and avoid wrong merges during the
agglomerative clustering process.</p>
      <p>For future works we will improve the distance between
speech turns, try other clustering methods and cross-show
propagation.</p>
      <p>Acknowledgment. This work was supported by the French
National Agency for Research under grant
ANR-12-CHRI0006-01 (CAMOMILE project).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dinarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          .
          <article-title>Models Cascade for Tree-Structured Named Entity Detection</article-title>
          .
          <source>In IJCNLP</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Courcinous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Despres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Josse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kilgour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kraft</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ney</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nussbaum-Thom</surname>
            , I. Oparin,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Schlippe</surname>
          </string-name>
          , R. Schleter, T. Schultz, T. F. da
          <string-name>
            <surname>Silva</surname>
            , S. Stuker,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sundermeyer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Vieru</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Waibel</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Woehrling</surname>
          </string-name>
          .
          <article-title>Speech Recognition for Machine Translation in Quaero</article-title>
          .
          <source>In IWSLT</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In ICME</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal Person Discovery in Broadcast TV at MediaEval 2015</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>MEDIAEVAL</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in TV broadcast</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot.</surname>
          </string-name>
          <article-title>Naming multi-modal clusters to identify persons in TV broadcast</article-title>
          .
          <source>MTAP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>