<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PERCOLATTE: A Multimodal Person Discovery System in TV Broadcast for the MediaEval 2015 Evaluation Campaign</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Meriem Bendris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Delphine Charlet</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gregory Senay</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MinYoung Kim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benoit Favre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mickael Rouvier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frederic Bechet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Géraldine Damnati</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix Marseille Université</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Panasonic Silicon Valley Lab</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the PERCOLATTE participation to MediaEval 2015 task: \Multimodal Person Discovery in Broadcast TV" which requires developing algorithms for unsupervised talking face identi cation in broadcast news. The proposed approach relies on two identity propagation strategies both based on document chaptering and restricted overlaid names propagation rules. The primary submission shows 10% improvement of Mean Average Precision of the baseline on the INA corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Identifying people in TV broadcasts has had a lot of
attention the last decade in the literature. Current trends
aim to combine traditional techniques with high level
information such as prior knowledge on document structure.
Indeed, TV program often have regular structure organized
in homogeneous sequences. The REPERE Challenge, that
ended in 2014, aimed at developing multimodal algorithms
for people identi cation in TV broadcasts. Our
PERCOLATOR system based on scene understanding features ranked
rst on the main task in 2014 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The Mediaeval
\Multimodal Person Discovery in Broadcast TV" task focuses on
unsupervised talking face identi cation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for search engine
applications. One novelty of this task is the metadata made
available by the organizers allowing expanded participations.
      </p>
      <p>This paper describes the PERCOLATTE system
submitted at the MediaEval 2015. The system relies on the
enrichment of broadcast news with video structure features
such as shot classi cation (studio/report) and speaker role
recognition. Two identi cation strategies were developed:
the primary is based on chapter-restricted identity
propagation to shot clusters and the secondary is based on speaker
identi cation and rule-based speaker-face mapping. Figure
1 shows the pipeline of the PERCOLATTE system. Notice
that no face-related processing (detection/identi cation) is
used in our approach.</p>
    </sec>
    <sec id="sec-2">
      <title>TOOLS</title>
      <p>
        The MediaEval 2015 organizers made available di erent
baseline mono-modal tools. In our system, we used the
provided Overlaid Person Names (OPN) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] system. In
addition, we used the automatic named entities [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the
speaking-face mapping to x the identi cation scores.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>List of names</title>
      <p>The Audiovisual National Institute (INA) collects and
enriches broadcast news with metadata such as summary,
identity of journalists, etc. We collected the metadata1 from
December 2004 to December 2009 and extracted automatically
the list of several journalists and anchors.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Overlaid anchor name detection</title>
      <p>Anchor names were not detected by the provided OCR
system. We developed an anchor name detector relying
on a Levenshtein-based mapping of OCR results 2 (on 2
rescaled frames) and the list of names described previously.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Speaker clustering</title>
      <p>
        The speaker clustering follows the approach described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
First, speech segments are grouped using a BIC clustering.
Then, obtained clusters are modelled with GMMs in order
to more accurately compare voices using a Cross-Likelihood
Criterion (CLR) in a second agglomerative clustering. At
each iteration, Viterbi decoding is performed to re-segment
the speech data into speaker turns given the new clusters.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Speaker role classification</title>
      <p>
        We used a simpli ed version of the speaker role classi
cation approach described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. First, the anchor is the
speaker cluster who speaks the most and regularly. Then,
a binary classi cation reporter/other is performed. As no
speech transcript was available, in this work, the classi
cation relies only on an acoustic GMM classi er.
2.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Speaker identification</title>
      <p>Speaker turns are identi ed by propagating OPNs to the
speaker turns that maximise temporal overlapping and to
it's cluster within the same chapter.
1Available on http://www.ina.fr
2https://github.com/meriembendris/ADNVideo
2.6</p>
    </sec>
    <sec id="sec-8">
      <title>Shot boundary detection</title>
      <p>
        Two systems were used based on RGB histograms peaks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and HSV histogram peaks on sliding window 2. As the
evaluation script needs the provided shot segmentation, a shot
boundaries mapping was necessary.
2.7
      </p>
    </sec>
    <sec id="sec-9">
      <title>Shot similarity and clustering</title>
      <p>
        In order to measure the similarity between shots, three
features where extracted: RGB histograms, HOG features
on resized frames (128 64) and DNN-based frame
representation (image embeddings). For the DNN-based features, we
used the Alexnet DNN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to extract feature vectors at the
3rd fully-connected layer (1000 dimension vectors). Then,
shots were grouped using cosine-based distance and Integer
Linear Program clustering (described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
2.8
      </p>
    </sec>
    <sec id="sec-10">
      <title>Shot classification and chaptering</title>
      <p>The shot classi er is trained on external data (8
broadcast news, 4914 shots). Four labels were annotated: studio,
report, mixed and other. First, HOG features on resized
frames (128x64) were extracted for each shot. Then, a
Liblinear3 classi er was trained on three quarters of data. The
system reached 99.43% of accuracy on the remaining
quarter. Finally, successive shots sharing the same label were
grouped into chapters.
3.</p>
    </sec>
    <sec id="sec-11">
      <title>TALKING FACE IDENTIFICATION</title>
      <p>Participants were asked to provide identi ed talking faces
within shots with their con dence scores and evidences
justifying their assertions. Two strategies were developed.
3.1</p>
    </sec>
    <sec id="sec-12">
      <title>The primary strategy</title>
      <p>The primary strategy relies on the fact that report
chapters are independent in broadcast news. The strategy is
based on a restricted OPN propagation to cluster shots within
the same chapter. Precisely, we followed those rules:
Propagate OPN to overlapping shots and their shot
clusters sharing the speaker cluster within a chapter.
Propagate anchor name to overlapping \studio" shots
and their shot clusters without chapters restrictions.</p>
      <p>Propagate anchor name if the speaker role is an anchor.</p>
      <p>For each identi ed talking face, the score was initialized by
the provided OPN score and incrementally increased
following those events: OPN shot overlapping, provided
talkingface score &gt; 0:8 and OPN pronounced around the shot( 5s).
3.2</p>
    </sec>
    <sec id="sec-13">
      <title>The secondary strategy</title>
      <p>The secondary strategy is based on a speaker identi cation
followed by speaker-face rule-based mapping. This mapping
relies on simple rules based on prior knowledge about
broadcast news. Precisely, we considered a speaker visible when
the name appears on the screen (OPN), on studio shots and
on report shots when the role is not a reporter. In this
strategy, no scores function was developed (score=1).
3.3</p>
    </sec>
    <sec id="sec-14">
      <title>The evidence</title>
      <p>To ensure that identities where detected only in
unsupervised way, and to help collaborative annotations of the
test set, participants were asked to select one shot per name
proving his/her identity. For each name, we selected the
provided OPN shot that maximizes the OCR result score.
3http://www.csie.ntu.edu.tw/~cjlin/liblinear/
4.</p>
      <p>
        Systems were evaluated using the Mean Average
Precision (M AP ) metric and the o cial C and EwM AP metrics
described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Two submission deadlines were xed: July
1st and 8th. In our submissions, the only di erence concerns
shot boundary mapping. Indeed, on July the 1st, the
mapping was based on overlapping shots over 0:5s (a rather cure
strategy) while it was on overlapping coverage above 50%
for the July the 8th submissions. Four runs were submitted:
Primary: primary strategy with DNN- and
HOGbased shot clustering.
      </p>
      <p>Primary DNNOnly: primary strategy with
DNNbased shot clustering.</p>
      <p>Primary RGBOnly: primary strategy with
RGBbased shot clustering.</p>
      <p>Secondary: secondary strategy based on speaker
identi cation and speaker-face rule-based mapping.</p>
      <p>
        Table 1 shows results of the PERCOLATTE runs. The
secondary strategy having similar principles than the
baseline [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] shows a MAP improvement of 8%. Indeed,
chapterrestricted propagation in addition to simple rule-based
speakerface mapping based on shot classi cation and speaker roles
allowed to detect less talking faces with higher precision.
The primary strategy using DNN- and HOG-based shot
clustering obtains the best MAP of 88.45%. This shows the
consistency of the chapter-constrained propagation strategy in
broadcast news. Contrastive runs with di erent features for
shot clustering did not show signi cant di erences. Anchor
names were detected in 93% of shows. However, the
primary run without anchor-speci c modules performs 88.31%
of MAP.
      </p>
      <p>Metrics
Baseline
Secondary on July 1st
Secondary on July 8th
Primary DNNOnly on July 1st
Primary DNNOnly on July 8th
Primary RGBOnly on July 1st
Primary RGBOnly on July 8th
Primary on July 1st deadline
Primary on July 8th</p>
      <p>Acknowledgment. This work has been carried out thanks to the
support of the A*MIDEX project (no ANR-11-IDEX-0001-02) funded
by the \Investissements d'Avenir" French Government program,
managed by the French National Research Agency (ANR).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          <article-title>. Multi-stage speaker diarization of broadcast news</article-title>
          .
          <source>IEEE Transactions on Audio, Speech and Language Processing</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bechet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Linares, G. Senay,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirilly</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          .
          <article-title>Multimodal understanding for person recognition in video broadcasts</article-title>
          .
          <source>In Interspeech, Singapore</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          <article-title>. Multi-view approach for speaker turn role labeling in tv broadcast news shows</article-title>
          .
          <source>In INTERSPEECH</source>
          , pages
          <volume>1285</volume>
          {
          <fpage>1288</fpage>
          .
          <string-name>
            <surname>ISCA</surname>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dinarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          .
          <article-title>Models cascade for tree-structured named entity detection</article-title>
          .
          <source>In Proceedings of 5th International Joint Conference on Natural Language Processing</source>
          , pages
          <volume>1269</volume>
          {
          <fpage>1278</fpage>
          ,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand,
          <year>November 2011</year>
          .
          <article-title>Asian Federation of Natural Language Processing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          . In F. Pereira,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>25</volume>
          , pages
          <fpage>1097</fpage>
          {
          <fpage>1105</fpage>
          . Curran Associates, Inc.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In Multimedia and Expo (ICME)</source>
          ,
          <year>2012</year>
          IEEE International Conference on,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast tv at mediaeval 2015</article-title>
          . In MediaEval,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Besacier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quenot. Unsupervised Speaker</surname>
          </string-name>
          <article-title>Identi cation using Overlaid Texts in TV Broadcast</article-title>
          . In Interspeech 2012 - Conference of the International Speech Communication Association, Portland,
          <string-name>
            <given-names>OR</given-names>
            ,
            <surname>United</surname>
          </string-name>
          <string-name>
            <surname>States</surname>
          </string-name>
          ,
          <year>2012</year>
          . Poster Session:
          <article-title>Speaker Recognition III</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          .
          <article-title>A global optimization framework for speaker diarization</article-title>
          .
          <source>In Speaker Odyssey</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>A shot boundary detection method based on color feature</article-title>
          .
          <source>In Computer Science and Network Technology (ICCSNT)</source>
          , 2011 International Conference on,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>