<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mateusz Budnik, Bahjat Safadi, Laurent</string-name>
          <email>firstname.lastname@imag.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Khodabakhsh, Cenk Demiroglu</string-name>
          <email>ali.khodabakhsh@ozu.edu.tr</email>
          <email>ali.khodabakhsh@ozu.edu.tr cenk.demiroglu@ozyegin.edu.tr</email>
          <email>cenk.demiroglu@ozyegin.edu.tr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Besacier, Georges Quénot</institution>
          ,
          <addr-line>Univ. Grenoble Alpes, LIG, F-38000 Grenoble, France, CNRS, LIG, F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Electrical and Computer Engineering, Department, Ozyegin University</institution>
          ,
          <addr-line>Istanbul</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this working notes paper the contribution of the LIG team (partnership between Univ. Grenoble Alpes and Ozyegin University) to the Multimodal Person Discovery in Broadcast TV task in MediaEval 2015 is presented. The task focused on unsupervised learning techniques. Two di erent approaches were submitted by the team. In the rst one, new features for face and speech modalities were tested. In the second one, an alternative way to calculate the distance between face tracks and speech segments is presented. It also had a competitive MAP score and was able to beat the baseline.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        These working notes present the submissions proposed by
LIG team (partnership between Univ. Grenoble Alpes and
Ozyegin University) to the MediaEval 2015 Multimodal
Person Discovery in Broadcast TV task. Along with the
algorithms and initial results, a more general discussion about
the task is provided as well. A detailed description of the
task, the dataset, the evaluation metric and the baseline
system can be found in the paper provided by the organizers
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. All the approaches presented here are unsupervised
(following the organizers guidelines) and were submitted to the
main task.
      </p>
      <p>
        The main goal of the task is to identify people
appearing in various TV shows, mostly news or political debates.
The task is limited to persons that speak and are visible
at the same time (potential people of interest).
Additionally, the task is con ned to the multimodal data (including
face, speech, overlaid text) found in the test set videos and
is strictly unsupervised (no manual annotation available).
The main source of names is given by the optical character
recognition system used in the baseline [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Thanks to the provided baseline system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], it was possible
to concentrate on some aspects of the task, like a
particular modality or the clustering method. Initially, our focus
was on creating better face and speech descriptors. In the
second approach however, only the distances between face
tracks and speech segments were modi ed. The output of
the baseline OCR system was used as is, while the output
from the speech transcription system was not used at all.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>Our initial approach focused on creating new features for
both face and speech. The second approach is based more
on the baseline system, i.e. no new descriptors were
generated and the key element was the distance between speech
segments and face tracks.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>What did not work: new features</title>
      <p>
        The rst approach explored the use of alternative
features for di erent modalities. For speech, a Total Variability
Space (TVS) system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was designed using the following
settings with the segmentation provided by the baseline system.
Models were learned on the test data without any manual
annotation available.
      </p>
      <p>19 MFCC and energy +
ture warping
20ms length window with a 10ms shift
Energy based silence ltering
1024 GMMs + 400 dimensional TVS</p>
      <p>s (no static energy) +
feaCosine similarities between segments within each video
are calculated</p>
      <p>
        For faces, features extracted from a deep convolutional
neural network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] were used. This was done in the following
way using the test set only:
      </p>
      <p>Face extraction with the approach provided by the
organizers. All scaled to resolution of 100 100 pixels.
Labels generated by the OCR. They were then
assigned to co-occurring faces. This was based on a
temporal overlap between the face and the label. This
generated list served as a training set. The number of
classes equaled the number of unique names.</p>
      <p>
        The general structure of the net is based on the
smallest architecture presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but with just 5
convolutional layers and the number of lters at each layer
reduced by half. The fully connected layers had 1024
outputs. It was trained for around 15 epochs.
      </p>
      <p>After the training, the last layer containing the classes
was discarded and the last fully connected hidden layer
(1024 outputs) was then used for feature extraction.</p>
      <p>Two individual sets of clusters were generated for each
modality. Afterwards, both were mapped to the shots. If
there was an overlap with the same label, the person was
named. Additional submissions involving this approach were
made, which included adding descriptors provided by the
baseline (e.g. HOG for face and BIC for speech). However,
they did not manage to give better performance than the
baseline.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>What did work : modified distance between modalities</title>
      <p>In the baseline provided, the written names are rst
propagated to speaker cluster and then the named speakers are
assigned to co-occurring faces. Due to the nature of the test
set, an alternative was used where the written names are rst
propagated to face clusters. These face-name pairs are
subsequently assigned to co-occurring speech segments. This
approach yielded a more precise but smaller set of named
people compared to the baseline. In order to expand it, a
fusion with the output of the baseline system was made,
where every con ict (e.g. di erent names for the same shot)
would be resolved in favor of our proposed approach.</p>
      <p>Additionally, another way to calculate the distance
between speech and face track was developed. In the baseline
the distance between a face track and a speech segment is
calculated using lip movement detection, size and the
position of the face and so on. Our complementary approach is
based on temporal correlation of tracks from di erent
modalities.</p>
      <p>First, overlapping face tracks and speech segments are
extracted for each video. Similarity vectors for both modalities
are extracted with respect to all the other segments within
the same video. Correlation of the similarity vectors are
calculated in order to determine which face and voice go
together. In other words, a face-speech pair which appears
frequently throughout the video is more likely to belong to
the same person. Finally, the output of this approach is
fused with the output of the system described in the rst
paragraph of this subsection (face-name pairs assigned to
co-occuring speech segments) to produce a single name for
each shot.</p>
    </sec>
    <sec id="sec-5">
      <title>3. INITIAL RESULTS AND DISCUSSION</title>
      <p>The rst system (submitted for the rst deadline)
performed rather poorly with 30.48 % EwMAP (MAP = 30.63
%). While our second approach, submitted as the main
system by the second deadline, with EwMAP = 85.67 % (and
MAP = 86.03 %) was far more successful and was able to
beat the baseline system (EwMAP = 78.35 % and MAP =
78.64 %). The scores presented here were provided by the
organizers and can change slightly before the workshop, due
to more annotation being available.</p>
      <p>During the preparation for this evaluation there were a
number of issues and observations connected to both our
approach and to the data. First of all, trying to build
biometric models for individual people does not work well for
this particular task (at least based on what was tested in
the context of this evaluation, e.g. SVMs). In order to
comply with the task requirements, the labels can only be
generated from the OCR and then be assigned to one of the
modalities. However, both steps are unsupervised,
generating noisy annotation in the process. Additionally, the video
test set consists of one type of program (TV news) where,</p>
    </sec>
    <sec id="sec-6">
      <title>REFERENCES</title>
      <p>apart from the news anchor, most people appear only once
and this may not be enough to create an accurate
biometric model. This stands in contrast to the development set,
which contains debates and parliament sessions where some
persons re-appeared much more frequently.</p>
      <p>A more general issue is also the class imbalance. While
some people, especially the anchors, appear frequently across
di erent videos, most of the others are shown once or twice
and are con ned to a single video. This makes the use of
unsupervised techniques, like clustering, challenging, due to
widely varying cluster sizes - small clusters can get attached
to bigger ones, which is heavily penalized under the MAP
metric. This can, at least partially, explain the poor
performance of the rst approach. Even though the features
used in this method are state-of-the-art, they would require
more high quality data (including annotation) and
parameter adjustment to create good enough distinctions between
thousands of individual persons appearing in the videos.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>During this evaluation di erent algorithms were tested in
order to (unsupervisingly) identify people, which speak and
are visible in TV broadcasts. One approach concentrated
on trying to provide state-of-the-art features for di erent
modalities, while the other provided an alternative
estimation of the distance between already provided modalities of
face and speech.</p>
      <p>The rst approach, even with its limited performance on
this particular shared task, seems to have greater potential
and our future work may try to address some of its
shortcomings. This includes a focus on a more robust deep
learning approach that could deal with noisy or automatically
generated training sets.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was conducted as a part of the CHIST-ERA
CAMOMILE project, which was funded by the ANR (Agence
Nationale de la Recherche, France).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          .
          <article-title>Front-end factor analysis for speaker veri cation</article-title>
          .
          <source>Audio, Speech, and Language Processing</source>
          , IEEE Transactions on,
          <volume>19</volume>
          (
          <issue>4</issue>
          ):
          <volume>788</volume>
          {
          <fpage>798</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>1097</volume>
          {
          <fpage>1105</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>ICME</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast tv at mediaeval 2015</article-title>
          . MediaEval 2015 Workshop,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in tv broadcast</article-title>
          .
          <source>INTERSPEECH</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>