<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPC System for the 2016 MediaEval Multimodal Person Discovery in Broadcast TV task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miquel India</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerad Martí</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carla Cortillas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Bouritsas Elisa Sayrol</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josep Ramon Morros</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>19</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independently both on the speech and video signals. A simple fusion scheme is used to combine both monomodal annotations into a single one.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This paper describes the UPC system for the 2016
Multimodal Person Discovery in Broadcast TV task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in the
2016 MediaEval evaluations. The system detects face tracks
(FT), detects speech segments (SS) and also detects the
person names overlaid in the video signal. Both the video and
the speech signals are processed independently. For each
modality, we aim to construct a classi er that can determine
if a given FT or SS belongs or not to one of the persons
appearing on the scene with an assigned overlaid name. As
the system is unsupervised, we will use the detected person
names to identify the persons appearing on the program.
Thus, we assume that the FT of SS that overlap with a
detected person name are true representations of this person.
      </p>
      <p>The signal intervals that overlap with an overlaid person
name are extracted and used for unsupervised enrollment,
de ning a model for each detected name. This way, a set of
classes corresponding to the di erent persons in the detected
names is de ned. These intervals are directly labeled by
assigning the identity corresponding to the overlaid name.</p>
      <p>For each modality, a joint identi cation veri cation
algorithm is used to assign each unlabeled signal interval (FT or
This work has been developed in the framework of the
projects TEC2013-43935-R, TEC2012-38939-C03-02 and
PCIN- 2013-067. It has been nanced by the Spanish
Ministerio de Econom a y Competitividad and the European
Regional Development Fund (ERDF).
SS not overlapping with an overlaid name) to one of the
previous classes. For each unlabeled interval, the signal is
compared against all models and the one with better likelihood
is selected. An additional 'Unknown' class is implicitly
considered, corresponding to the cases where the face track or
speech segment correspond to a person that is never named
(i.e. none of the appearances of this person in the video do
overlap with a detected name).</p>
      <p>At the end of this process we have two di erent sets of
annotations, one for speech and one for video. The two
results are fused, as described in section 5 to obtain the
nal annotation.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>TEXT DETECTION</title>
      <p>
        We have used the two baseline detections with some
additional post-processing. The rst one (TB1) was generated by
our team and distributed to all participants. The LOOV [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
text detection tool was used to detect and track (de ne the
temporal intervals where a given text appears) text.
Detections were ltered by comparing against list of rst names
and last names downloaded from the internet. We also used
lists of neutral particles ('van', 'von', 'del', etc.) and
negative names ('boulevard', etc.). All names were normalized to
contain only alphabetic ASCII characters, without accents
nor special characters and in lower case. For a given
detected text to be considered as name it had to contain at
least one rst name and one last name. The percentage of
positive matches for these two classes was used as a score.
Matches from the neutral class did not penalize the
percentage. Additionally, if the rst word in the detected text was
included in the negative list, the text was discarded. To
construct TB1 we had access to the test videos before than
the rest of participants. However, we only used this data for
this purpose and we did not perform any test of the rest of
our system before the o cial release.
      </p>
      <p>
        The second set of annotations, TB2 was provided by the
organizers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These annotations had a large quantity of
false positives. We applied the above described ltering to
TB2 and we combined the result with TB1, as the detectors
resulted to be partly complementary.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>VIDEO SYSTEM</title>
      <p>
        For face tracking, the 2015 baseline code [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was used.
Filtering was applied to remove tracks shorter than a xed
time or with too small face size.
      </p>
      <p>
        The VGG-face [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Convolutional Neural Network (CNN)
was used for feature extraction. We extracted the features
from the activation of the last fully connected layer. The
network was trained using a triplet network arquitecture [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The features from the detected faces in each track are
extracted using this network, and then averaged to obtain a
feature vector for each track, of size 1024.
      </p>
      <p>A face veri cation algorithm was used to compare and
classify the tracks. First, the tracks that were overlapped
with a detected name were named by assigning that identity.
To reduce wrong assignations, the name was only assigned
if it overlapped with a single track. Then, using the set of
named tracks from the full video corpus, a Gaussian Naive
Bayes (GNB) binary classi er model was trained, using the
euclidean distance between pairs of samples from the named
tracks. Then, for each speci c video, each unnamed track
was compared with all the named tracks of the video,
computing the euclidean distance between the respective feature
vectors of the tracks (see Figure 1). This value was classi ed
using the GNB to either being a intra-class distance (both
tracks belong to the same identity) or an inter-class distance
(the tracks are not from the same person). The probability
of the distance being intra-class was used as the con dence
score. The unnamed track was assigned the identity of the
most similar named track. A threshold on the con dence
score (0.75) was used to discard tracks not corresponding to
any named track.</p>
    </sec>
    <sec id="sec-4">
      <title>4. SPEAKER TRACKING</title>
      <p>Speaker information was extracted using an i-vector based
speaker tracking system. Assuming that text names are
temporarily overlapped with their speaker and face
identities, speaker models were created using the data of those
text tracks. Speaker tracking was performed evaluating the
cosine distance between model i-vectors and i-vectors
extracted for each frame of the signal.</p>
      <p>
        Speaker modelling was implemented using i-vectors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
An i-vector is a low rank vector, typically between 400 and
600, representing a speech utterance. Feature vectors of
the speech signal are modeled by a set of Gaussian
Mixtures (GMM) adapted from a Universal Background Model
(UBM). The mean vectors of the adapted GMM are stacked
to build the M supervector, wich can be written as:
M = mu + T !
(1)
where mu is the speaker- and session-independent mean
supervector from UBM, T is the total variability matrix, and
! is a hidden variable. The mean of the posterior
distribution of ! is referred to as i-vector. This posterior
distribution is conditioned on the Baum-Welch statistics of the
given speech utterance. The T matrix is trained using the
Expectation-Maximization (EM) algorithm given the
centralized Baum-Welch statistics from background speech
utterances. More details can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The speaker tracking system has been implemented as a
speaker identi cation system with a segmentation by classi
System</p>
      <p>Baseline 1
Spk Tracking</p>
      <p>Baseline 2
Face Tracking</p>
      <p>Baseline 3
Intersection</p>
      <p>
        Union
cation method. For the feature extraction, 20 Mel Frequency
Cepstral Coe cients (MFCC) plus and coe cients
were extracted. Using the Alize toolkit[
        <xref ref-type="bibr" rid="ref1 ref4">4, 1</xref>
        ], a total
variability matrix has been trained per show. I-vectors have
been extracted from 3 seconds segments with a 0.5 second
shift and the baseline speaker diarization was used to select
speaker turn segments to extract the i-vector queries. The
identi cation was performed evaluating the cosine distance
of the i-vectors with each query i-vector. The query with the
lowest distance was assigned to the segment. A global
distance threshold was previously trained with the development
database so as to discard assignations with high distances.
5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>FUSION SYSTEM AND RESULTS</title>
      <p>Starting o with the speaker and face tracking shot
labeling, two fusion methods were implemented. The rst
method was the intersection of the shots of both tracking
systems, averaging the con dence of the intersected shots.
In the second method, the union strategy was implemented
relying on the intersected shots of both modalities and
reducing the con dence of those not intersected. The shots of
both video and speaker systems were merged, averaging the
con dence score if both systems detect the same identity in
a shot, or reducing the con dence by a 0.5 factor if only one
of the systems detected a query.</p>
      <p>Four di erent experiments were performed which are shown
in Table 1. Baseline 1 refers to the fusion between the
baseline speaker diarization and OCR, Baseline 2 refers to the
fusion between the face detection and the OCR and Baseline
3 is the intersection of the both previous baselines. Initially,
speaker and face tracking have been evaluated separately.
The intersection and the union of both tracking systems were
implemented as fusion strategies.</p>
      <p>As is shown in Table 1, both monomodal systems improve
the baseline performances by a great margin. The union
strategy has shown a better performance than the
intersection strategy but this fusion does not show a signi cative
performance increase over the individual modalities.</p>
      <p>By analysing the results, we believe that failures at text
detection was the main factor impacting the nal scores.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>Speaker and face tracking have been combined using
different fusion strategies. This year, our idea was to focus
only on the overlaid names to develop tracking systems
instead of performing diarization systems merged with text
dectection. Tracking systems have shown a better
performance than the diarization based ones of the baseline. For
fusion, the union strategy has shown higher results than the
intersection method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Bonastre</surname>
          </string-name>
          , N. Sche er, D. Matrouf,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Preti</surname>
          </string-name>
          , G. Pouchoulin,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fauve</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Mason.</surname>
          </string-name>
          <article-title>ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition</article-title>
          .
          <source>In Proc. Odyssey: the Speaker and Language Recognition Workshop</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Guinaudeau</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast tv at mediaeval 2016</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2016 Workshop</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          .
          <article-title>Front-end factor analysis for speaker veri cation</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          , May
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Bonastre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fauve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. L. Christophe Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.D</given-names>
            ,
            <surname>Mason</surname>
          </string-name>
          , and J.-
          <source>Y. Parfait. ALIZE 3</source>
          .
          <article-title>0 - Open Source Toolkit for State-of-the-Art Speaker Recognition</article-title>
          .
          <source>In Annual Conference of the International Speech Communication Association (Interspeech)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Deep face recognition</article-title>
          .
          <source>In Proceedings of the British Machine Vision Conference (BMVC)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In ICME</source>
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast tv at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR, abs/1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>