<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GTM-UVigo Systems for Person Discovery Task at MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paula Lopez-Otero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosalía Barros</string-name>
          <email>rbarros@gts.uvigo.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Docio-Fernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisardo González-Agulla</string-name>
          <email>eli@gts.uvigo.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Luis Alba-Castro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Garcia-Mateo</string-name>
          <email>carmen@gts.uvigo.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AtlantTIC Research Center</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present the systems developed by GTMUVigo team for the Multimedia Person Discovery in Broadcast TV task at MediaEval 2015. The systems propose two different strategies for person discovery in audio through speaker diarization (one based on an online clustering strategy with error correction using OCR information and the other based on agglomerative hierarchical clustering) as well as intrashot and intershot strategies for face clustering.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Person Discovery in Broadcast TV task at
MediaEval 2015 aims at finding out the names of people who can
be both seen as well as heard in every shot of a collection
of videos [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This paper describes the audio, video and
multimodal approaches developed by GTM-UVigo team to
address this task1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>AUDIO-BASED PERSON DISCOVERY</title>
      <p>The audio approaches can be divided in three stages: speech
activity detection, division of speech regions in speaker turns
and, lastly, speaker clustering.</p>
    </sec>
    <sec id="sec-3">
      <title>Speech Activity Detection</title>
      <p>A Deep Neural Network (DNN) based speech activity
detector (SAD) was used. The acoustic features used were 26
log-mel-filterbank outputs, and a window of 31 frames was
used to predict the label of the central frame. The DNN
has the following architecture: 806 unit input layer, 4
hidden layers, each containing 32 tanh activation units, and an
output layer consisting of two softmax units. The output
layer generates a posterior probability for the presence or
non-presence of speech, and the ratio of both output
posteriors is used as a confidence measure about speech activity
over time. This confidence is median filtered to produce a
smoothed estimate of speech presence and, finally, a frame
is classified as speech if this smoothed value is greater than
a threshold.</p>
    </sec>
    <sec id="sec-4">
      <title>Speaker Segmentation</title>
      <p>1The code of GTM-UVigo systems will be released at
https://github.com/gtm-uvigo/Mediaeval_PersonDiscovery</p>
      <p>
        After performing speech activity detection, the speech
segments are further divided into speaker turns following the
approach described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. First, Mel-frequency cepstral
coefficients (MFCCs) plus energy are extracted from the
waveform. After this, the Bayesian Information Criterion (BIC)
based segmentation approach described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is employed,
performing a coarse segmentation to find candidates followed
by a refinement step. A false alarm rejection strategy is
applied in the latter step so as to reject change-points that are
suspicious of being false alarms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Speaker Clustering</title>
      <p>
        Two different approaches for speaker diarization were
assessed, one working in online mode, used in the primary
system, and another working in offline mode. A feature they
have in common is the use of the iVector paradigm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
speaker turn representation.
2.3.1
      </p>
      <sec id="sec-5-1">
        <title>Online approach</title>
        <p>This clustering strategy consists in comparing the iVectors
of the speaker models with the iVector of a given speaker
turn by computing its dot product and, if the maximum dot
product exceeds a predefined threshold, the speaker turn is
assigned to the speaker model; else, it is considered as a
new speaker. Every time a new segment is assigned to a
speaker, its model is refined by computing the mean of all
the iVectors assigned to that speaker model.</p>
        <p>
          A novel feature introduced in this online clustering scheme
is the use of written names obtained from OCR [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for
automatic error correction. To that end, the speaker assignment
using these written names is considered as more reliable than
the clustering assignment, so anytime the clustering and the
written name approach make a different decision, the
written name will prevail over the clustering decision.
2.3.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Offline approach</title>
        <p>
          The proposed offline clustering strategy relies on an
agglomerative hierarchical clustering scheme. First, a
similarity matrix was obtained by computing the dot product
between all the pairwise combinations of the iVectors of each
speaker turn, and this matrix was used to obtain a
dendrogram.The C-score stopping criterion described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was
used to select the number of clusters.
3.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VIDEO-BASED PERSON DISCOVERY</title>
      <p>The video-based strategies encompass three different steps:
face detection and tracking, visual speech activity detection
and face clustering.</p>
      <p>Table 2 shows that the two speaker diarization strategies
are almost equally suitable for this task as they achieve very
similar results; however, the online strategy shows a better
performance, probably due to the use of the OCR
information for error correction. With respect to the face
clustering strategies, the intrashot method obtained better results,
probably because the intershot combination led to an
excessive combination of faces, making the system miss speakers
by erroneously combining them with others.</p>
    </sec>
    <sec id="sec-7">
      <title>Face detection and Tracking</title>
      <p>
        Face detection is based on histogram of oriented gradient
features (HOG) and a linear SVM classifier implemented in
dlib library [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For each detected person, a face tracking and
landmark detection method based on CLNF models are used
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; every time a person stops being visible on screen, a model
that has information about presence, speech intervals and
the highest quality face templates is stored in a database.
To reduce the false alarm rate, face tracks that have a short
time duration and a low quality score are rejected; this score
is calculated with a weighted sum of face symmetry and
sharpness values.
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Visual Speech Activity Detection</title>
      <p>
        The proposed visual speech activity detection method is
based on the relative mouth movements which are generally
small in silence sections, whereas variations of lip shape are
usually stronger during speech [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Using face landmarks
obtained from the previous step, mouth openness and lips
height variance over time are computed. A variable
threshold based on face size is applied in order to make the decision
at each frame and a low-pass filter is used to smooth results.
3.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Face clustering</title>
      <p>
        The face clustering strategies consist in a face recognition
system so that every time a face track is going to be inserted
in the database, a score is computed in order to add it as a
new person or to merge it with an existing one. First, Gabor
features are extracted from the highest-quality templates of
a person and matching scores are obtained using the
hyper cosine distance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Second, the final score to compare
with the merging threshold is computed as the maximum of
all the matching scores obtained from the two sets of face
images. In the intrashot strategy, only models that appear
within the same shot are compared, aiming at correcting
presence intervals when the tracking method fails. The
intershot strategy allows to merge all the person appearances
in a video.
      </p>
    </sec>
    <sec id="sec-10">
      <title>MULTIMODAL PERSON DISCOVERY</title>
      <p>
        Multimodal person discovery was performed using four
different sources of information: speaker diarization (SD)
using the techniques described in Section 2; face detection
(FD) and video-based speech activity detection (VVAD) as
described in Section 3; and written names (WN) extracted
using the strategy described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. First, the set of
evidences is defined as proposed in the baseline fusion strategy
provided by the organizers. Given a shot, a person is
considered to appear in it if the same name is present in SD, FD
and VVAD within the time interval that defines the shot. A
late naming strategy was used to assign names to the
different sources of information [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For each hypothesized name,
a confidence is computed as proposed in the baseline
strategy, but those hypotheses with confidence lower than 1 are
discarded, as they correspond to situations of non-overlap
between the evidence and the hypothesized name.
      </p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS AND DISCUSSION</title>
      <p>Table 2 shows the results achieved by the submitted
systems both in REPERE (partition ’test2’) and INA datasets;
these systems are combinations of the two proposed speaker
diarization and face clustering strategies as summarized in
The development of the audio-based person discovery
approaches showed us that a lower speaker diarization error
rate do not lead to a higher EwMAP, as overclustering
results in incorrect person detections. Also, we have to
increase our efforts in TV programmes featuring challenging
acoustic conditions, which are the ones who had a more
degraded performance. Lastly, we realised that adding
written names obtained from OCR information to the speaker
diarization algorithm led to an improvement of the
performance, so this type of fusion will be studied in more depth.</p>
      <p>The proposed video-based person discovery approaches
showed us that the intrashot strategy performed better than
the intershot strategy, probably because of the
overclustering issue mentioned above. The most challenging aspects,
that will have to be addressed in the future, were the
variations in pose, scale and illumination, as they made it difficult
to develop a robust face matching strategy.</p>
      <p>GTM-UVigo team got into this task by developing
audio and face modules and combining them through a simple
decision-level fusion but, in future work, audiovisual fusion
in earlier stages of the system will be researched in order to
exploit all the potential of multimodal person discovery.
6.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research was funded by the Spanish Government
(’SpeechTech4All Project’ TEC2012-38939-C03-01), the
Galician Government through the research contract GRC2014/024
(Modalidade: Grupos de Referencia Competitiva 2014) and
’AtlantTIC Project’ CN2012/160, and also by the Spanish
Government and the European Regional Development Fund
(ERDF) under project TACTICA.</p>
      <p>EwMAP MAP C EwMAP MAP C
p 75.76% 77.10% 78.03% 80.34% 80.61% 92.42%</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrusaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Robinson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Morency</surname>
          </string-name>
          .
          <article-title>Constrained local neural fields for robust facial landmark detection in the wild</article-title>
          .
          <source>In IEEE International Conference on Computer Vision Workshops (ICCVW)</source>
          , pages
          <fpage>354</fpage>
          -
          <lpage>361</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cettolo</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Vescovi</surname>
          </string-name>
          .
          <article-title>Efficient audio segmentation algorithms based on the BIC</article-title>
          .
          <source>In Proceedings of ICASSP</source>
          ,
          <string-name>
            <surname>volume</surname>
            <given-names>VI</given-names>
          </string-name>
          , pages
          <fpage>537</fpage>
          -
          <lpage>540</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Kenny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellet</surname>
          </string-name>
          .
          <article-title>Front end factor analysis for speaker verification</article-title>
          .
          <source>IEEE Transactions on Audio, Speech and Language Processing</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gonz</surname>
          </string-name>
          <article-title>´alez-</article-title>
          <string-name>
            <surname>Agulla</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Argones-Rua</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Alba-Castro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Gonz´alez-Jim´enez, and L. Anido-Rif´on. Multimodal biometrics-based student attendance measurement in learning management systems</article-title>
          .
          <source>In IEEE International Symposium on Multimedia (ISM)</source>
          , pages
          <fpage>699</fpage>
          -
          <lpage>704</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>King</surname>
          </string-name>
          .
          <article-title>Dlib-ml: A machine learning toolkit</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          :
          <fpage>1755</fpage>
          -
          <lpage>1758</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lopez-Otero</surname>
          </string-name>
          .
          <article-title>Improved Strategies for Speaker Segmentation and Emotional State Detection</article-title>
          .
          <source>PhD thesis</source>
          , Universidade de Vigo,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lopez-Otero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Docio-Fernandez</surname>
          </string-name>
          , and
          <string-name>
            <surname>C.</surname>
          </string-name>
          Garcia-Mateo.
          <article-title>GTM-UVigo system for Albayzin 2014 audio segmentation evaluation</article-title>
          .
          <source>In Iberspeech</source>
          <year>2014</year>
          :
          <article-title>VIII Jornadas en Tecnolog´ıa del Habla and</article-title>
          IV SLTech Workshop,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lopez-Otero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Docio-Fernandez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Garcia-Mateo</surname>
          </string-name>
          .
          <article-title>A novel method for selecting the number of clusters in a speaker diarization system</article-title>
          .
          <source>In Proceedings of EUSIPCO</source>
          , pages
          <fpage>656</fpage>
          -
          <lpage>660</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Qu´enot, and</article-title>
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identification</article-title>
          .
          <source>In Proceedings of IEEE International Conference on Multimedia and Expo (ICME)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal Person Discovery in Broadcast TV at MediaEval 2015</article-title>
          .
          <source>In Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Qu´enot. Unsupervised speaker identification using overlaid texts in TV broadcast</article-title>
          .
          <source>In Proceedings of Interspeech</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Rivet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Girin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Jutten</surname>
          </string-name>
          .
          <article-title>Visual voice activity detection as a help for speech source separation from convolutive mixtures</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>49</volume>
          (
          <issue>7</issue>
          ):
          <fpage>667</fpage>
          -
          <lpage>677</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>