<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fe´licien Vallet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gae¨l Richard</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Slim Essid</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean Carrive</string-name>
          <email>fjcarrive@ina.frg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut National de l'Audiovisuel 4</institution>
          ,
          <addr-line>Avenue de l'Europe 94366 Bry-sur-Marne Cedex</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TELECOM ParisTech 37</institution>
          ,
          <addr-line>rue Dareau 75014 Paris</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>TV show structuring consists in breaking down a program into several sequences (interviews, musical performances, film excerpts, etc.) and in retrieving for each of these segments high-level knowledge often refered to semantic content. The focus here is on two particular tasks: the detection of musical performances, i.e. the segments where artists are performing live music, and the identification of the artist for each of these segments. The corpus used in this study is ”Le Grand E´ chiquier”, a French talk show from the 1970s-1980s provided by INA, each show lasting around three hours. This corpus contains on top of videos, annotations made by professional documentalists, providing a list of participants and a summary for each show. We use segmentations gathered from partners of the projects K-Space [2] and Infom@gic:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>This paper presents a preliminary study based on only one
show (CPB84052346). The approach relies on the fusion of
data gathered from partners. A major aspect of the work
resides in the use of multimodal features. In [1], Cheng et al.
propose a semantic-event segmentation of wedding ceremony
videos. Similarly, in our case, to build a robust musical
performance segmentation we combine basic musical descriptors
carried by the audio track, with video descriptors. This fusion
step helps for the disambiguation of complex situations likely
to contain music, such as film excerpts. Once the musical
performance segmentation obtained, the next task is the labeling
of each segment with the name of the performing artist.</p>
    </sec>
    <sec id="sec-2">
      <title>2. SEGMENTATIONS AND DESCRIPTORS</title>
      <p>Audio segmentations includes general sound classification
discriminating music, speech and applause by TELECOM
ParisTech and TUB.</p>
      <p>Video segmentations consists in face detection by
TELECOM ParisTech and film excerpts detection by EADS.
Speech segmentation consists in automatic transcription by
VECSYS.</p>
      <p>Fig. 1: Overall fusion process.</p>
    </sec>
    <sec id="sec-3">
      <title>3. MUSICAL PERFORMANCE SEGMENTATION</title>
      <p>In this section we fuse various segmentations. An important
issue is the integration of segmentations of different natures.
While some of the data present, at a fixed time step,
probability values to belong to a class, some others are merely rough
boundaries indicating the detection of a concept, like a film
excerpt for instance.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1. Class probability fusion</title>
      <p>The general sound classification provided by TUB uses
Gaussian Multivariate Model (GMM) while TELECOM ParisTech
used Support Vector Machines (SVM). These segmentations
are to be fused in order to produce a common temporal
representation. For this, we average the probability/sec of each
class. This provides, for the music and speech classes,
probability curves of the whole show.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2. Heuristic rules</title>
      <p>For the film excerpts and applause detections, averaging is not
an issue since these segmentations present rough boundaries
and no class probabilities. However, it is crucial to proceed to
the fusion. Indeed, fusing the music/non-music segmentation
obtained earlier with applause and film excerpts detections
helps for the disambiguation of sequences containing music,
but not considered as live music performances. To rule out
such cases we use heuristic hypotheses:
a musical segment last at least 90 sec.
a musical segment is followed by applause from the audience.
a film excerpt last at least 30 sec.
a film excerpt is followed by applause from the audience.
musical segments over film excerpts are ignored.</p>
      <p>The result obtained after the application of these rules is a
refined segmentation that can be used for the artist
identification task.</p>
    </sec>
    <sec id="sec-6">
      <title>4. ARTIST IDENTIFICATION</title>
      <p>The aim here is to bridge the gap between low-level
descriptions and semantic content. For this, two tools are of great
use: the annotation by documentalists (particularly the
participants list) and the automatic transcription. The idea is to
retrieve first and last names of the artists which are expected
to perform on stage. Again several heuristic rules are used:
we suppose that the name of the artist is pronounced before
or after each music segment.
the time window for the names retrieval is 90 sec before and
90 sec after each music segment.
we associate the biggest face detected during the musical
segment with the main artist.</p>
    </sec>
    <sec id="sec-7">
      <title>5. RESULTS</title>
    </sec>
    <sec id="sec-8">
      <title>5.1. Musical performance segmentation</title>
      <p>With the heuristic rules, we obtained a rather robust
segmentation for live musical appearance for the show CPB84052346.
The duration of the show is 3h 25min 57sec and contains
42min 31sec of musical performances. In the following table,
scores are given in percent with respect to the total number of
live music segments:</p>
      <p>Accuracy
86%</p>
      <p>False Alarm
6%</p>
      <p>Non-Detection
8%
Results for musical appearance segmentation for the show</p>
      <p>CPB84052346.</p>
      <p>From a user-oriented point of view, the test show contains
twelve segments of artists performing live. The non-detection
rate comes from two musical segments too short to be
detected while the false alarm rate can be explained by the
detection of a small documentary that contains a lot of music.
It is much more difficult to get results for this aspect. First
of all, all names cannot be retrieved because all of them were
not in the database used by VECSYS for the automatic
transcription. So, it may happen that only first names are
detected. Also, for a given musical segment, several names can
be proposed. Figure 1 shows the biggest face detected for the
first eight segments while the table displays the results for the
artist identification:</p>
      <p>Fig. 2: Face retrieval for musical segments
Results of artist identification for the show CPB84052346
(Truth being the actual name of the artist).</p>
      <p>Our method provides a good indication of the identity
of the performing artist. Besides, the assumption about the
biggest face detected during the music segment belonging to
the artist seems quite justified despite one error (excerpt 4).</p>
    </sec>
    <sec id="sec-9">
      <title>6. PERSPECTIVES</title>
      <p>With this study, it seems possible to build an automatic
approach to analyse sentences from the documentalists’
annotations and then provide semantic knowledge to low-level
segmentations, like the automatic labeling of musical excerpts.</p>
    </sec>
    <sec id="sec-10">
      <title>7. REFERENCES</title>
      <p>[1] Wen-Huang Cheng et al., ”Semantic-event based
analysis and segmentation of wedding ceremony videos” in
proceedings MIR ’07, Augsburg, Germany, 2007.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K</given-names>
            <surname>-Space</surname>
          </string-name>
          ,
          <article-title>Network of Excellence http://www.k-space</article-title>
          .eu/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>