<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MusiClef 2013: Soundtrack Selection for Commercials</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Cynthia C. S. Liem Delft University of Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Markus Schedl Johannes Kepler University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Nicola Orio University of Padua</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>MusiClef was one of the \brave new tasks" at MediaEval 2013 with a multimodal approach that combined music, video and textual information in order to evaluate systems that recommend a music soundtrack given the video of a commercial and the information on the product to be advertised.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The MusiClef 2013: \Soundtrack Selection for
Commercials" task aims at analyzing music usage in TV
commercials and determining music that ts a given commercial
video. Usually, this task is carried out by music
consultants, who select a song to advertise a particular brand or a
given product. The MusiClef benchmarking activity, in
contrast, aims at making this process automated by taking into
account both context- and content-based information about
the video, the brand, and the music. The goal of MusiClef
2013, which as in its tradition is motivated by a real
professional application, can be summarized as follows: Given a
TV commercial, predict the most suitable music from a list
of candidate songs.</p>
      <p>The selection of a suitable soundtrack for a given
commercial can be based on a number of characteristics, which
have been taken into account while organizing this brave new
task. On the one hand, each brand/product has a
particular signature that should be underlined by the soundtrack.
For this reason, a number of web pages describing either
the brands or the products included in the evaluation
campaign have been crawled automatically to extract a number
of contextual descriptors. One the other hand, the choice of
a particular song depends also on the public image of the
performer. Again, web pages describing the artists included
in the evaluation campaign have been automatically crawled
to extract additional contextual descriptors regarding music.
Finally, the choice of a soundtrack depends also on how
previous commercials were perceived by the public. Thus, as an
additional semantic data source, we provide the comments
on the commercials made by the persons who uploaded the
videos on the web.</p>
      <p>Content plays an equally important role in the selection of
soundtracks. For this reason a number of descriptors were
computed from the audiovisual content of the commercial
videos. Clearly, the soundtrack of a commercial video may
contain also speech and environmental sounds that are
usually not available to music consultants at the time of
soundtrack selection. In order to better simulate the selection, we
computed the same set of audio descriptors also from the
original recordings. It is important to note that, for obvious
copyright reasons, we did not distribute the original content
but only the lossy descriptors. Participants were referred to
web services run by third parties to access to the original
multimedia content, both for videos and for songs.</p>
      <p>This has been a challenging task, in which multimodal
information sources needed to be considered, which do not
trivially connect to each other. In particular, participants
were asked to provided at least one run based on the
combination of multimodal information.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>THE DATASETS</title>
      <p>
        Two datasets have been made available to participants.
First of all, the development set included YouTube links
to 392 commercial videos for which music has been
identi ed. For each video the development set contained
metadata on the commercial as available from comments in the
YouTube page, video features (MPEG-7 Motion Activity
and Scalable Color Descriptor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), web pages about the
respective brands and music artists, and music features (the
well-known MFCC, BLF as proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], PS209 as
proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and beat, key, harmonic pattern using the
software available at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) computed from both the original
soundtracks and from the corresponding recordings.
Moreover, a set of 227 additional commercial videos has been
included in the development set although it was not
possibile to identify the original soundtracks. For these videos
the same information has been made available, except for
music features of the original recordings.
      </p>
      <p>The test set included 55 additional commercial videos for
which participants have to suggest a suitable soundtrack
from 5000 candidate recordings of published music made
available from a broadcasting company database.
Particular care has been paid to not include the original recording
of the commercial in the list of candidate songs. Moreover,
the 5000 candidate songs were recorded by the same pool
of artists of the development set. To prevent the task
becoming a simple audio comparison task, test set videos were
provided in muted form. Therefore, for test set videos, no
original soundtrack features were provided. However, for
the rest, the same information was made available as with
videos from the development set. As for the 5000 audio
candidate recordings, for each recording a 30 second snippet
was extracted, for which the same music features as in the
development set were computed.</p>
      <p>Audio similarity has been precomputed by the organizers
and made available to participants for both sets, in order to
provide a common background for all experiments.
Participants were free to carry out further processing both on the
audio/video features and on the computation of similarity.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>COLLECTING THE DATA</title>
      <p>The process of collecting the data described in the
previous sections required a number of steps, that have been
carried out by the organizers. In the following we
summarize the procedure in order to highlight the main points and
to discuss the main decisions that have been taken.</p>
      <p>First of all, we selected a number of representative
commercials that were available for download on YouTube. We
started from a list of annotated commercials proposed in
http://admusicdb.com/. Starting from this list we
automatically crawled YouTube in order to get the complete
videos (for this content type, only derived features are
distributed), the description inserted by the uploader, and the
comments by other viewers.</p>
      <p>The audio tracks of the commercials were analyzed by a
software for audio ngerprinting and matched with a
reference collection of about 380; 000 commercial MP3s, which
was available thanks to a collaboration with the Italian
broadcaster RTI. Only about 50% of the original soundtracks were
successfully identi ed, thus we manually inspected the
reasons for missing identi cations. In general, a number of
soundtracks were composed purposely for the commercials
while some of them were played live by the testimonials.
The remaining soundtracks were simply not present in the
reference collection or were stored as di erent covers in the
reference collection. In order to deal with the latter, we
collected all the available covers and manually compared their
music content with the soundtracks, evaluating the
similarity in a three-level scale. Through manual identi cation we
increased the available MP3s to about 60% of the
downloaded videos. Participants were informed, for each MP3,
on the con dence level of the identi cation.</p>
      <p>The nal step consisted in selecting the les for the actual
task: videos and MP3s. Videos were chosen among the ones
where no identi cation was possible, selecting the ones with
a similar length of about 30 seconds. MP3s were selected
as a subset of the reference collection, taking particular care
that they were performed by the same pool of artists and
that did not contain the original songs. For each MP3 we
extracted a sample of 30 seconds that were used for the task.</p>
      <p>In parallel with the content descriptors, we retrieve
relevant contextual information. Starting from the complete
list of videos, we could select the set of brands and
products that have been advertised and the set of artists that
were mentioned as the main performers. We crawled the
eb submitting three di erent queries to Bing search engine:
\brand/product ", \artist music", and \artist music review".
In order to guarantee reproducibility of the results, we
downloaded the complete pages besides computing the Lucene
index and the term weight (TF IDF).</p>
    </sec>
    <sec id="sec-4">
      <title>EVALUATION</title>
      <p>Participants could submit one to three runs, with the
requirement that at least one run should use multimodal
information. For each video in the test set, participants are
requested to propose a ranked list of 5 candidate songs.</p>
      <p>Evaluation is carried out using the Amazon Mechanical
Turk platform. For every video in the test set, a HIT was
designed presenting the muted test set video, and all top-5
song (snippet) suggestions for the video, as submitted by
the participants. These song suggestions were presented in
randomized order. For each HIT, 5 assignments are released.
Since both the video and each of the song snippets were not
longer than 30 seconds, the load on the side of the workers
was kept within reasonable bounds.</p>
      <p>MTurk workers are asked to grade the suitability of each
song suggestion on a 4-level Likert scale, ranging from very
poor (1 point) to very good (4 points). There also is a
fallback `impossible to tell' option, which required a mandatory
explanation on why the suitability could not be graded.</p>
      <p>For each run, evaluation results are computed using three
di erent measures. Let V be the full collection of test set
videos, and let sr(v) be the average suitability score for the
audio le suggested at rank r for video v. Then, the
evaluation measures are computed as follows:</p>
      <p>Average suitability score of the rst-ranked song:
1 jV j</p>
      <p>X s1(vi)
jV j i=1
1 XjVj 1 X sr(vi)</p>
      <p>5
jV j i=1 5 r=1
jV j i=1 P5</p>
      <p>r=1
1 jV j P5</p>
      <p>X r=1 sr(vi)
sr(vi)
r
It should be stressed that this brave new task is highly novel
and non-trivial in terms of `ground truth'. This is why we
purely use human ratings for the evaluation, and use the
di erent measures above to both study rating and ranking
aspects of the results.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>MusiClef has been partially supported by EU-FP7 through
the PHENICX project (no. 601166) and the PROMISE
Network of Excellence (no. 258191); it is also partially
supported by the Austrian Science Funds (FWF): PP22856-N23
and P25655. The work of Cynthia Liem is supported in part
by the Google European Doctoral Fellowship in Multimedia.
Weighted average suitability score of the full top-5.
Here, we apply a weighted harmonic mean score
instead of an arithmetic mean:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ircam.</surname>
          </string-name>
          Analyse-Synthese: Software. http://anasynth.ircam.fr/home/software. Accessed: Sept.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Manjunath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Salembier</surname>
          </string-name>
          , and T. Sikora, editors.
          <source>Introduction to MPEG-7: Multimedia Content Description Interface</source>
          . John Wiley &amp; Sons, New York, USA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pohle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schnitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knees</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Widmer</surname>
          </string-name>
          . On Rhythm and
          <article-title>General Music Similarity</article-title>
          .
          <source>In Proc. of ISMIR</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Seyerlehner</surname>
          </string-name>
          , G. Widmer, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Pohle</surname>
          </string-name>
          .
          <article-title>Fusing Block-Level Features for Music Similarity Estimation</article-title>
          .
          <source>In Proc. of DAFx</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Average suitability score of the full top-5:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>