=Paper=
{{Paper
|id=None
|storemode=property
|title=MusiClef 2013: Soundtrack Selection for Commercials
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_97.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LiemOPS13
}}
==MusiClef 2013: Soundtrack Selection for Commercials==
<pdf width="1500px">https://ceur-ws.org/Vol-1043/mediaeval2013_submission_97.pdf</pdf>
<pre>
     MusiClef 2013: Soundtrack Selection for Commercials

                                    Cynthia C. S. Liem                   Nicola Orio
                                Delft University of Technology        University of Padua
                                    C.C.S.Liem@tudelft.nl             Nicola.Orio@unipd.it
                                     Geoffroy Peeters                  Markus Schedl
                                 UMR STMS IRCAM-CNRS             Johannes Kepler University
                                 Geoffroy.Peeters@ircam.fr         Markus.Schedl@jku.at

ABSTRACT                                                         ally not available to music consultants at the time of sound-
MusiClef was one of the “brave new tasks” at MediaEval 2013      track selection. In order to better simulate the selection, we
with a multimodal approach that combined music, video and        computed the same set of audio descriptors also from the
textual information in order to evaluate systems that recom-     original recordings. It is important to note that, for obvious
mend a music soundtrack given the video of a commercial          copyright reasons, we did not distribute the original content
and the information on the product to be advertised.             but only the lossy descriptors. Participants were referred to
                                                                 web services run by third parties to access to the original
                                                                 multimedia content, both for videos and for songs.
1.   INTRODUCTION                                                   This has been a challenging task, in which multimodal
   The MusiClef 2013: “Soundtrack Selection for Commer-          information sources needed to be considered, which do not
cials” task aims at analyzing music usage in TV commer-          trivially connect to each other. In particular, participants
cials and determining music that fits a given commercial         were asked to provided at least one run based on the com-
video. Usually, this task is carried out by music consul-        bination of multimodal information.
tants, who select a song to advertise a particular brand or a
given product. The MusiClef benchmarking activity, in con-
trast, aims at making this process automated by taking into      2.    THE DATASETS
account both context- and content-based information about           Two datasets have been made available to participants.
the video, the brand, and the music. The goal of MusiClef        First of all, the development set included YouTube links
2013, which as in its tradition is motivated by a real profes-   to 392 commercial videos for which music has been iden-
sional application, can be summarized as follows: Given a        tified. For each video the development set contained meta-
TV commercial, predict the most suitable music from a list       data on the commercial as available from comments in the
of candidate songs.                                              YouTube page, video features (MPEG-7 Motion Activity
   The selection of a suitable soundtrack for a given com-       and Scalable Color Descriptor [2]), web pages about the re-
mercial can be based on a number of characteristics, which       spective brands and music artists, and music features (the
have been taken into account while organizing this brave new     well-known MFCC, BLF as proposed in [4], PS209 as pro-
task. On the one hand, each brand/product has a particu-         posed in [3], and beat, key, harmonic pattern using the
lar signature that should be underlined by the soundtrack.       software available at [1]) computed from both the original
For this reason, a number of web pages describing either         soundtracks and from the corresponding recordings. More-
the brands or the products included in the evaluation cam-       over, a set of 227 additional commercial videos has been
paign have been crawled automatically to extract a number        included in the development set although it was not possi-
of contextual descriptors. One the other hand, the choice of     bile to identify the original soundtracks. For these videos
a particular song depends also on the public image of the        the same information has been made available, except for
performer. Again, web pages describing the artists included      music features of the original recordings.
in the evaluation campaign have been automatically crawled          The test set included 55 additional commercial videos for
to extract additional contextual descriptors regarding music.    which participants have to suggest a suitable soundtrack
Finally, the choice of a soundtrack depends also on how pre-     from 5000 candidate recordings of published music made
vious commercials were perceived by the public. Thus, as an      available from a broadcasting company database. Particu-
additional semantic data source, we provide the comments         lar care has been paid to not include the original recording
on the commercials made by the persons who uploaded the          of the commercial in the list of candidate songs. Moreover,
videos on the web.                                               the 5000 candidate songs were recorded by the same pool
   Content plays an equally important role in the selection of   of artists of the development set. To prevent the task be-
soundtracks. For this reason a number of descriptors were        coming a simple audio comparison task, test set videos were
computed from the audiovisual content of the commercial          provided in muted form. Therefore, for test set videos, no
videos. Clearly, the soundtrack of a commercial video may        original soundtrack features were provided. However, for
contain also speech and environmental sounds that are usu-       the rest, the same information was made available as with
                                                                 videos from the development set. As for the 5000 audio can-
                                                                 didate recordings, for each recording a 30 second snippet
Copyright is held by the author/owner(s).                        was extracted, for which the same music features as in the
MediaEval Workshop, 2013. Barcelona, Spain                       development set were computed.
  Audio similarity has been precomputed by the organizers             Evaluation is carried out using the Amazon Mechanical
and made available to participants for both sets, in order to       Turk platform. For every video in the test set, a HIT was
provide a common background for all experiments. Partici-           designed presenting the muted test set video, and all top-5
pants were free to carry out further processing both on the         song (snippet) suggestions for the video, as submitted by
audio/video features and on the computation of similarity.          the participants. These song suggestions were presented in
                                                                    randomized order. For each HIT, 5 assignments are released.
                                                                    Since both the video and each of the song snippets were not
3.   COLLECTING THE DATA                                            longer than 30 seconds, the load on the side of the workers
   The process of collecting the data described in the pre-         was kept within reasonable bounds.
vious sections required a number of steps, that have been             MTurk workers are asked to grade the suitability of each
carried out by the organizers. In the following we summa-           song suggestion on a 4-level Likert scale, ranging from very
rize the procedure in order to highlight the main points and        poor (1 point) to very good (4 points). There also is a fall-
to discuss the main decisions that have been taken.                 back ‘impossible to tell’ option, which required a mandatory
   First of all, we selected a number of representative com-        explanation on why the suitability could not be graded.
mercials that were available for download on YouTube. We              For each run, evaluation results are computed using three
started from a list of annotated commercials proposed in            different measures. Let V be the full collection of test set
http://admusicdb.com/. Starting from this list we auto-             videos, and let sr (v) be the average suitability score for the
matically crawled YouTube in order to get the complete              audio file suggested at rank r for video v. Then, the evalu-
videos (for this content type, only derived features are dis-       ation measures are computed as follows:
tributed), the description inserted by the uploader, and the
comments by other viewers.                                               • Average suitability score of the first-ranked song:
   The audio tracks of the commercials were analyzed by a                                            |V |
software for audio fingerprinting and matched with a ref-                                        1 X
                                                                                                          s1 (vi )
erence collection of about 380, 000 commercial MP3s, which                                      |V | i=1
was available thanks to a collaboration with the Italian broad-
caster RTI. Only about 50% of the original soundtracks were              • Average suitability score of the full top-5:
successfully identified, thus we manually inspected the rea-                                      |V |   5
sons for missing identifications. In general, a number of                                     1 X1X
                                                                                                            sr (vi )
soundtracks were composed purposely for the commercials                                      |V | i=1 5 r=1
while some of them were played live by the testimonials.
The remaining soundtracks were simply not present in the                 • Weighted average suitability score of the full top-5.
reference collection or were stored as different covers in the             Here, we apply a weighted harmonic mean score in-
reference collection. In order to deal with the latter, we col-            stead of an arithmetic mean:
lected all the available covers and manually compared their                                      |V | P5
music content with the soundtracks, evaluating the similar-                                  1 X r=1 sr (vi )
ity in a three-level scale. Through manual identification we                                |V | i=1 P5 sr (vi )
                                                                                                         r=1     r
increased the available MP3s to about 60% of the down-
loaded videos. Participants were informed, for each MP3,            It should be stressed that this brave new task is highly novel
on the confidence level of the identification.                      and non-trivial in terms of ‘ground truth’. This is why we
   The final step consisted in selecting the files for the actual   purely use human ratings for the evaluation, and use the
task: videos and MP3s. Videos were chosen among the ones            different measures above to both study rating and ranking
where no identification was possible, selecting the ones with       aspects of the results.
a similar length of about 30 seconds. MP3s were selected
as a subset of the reference collection, taking particular care     Acknowledgments
that they were performed by the same pool of artists and            MusiClef has been partially supported by EU-FP7 through
that did not contain the original songs. For each MP3 we            the PHENICX project (no. 601166) and the PROMISE
extracted a sample of 30 seconds that were used for the task.       Network of Excellence (no. 258191); it is also partially sup-
   In parallel with the content descriptors, we retrieve rel-       ported by the Austrian Science Funds (FWF): PP22856-N23
evant contextual information. Starting from the complete            and P25655. The work of Cynthia Liem is supported in part
                                                                    by the Google European Doctoral Fellowship in Multimedia.
list of videos, we could select the set of brands and prod-
ucts that have been advertised and the set of artists that
were mentioned as the main performers. We crawled the               5.    REFERENCES
eb submitting three different queries to Bing search engine:        [1] Ircam. Analyse-Synthèse: Software.
“brand/product”, “artist music”, and “artist music review”.             http://anasynth.ircam.fr/home/software. Accessed:
In order to guarantee reproducibility of the results, we down-          Sept. 2013.
loaded the complete pages besides computing the Lucene              [2] B. S. Manjunath, P. Salembier, and T. Sikora, editors.
                                                                        Introduction to MPEG-7: Multimedia Content
index and the term weight (TF × IDF).                                   Description Interface. John Wiley & Sons, New York,
                                                                        USA, 2002.
4.   EVALUATION                                                     [3] T. Pohle, D. Schnitzer, M. Schedl, P. Knees, and
                                                                        G. Widmer. On Rhythm and General Music Similarity.
  Participants could submit one to three runs, with the re-             In Proc. of ISMIR, 2009.
quirement that at least one run should use multimodal in-           [4] K. Seyerlehner, G. Widmer, and T. Pohle. Fusing
formation. For each video in the test set, participants are             Block-Level Features for Music Similarity Estimation.
requested to propose a ranked list of 5 candidate songs.                In Proc. of DAFx, 2010.

</pre>