<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2013 Brave New Task: Emotion in Music</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Soleymani</string-name>
          <email>m.soleymani@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.N. Caro and E.M.</string-name>
          <email>{mc947,eschmidt}@drexel.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Y.-H. Yang</string-name>
          <email>yang@citi.sinica.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Academia Sinica</institution>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Schmidt, Drexel University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Music is composed to be emotionally expressive. Emotional associations of music thus provide an especially natural feature for music indexing and recommendation. Emotion in Music Task is a brave new task addressing emotional characterization of music. In addressing the difficulties of emotion annotation we have turned to crowdsourcing, using Amazon Mechanical Turk. The dataset consists entirely of Creative Commons music from the Free Music Archive, which as the name suggests, can be shared freely without restrictions. In this paper, the dataset collection, annotations, and evaluation criteria as well as the two required and optional runs are described.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The Emotion in Music task is a brave new task in the
MediaEval 2013 benchmarking initiative for multimedia
evaluation1. In seeking to develop tools for navigating today’s
vast digital music libraries, emotional associations provide
an especially natural domain for indexing and
recommendation. Because there are a myriad of challenges to such
a task, powerful tools are required for the development of
systems that automate the prediction of emotion in music.
As such, a considerable amount of work has been dedicated
to the development of automatic music emotion recognition
(MER) systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Given the perceptual nature of human
emotion, most existing work on MER has pursued
supervised machine learning approaches, training MER systems
using emotion labels or ratings entered by human subjects
for a number of training clips.
      </p>
      <p>
        The only current evaluation task for MER is the audio mood
classification (AMC) task of the annual music information
retrieval evaluation exchange2 (MIREX) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The audio files
(totaling 600 clips) are available to the participants of the
task, who have agreed not to distribute the files for
commercial purposes. Being the only benchmark in the field
of MER so far, this contest draws many participants every
year. However, AMC describes emotions using five discrete
emotion clusters instead of affect dimensions (e.g., valence
and arousal). The clusters do not have origins in
psychology literature, and some have noted semantic or acoustic
overlap between clusters [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Furthermore, the dataset only
1http://www.multimediaeval.org
2http://www.music-ir.org/mirex/wiki/
applies a singular static rating per audio clip, which belies
the time-varying nature of music.
      </p>
      <p>
        Our new benchmarking corpus employs Creative Commons3
(CC) licensed music from the Free Music Archive4 (FMA),
which enables us to redistribute the content. For
annotations we have turned to crowdsourcing using Amazon
Mechanical Turk (MTurk)5, as others have found success using
these tools to label large libraries [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ]. In addition we have
developed a two-stage procedure for filtering out poor
quality workers, where workers must first pass a test
demonstrating a thorough understanding of the task, and an ability to
produce good quality work. The final dataset spans 1000,
45-second clips, and each clip is annotated by a minimum of
10 workers, which is substantially larger than any existing
music emotion dataset.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>This task comprises of two subtasks. In the first task, the
dynamic emotion characterization task, the emotional
dimensions, arousal and valence, should be determined for the
given song continuously in time; the temporal resolution is
one second. The second task, the static emotion
characterization task, requires participants to deploy multimodal
features to automatically detect arousal and valence for each
song. We developed a dataset of 1000 songs which are split
into the development set (700 songs) and the test set (300
songs). These affective features can be used in
recommendation and retrieval platforms. There are already examples
of mood based or emotion based online radios, e.g.,
Stereomood 6.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Run description</title>
      <p>Our task comprises two tasks: Subtask 1, dynamic
estimation: In this task, the participants will estimate the valence
and arousal scores continuously in time. For every segment,
which is 1 second long, valence and arousal scores between
-1 and 1 should be estimated. Each team can submit up to 3
runs for this task. Subtask 2, static estimation: In this task,
the participants will estimate the valence and arousal scores
of the whole 45 seconds excerpt extracted from a song. Each
team can submit 3 runs for this task
For both subtasks, and for the main run, any features
automatically extracted from the audio or the metadata provided
3http://creativecommons.org/
4http://freemusicarchive.org/
5http://mturk.com
6www.stereomood.com
by the organizers are allowed. This is the required run.
Optional runs, or general runs, include the possibility for the
participants to use additional external data.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>DATASET AND GROUND TRUTH</title>
      <p>the annotations of the first 5 seconds. The average W is
0.23 ± 0.16 for arousal and 0.28 ± 0.21 for valence. The
observed agreement was statistically significant for arousal in
60.0% of songs and for valence in 65.8% of songs.</p>
    </sec>
    <sec id="sec-5">
      <title>BASELINE RESULTS</title>
      <p>The following features were extracted from audio signals:
Mel-Frequency Cepstrum Coefficients (MFCC),
octavebased spectral contrast, Statistical Spectrum Descriptors
(SSDs) which is composed of spectral centroid, spectral flux,
spectral rolloff, and spectral flatness in that order,
Chromagram. The following features were extracted using
Echonest7 API: timbre, pitch, and loudness features.</p>
      <p>A Multivariate Linear Regression (MLR) was selected for
the baseline system because it is a simple and generalizable
prediction method. The MLR was trained on the
development set and evaluated on the test set. All the annotations
including for the static and dynamic ones were scaled
between [−0.5, 0.5]. The Euclidean distance between the
estimated arousal and valence points as well as R2 were
calculated for the evaluation of the static results. To evaluate the
dynamic results, mean distance and Kendall’s Tau ranking
correlation were used. The average values of arousal and
valence on the training set was chosen as the random level
baseline to be compared with our results. To evaluate the
estimation models from content features R2 and mean
absolute error (distances) are reported for static estimation and
Kendall Tau (τ ) is reported with distance for dynamic
estimation. The reported measures on dynamic annotated data
are averaged for all the clips. Random level results are
calculated by setting the target to the average score in the
training set. The results that are significantly better (Wilcoxon
test p &lt; 0.01) than the random level were the arousal static
estimation, Distance = 0.10 ± 0.07, R2 = 0.07, and arousal
dynamic estimation, Distance = 0.08±0.05, τ = 0.15±0.22.
On the estimation of static ratings, the arousal estimations
are far better than valence estimations which are in the order
of chance level. Consistently, arousal estimation results are
superior to valence estimation on the continuous, dynamic
affect estimation task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Downie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bay</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Ehmann</surname>
          </string-name>
          .
          <article-title>The 2007 MIREX audio mood classification task: Lessons learned</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf.</source>
          , pages
          <fpage>462</fpage>
          -
          <lpage>467</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          , E. Schmidt, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Emelle</surname>
          </string-name>
          .
          <article-title>Moodswings: A collaborative game for music mood label collection</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf.</source>
          , pages
          <fpage>231</fpage>
          -
          <lpage>236</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>Audio music mood classification using support vector machine</article-title>
          .
          <source>In MIREX task on Audio Mood Classification</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Crowdsourcing for affective annotation of video: Development of a viewer-reported boredom corpus</article-title>
          .
          <source>In Workshop on Crowdsourcing for Search Evaluation, SIGIR</source>
          <year>2010</year>
          , Geneva, Switzerland,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Speck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Morton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>A comparative study of collaborative vs. traditional musical mood annotation</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf</source>
          .,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Music Emotion Recognition</article-title>
          . CRC Press, Boca Raton, Florida,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>