<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Aljanaki</string-name>
          <email>a.aljanaki@uu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi-Hsuan Yang</string-name>
          <email>yang@citi.sinica.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Soleymani</string-name>
          <email>mohammad.soleymani@unige.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Academia Sinica</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Dept., University of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information and Computing, Sciences, Utrecht University</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Emotional expression is an important property of music. Its emotional characteristics are thus especially natural for music indexing and recommendation. The Emotion in Music task addresses the task of automatic music emotion prediction and is held for the second year in 2014. As compared to previous year, we modified the task by offering a new feature development subtask, and releasing a new evaluation set. We employed a crowdsourcing approach to collect the data, using Amazon Mechanical Turk. The dataset consists of music licensed under Creative Commons from the Free Music Archive, which can be shared freely without restrictions. In this paper we describe the dataset collection, annotations, and evaluation criteria, as well as the two required and optional runs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Huge music libraries create a demand for tools providing
automatic music classification by various parameters, such
as genre, instrumentation, emotion. Among these, emotion
is one of the most important classification criteria. This task
presents many challenges, starting from its internal
ambiguity and ending with audio processing difficulties [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As
musical emotion is subjective, most existing work on MER relies
on supervised machine learning approaches, training MER
systems with emotion labels provided by human
annotators. Currently, many researchers collect their own
groundtruth data, which makes direct comparison between their
approaches impossible. A benchmark is necessary to
facilitate the cross-site comparison. The Emotion in Music task
appears for the second time in the MediaEval
benchmarking campaign for multimedia evaluation1 and is designed to
serve this purpose.
      </p>
      <p>
        The only other current evaluation task for MER is the
audio mood classification (AMC) task of the annual music
information retrieval evaluation exchange2 (MIREX) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In
this task, 600 audio files are provided to the participants
of the task, who have agreed not to distribute the files for
commercial purposes. However, AMC has been criticized for
using an emotional model that is not based on psychological
research. Namely, this benchmark uses five discrete emotion
1http://www.multimediaeval.org
2http://www.music-ir.org/mirex/wiki/
clusters, derived from cluster analysis of online tags, instead
of more widely accepted dimensional or categorical models of
emotion. It was noted that there exists semantic or acoustic
overlap between clusters [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, the dataset only
applies a singular static rating per audio clip, which belies
the time-varying nature of music.
      </p>
      <p>
        In our corpus we employ the music licensed under
Creative Commons3 (CC) from the Free Music Archive4 (FMA),
which enables us to redistribute the content. We do not use
volunteers or online tag mining to collect the annotations,
but pay the annotators to perform the task via Amazon
Mechanical Turk (MTurk)5, in a similar way as [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ]. We
filter poor quality workers by making them first pass a test
demonstrating a thorough understanding of the task, and
an ability to produce good quality work. The final dataset
spans 1744 clips of 45 seconds, and each clip is annotated
by a minimum of 10 workers, which is substantially larger
than any existing music emotion dataset with continuous
annotations.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>This year, similar to last year, the task comprises two
subtasks. The first task is dynamic emotion characterization
(main task). The second task, feature design, is introduced
for the first time this year. New features, which have
either not been developed before, or have not been applied to
MER, should be proposed and applied to automatically
detect arousal and valence for the whole song. The tasks will
be trained on a development set of 744 songs and evaluated
on a evaluation set of 1000 songs.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Run description</title>
      <p>In Subtask 1, dynamic estimation, the participants will
estimate the valence and arousal scores continuously in time
for every segment (half a second long) on a scale from -1 to
1. In Subtask 2, feature design, the participants will develop
new features and predict the valence and arousal scores of
whole 45 second excerpts (on average, i.e. statically). Only
one new feature will be evaluated in each run. For both
tasks, together, each team can submit up to 5 runs, totally.</p>
      <p>For the main (dynamic subtask) run, any features
automatically extracted from the audio or the metadata provided
by the organizers are allowed. For the dynamic emotional
analysis we will use the Pearson correlation calculated per
3http://creativecommons.org/
4http://freemusicarchive.org/
5http://mturk.com
song and averaged for the final value. We will also report
the Root-Means-Squared Error (RMSE). We will rank the
submissions based on the averaged correlations. Whenever
the difference based on the one sided Wilcoxon test is not
significant (p&gt;0.05), we will use the RMSE to break the tie.
The feature design task will be also evaluated based on the
averaged across songs Pearson correlation and three runs.
The participants can apply any non-linear transformation
to their designed features to maximize the correlation.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>DATASET AND GROUND TRUTH</title>
      <p>
        For the description of the development set we refer to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
This year we collected more data in a similar way, but
included external sources for metadata. We used the last.fm
API to collect tags for matching songs from FMA. The songs
that are already in last year’s corpus were excluded. Then,
we chose 1000 songs with the largest number of tags. Each
song is from one or several genres from the following list:
Soul, Blues, Electronic, Rock, Classical, Hip-Hop,
International, Experimental, Folk, Jazz, Country, and Pop. We
excluded songs from these genres: Spoken, Old-time historic,
Experimental (in case the latter was the only genre that
song belonged to). We also manually checked the music and
excluded the files with bad recording quality or those not
containing music, but speech or noise. For each artist, we
selected maximally 5 songs to be included in the dataset.
      </p>
      <p>
        To assure the adequate quality of the ground-truth, we
created a procedure to select only the workers who are
motivated and qualified to do the task, following current
stateof-the-art crowdsourcing approaches [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All the workers had
to pass a qualification test that was later evaluated
manually. It consisted of three stages. Prior to the test,
participants were provided with the definitions of arousal and
valence, and could watch an instruction video. In the first
stage, they listened to two short music audio clips, which
contained distinctive emotional shift, and annotated arousal
and valence continuously. In the second stage, workers
described the emotional shift, and in the third stage, they
described the song and indicated its genre. We also collected
anonymized personal information from the workers,
including, gender, age, and location, and asked them to take a
short personality test.
      </p>
      <p>Based on the quality of musical descriptions, and the
correctness of their answers in the qualification task, we granted
qualifications to the workers, after which they could proceed
to the second step (the main task). The main task involved
annotating the songs continuously over time once for arousal
and once for valence, which in total constituted 334
microtasks. Each micro-task involved annotating 3 audio clips of
45 seconds on arousal and valence scales dynamically and
statically, as a whole. The workers also characterized the
song in emotional terms, and reported confidence of their
answers, as well as familiarity and liking of the music.
Workers were paid $0.25 USD for the qualification HITs and $0.40
USD for each main HIT that they successfully completed.
On average, each HIT took 10 minutes.</p>
      <p>To measure the inter-annotation agreement for the static
annotations, we calculated Krippendorff’s alpha on an
ordinal scale. The values were 0.22 for valence and 0.37 for
arousal, which are in the range of fair agreement. For the
dynamic annotations, we used Kendall’s coefficient of
concordance (Kendall’s W) with corrected tied ranks. Kendall’s
W was calculated for each song separately after discarding
the annotations of the first 15 seconds. The average W is
0.2 ± 0.13 for arousal and 0.16 ± 0.11 for valence, which
indicate weak agreement.
4.</p>
    </sec>
    <sec id="sec-5">
      <title>BASELINE RESULTS</title>
      <p>
        For the baseline, we used MIRToolbox [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to extract 5
features (spectral flux, harmonic change detection
function, loudness, roughness and zero crossing rate) from
nonoverlapping segments of 500ms, with frame size of 50ms.
We used multilinear regression as we did last year. For
valence, the correlation averaged across songs was 0.11 ± 0.34
and RMSE: 0.19 ± 0.11. For arousal, the correlation was
0.18 ± 0.36 and RMSE was 0.27 ± 0.12. As compared to last
year (for arousal, r = 0.16±0.35, for valence, r = 0.06±0.3),
the baseline is higher. We also calculated the random
baseline by averaging all the predictions. The RMSE for random
average baseline is 0.18 ± 0.11 for valence and 0.21 ± 0.12
for arousal, which means that in terms of RMSE random
baseline performs better.
5.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>We are grateful to Sung-Yen Liu from Academia Sinica
for helping with the task organization. This research was
supported in part by European Research Area, the CVML
Lab.6, University of Geneva, and by the FES project
COMMIT/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Downie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bay</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Ehmann</surname>
          </string-name>
          .
          <article-title>The 2007 MIREX audio mood classification task: Lessons learned</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf.</source>
          , pages
          <fpage>462</fpage>
          -
          <lpage>467</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          , E. Schmidt, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Emelle</surname>
          </string-name>
          .
          <article-title>Moodswings: A collaborative game for music mood label collection</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf.</source>
          , pages
          <fpage>231</fpage>
          -
          <lpage>236</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Lartillot</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Toiviainen</surname>
          </string-name>
          .
          <article-title>A matlab toolbox for musical feature extraction from audio</article-title>
          .
          <source>In International Conference on Digital Audio Effects, Bordeaux</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>Audio music mood classification using support vector machine</article-title>
          .
          <source>In MIREX task on Audio Mood Classification</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , C.-Y. Sha, and
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>1000 songs for emotional analysis of music</article-title>
          .
          <source>In Proceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia</source>
          ,
          <source>CrowdMM '13</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Crowdsourcing for affective annotation of video: Development of a viewer-reported boredom corpus</article-title>
          .
          <source>In Workshop on Crowdsourcing for Search Evaluation, SIGIR</source>
          <year>2010</year>
          , Geneva, Switzerland,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Speck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Morton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>A comparative study of collaborative vs. traditional musical mood annotation</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf</source>
          .,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Music Emotion Recognition</article-title>
          . CRC Press, Boca Raton, Florida,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>