=Paper=
{{Paper
|id=Vol-1263/paper33
|storemode=property
|title=Emotion in Music Task at MediaEval 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_33.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS14
}}
==Emotion in Music Task at MediaEval 2014==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_33.pdf</pdf>
<pre>
                   Emotion in Music Task at MediaEval 2014

                   Anna Aljanaki                       Yi-Hsuan Yang                  Mohammad Soleymani
             Information and Computing                  Academia Sinica                Computer Science Dept.
                      Sciences                              Taipei                      University of Geneva
                  Utrecht University                        Taiwan                          Switzerland
                   the Netherlands                    yang@citi.sinica.edu.tw         mohammad.soleymani@unige.ch
                    a.aljanaki@uu.nl


ABSTRACT                                                            clusters, derived from cluster analysis of online tags, instead
Emotional expression is an important property of music. Its         of more widely accepted dimensional or categorical models of
emotional characteristics are thus especially natural for mu-       emotion. It was noted that there exists semantic or acoustic
sic indexing and recommendation. The Emotion in Music               overlap between clusters [4]. Furthermore, the dataset only
task addresses the task of automatic music emotion predic-          applies a singular static rating per audio clip, which belies
tion and is held for the second year in 2014. As compared           the time-varying nature of music.
to previous year, we modified the task by offering a new fea-          In our corpus we employ the music licensed under Cre-
ture development subtask, and releasing a new evaluation            ative Commons3 (CC) from the Free Music Archive4 (FMA),
set. We employed a crowdsourcing approach to collect the            which enables us to redistribute the content. We do not use
data, using Amazon Mechanical Turk. The dataset consists            volunteers or online tag mining to collect the annotations,
of music licensed under Creative Commons from the Free              but pay the annotators to perform the task via Amazon
Music Archive, which can be shared freely without restric-          Mechanical Turk (MTurk)5 , in a similar way as [2, 7]. We
tions. In this paper we describe the dataset collection, anno-      filter poor quality workers by making them first pass a test
tations, and evaluation criteria, as well as the two required       demonstrating a thorough understanding of the task, and
and optional runs.                                                  an ability to produce good quality work. The final dataset
                                                                    spans 1744 clips of 45 seconds, and each clip is annotated
                                                                    by a minimum of 10 workers, which is substantially larger
1. INTRODUCTION                                                     than any existing music emotion dataset with continuous
   Huge music libraries create a demand for tools providing         annotations.
automatic music classification by various parameters, such
as genre, instrumentation, emotion. Among these, emotion
is one of the most important classification criteria. This task
                                                                    2.    TASK DESCRIPTION
presents many challenges, starting from its internal ambigu-          This year, similar to last year, the task comprises two sub-
ity and ending with audio processing difficulties [8]. As mu-       tasks. The first task is dynamic emotion characterization
sical emotion is subjective, most existing work on MER relies       (main task). The second task, feature design, is introduced
on supervised machine learning approaches, training MER             for the first time this year. New features, which have ei-
systems with emotion labels provided by human annota-               ther not been developed before, or have not been applied to
tors. Currently, many researchers collect their own ground-         MER, should be proposed and applied to automatically de-
truth data, which makes direct comparison between their             tect arousal and valence for the whole song. The tasks will
approaches impossible. A benchmark is necessary to facili-          be trained on a development set of 744 songs and evaluated
tate the cross-site comparison. The Emotion in Music task           on a evaluation set of 1000 songs.
appears for the second time in the MediaEval benchmark-
ing campaign for multimedia evaluation1 and is designed to          2.1    Run description
serve this purpose.                                                   In Subtask 1, dynamic estimation, the participants will
   The only other current evaluation task for MER is the            estimate the valence and arousal scores continuously in time
audio mood classification (AMC) task of the annual music            for every segment (half a second long) on a scale from -1 to
information retrieval evaluation exchange2 (MIREX) [1]. In          1. In Subtask 2, feature design, the participants will develop
this task, 600 audio files are provided to the participants         new features and predict the valence and arousal scores of
of the task, who have agreed not to distribute the files for        whole 45 second excerpts (on average, i.e. statically). Only
commercial purposes. However, AMC has been criticized for           one new feature will be evaluated in each run. For both
using an emotional model that is not based on psychological         tasks, together, each team can submit up to 5 runs, totally.
research. Namely, this benchmark uses five discrete emotion           For the main (dynamic subtask) run, any features auto-
1                                                                   matically extracted from the audio or the metadata provided
    http://www.multimediaeval.org
2                                                                   by the organizers are allowed. For the dynamic emotional
    http://www.music-ir.org/mirex/wiki/
                                                                    analysis we will use the Pearson correlation calculated per
                                                                    3
                                                                      http://creativecommons.org/
                                                                    4
Copyright is held by the author/owner(s).                             http://freemusicarchive.org/
                                                                    5
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        http://mturk.com
song and averaged for the final value. We will also report        W was calculated for each song separately after discarding
the Root-Means-Squared Error (RMSE). We will rank the             the annotations of the first 15 seconds. The average W is
submissions based on the averaged correlations. Whenever          0.2 ± 0.13 for arousal and 0.16 ± 0.11 for valence, which
the difference based on the one sided Wilcoxon test is not        indicate weak agreement.
significant (p>0.05), we will use the RMSE to break the tie.
The feature design task will be also evaluated based on the       4.     BASELINE RESULTS
averaged across songs Pearson correlation and three runs.
                                                                     For the baseline, we used MIRToolbox [3] to extract 5
The participants can apply any non-linear transformation
                                                                  features (spectral flux, harmonic change detection func-
to their designed features to maximize the correlation.
                                                                  tion, loudness, roughness and zero crossing rate) from non-
                                                                  overlapping segments of 500ms, with frame size of 50ms.
3. DATASET AND GROUND TRUTH                                       We used multilinear regression as we did last year. For va-
                                                                  lence, the correlation averaged across songs was 0.11 ± 0.34
   For the description of the development set we refer to [5].
                                                                  and RMSE: 0.19 ± 0.11. For arousal, the correlation was
This year we collected more data in a similar way, but in-
                                                                  0.18 ± 0.36 and RMSE was 0.27 ± 0.12. As compared to last
cluded external sources for metadata. We used the last.fm
                                                                  year (for arousal, r = 0.16±0.35, for valence, r = 0.06±0.3),
API to collect tags for matching songs from FMA. The songs
                                                                  the baseline is higher. We also calculated the random base-
that are already in last year’s corpus were excluded. Then,
                                                                  line by averaging all the predictions. The RMSE for random
we chose 1000 songs with the largest number of tags. Each
                                                                  average baseline is 0.18 ± 0.11 for valence and 0.21 ± 0.12
song is from one or several genres from the following list:
                                                                  for arousal, which means that in terms of RMSE random
Soul, Blues, Electronic, Rock, Classical, Hip-Hop, Interna-
                                                                  baseline performs better.
tional, Experimental, Folk, Jazz, Country, and Pop. We ex-
cluded songs from these genres: Spoken, Old-time historic,
Experimental (in case the latter was the only genre that          5.     ACKNOWLEDGMENTS
song belonged to). We also manually checked the music and           We are grateful to Sung-Yen Liu from Academia Sinica
excluded the files with bad recording quality or those not        for helping with the task organization. This research was
containing music, but speech or noise. For each artist, we        supported in part by European Research Area, the CVML
selected maximally 5 songs to be included in the dataset.         Lab.6 , University of Geneva, and by the FES project COM-
   To assure the adequate quality of the ground-truth, we         MIT/.
created a procedure to select only the workers who are mo-
tivated and qualified to do the task, following current state-    6.     REFERENCES
of-the-art crowdsourcing approaches [6]. All the workers had      [1] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F.
to pass a qualification test that was later evaluated manu-           Ehmann. The 2007 MIREX audio mood classification
ally. It consisted of three stages. Prior to the test, par-           task: Lessons learned. In Proc. Int. Soc. Music Info.
ticipants were provided with the definitions of arousal and           Retrieval Conf., pages 462–467, 2008.
valence, and could watch an instruction video. In the first       [2] Y. E. Kim, E. Schmidt, and L. Emelle. Moodswings: A
stage, they listened to two short music audio clips, which            collaborative game for music mood label collection. In
contained distinctive emotional shift, and annotated arousal          Proc. Int. Soc. Music Info. Retrieval Conf., pages
and valence continuously. In the second stage, workers de-            231–236, 2008.
scribed the emotional shift, and in the third stage, they de-     [3] O. Lartillot and P. Toiviainen. A matlab toolbox for
scribed the song and indicated its genre. We also collected           musical feature extraction from audio. In International
anonymized personal information from the workers, includ-             Conference on Digital Audio Effects, Bordeaux, 2007.
ing, gender, age, and location, and asked them to take a          [4] C. Laurier and P. Herrera. Audio music mood
short personality test.                                               classification using support vector machine. In MIREX
   Based on the quality of musical descriptions, and the cor-         task on Audio Mood Classification, 2007.
rectness of their answers in the qualification task, we granted   [5] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha,
qualifications to the workers, after which they could proceed         and Y.-H. Yang. 1000 songs for emotional analysis of
to the second step (the main task). The main task involved            music. In Proceedings of the 2nd ACM International
annotating the songs continuously over time once for arousal          Workshop on Crowdsourcing for Multimedia,
and once for valence, which in total constituted 334 micro-           CrowdMM ’13, pages 1–6, New York, NY, USA, 2013.
tasks. Each micro-task involved annotating 3 audio clips of           ACM.
45 seconds on arousal and valence scales dynamically and          [6] M. Soleymani and M. Larson. Crowdsourcing for
statically, as a whole. The workers also characterized the            affective annotation of video: Development of a
song in emotional terms, and reported confidence of their             viewer-reported boredom corpus. In Workshop on
answers, as well as familiarity and liking of the music. Work-        Crowdsourcing for Search Evaluation, SIGIR 2010,
ers were paid $0.25 USD for the qualification HITs and $0.40          Geneva, Switzerland, 2010.
USD for each main HIT that they successfully completed.
                                                                  [7] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E.
On average, each HIT took 10 minutes.
                                                                      Kim. A comparative study of collaborative vs.
   To measure the inter-annotation agreement for the static           traditional musical mood annotation. In Proc. Int. Soc.
annotations, we calculated Krippendorff’s alpha on an or-             Music Info. Retrieval Conf., 2011.
dinal scale. The values were 0.22 for valence and 0.37 for        [8] Y.-H. Yang and H. H. Chen. Music Emotion
arousal, which are in the range of fair agreement. For the            Recognition. CRC Press, Boca Raton, Florida, 2011.
dynamic annotations, we used Kendall’s coefficient of con-
                                                                  6
cordance (Kendall’s W) with corrected tied ranks. Kendall’s           http://cvml.unige.ch

</pre>