=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_10
|storemode=property
|title=Emotion in Music task: Lessons Learned
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_10.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS16
}}
==Emotion in Music task: Lessons Learned==
<pdf width="1500px">https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_10.pdf</pdf>
<pre>
                       Emotion in Music task: lessons learned

                   Anna Aljanaki                            Yi-Hsuan Yang                     Mohammad Soleymani
                University of Geneva                         Academia Sinica                     University of Geneva
                Geneva, Switzerland                           Taipei, Taiwan                     Geneva, Switzerland
                   aljanaki@gmail.com                      yang@citi.sinica.edu.tw            mohammad.soleymani@unige.ch


ABSTRACT                                                                 the existent Mood Classification task at MIREX1 bench-
The Emotion in Music task was organized within MediaE-                   mark. The continuous emotion recognition is also arguably
val benchmarking campaign during three consecutive years,                less researched than static emotion recognition. However,
from 2013 to 2015. In this paper we describe the chal-                   the pragmatic utilitarian aspect of the task valued in the
lenges we faced and the solutions we found. We used crowd-               MediaEval community became less prominent. There are
sourcing on Amazon Mechanical Turk to annotate a corpus                  much less applications and much less interest (at least cur-
of music pieces with continuous (per-second) emotion anno-               rently) for automatic recognition of emotion varying over
tations. To assure sufficient quality of the data, the anno-             time.
tation process on Mechanical Turk requires sufficient atten-
tion. Labeling music with emotion continuously proved to                 2.     CROWDSOURCING THE DATA
be a very difficult task for listeners, where both time de-                 Music annotation with emotion is a time-consuming task,
lay and demand for absolute ratings degraded the quality                 which often generates very inconsistent responses even with
of the data. We suggest certain transformations to allevi-               conscientious annotators. Therefore, it is difficult to ver-
ate the problems. Finally, the length of the annotated seg-              ify the responses, because many inconsistencies can be at-
ments (0.5-1s) led to task participants classifying music on             tributed to individual perception. We used crowdsourcing
the equally small time scale, which only allowed them to                 (on Amazon Mechanical Turk (AMT)) to annotate the data,
capture changes in dynamics and timbre, but not musically                we paid the workers to annotate the music, and the workers
meaningful harmonic, melodic and other changes, which oc-                had to pass a test before being admitted to the task. In the
cur on a larger time scale. LSTM-RNN based methods,                      first 2 years, we did not monitor the quality of the work after
which allow to incorporate larger context, gave better re-               the test was passed. We tried to estimate the lower bound of
sults than other methods, but still the proposed methods                 the number low-quality workers by only counting the people
did not show significant improvement over the years and                  who did not move their mouse at all when annotating. Some
the task was concluded.                                                  of the songs may not have any emotional change in them,
                                                                         but at least some initial movement from the start position is
1.    INTRODUCTION                                                       necessary before stabilizing. In year 2014, 99 annotators an-
                                                                         notated 1000 pieces of music, of them only 2 people did not
   The Emotion in Music task was first proposed by M. So-
                                                                         move their mouse at all, and they annotated only a small
leymani, M.N. Caro, E.M. Schmidt and Y.-H. Yang in 2013
                                                                         amount of songs.
[9]. The task was targeting music emotion recognition al-
                                                                            However, the agreement between the annotators was not
gorithms for music indexing and recommendation, predom-
                                                                         very good both in 2013 and 2014 (less than 0.3 in Cronbach’s
inantly for popular music. The most common paradigm for
                                                                         α). In 2015, we changed the procedure to a more lab-like
music retrieval by emotion is the one when emotion is as-
                                                                         setting by hiring 5 annotators to annotate all the dataset,
signed to an entire piece of music. However, a piece of music
                                                                         half of them in the lab and half on the AMT [1]. The qual-
exists in time and assigning just one emotion to a piece of
                                                                         ity was much better. This could also be attributed to the
music is most of the time incorrect. Therefore, music ex-
                                                                         other changes, such as choosing full length songs, choosing
cerpts were annotated continuously using a paradigm that
                                                                         the best annotators of the previous years, negotiating a fare
was first suggested for studying dynamics and general emo-
                                                                         compensation in advance on a Turker forum2 and introduc-
tionality in music — Continouos Response Digital Interface
                                                                         ing a preliminary listening stage.
(CRDI) [3]. The CRDI was later adapted by E. Schubert to
                                                                            Despite having highly qualified annotators, the following
record emotional response on valence and arousal scale [7].
                                                                         problems were not resolved:
In the first year of the task, static and dynamic tasks existed
side by side. However, later the static task was dropped as
                                                                              1. Absolute scale ratings. The ratings had to be given
less challenging. The decision to focus on continuous track-
                                                                                 on an absolute scale while estimating changes in arousal
ing on emotion had both pros and cons. On the bright side,
                                                                                 and valence. Though the annotators often agreed on
it made the Emotion in Music benchmark very distinct from
                                                                                 the direction of change, the magnitude of change was
                                                                                 often different. We suggest shifting the annotation so
                                                                         1
Copyright held by the authors/owners.                                        http://www.music-ir.org/mirex
                                                                         2
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands            http://www.mturkgrind.com/
       that its mean is where the mean was indicated by the            Method                            ρ        RMSE
       annotator (beforehand).                                         2013, BLSTM-RNN [10]          .31 ± .37   .08 ± .05
                                                                       2014, LSTM [2]                .35 ± .45   .10 ± .05
     2. Human annotators have a reaction time. The biggest             2015, BLSTM-RNN [11]          .66 ± .25   .12 ± .06
        time lag is observed in the beginning of the annotation
        (around 13 seconds), but after every musical change a
        small time lag is also present. The beginnings of the     Table 1: Winning algorithms on arousal, ordered by
        annotations had to be deleted as unreliable.              Spearman’s ρ. BLSTM-RNN – Bi-directional Long-
                                                                  Short Term Memory Recurrent Neural Networks.
     3. The time scale. Some of the annotators would react
        to every beat and every note, and some annotators              Method                            ρ        RMSE
        would only consider changing their arousal or valence          2013, BLSTM-RNN [10]          .19 ± .43   .08 ± .04
        at section bounds.                                             2014, LSTM [2]                .20 ± .49   .08 ± .05
                                                                       2015, BLSTM-RNN [11]          .17 ± .09   .12 ± .54
3.     SUGGESTED SOLUTIONS
   Participants received the data as a sequence of features       Table 2: Winning algorithms on valence, ordered by
and annotations with either half a second or one second           Spearman’s ρ.
frame rate. Many participants extracted their own features,
but almost always the windows for feature extraction were
smaller than the 0.5-1s, and the features were very low-level,
mostly relating to timbral properties of the sound (energy        response annotation interface seem to be unsolvable unless
distribution across spectrum) and loudness.                       either the task or the interface change.
   The best solutions in all the years were obtained using           One of the possible solutions is to change the underly-
Long-Short Term Memory Recurrent Neural Networks. Al-             ing task. It seems that the algorithms developed by the
though the arousal prediction performance improved over           teams can track musical dynamics rather well. Many expres-
the three years (see Table 1), the accuracy obtained when         sive means in music are characterized by gradual changes
predicting valence did not improve much (Table 2). It is a        (e.g., diminuendo, crescendo, rallentando). Tracking these
known issue that modeling valence in music is more chal-          changes in tempo and dynamics could be useful as a prelimi-
lenging both due to the higher subjectivity associated with       nary step to tracking emotional changes. Changes in timbre
valence perception and in part due to the absence of salient      can also be tracked in a similar way on a very short time
valence-related audio features that can be reliably computed      scale.
by state-of-the-art music signal processing algorithms [12, 5,       Another possibility is changing the interface. One of the
4]. The almost twofold improvement in arousal can also be         alternative continuous annotation interfaces suggested by E.
attributed to the improvement in the quality and consis-          Schubert uses categorical model instead of a dimensional one
tency of the data. In year 2015, the situation with valence       [8]. Using categorical model would eliminate the problem
became even worse, because we invested extra effort to as-        with absolute scaling.
semble the data set in such a way, that valence and arousal          A more sophisticated interface could also allow to modify
would not be correlated (by picking more songs from the           the annotation afterwards by changing the scale (squeezing
upper left (“angry”) and lower right (“serene”) quadrants).       or expanding), removing the time lags.
Because of the difference in the development and evaluation          At last, one of the major questions with the continuous
sets’ distributions, the evaluation results were inaccurate in    emotion tracking task is its practical applicability. In most
2015. We trained an LSTM-RNN with the features sup-               cases, the estimation of the overall emotion of the song,
plied by the participants and evaluation set data. Using          or the most representative part of a song, is most useful
20-fold cross-validation, we obtained more accurate estima-       to users. Retrieval by continuous emotion tracking could
tion of the state-of-the-art performance on valence. The          be useful when a song with a certain emotional trajectory
best result for valence detection on the test-set of 2015 was     is necessary, for instance, for production music or sound-
achieved using JUNLP team’s features (ρ = .27 ± .13 and           tracks. Another possible application would be finding the
RM SE = .19±.35) [6]. JUNLP team used feature reduction           most dramatic (emotionally charged) moment in a song to
to find the features which were most important for valence.       be used as a snippet. Moreover, as music is often used as
However, the result is still much worse than the one obtained     a stimulus in the affective computing community to study
for arousal. A very interesting finding was that even though      emotion prediction by brain waves or physiological signals,
some sophisticated procedures for feature dimensionality re-      a model to predict dynamic emotion in music can be helpful
duction and BLSTM-RNNs were suggested by the partici-             in this research. Departing from such bottom-up needs and
pants, an almost equally good result could be obtained for        requirements, hopefully the problem could be reformulated
arousal by using just eight low-level timbral features, and       in a better way.
linear regression with smoothing.
                                                                  5.   ACKNOWLEDGMENTS
4.     FUTURE OF THE TASK                                           We would like to thank Erik M. Schmidt, Mike N. Caro,
  The major problem that we encountered when organizing           Cheng-Ya Sha, Alexander Lansky, Sung-Yen Liu and Ed-
the task was assembling good quality data. The improve-           uardo Countinho for their contributions in the development
ment in performance over the years was partly dependent           of this task. We also thank all the task participants and
on that. The problems arising when using the continuous           anonymous turkers for their invaluable contributions.
6.   REFERENCES
 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
     in music task at mediaeval 2015. In Working Notes
     Proceedings of the MediaEval 2015 Workshop, 2015.
 [2] E. Coutinho, F. Weninger, B. Schuller, and K. R.
     Scherer. The munich lstm-rnn approach to the
     mediaeval 2014 emotion in music task. In Working
     Notes Proceedings of the MediaEval 2014 Workshop,
     2014.
 [3] D. Gregory. Using computers to measure continuous
     music responses. Psychomusicology: A Journal of
     Research in Music Cognition, 8(2):127–134, 1989.
 [4] D. Guan, X. Chen, and D. Yang. Music Emotion
     Regression Based on Multi-modal Features. In
     Symposium on Computer Music Multidisciplinary
     Research, pages 70–77, 2012.
 [5] C. Laurier, O. Lartillot, T. Eerola, and P. Toiviainen.
     Exploring Relationships between Audio Features and
     Emotion in Music. In Proceedings of the 7th Triennal
     Conference of European Society for Cognitive Sciences
     of Music, pages 260–264, 2009.
 [6] B. G. Patra, P. Maitra, D. Das, and
     S. Bandyopadhyay. Mediaeval 2015: Music emotion
     recognition based on feed-forward neural network. In
     Working Notes Proceedings of the MediaEval 2015
     Workshop, 2015.
 [7] E. Schubert. Continuous response to music using a
     two dimensional emotion space. In Proceedings of
     International Conference of Music Perception and
     Cognition, pages 263–268, 1996.
 [8] E. Schubert, S. Ferguson, N. Farrar, D. Taylor, and
     G. E. McPherson. Continuous Response to Music
     using Discrete Emotion Faces. In Proceedings of the
     9th international symposium on computer music
     modeling and retrieval, 2012.
 [9] M. Soleymani, M. N. Caro, E. M. Schmidt, and Y.-H.
     Yang. The mediaeval 2013 brave new task: Emotion in
     music. In Working Notes Proceedings of the MediaEval
     2013 Workshop, 2013.
[10] F. Weninger, F. Eyben, and B. Schuller. The TUM
     approach to the mediaeval music emotion task using
     generic affective audio features. In Working Notes
     Proceedings of the MediaEval 2013 Workshop, 2013.
[11] M. Xu, X. Li, H. Xianyu, J. Tian, F. Meng, and
     W. Chen. Multi-scale approaches to the mediaeval
     2015 “emotion in music” task. In Working Notes
     Proceedings of the MediaEval 2015 Workshop, 2015.
[12] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen. A
     Regression Approach to Music Emotion Recognition.
     IEEE Transactions on Audio, Speech, and Language
     Processing, 16(2):448–457, 2008.

</pre>