=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_10
|storemode=property
|title=Emotion in Music task: Lessons Learned
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_10.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS16
}}
==Emotion in Music task: Lessons Learned==
Emotion in Music task: lessons learned Anna Aljanaki Yi-Hsuan Yang Mohammad Soleymani University of Geneva Academia Sinica University of Geneva Geneva, Switzerland Taipei, Taiwan Geneva, Switzerland aljanaki@gmail.com yang@citi.sinica.edu.tw mohammad.soleymani@unige.ch ABSTRACT the existent Mood Classification task at MIREX1 bench- The Emotion in Music task was organized within MediaE- mark. The continuous emotion recognition is also arguably val benchmarking campaign during three consecutive years, less researched than static emotion recognition. However, from 2013 to 2015. In this paper we describe the chal- the pragmatic utilitarian aspect of the task valued in the lenges we faced and the solutions we found. We used crowd- MediaEval community became less prominent. There are sourcing on Amazon Mechanical Turk to annotate a corpus much less applications and much less interest (at least cur- of music pieces with continuous (per-second) emotion anno- rently) for automatic recognition of emotion varying over tations. To assure sufficient quality of the data, the anno- time. tation process on Mechanical Turk requires sufficient atten- tion. Labeling music with emotion continuously proved to 2. CROWDSOURCING THE DATA be a very difficult task for listeners, where both time de- Music annotation with emotion is a time-consuming task, lay and demand for absolute ratings degraded the quality which often generates very inconsistent responses even with of the data. We suggest certain transformations to allevi- conscientious annotators. Therefore, it is difficult to ver- ate the problems. Finally, the length of the annotated seg- ify the responses, because many inconsistencies can be at- ments (0.5-1s) led to task participants classifying music on tributed to individual perception. We used crowdsourcing the equally small time scale, which only allowed them to (on Amazon Mechanical Turk (AMT)) to annotate the data, capture changes in dynamics and timbre, but not musically we paid the workers to annotate the music, and the workers meaningful harmonic, melodic and other changes, which oc- had to pass a test before being admitted to the task. In the cur on a larger time scale. LSTM-RNN based methods, first 2 years, we did not monitor the quality of the work after which allow to incorporate larger context, gave better re- the test was passed. We tried to estimate the lower bound of sults than other methods, but still the proposed methods the number low-quality workers by only counting the people did not show significant improvement over the years and who did not move their mouse at all when annotating. Some the task was concluded. of the songs may not have any emotional change in them, but at least some initial movement from the start position is 1. INTRODUCTION necessary before stabilizing. In year 2014, 99 annotators an- notated 1000 pieces of music, of them only 2 people did not The Emotion in Music task was first proposed by M. So- move their mouse at all, and they annotated only a small leymani, M.N. Caro, E.M. Schmidt and Y.-H. Yang in 2013 amount of songs. [9]. The task was targeting music emotion recognition al- However, the agreement between the annotators was not gorithms for music indexing and recommendation, predom- very good both in 2013 and 2014 (less than 0.3 in Cronbach’s inantly for popular music. The most common paradigm for α). In 2015, we changed the procedure to a more lab-like music retrieval by emotion is the one when emotion is as- setting by hiring 5 annotators to annotate all the dataset, signed to an entire piece of music. However, a piece of music half of them in the lab and half on the AMT [1]. The qual- exists in time and assigning just one emotion to a piece of ity was much better. This could also be attributed to the music is most of the time incorrect. Therefore, music ex- other changes, such as choosing full length songs, choosing cerpts were annotated continuously using a paradigm that the best annotators of the previous years, negotiating a fare was first suggested for studying dynamics and general emo- compensation in advance on a Turker forum2 and introduc- tionality in music — Continouos Response Digital Interface ing a preliminary listening stage. (CRDI) [3]. The CRDI was later adapted by E. Schubert to Despite having highly qualified annotators, the following record emotional response on valence and arousal scale [7]. problems were not resolved: In the first year of the task, static and dynamic tasks existed side by side. However, later the static task was dropped as 1. Absolute scale ratings. The ratings had to be given less challenging. The decision to focus on continuous track- on an absolute scale while estimating changes in arousal ing on emotion had both pros and cons. On the bright side, and valence. Though the annotators often agreed on it made the Emotion in Music benchmark very distinct from the direction of change, the magnitude of change was often different. We suggest shifting the annotation so 1 Copyright held by the authors/owners. http://www.music-ir.org/mirex 2 MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands http://www.mturkgrind.com/ that its mean is where the mean was indicated by the Method ρ RMSE annotator (beforehand). 2013, BLSTM-RNN [10] .31 ± .37 .08 ± .05 2014, LSTM [2] .35 ± .45 .10 ± .05 2. Human annotators have a reaction time. The biggest 2015, BLSTM-RNN [11] .66 ± .25 .12 ± .06 time lag is observed in the beginning of the annotation (around 13 seconds), but after every musical change a small time lag is also present. The beginnings of the Table 1: Winning algorithms on arousal, ordered by annotations had to be deleted as unreliable. Spearman’s ρ. BLSTM-RNN – Bi-directional Long- Short Term Memory Recurrent Neural Networks. 3. The time scale. Some of the annotators would react to every beat and every note, and some annotators Method ρ RMSE would only consider changing their arousal or valence 2013, BLSTM-RNN [10] .19 ± .43 .08 ± .04 at section bounds. 2014, LSTM [2] .20 ± .49 .08 ± .05 2015, BLSTM-RNN [11] .17 ± .09 .12 ± .54 3. SUGGESTED SOLUTIONS Participants received the data as a sequence of features Table 2: Winning algorithms on valence, ordered by and annotations with either half a second or one second Spearman’s ρ. frame rate. Many participants extracted their own features, but almost always the windows for feature extraction were smaller than the 0.5-1s, and the features were very low-level, mostly relating to timbral properties of the sound (energy response annotation interface seem to be unsolvable unless distribution across spectrum) and loudness. either the task or the interface change. The best solutions in all the years were obtained using One of the possible solutions is to change the underly- Long-Short Term Memory Recurrent Neural Networks. Al- ing task. It seems that the algorithms developed by the though the arousal prediction performance improved over teams can track musical dynamics rather well. Many expres- the three years (see Table 1), the accuracy obtained when sive means in music are characterized by gradual changes predicting valence did not improve much (Table 2). It is a (e.g., diminuendo, crescendo, rallentando). Tracking these known issue that modeling valence in music is more chal- changes in tempo and dynamics could be useful as a prelimi- lenging both due to the higher subjectivity associated with nary step to tracking emotional changes. Changes in timbre valence perception and in part due to the absence of salient can also be tracked in a similar way on a very short time valence-related audio features that can be reliably computed scale. by state-of-the-art music signal processing algorithms [12, 5, Another possibility is changing the interface. One of the 4]. The almost twofold improvement in arousal can also be alternative continuous annotation interfaces suggested by E. attributed to the improvement in the quality and consis- Schubert uses categorical model instead of a dimensional one tency of the data. In year 2015, the situation with valence [8]. Using categorical model would eliminate the problem became even worse, because we invested extra effort to as- with absolute scaling. semble the data set in such a way, that valence and arousal A more sophisticated interface could also allow to modify would not be correlated (by picking more songs from the the annotation afterwards by changing the scale (squeezing upper left (“angry”) and lower right (“serene”) quadrants). or expanding), removing the time lags. Because of the difference in the development and evaluation At last, one of the major questions with the continuous sets’ distributions, the evaluation results were inaccurate in emotion tracking task is its practical applicability. In most 2015. We trained an LSTM-RNN with the features sup- cases, the estimation of the overall emotion of the song, plied by the participants and evaluation set data. Using or the most representative part of a song, is most useful 20-fold cross-validation, we obtained more accurate estima- to users. Retrieval by continuous emotion tracking could tion of the state-of-the-art performance on valence. The be useful when a song with a certain emotional trajectory best result for valence detection on the test-set of 2015 was is necessary, for instance, for production music or sound- achieved using JUNLP team’s features (ρ = .27 ± .13 and tracks. Another possible application would be finding the RM SE = .19±.35) [6]. JUNLP team used feature reduction most dramatic (emotionally charged) moment in a song to to find the features which were most important for valence. be used as a snippet. Moreover, as music is often used as However, the result is still much worse than the one obtained a stimulus in the affective computing community to study for arousal. A very interesting finding was that even though emotion prediction by brain waves or physiological signals, some sophisticated procedures for feature dimensionality re- a model to predict dynamic emotion in music can be helpful duction and BLSTM-RNNs were suggested by the partici- in this research. Departing from such bottom-up needs and pants, an almost equally good result could be obtained for requirements, hopefully the problem could be reformulated arousal by using just eight low-level timbral features, and in a better way. linear regression with smoothing. 5. ACKNOWLEDGMENTS 4. FUTURE OF THE TASK We would like to thank Erik M. Schmidt, Mike N. Caro, The major problem that we encountered when organizing Cheng-Ya Sha, Alexander Lansky, Sung-Yen Liu and Ed- the task was assembling good quality data. The improve- uardo Countinho for their contributions in the development ment in performance over the years was partly dependent of this task. We also thank all the task participants and on that. The problems arising when using the continuous anonymous turkers for their invaluable contributions. 6. REFERENCES [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015. [2] E. Coutinho, F. Weninger, B. Schuller, and K. R. Scherer. The munich lstm-rnn approach to the mediaeval 2014 emotion in music task. In Working Notes Proceedings of the MediaEval 2014 Workshop, 2014. [3] D. Gregory. Using computers to measure continuous music responses. Psychomusicology: A Journal of Research in Music Cognition, 8(2):127–134, 1989. [4] D. Guan, X. Chen, and D. Yang. Music Emotion Regression Based on Multi-modal Features. In Symposium on Computer Music Multidisciplinary Research, pages 70–77, 2012. [5] C. Laurier, O. Lartillot, T. Eerola, and P. Toiviainen. Exploring Relationships between Audio Features and Emotion in Music. In Proceedings of the 7th Triennal Conference of European Society for Cognitive Sciences of Music, pages 260–264, 2009. [6] B. G. Patra, P. Maitra, D. Das, and S. Bandyopadhyay. Mediaeval 2015: Music emotion recognition based on feed-forward neural network. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015. [7] E. Schubert. Continuous response to music using a two dimensional emotion space. In Proceedings of International Conference of Music Perception and Cognition, pages 263–268, 1996. [8] E. Schubert, S. Ferguson, N. Farrar, D. Taylor, and G. E. McPherson. Continuous Response to Music using Discrete Emotion Faces. In Proceedings of the 9th international symposium on computer music modeling and retrieval, 2012. [9] M. Soleymani, M. N. Caro, E. M. Schmidt, and Y.-H. Yang. The mediaeval 2013 brave new task: Emotion in music. In Working Notes Proceedings of the MediaEval 2013 Workshop, 2013. [10] F. Weninger, F. Eyben, and B. Schuller. The TUM approach to the mediaeval music emotion task using generic affective audio features. In Working Notes Proceedings of the MediaEval 2013 Workshop, 2013. [11] M. Xu, X. Li, H. Xianyu, J. Tian, F. Meng, and W. Chen. Multi-scale approaches to the mediaeval 2015 “emotion in music” task. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015. [12] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen. A Regression Approach to Music Emotion Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(2):448–457, 2008.