=Paper=
{{Paper
|id=Vol-1436/Paper10
|storemode=property
|title=Emotion in Music Task at MediaEval 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper10.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS15
}}
==Emotion in Music Task at MediaEval 2015==
Emotion in Music Task at MediaEval 2015 Anna Aljanaki Yi-Hsuan Yang Mohammad Soleymani Information and Computing Academia Sinica Computer Science Dept. Sciences Taipei University of Geneva Utrecht University Taiwan Switzerland the Netherlands yang@citi.sinica.edu.tw mohammad.soleymani@unige.ch∗ a.aljanaki@uu.nl ABSTRACT to develop a benchmark and an evaluation framework for The Emotion in Music task is held for the third consecu- such a comparison. tive year at the MediaEval benchmarking campaign. The The task is held for the third year in the MediaEval bench- unceasing interest towards the task shows that the music marking campaign for multimedia evaluation1 [1,14]. Build- emotion recognition (MER) problem is truly important to ing on our experience in the last two years, we concentrate the community, and there is a lot remaining to be discovered on a single dynamic emotion characterization task and on about it. Automatic MER methods could greatly improve offering high quality ground truth. the accessibility of music collections by providing quick and The only other current evaluation task for MER is the standardized means of music categorization and indexing. audio mood classification (AMC) task of the annual music In the Emotion in Music task we provide a benchmark for information retrieval evaluation exchange2 (MIREX) [8]. In automatic MER methods. This year, we concentrated on this task, 600 audio files are provided to the participants a single task, which proved to be the most challenging in of the task, who have agreed not to distribute the files for the previous years: dynamic emotion characterization. We commercial purposes. However, AMC has been criticized for put special emphasis on providing high-quality ground truth using an emotional model that is not based on psychological data and maximizing inter-annotator agreement. As a con- research. Namely, this benchmark uses five discrete emotion sequence of meeting a higher quality demand, the dataset clusters, derived from cluster analysis of online tags, instead both for training and evaluation is smaller than in the pre- of more widely accepted dimensional or categorical models of vious years. The dataset consists of music licensed under emotion. It was noted that there exists semantic or acoustic Creative Commons from the Free Music Archive, medleyDB overlap between clusters [12]. Furthermore, the dataset only dataset and Jamendo. This paper describes the dataset col- applies a singular static rating per audio clip, which belies lection, annotations, and evaluation criteria of the task. the time-varying nature of music. Since 2013, another set of 1,438 segments of 30 seconds clipped from Korean pop songs has been used in MIREX as well. However, the same 1. INTRODUCTION five-class taxonomy is adopted for this Korean set. Contemporary music listeners rely on online music ser- Since the first edition of the Emotion in Music task in vices such as Spotify, iTunes or Soundcloud to access their 2013 we have opted for characterizing the per-second emo- favorite music. In order to make their collections accessible, tion of music as numerical values in two dimensions — va- music libraries need to classify music by genre, instrumenta- lence (positive or negative emotions expressed in music) and tion, tempo and mood. Automatic solutions to auto-tagging arousal (energy of the music) (VA) [13, 17], making it eas- problem are invaluable because they make annotation fast, ier to depict the temporal dynamics of emotion variation. cheap and standardized. Emotion is one of the most im- The VA model has been widely adopted in affective re- portant search criteria for music. Automatic MER (music search [2, 6, 9–11, 15, 18–20]. However, the model is not free emotion recognition) algorithms rely on ground truth for of criticisms and some other alternatives may be consid- training. There are many ways in which such a ground truth ered in the future. For example, the VA model has been can be generated [6]; using different affective representations criticized for being too reductionist and that other dimen- or different temporal granularity. Depending on the affec- sions such as dominance should be added [5]. Moreover, the tive model or temporal resolution, the evaluation criteria can terms ‘valence’ and ‘arousal’ may be sometimes too abstract vary. These discrepancies make it very difficult to compare for people to have a common understanding of its meaning. different methods. The Emotion in Music task is designed Such drawbacks of the VA model can further harm the inter- ∗ annotator agreement of the annotations for an annotation This research was supported in part by Ambizione pro- task which is already inherently fairly subjective. gram of the Swiss National Science Foundation and the FES project COMMIT/. We thank Alexander Lansky, from Queens University, Canada and Yu-Hao Chin from National Central University, Taiwan for assistance with the song se- 2. TASK DESCRIPTION lection and annotations. This year we offer only one task - dynamic emotion char- acterization. However, in order to permit a thorough com- Copyright is held by the author/owner(s). 1 MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- http://www.multimediaeval.org 2 many http://www.music-ir.org/mirex/wiki/ parison between different methods, this year we require the Arousal Valence participants to submit two different runs. RMSE r RMSE r • In one run, the participants are required to submit openSMILE 0.27±0.11 0.36±0.26 0.37±0.18 0.01±0.38 their features and we use a baseline regression method + MLR (linear regression) to estimate dynamic affect. Any Average 0.28±0.13 – 0.29±0.14 – features automatically extracted from the audio or the baseline metadata provided by the organizers are allowed. • In the second required run, all the participants are Table 1: Baseline results. required to use the baseline features that we provided (see Section 3 for details) to compare their machine The evaluation data we collected this year is different in learning methods. Participants are also free to submit several respects. First, we opted for full-length songs to any combination of the features and machine learning cover the whole affective variation. Second, we partially an- methods up to the total of five runs. notated the data in the laboratory. The evaluation set is an- The participants will estimate the valence and arousal notated by 6 people; two onsite and 4 conscientious MTurk scores continuously in time for every segment (half a sec- workers, where 29% of the annotations was done in the lab. ond long) on a scale from –1 to 1. The participants have This way, we can compare the agreement between the onsite to submit both predictions of valence and arousal, their fea- workers and the crowdworkers. The annotators listened to ture set, if different from the basic provided one, and their the entire song before starting with the annotation, to get predictions when using a universal feature set. We will use familiar with the music and to reduce the reaction time lag. the Root-Mean-Square Error (RMSE) as the primary eval- The workers were only payed the full fee after their work uation measure. We will also report the Pearson correlation was reviewed and appeared to be of high quality. The Cron- (r) of the prediction and the ground truth. We will rank bach’s α this year is 0.65 ± 0.28 for arousal, and 0.29 ± 0.94 the submissions based on the averaged RMSE. Whenever for valence. In comparison, the Cronbach’s α for another the difference based on the one sided Wilcoxon test is not two existing datasets MoodSwing [16] and AMG1608 [4], is significant (p>0.05), we will use the averaged correlation co- 0.41, 0.46 for arousal, and 0.25, 0.31 for valence, respectively. efficient to break the tie. As compared to our dataset, the consistency of annotations has improved for arousal, but not for valence. 3. DATASETS AND GROUND TRUTH It can be found that, there is a mismatch between the Our datasets consist of royalty-free music from several training and test sets in terms of the duration of the clips sources: freemusicarchive.org (FMA), jamendo.com, and (45-second segments versus full songs) and the data sources the medleyDB dataset [3]. The development set consists of (FMA versus medleyDB and jamendo). In contrast, in either 431 clips of 45 seconds, which were selected from last year’s 2013 or 2014 the training and test sets were of the same data based on inter-annotator agreement criteria. The test length and both were from FMA [1, 14]. set comprises 58 complete music pieces with an average du- ration of 234 ± 105.7 seconds. 3.1 Baseline features The development set is a subset of clips from last years In order to enable comparison between different machine [1, 14], all of which are from FMA. The subset was selected learning algorithms, we provide a baseline universal feature according to the procedure described below: set, extracted with openSMILE [7], consisting of 260 low- 1. We deleted the annotations which Pearson’s correla- level features (mean and standard deviation of 65 low-level tion with the averaged annotations for the same song acoustic descriptors, and their first-order derivatives). In is below 0.1. If less than 5 annotators remain after the addition to the audio features, we also provide meta-data deletion, we discarded the song. covering the genre labels obtained from FMA, and, for some 2. For the remaining songs and remaining annotations, of the songs, folksonomy tags crawled from last.fm. we calculated the Cronbach’s α. If it was bigger than 0.6, the song was retained. 4. BASELINE RESULTS 3. The mean (bias) of every dynamic annotation was For the baseline, we used the openSMILE toolbox [7] to changed to match the averaged static annotation for extract 260 feature from nonoverlapping segments of 500ms, the same song. with frame size of 60ms with a 10ms step. We used multiple This procedure resulted in a reduction from 1,744 songs to linear regression (MLR), following last years. The results are 431 songs (the rest did not have consistent enough annota- shown in the first row of Table 1. Compared to the last year tions), each of which was annotated by 5–7 workers from (for arousal, r = 0.27±0.12, for valence, r = 0.19±0.11), the the Amazon Mechanical Turk (MTurk). The Cronbach’s α baseline is worse. We also calculated an average baseline by is 0.76 ± 0.12 for arousal, and 0.73 ± 0.12 for valence. using the average of all the development set ground truth as The evaluation set consists of 58 complete songs, one half the prediction result for all the songs. In terms of RMSE, from medleyDB dataset [3] of royalty-free multitrack record- this average baseline performs better for valence and at the ings and another half from the jamendo.com music website, same level for arousal. which provides music under Creative Commons license. We selected songs with some emotional variation in them from genres corresponding to the ones in the development set. 5. CONCLUSIONS We used the same annotation interface as the previous two A task has been developed to analyze emotion in music. years: a slider that is continuously moved by an annotator Annotations were collected using both onsite annotator and while listening to music. The position of the slider indicates crowdsourcing workers. The quest for higher quality labels the magnitude of valence or arousal. has led to a lower number of training and evaluation samples. 6. REFERENCES [18] J.-C. Wang, Y.-H. Yang, H.-M. Wang, and S.-K. Jeng. [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion Modeling the affective content of music with a in music task at MediaEval 2014. In MediaEval 2014 Gaussian mixture model. IEEE Transactions on Workshop, 2014. Affective Computing, 6(1):56–68, 2015. [2] M. Barthet, G. Fazekas, and M. Sandler. [19] S. Wang and Q. Ji. Video affective content analysis: a Multidisciplinary perspectives on music emotion survey of state of the art methods. IEEE Trans. recognition: Implications for content and Affective Computing, PP(99):1, 2015. context-based models. In Int’l Symp. Computer Music [20] Y.-H. Yang and H.-H. Chen. Machine recognition of Modelling & Retrieval, pages 492–507, 2012. music emotion: A review. ACM Trans. Intel. Systems [3] R. Bittner, J. Salamon, M. Tierney, M. Mauch, & Technology, 3(4), 2012. C. Cannam, and J. P. Bello. MedleyDB: A multitrack dataset for annotation-intensive mir research. In Proc. ISMIR, 2014. [4] Y.-A. Chen, J.-C. Wang, Y.-H. Yang, and H. H. Chen. The AMG1608 dataset for music emotion recognition. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pages 693–697, 2015. [5] G. Collier. Beyond valence and activity in the emotional connotations of music. Psychology of Music, 35(1):110–131, 2007. [6] T. Eerola. Modelling emotions in music: Advances in conceptual, contextual and validity issues. In AES International Confernece, 2014. [7] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in openSMILE, the Munich Open-source Multimedia Feature Extractor. In Proceedings of ACM MM, pages 835–838, 2013. [8] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F. Ehmann. The 2007 MIREX audio mood classification task: Lessons learned. In Proc. Int. Soc. Music Info. Retrieval Conf., pages 462–467, 2008. [9] A. Huq, J. P. Bello, and R. Rowe. Automated music emotion recognition: A systematic evaluation. Journal of New Music Research, 39(3):227–244, 2010. [10] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. Speck, and D. Turnbull. Music emotion recognition: A state of the art review. In Proc. Int. Soc. Music Info. Retrieval Conf., 2010. [11] S. Koelstra, C. Mühl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. DEAP: A database for emotion analysis; using physiological signals. IEEE Trans. Affective Computing, 3(1):18–31, 2012. [12] C. Laurier and P. Herrera. Audio music mood classification using support vector machine. In MIREX Task on Audio Mood Classification, 2007. [13] J. A. Russell. A circumplex model of affect. J. Personality & Social Science, 39(6):1161–1178, 1980. [14] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha, and Y.-H. Yang. 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia, pages 1–6, 2013. [15] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic. Corpus development for affective video indexing. IEEE Trans. Multimedia, 16(4):1075–1089, 2014. [16] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E. Kim. A comparative study of collaborative vs. traditional musical mood annotation. In Proc. Int. Soc. Music Info. Retrieval Conf., 2011. [17] R. E. Thayer. The Biopsychology of Mood and Arousal. Oxford University Press, New York, 1989.