=Paper=
{{Paper
|id=Vol-1263/paper33
|storemode=property
|title=Emotion in Music Task at MediaEval 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_33.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS14
}}
==Emotion in Music Task at MediaEval 2014==
Emotion in Music Task at MediaEval 2014 Anna Aljanaki Yi-Hsuan Yang Mohammad Soleymani Information and Computing Academia Sinica Computer Science Dept. Sciences Taipei University of Geneva Utrecht University Taiwan Switzerland the Netherlands yang@citi.sinica.edu.tw mohammad.soleymani@unige.ch a.aljanaki@uu.nl ABSTRACT clusters, derived from cluster analysis of online tags, instead Emotional expression is an important property of music. Its of more widely accepted dimensional or categorical models of emotional characteristics are thus especially natural for mu- emotion. It was noted that there exists semantic or acoustic sic indexing and recommendation. The Emotion in Music overlap between clusters [4]. Furthermore, the dataset only task addresses the task of automatic music emotion predic- applies a singular static rating per audio clip, which belies tion and is held for the second year in 2014. As compared the time-varying nature of music. to previous year, we modified the task by offering a new fea- In our corpus we employ the music licensed under Cre- ture development subtask, and releasing a new evaluation ative Commons3 (CC) from the Free Music Archive4 (FMA), set. We employed a crowdsourcing approach to collect the which enables us to redistribute the content. We do not use data, using Amazon Mechanical Turk. The dataset consists volunteers or online tag mining to collect the annotations, of music licensed under Creative Commons from the Free but pay the annotators to perform the task via Amazon Music Archive, which can be shared freely without restric- Mechanical Turk (MTurk)5 , in a similar way as [2, 7]. We tions. In this paper we describe the dataset collection, anno- filter poor quality workers by making them first pass a test tations, and evaluation criteria, as well as the two required demonstrating a thorough understanding of the task, and and optional runs. an ability to produce good quality work. The final dataset spans 1744 clips of 45 seconds, and each clip is annotated by a minimum of 10 workers, which is substantially larger 1. INTRODUCTION than any existing music emotion dataset with continuous Huge music libraries create a demand for tools providing annotations. automatic music classification by various parameters, such as genre, instrumentation, emotion. Among these, emotion is one of the most important classification criteria. This task 2. TASK DESCRIPTION presents many challenges, starting from its internal ambigu- This year, similar to last year, the task comprises two sub- ity and ending with audio processing difficulties [8]. As mu- tasks. The first task is dynamic emotion characterization sical emotion is subjective, most existing work on MER relies (main task). The second task, feature design, is introduced on supervised machine learning approaches, training MER for the first time this year. New features, which have ei- systems with emotion labels provided by human annota- ther not been developed before, or have not been applied to tors. Currently, many researchers collect their own ground- MER, should be proposed and applied to automatically de- truth data, which makes direct comparison between their tect arousal and valence for the whole song. The tasks will approaches impossible. A benchmark is necessary to facili- be trained on a development set of 744 songs and evaluated tate the cross-site comparison. The Emotion in Music task on a evaluation set of 1000 songs. appears for the second time in the MediaEval benchmark- ing campaign for multimedia evaluation1 and is designed to 2.1 Run description serve this purpose. In Subtask 1, dynamic estimation, the participants will The only other current evaluation task for MER is the estimate the valence and arousal scores continuously in time audio mood classification (AMC) task of the annual music for every segment (half a second long) on a scale from -1 to information retrieval evaluation exchange2 (MIREX) [1]. In 1. In Subtask 2, feature design, the participants will develop this task, 600 audio files are provided to the participants new features and predict the valence and arousal scores of of the task, who have agreed not to distribute the files for whole 45 second excerpts (on average, i.e. statically). Only commercial purposes. However, AMC has been criticized for one new feature will be evaluated in each run. For both using an emotional model that is not based on psychological tasks, together, each team can submit up to 5 runs, totally. research. Namely, this benchmark uses five discrete emotion For the main (dynamic subtask) run, any features auto- 1 matically extracted from the audio or the metadata provided http://www.multimediaeval.org 2 by the organizers are allowed. For the dynamic emotional http://www.music-ir.org/mirex/wiki/ analysis we will use the Pearson correlation calculated per 3 http://creativecommons.org/ 4 Copyright is held by the author/owner(s). http://freemusicarchive.org/ 5 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain http://mturk.com song and averaged for the final value. We will also report W was calculated for each song separately after discarding the Root-Means-Squared Error (RMSE). We will rank the the annotations of the first 15 seconds. The average W is submissions based on the averaged correlations. Whenever 0.2 ± 0.13 for arousal and 0.16 ± 0.11 for valence, which the difference based on the one sided Wilcoxon test is not indicate weak agreement. significant (p>0.05), we will use the RMSE to break the tie. The feature design task will be also evaluated based on the 4. BASELINE RESULTS averaged across songs Pearson correlation and three runs. For the baseline, we used MIRToolbox [3] to extract 5 The participants can apply any non-linear transformation features (spectral flux, harmonic change detection func- to their designed features to maximize the correlation. tion, loudness, roughness and zero crossing rate) from non- overlapping segments of 500ms, with frame size of 50ms. 3. DATASET AND GROUND TRUTH We used multilinear regression as we did last year. For va- lence, the correlation averaged across songs was 0.11 ± 0.34 For the description of the development set we refer to [5]. and RMSE: 0.19 ± 0.11. For arousal, the correlation was This year we collected more data in a similar way, but in- 0.18 ± 0.36 and RMSE was 0.27 ± 0.12. As compared to last cluded external sources for metadata. We used the last.fm year (for arousal, r = 0.16±0.35, for valence, r = 0.06±0.3), API to collect tags for matching songs from FMA. The songs the baseline is higher. We also calculated the random base- that are already in last year’s corpus were excluded. Then, line by averaging all the predictions. The RMSE for random we chose 1000 songs with the largest number of tags. Each average baseline is 0.18 ± 0.11 for valence and 0.21 ± 0.12 song is from one or several genres from the following list: for arousal, which means that in terms of RMSE random Soul, Blues, Electronic, Rock, Classical, Hip-Hop, Interna- baseline performs better. tional, Experimental, Folk, Jazz, Country, and Pop. We ex- cluded songs from these genres: Spoken, Old-time historic, Experimental (in case the latter was the only genre that 5. ACKNOWLEDGMENTS song belonged to). We also manually checked the music and We are grateful to Sung-Yen Liu from Academia Sinica excluded the files with bad recording quality or those not for helping with the task organization. This research was containing music, but speech or noise. For each artist, we supported in part by European Research Area, the CVML selected maximally 5 songs to be included in the dataset. Lab.6 , University of Geneva, and by the FES project COM- To assure the adequate quality of the ground-truth, we MIT/. created a procedure to select only the workers who are mo- tivated and qualified to do the task, following current state- 6. REFERENCES of-the-art crowdsourcing approaches [6]. All the workers had [1] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F. to pass a qualification test that was later evaluated manu- Ehmann. The 2007 MIREX audio mood classification ally. It consisted of three stages. Prior to the test, par- task: Lessons learned. In Proc. Int. Soc. Music Info. ticipants were provided with the definitions of arousal and Retrieval Conf., pages 462–467, 2008. valence, and could watch an instruction video. In the first [2] Y. E. Kim, E. Schmidt, and L. Emelle. Moodswings: A stage, they listened to two short music audio clips, which collaborative game for music mood label collection. In contained distinctive emotional shift, and annotated arousal Proc. Int. Soc. Music Info. Retrieval Conf., pages and valence continuously. In the second stage, workers de- 231–236, 2008. scribed the emotional shift, and in the third stage, they de- [3] O. Lartillot and P. Toiviainen. A matlab toolbox for scribed the song and indicated its genre. We also collected musical feature extraction from audio. In International anonymized personal information from the workers, includ- Conference on Digital Audio Effects, Bordeaux, 2007. ing, gender, age, and location, and asked them to take a [4] C. Laurier and P. Herrera. Audio music mood short personality test. classification using support vector machine. In MIREX Based on the quality of musical descriptions, and the cor- task on Audio Mood Classification, 2007. rectness of their answers in the qualification task, we granted [5] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha, qualifications to the workers, after which they could proceed and Y.-H. Yang. 1000 songs for emotional analysis of to the second step (the main task). The main task involved music. In Proceedings of the 2nd ACM International annotating the songs continuously over time once for arousal Workshop on Crowdsourcing for Multimedia, and once for valence, which in total constituted 334 micro- CrowdMM ’13, pages 1–6, New York, NY, USA, 2013. tasks. Each micro-task involved annotating 3 audio clips of ACM. 45 seconds on arousal and valence scales dynamically and [6] M. Soleymani and M. Larson. Crowdsourcing for statically, as a whole. The workers also characterized the affective annotation of video: Development of a song in emotional terms, and reported confidence of their viewer-reported boredom corpus. In Workshop on answers, as well as familiarity and liking of the music. Work- Crowdsourcing for Search Evaluation, SIGIR 2010, ers were paid $0.25 USD for the qualification HITs and $0.40 Geneva, Switzerland, 2010. USD for each main HIT that they successfully completed. [7] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E. On average, each HIT took 10 minutes. Kim. A comparative study of collaborative vs. To measure the inter-annotation agreement for the static traditional musical mood annotation. In Proc. Int. Soc. annotations, we calculated Krippendorff’s alpha on an or- Music Info. Retrieval Conf., 2011. dinal scale. The values were 0.22 for valence and 0.37 for [8] Y.-H. Yang and H. H. Chen. Music Emotion arousal, which are in the range of fair agreement. For the Recognition. CRC Press, Boca Raton, Florida, 2011. dynamic annotations, we used Kendall’s coefficient of con- 6 cordance (Kendall’s W) with corrected tied ranks. Kendall’s http://cvml.unige.ch