INTRODUCTION

The MediaEval 2013 Brave New Task: Emotion in Music

M. Soleymani

m.soleymani@imperial.ac.uk 1

M.N. Caro and E.M.

{mc947,eschmidt}@drexel.edu 2

Y.-H. Yang

yang@citi.sinica.edu.tw 0 0 Academia Sinica , Taiwan 1 Department of Computing, Imperial College London , UK 2 Schmidt, Drexel University , USA

2013

18 19

Music is composed to be emotionally expressive. Emotional associations of music thus provide an especially natural feature for music indexing and recommendation. Emotion in Music Task is a brave new task addressing emotional characterization of music. In addressing the difficulties of emotion annotation we have turned to crowdsourcing, using Amazon Mechanical Turk. The dataset consists entirely of Creative Commons music from the Free Music Archive, which as the name suggests, can be shared freely without restrictions. In this paper, the dataset collection, annotations, and evaluation criteria as well as the two required and optional runs are described.

INTRODUCTION

The Emotion in Music task is a brave new task in the MediaEval 2013 benchmarking initiative for multimedia evaluation1. In seeking to develop tools for navigating today’s vast digital music libraries, emotional associations provide an especially natural domain for indexing and recommendation. Because there are a myriad of challenges to such a task, powerful tools are required for the development of systems that automate the prediction of emotion in music. As such, a considerable amount of work has been dedicated to the development of automatic music emotion recognition (MER) systems [ 6 ]. Given the perceptual nature of human emotion, most existing work on MER has pursued supervised machine learning approaches, training MER systems using emotion labels or ratings entered by human subjects for a number of training clips.

The only current evaluation task for MER is the audio mood classification (AMC) task of the annual music information retrieval evaluation exchange2 (MIREX) [ 1 ]. The audio files (totaling 600 clips) are available to the participants of the task, who have agreed not to distribute the files for commercial purposes. Being the only benchmark in the field of MER so far, this contest draws many participants every year. However, AMC describes emotions using five discrete emotion clusters instead of affect dimensions (e.g., valence and arousal). The clusters do not have origins in psychology literature, and some have noted semantic or acoustic overlap between clusters [ 3 ]. Furthermore, the dataset only 1http://www.multimediaeval.org 2http://www.music-ir.org/mirex/wiki/ applies a singular static rating per audio clip, which belies the time-varying nature of music.

Our new benchmarking corpus employs Creative Commons3 (CC) licensed music from the Free Music Archive4 (FMA), which enables us to redistribute the content. For annotations we have turned to crowdsourcing using Amazon Mechanical Turk (MTurk)5, as others have found success using these tools to label large libraries [ 2, 5 ]. In addition we have developed a two-stage procedure for filtering out poor quality workers, where workers must first pass a test demonstrating a thorough understanding of the task, and an ability to produce good quality work. The final dataset spans 1000, 45-second clips, and each clip is annotated by a minimum of 10 workers, which is substantially larger than any existing music emotion dataset. 2.

TASK DESCRIPTION

This task comprises of two subtasks. In the first task, the dynamic emotion characterization task, the emotional dimensions, arousal and valence, should be determined for the given song continuously in time; the temporal resolution is one second. The second task, the static emotion characterization task, requires participants to deploy multimodal features to automatically detect arousal and valence for each song. We developed a dataset of 1000 songs which are split into the development set (700 songs) and the test set (300 songs). These affective features can be used in recommendation and retrieval platforms. There are already examples of mood based or emotion based online radios, e.g., Stereomood 6. 2.1

Run description

Our task comprises two tasks: Subtask 1, dynamic estimation: In this task, the participants will estimate the valence and arousal scores continuously in time. For every segment, which is 1 second long, valence and arousal scores between -1 and 1 should be estimated. Each team can submit up to 3 runs for this task. Subtask 2, static estimation: In this task, the participants will estimate the valence and arousal scores of the whole 45 seconds excerpt extracted from a song. Each team can submit 3 runs for this task For both subtasks, and for the main run, any features automatically extracted from the audio or the metadata provided 3http://creativecommons.org/ 4http://freemusicarchive.org/ 5http://mturk.com 6www.stereomood.com by the organizers are allowed. This is the required run. Optional runs, or general runs, include the possibility for the participants to use additional external data. 3.

DATASET AND GROUND TRUTH

the annotations of the first 5 seconds. The average W is 0.23 ± 0.16 for arousal and 0.28 ± 0.21 for valence. The observed agreement was statistically significant for arousal in 60.0% of songs and for valence in 65.8% of songs.

BASELINE RESULTS

The following features were extracted from audio signals: Mel-Frequency Cepstrum Coefficients (MFCC), octavebased spectral contrast, Statistical Spectrum Descriptors (SSDs) which is composed of spectral centroid, spectral flux, spectral rolloff, and spectral flatness in that order, Chromagram. The following features were extracted using Echonest7 API: timbre, pitch, and loudness features.

A Multivariate Linear Regression (MLR) was selected for the baseline system because it is a simple and generalizable prediction method. The MLR was trained on the development set and evaluated on the test set. All the annotations including for the static and dynamic ones were scaled between [−0.5, 0.5]. The Euclidean distance between the estimated arousal and valence points as well as R2 were calculated for the evaluation of the static results. To evaluate the dynamic results, mean distance and Kendall’s Tau ranking correlation were used. The average values of arousal and valence on the training set was chosen as the random level baseline to be compared with our results. To evaluate the estimation models from content features R2 and mean absolute error (distances) are reported for static estimation and Kendall Tau (τ ) is reported with distance for dynamic estimation. The reported measures on dynamic annotated data are averaged for all the clips. Random level results are calculated by setting the target to the average score in the training set. The results that are significantly better (Wilcoxon test p < 0.01) than the random level were the arousal static estimation, Distance = 0.10 ± 0.07, R2 = 0.07, and arousal dynamic estimation, Distance = 0.08±0.05, τ = 0.15±0.22. On the estimation of static ratings, the arousal estimations are far better than valence estimations which are in the order of chance level. Consistently, arousal estimation results are superior to valence estimation on the continuous, dynamic affect estimation task.

[1]

Hu ,

J. S.

Downie ,

Laurier ,

Bay , and

A. F.

Ehmann . The 2007 MIREX audio mood classification task: Lessons learned . In Proc. Int. Soc. Music Info. Retrieval Conf. , pages 462 - 467 , 2008 .

[2]

Y. E.

Kim , E. Schmidt, and

Emelle . Moodswings: A collaborative game for music mood label collection . In Proc. Int. Soc. Music Info. Retrieval Conf. , pages 231 - 236 , 2008 .

[3]

Laurier and

Herrera . Audio music mood classification using support vector machine . In MIREX task on Audio Mood Classification , 2007 .

[4]

Soleymani and

Larson . Crowdsourcing for affective annotation of video: Development of a viewer-reported boredom corpus . In Workshop on Crowdsourcing for Search Evaluation, SIGIR 2010 , Geneva, Switzerland, 2010 .

[5]

J. A.

Speck ,

E. M.

Schmidt ,

B. G.

Morton , and

Y. E.

Kim . A comparative study of collaborative vs. traditional musical mood annotation . In Proc. Int. Soc. Music Info. Retrieval Conf ., 2011 .

[6]

Y.-H.

Yang and

H. H.

Chen . Music Emotion Recognition . CRC Press, Boca Raton, Florida, 2011 .