=Paper=
{{Paper
|id=Vol-1436/Paper82
|storemode=property
|title=MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper82.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiWV15
}}
==MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking==
MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking Anna Aljanaki Frans Wiering Remco C. Veltkamp Information and Computing Information and Computing Information and Computing Sciences Sciences Sciences Utrecht University Utrecht University Utrecht University the Netherlands the Netherlands the Netherlands a.aljanaki@uu.nl F.Wiering@uu.nl R.C.Veltkamp@uu.nl ABSTRACT 2.1 Annotation preprocessing In this paper we approach the task of continuous music emo- The development set consists of excerpts of 45 seconds, tion recognition using unsupervised audio segmentation as but the annotations are only provided from the 15s second a preparatory step. The MediaEval task requires predict- onwards, to provide a generous habituation time to the an- ing emotion of the song with a high time resolution of 2Hz. notators. Nevertheless, dynamic emotion annotations can Though this resolution is necessary to find exact locations of have a time lag of 2-4 seconds because of the annotators’ emotional changes, we believe that those changes occur more reaction time [5]. To compensate for it, we shift the annota- sparsely. We suggest that using bigger time windows for fea- tions by 3 seconds (i.e., we use audio from 12 to 42 second ture extraction and emotion prediction might make emotion to extract the features, and couple it with the annotations recognition more accurate. We use an unsupervised method from 15 to 45 seconds). Structure Features [6] to segment the audio both from the development set and the evaluation set. Then we use Gaus- sian Process regression to predict the emotion of the segment 2.2 Feature extraction (Essentia) using features extracted with the Essentia and openSMILE We use the open-source framework Essentia [3] to extract frameworks. a range of high (scale, tempo, tonal stability, etc.) and low level (spectral shape, mfcc, chroma, energy, dissonance, etc.) 1. INTRODUCTION features, for a total of 40 features. For low-level timbral features we use a half-overlapping window of 100ms, for high This working notes paper describes a submission to the level features we use a window of 3 seconds. Emotion in Music task in the Mediaeval 2015 benchmark. The task requires predicting emotion of the music (arousal We use the same set of features both for segmentation and or valence) based on musical audio continuously (over time) for emotion recognition, but for segmentation purposes the with a resolution of 2Hz. The organizers provided an an- features are smoothed with a median sliding window and notated development set of 431 excerpts of 45 seconds, and resampled according to beats detected using the Essentia an evaluation set of 58 full-length songs. For more detail we BeatTracker algorithm. refer to the task overview paper [2]. We approach the task of music emotion recognition by 2.3 Segmentation taking it to a higher time-resolution, i.e., to a segment level emotion recognition. We use unsupervised audio segmen- We use an unsupervised method to perform the segmenta- tation method to segment the music into emotionally ho- tion of both development and evaluation set audio. We chose mogenous excerpts; next, we predict the emotion for every SF (Structural Features) because it performed best in an segment and then resample the result to 2Hz. As one of the evaluation of segmentation methods when applied to emo- task requirements, baseline features from the openSMILE tional segmentation, with recall of 67% of emotional bound- framework [4] (260 low level spectral features) have to be aries [1]. Using SF method to segment the development used. We also use create our own feature set using Essen- set (instead of employing labeled emotionally homogenous tia, which also contains high-level features, and uses bigger segments as the ground truth) is a weak spot of our ap- time windows for feature extraction, which becomes possible proach, because it degrades the quality of the ground truth when predicting emotion of the music per segment. data, which is not completely human-annotated after this step anymore. Our method could use any other dataset of music excerpts labeled with valence and arousal, but for the 2. APPROACH purposes of participating in MediaEval benchmark we are In this section we will describe the main steps of our ap- using the standard development set provided to all the par- proach, namely, annotation preprocessing, feature extrac- ticipants. tion, segmentation method and learning algorithm. The SF method is both homogeneity and repetition based. It uses a variant of lag matrix to obtain structural features. The SF are differentiated to obtain a novelty curve, on which Copyright is held by the author/owner(s). peak picking is performed. The SF method calculates self- MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany similarity between samples i and j as follows: Framework Target RMSE r Essentia Valence 0.3576±0.1952 -0.1214±0.4156 Essentia Arousal 0.2640±0.1341 0.4050±0.3361 openSMILE Valence 0.2946±0.1473 -0.0853±0.3863 openSMILE Arousal 0.2854±0.1242 0.1669±0.3955 Table 1: Evaluation results. 4. CONCLUSION In this paper we described an approach to music emo- tion variation detection which uses an intermediary step - music segmentation into fragments of homogenous emotion. We used Gaussian Processes modeling to predict the emo- Figure 1: Prediction of arousal for the song ”You tion per segment, and two different frameworks (Essentia listen” by Meaxic. and openSMILE) to extract the features, which were used both during the segmentation and for emotion recognition. Bringing the problem from a level of sound fragment (half a second) into a level of short musical segment (10 seconds Si,j = Θ (εi,j − ||xi − xj ||) , (1) on average) has two advantages. Firstly, employing longer segments allows to extract musically meaningful features, where Θ(z) is a Heaviside step function, xi is a feature time such as tonality or tempo. Secondly, averaging features and series transformed using delay coordinates, ||z|| is a Eu- annotations over longer segments could be beneficial as a clidean norm, and ε is a threshold, which is set adaptively for smoothing step. The runs produced with baseline openS- each cell of the matrix S. From the matrix S structural fea- MILE low level spectral features could not benefit from these tures are then obtained using a lag-matrix, and computing advantages, which could explain part of the difference in per- the difference between successive structural features yields formance on arousal. Both algorithms performed very bad a novelty curve. on valence. By means of the segmentation step we obtain 1304 seg- ments with an average segment length of 10.8±5.7 seconds 5. ACKNOWLEDGEMENTS using Essentia features, and 1017 segments with an average This publication was supported by the Dutch national length of 10.7±5.3 seconds using openSMILE features on the program COMMIT/. development set. For each of the segments, we average the continuous emotion annotation inside the segment to obtain the training data. 6. REFERENCES [1] A. Aljanaki, F. Wiering, and R. C. Veltkamp. Emotion We also segment the songs from the evaluation set in the based segmentation of musical audio. In Proceedings of same way. the 16th International Society for Music Information 2.4 Learning algorithm Retrieval Conference, 2015. [2] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion We use Gaussian Processes regression to predict the va- in music task at mediaeval 2015. In Working Notes lence and arousal values per segment, using maximum like- Proceedings of the MediaEval 2015 Workshop, lihood estimation of the best set of parameters. We use a September 2015. squared exponential autocorrelation function (radial basis function): [3] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, and O. Mayor. Essentia: an audio analysis library for music information retrieval. In International (i − j)2 Society for Music Information Retrieval Conference, K(i, j) = exp − , (2) 2θ2 pages 493–498, 2013. where θ is a tuned parameter, and i and j are the points in [4] F. Eyben, F. Weninger, F. Gross, and B. Schuller. feature space. Recent developments in opensmile, the munich open-source multimedia feature extractor. In ACM 3. EVALUATION Multimedia, pages 835–838, 2013. Figure 1 shows an example of the output of the algorithm. [5] E. Schubert. Handbook of Music and Emotion: Theory, The task is evaluated based on RMSE and Pearson’s cor- Research, Applications, chapter Continuous self-report relation coefficient between the ground truth and the predic- methods., pages 223–253. Oxford University Press, tion, averaged across the 58 songs of the testset. The results 2011. are displayed in the table 1. [6] J. Serra, M. Muller, P. Grosche, and J. L. Arcos. The algorithm based on features from Essentia performs Unsupervised music structure annotation by time series much better for arousal (both in terms of correlation and structure features and segment similarity. IEEE RMSE), but worse for valence. Both algorithms perform Transactions on Multimedia, Special Issue on Music unacceptably bad on valence. Data Mining, 2014.