Predicting Affect in Music Using Regression Methods on Low Level Features Rahul Gupta, Shrikanth Narayanan Signal Analysis and Interpretation Lab (SAIL), University of Southern California, Los Angeles, CA, USA guptarah@usc.edu, shri@sipi.usc.edu ABSTRACT in music. In general, the affective signals evolve smoothly Music has been shown to impact the affective states of the over time and do not undergo abrupt changes. Our models listener. The emotion in music task at the MediaEval chal- take this factor into account by learning the mapping from lenge 2015 focuses on predicting the affective dimensions of features to the affective dimensions while also accounting valence and arousal in music using low level features. In par- for the smooth temporal evolution of affect. In the 2015 ticular, this edition of the challenge involves prediction on emotion in music task, our best models obtain a root mean full length songs given a training set containing smaller 30 square error values of .35 and .24 in valence and arousal second clips. We approach the problem as a regression task prediction, respectively. In the next section we describe our and test several regression algorithms. We proposed these methodology in detail. regression methods on the dataset from previous edition of the same task (Mediaeval 2014) involving prediction on 30 2. METHODOLOGY second clips instead of full length songs. Through evaluation The 2015 challenge task provides a development set con- on the 2015 data set, we obtain a point of reference for the sisting of 30 second clips from 431 songs; annotated at a rate model performances on longer song clips. Whereas our mod- of 2 frames per second. The baseline feature set is extracted els perform relatively well in predicting arousal (root mean using OpenSmile [8] and contains 260 features. The test set square error: .24), we do not obtain good results for va- contains 58 full length songs annotated at the same frame lence prediction (root mean square error: .35). We analyze rate as the development set. We use three different regres- the results and the experimental setup and discuss plausible sion methods to predict the affective dimensions of valence solutions for a better prediction. and arousal from the 260 baseline features. We describe these methods below. 1. INTRODUCTION 2.1 Linear Regression + Smoothing (LR+S) Music is an important part of media and considerable re- search has gone into understanding and indexing the music In this model, we use the 260 features and learn sepa- signal [1, 2]. Music has been shown to impact the affective rate linear regression models to predict arousal and valence. states of the listeners and in depth analysis of the relation After obtaining the decisions, we perform a smoothing op- between music and affect can impact both understanding eration by low pass filtering the frame-wise arousal and va- and design of music. Over the past few years, the emotion lence values. We use a moving average filter as the low pass in music task at various MediaEval challenges [3, 4, 5] has filter with filter length tuned using three fold inner cross provided a unified platform for understanding the affective validation on the train set (arousal filter length = 13; va- characteristics of music signals. The emotion in music task lence filter length = 38). The smoothing operation not only at MediaEval 2015 [5] provides a training set which is a removes the high frequency noise, but also incorporates the subset of the 2014 challenge, with valence and arousal anno- local context into account while making decision for a frame. tations over 30 second clips. This subset is chosen for better The decision for a frame is given as an unweighted combi- quality annotations as described in the overview paper [5]. nation of frame values in a window centered around that However, it is also unique in the sense that the prediction frame, thereby incorporating local context. has to be made on a test set containing full length songs. 2.2 Least Squares Boosting + This poses the challenge of generalizing models trained over smaller music segments for prediction on longer segments. Smoothing (LSB+S) In this work, we present the results on affect prediction Least squares boosting [9, 10] is another regression al- in music using our previous models developed on the 2014 gorithm trained using gradient boosting [9]. We use the challenge data set. We tested multiple regression models “fitensemble” function in Matlab to train a least squares followed by a smoothing operation in last year’s challenge [6] boosting model for predicting valence and arousal. The base and more recently developed a Boosted Ensemble of Single learners used for least squares boosting are regression trees feature Filters (BESiF) algorithm [7] for affect prediction [11]. The number of regression trees in the ensemble is tuned using 3 fold cross-validation on the train set. After obtain- ing the frame-wise decisions from the least squares boosting algorithm, we perform a smoothing operation as explained Copyright is held by the author/owner(s). in the section 2.1. MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany Method Valence Arousal prediction, we aim to perform a detailed analysis to under- RMSE ρ RMSE ρ stand the reasons behind the poor performance. Despite the Baseline [5] 0.37 0.01 0.27 0.36 presence of features correlated with valence in the train set LR+S 0.35 0.01 0.24 0.65 and our success in the last edition of the challenge, a low LSB+S 0.35 0.05 0.24 0.59 performance on valence prediction poses a challenge in form BESiF 0.37 -0.04 0.28 0.50 of understanding prediction over longer song segments. We Unweighted summation 0.35 0.00 0.24 0.64 suspect that providing annotators with small song segments versus longer segments may have an impact on the anno- Table 1: Results on valence and arousal prediction tation itself. Listening to longer clips may alter affective using the proposed regression systems. perceptions and introduce other annotator biases. In par- ticular, we aim to investigate the performance of our BESiF algorithm and modify for the given problem setting. This 2.3 Boosted Ensemble of Single feature may involve including adaptation schemes [14, 15] to model Filters (BESiF) differences in annotation over the train and the test set and We proposed another gradient boosting based algorithm other mismatch that may exist. on the 2014 emotion in music data set [7]. In this algo- Also several previous works have reported differences in rithm, we propose the base learners to be filters (analogous performances for arousal and valence prediction using acous- to regression trees used in LSB+S algorithm). The motiva- tic features similar to the ones used in this work [16, 17, 18]. tion behind this algorithm was to perform a joint learning This is worth investigating into as it may imply that va- of regression and smoothing unlike previous two methods. lence prediction may involve other features not considered in The filters not only learn the mapping between low level the baseline set of features. In the case of continuous emo- features and the affective dimensions, but also perform tem- tion tracking involving human interaction, video modality poral smoothing. A detailed description of the training al- has been shown to add complementarity and even outper- gorithm can be found in [7]. form audio signals [18, 19, 20]. This poses a very interesting problem for the valence prediction in music as emotion anno- 2.4 Unweighted combination of LS+S, LSB+S tations are made using music audio only. Whereas there can and BESiF algorithms exist videos for certain songs, it has not been investigated if Our final model was an unweighted combination of the videos can be associated with and even alter the perceived previous three models. Unweighted combination of models affective evolution of the song. Along similar lines, several have been shown to help prediction if and when models cap- works propose the use of song lyrics in predicting affect [21, ture complementary information from the features [12, 13]. 22]. Hence textual content of the song can also be incorpo- In the next section, we present our results and analysis. rated towards the development of an enhanced multi-modal affect prediction system. 3. RESULTS AND DISCUSSION We show the results from the four models presented above 5. CONCLUSION in Table 1. From the results, we observe that our approach In this work, we use several previously proposed regres- using regression fails for valence prediction with close to no sion methods on the emotion in music task at MediaEval correlation with the ground truth. As this was not the case challenge 2015. We note that despite our success in the pre- for at least the LR+S system in the previous edition of the vious edition of the challenge, our methods fail, particularly challenge (MediaEval 2014 [4]), we suspect that there are for valence prediction. Our methods perform relatively well inherent differences in the data sets from MediaEval 2014 for arousal prediction, however the trends in performance and 2015. As previously pointed out, this year’s challenge across models are not as expected. We suspect that there involved prediction over full length song segments with train- could be several reasons for the unexpected results. Primar- ing on 30 second clips. This poses a data mismatch problem, ily, the differences in lengths of the train and test sets could particularly with respect to our BESiF algorithm. The fil- lead to a mismatched model for test set prediction. We also ters in the algorithm are optimized over shorter time series suspect that it may cause differences in perception of affect whereas test set prediction is over longer time series. in music, leading to differences in affect annotation. In case of arousal, our systems perform relatively well. Instead of providing answers to relation between low level The linear regression system performs the best. The BESiF features and affective dimensions, our work in this paper algorithm again fails to perform better than the other al- opens up more questions regarding the affective evolution of gorithms primarily because of the data mismatch problem. music signal. With regards to the future work, differences The filters in the BESiF algorithm when trained on smaller in perception of short clips of music signal versus longer duration annotation time series may not capture the dynam- clips, differences between affective dimensions of valence and ics that can exist over longer duration annotations. The arousal with regards to model development and investigating success of linear regression in arousal prediction offers some algorithmic designs will be our initial steps. promise in case of problems involving such temporal mis- match between train and test set. In the next section, we talk about modifying our current approach to improve the results. 4. FUTURE WORK Given that our systems do not perform well for valence 6. REFERENCES [18] Rahul Gupta, Nikolaos Malandrakis, Bo Xiao, Tanaya [1] Mira Balaban, Kemal Ebcioglu, and Otto E Laske. Guha, Maarten Van Segbroeck, Matthew Black, Understanding music with ai: perspectives on music Alexandros Potamianos, and Shrikanth Narayanan. cognition. 1992. Multimodal prediction of affective dimensions and [2] Michael Tanner and Malcolm Budd. Understanding depression in human-computer interactions. In music. Proceedings of the Aristotelian Society, Proceedings of the 4th International Workshop on Supplementary Volumes, pages 215–248, 1985. Audio/Visual Emotion Challenge, pages 33–40. ACM, [3] Mohammad Soleymani, Michael Caro, Erik Schmidt, 2014. and Yi-Hsuan Yang. The mediaeval 2013 brave new [19] Michel Valstar, Björn Schuller, Kirsty Smith, Timur task: Emotion in music. In MediaEval 2013 Workshop, Almaev, Florian Eyben, Jarek Krajewski, Roddy Barcelona, Spain, 2013. Cowie, and Maja Pantic. Avec 2014: 3d dimensional [4] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad affect and depression recognition challenge. In Soleymani. Emotion in music task at mediaeval 2014. Proceedings of the 4th International Workshop on In MediaEval 2014 Workshop, Barcelona, Spain, 2014. Audio/Visual Emotion Challenge, pages 3–10. ACM, [5] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad 2014. Soleymani. Emotion in music task at mediaeval 2015. [20] Vikramjit Mitra, Elizabeth Shriberg, Mitchell In MediaEval 2015 Workshop, Wurzen, Germany, McLaren, Andreas Kathol, Colleen Richey, Dimitra 2015. Vergyri, and Martin Graciarena. The sri avec-2014 [6] Naveen Kumar, Rahul Gupta, Tanaya Guha, Colin evaluation system. In Proceedings of the 4th Vaz, Maarten Van Segbroeck, Jangwon Kim, and International Workshop on Audio/Visual Emotion Shrikanth S Narayanan. Affective feature design and Challenge, pages 93–101. ACM, 2014. predicting continuous affective dimensions from music. [21] S Omar Ali and Zehra F Peynircioğlu. Songs and In MediaEval Workshop, Barcelona, Spain, 2014. emotions: are lyrics and melodies equal partners? [7] Rahul Gupta, Naveen Kumar, and Shrikanth Psychology of Music, 34(4):511–534, 2006. Narayanan. Affect prediction in music using boosted [22] Youngmoo E Kim, Erik M Schmidt, Raymond ensemble of filters. In The 2015 European Signal Migneco, Brandon G Morton, Patrick Richardson, Processing Conference, Nice, France, 2015. Jeffrey Scott, Jacquelin A Speck, and Douglas [8] Florian Eyben, Martin Wöllmer, and Björn Schuller. Turnbull. Music emotion recognition: A state of the Opensmile: the munich versatile and fast open-source art review. In Proc. ISMIR, pages 255–266. Citeseer, audio feature extractor. In Proceedings of the 2010. international conference on Multimedia, pages 1459–1462. ACM, 2010. [9] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002. [10] Gene H Golub and Charles F Van Loan. An analysis of the total least squares problem. SIAM Journal on Numerical Analysis, 17(6):883–893, 1980. [11] Jane Elith, John R Leathwick, and Trevor Hastie. A working guide to boosted regression trees. Journal of Animal Ecology, 77(4):802–813, 2008. [12] Thomas G Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000. [13] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [14] George Foster and Roland Kuhn. Mixture-model adaptation for smt. 2007. [15] WA Ainsworth. Mechanisms of selective feature adaptation. Perception & Psychophysics, 21(4):365–370, 1977. [16] Mihalis Nicolaou, Hatice Gunes, Maja Pantic, et al. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. Affective Computing, IEEE Transactions on, 2(2):92–105, 2011. [17] Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image and Vision Computing, 31(2):137–152, 2013.