Predicting Affect in Music Using Regression Methods on
                        Low Level Features

                                            Rahul Gupta, Shrikanth Narayanan
      Signal Analysis and Interpretation Lab (SAIL), University of Southern California, Los Angeles, CA, USA
                                         guptarah@usc.edu, shri@sipi.usc.edu


ABSTRACT                                                         in music. In general, the affective signals evolve smoothly
Music has been shown to impact the affective states of the       over time and do not undergo abrupt changes. Our models
listener. The emotion in music task at the MediaEval chal-       take this factor into account by learning the mapping from
lenge 2015 focuses on predicting the affective dimensions of     features to the affective dimensions while also accounting
valence and arousal in music using low level features. In par-   for the smooth temporal evolution of affect. In the 2015
ticular, this edition of the challenge involves prediction on    emotion in music task, our best models obtain a root mean
full length songs given a training set containing smaller 30     square error values of .35 and .24 in valence and arousal
second clips. We approach the problem as a regression task       prediction, respectively. In the next section we describe our
and test several regression algorithms. We proposed these        methodology in detail.
regression methods on the dataset from previous edition of
the same task (Mediaeval 2014) involving prediction on 30        2.    METHODOLOGY
second clips instead of full length songs. Through evaluation       The 2015 challenge task provides a development set con-
on the 2015 data set, we obtain a point of reference for the     sisting of 30 second clips from 431 songs; annotated at a rate
model performances on longer song clips. Whereas our mod-        of 2 frames per second. The baseline feature set is extracted
els perform relatively well in predicting arousal (root mean     using OpenSmile [8] and contains 260 features. The test set
square error: .24), we do not obtain good results for va-        contains 58 full length songs annotated at the same frame
lence prediction (root mean square error: .35). We analyze       rate as the development set. We use three different regres-
the results and the experimental setup and discuss plausible     sion methods to predict the affective dimensions of valence
solutions for a better prediction.                               and arousal from the 260 baseline features. We describe
                                                                 these methods below.
1.    INTRODUCTION
                                                                 2.1    Linear Regression + Smoothing (LR+S)
   Music is an important part of media and considerable re-
search has gone into understanding and indexing the music           In this model, we use the 260 features and learn sepa-
signal [1, 2]. Music has been shown to impact the affective      rate linear regression models to predict arousal and valence.
states of the listeners and in depth analysis of the relation    After obtaining the decisions, we perform a smoothing op-
between music and affect can impact both understanding           eration by low pass filtering the frame-wise arousal and va-
and design of music. Over the past few years, the emotion        lence values. We use a moving average filter as the low pass
in music task at various MediaEval challenges [3, 4, 5] has      filter with filter length tuned using three fold inner cross
provided a unified platform for understanding the affective      validation on the train set (arousal filter length = 13; va-
characteristics of music signals. The emotion in music task      lence filter length = 38). The smoothing operation not only
at MediaEval 2015 [5] provides a training set which is a         removes the high frequency noise, but also incorporates the
subset of the 2014 challenge, with valence and arousal anno-     local context into account while making decision for a frame.
tations over 30 second clips. This subset is chosen for better   The decision for a frame is given as an unweighted combi-
quality annotations as described in the overview paper [5].      nation of frame values in a window centered around that
However, it is also unique in the sense that the prediction      frame, thereby incorporating local context.
has to be made on a test set containing full length songs.       2.2    Least Squares Boosting +
This poses the challenge of generalizing models trained over
smaller music segments for prediction on longer segments.
                                                                        Smoothing (LSB+S)
   In this work, we present the results on affect prediction        Least squares boosting [9, 10] is another regression al-
in music using our previous models developed on the 2014         gorithm trained using gradient boosting [9]. We use the
challenge data set. We tested multiple regression models         “fitensemble” function in Matlab to train a least squares
followed by a smoothing operation in last year’s challenge [6]   boosting model for predicting valence and arousal. The base
and more recently developed a Boosted Ensemble of Single         learners used for least squares boosting are regression trees
feature Filters (BESiF) algorithm [7] for affect prediction      [11]. The number of regression trees in the ensemble is tuned
                                                                 using 3 fold cross-validation on the train set. After obtain-
                                                                 ing the frame-wise decisions from the least squares boosting
                                                                 algorithm, we perform a smoothing operation as explained
Copyright is held by the author/owner(s).
                                                                 in the section 2.1.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
 Method                         Valence          Arousal         prediction, we aim to perform a detailed analysis to under-
                             RMSE       ρ     RMSE       ρ       stand the reasons behind the poor performance. Despite the
 Baseline [5]                 0.37    0.01     0.27   0.36       presence of features correlated with valence in the train set
 LR+S                         0.35    0.01     0.24   0.65       and our success in the last edition of the challenge, a low
 LSB+S                        0.35    0.05     0.24   0.59       performance on valence prediction poses a challenge in form
 BESiF                        0.37   -0.04     0.28   0.50       of understanding prediction over longer song segments. We
 Unweighted summation         0.35    0.00     0.24   0.64       suspect that providing annotators with small song segments
                                                                 versus longer segments may have an impact on the anno-
Table 1: Results on valence and arousal prediction               tation itself. Listening to longer clips may alter affective
using the proposed regression systems.                           perceptions and introduce other annotator biases. In par-
                                                                 ticular, we aim to investigate the performance of our BESiF
                                                                 algorithm and modify for the given problem setting. This
2.3      Boosted Ensemble of Single feature                      may involve including adaptation schemes [14, 15] to model
         Filters (BESiF)                                         differences in annotation over the train and the test set and
   We proposed another gradient boosting based algorithm         other mismatch that may exist.
on the 2014 emotion in music data set [7]. In this algo-            Also several previous works have reported differences in
rithm, we propose the base learners to be filters (analogous     performances for arousal and valence prediction using acous-
to regression trees used in LSB+S algorithm). The motiva-        tic features similar to the ones used in this work [16, 17, 18].
tion behind this algorithm was to perform a joint learning       This is worth investigating into as it may imply that va-
of regression and smoothing unlike previous two methods.         lence prediction may involve other features not considered in
The filters not only learn the mapping between low level         the baseline set of features. In the case of continuous emo-
features and the affective dimensions, but also perform tem-     tion tracking involving human interaction, video modality
poral smoothing. A detailed description of the training al-      has been shown to add complementarity and even outper-
gorithm can be found in [7].                                     form audio signals [18, 19, 20]. This poses a very interesting
                                                                 problem for the valence prediction in music as emotion anno-
2.4      Unweighted combination of LS+S, LSB+S                   tations are made using music audio only. Whereas there can
         and BESiF algorithms                                    exist videos for certain songs, it has not been investigated if
  Our final model was an unweighted combination of the           videos can be associated with and even alter the perceived
previous three models. Unweighted combination of models          affective evolution of the song. Along similar lines, several
have been shown to help prediction if and when models cap-       works propose the use of song lyrics in predicting affect [21,
ture complementary information from the features [12, 13].       22]. Hence textual content of the song can also be incorpo-
In the next section, we present our results and analysis.        rated towards the development of an enhanced multi-modal
                                                                 affect prediction system.
3.     RESULTS AND DISCUSSION
   We show the results from the four models presented above
                                                                 5.   CONCLUSION
in Table 1. From the results, we observe that our approach          In this work, we use several previously proposed regres-
using regression fails for valence prediction with close to no   sion methods on the emotion in music task at MediaEval
correlation with the ground truth. As this was not the case      challenge 2015. We note that despite our success in the pre-
for at least the LR+S system in the previous edition of the      vious edition of the challenge, our methods fail, particularly
challenge (MediaEval 2014 [4]), we suspect that there are        for valence prediction. Our methods perform relatively well
inherent differences in the data sets from MediaEval 2014        for arousal prediction, however the trends in performance
and 2015. As previously pointed out, this year’s challenge       across models are not as expected. We suspect that there
involved prediction over full length song segments with train-   could be several reasons for the unexpected results. Primar-
ing on 30 second clips. This poses a data mismatch problem,      ily, the differences in lengths of the train and test sets could
particularly with respect to our BESiF algorithm. The fil-       lead to a mismatched model for test set prediction. We also
ters in the algorithm are optimized over shorter time series     suspect that it may cause differences in perception of affect
whereas test set prediction is over longer time series.          in music, leading to differences in affect annotation.
   In case of arousal, our systems perform relatively well.         Instead of providing answers to relation between low level
The linear regression system performs the best. The BESiF        features and affective dimensions, our work in this paper
algorithm again fails to perform better than the other al-       opens up more questions regarding the affective evolution of
gorithms primarily because of the data mismatch problem.         music signal. With regards to the future work, differences
The filters in the BESiF algorithm when trained on smaller       in perception of short clips of music signal versus longer
duration annotation time series may not capture the dynam-       clips, differences between affective dimensions of valence and
ics that can exist over longer duration annotations. The         arousal with regards to model development and investigating
success of linear regression in arousal prediction offers some   algorithmic designs will be our initial steps.
promise in case of problems involving such temporal mis-
match between train and test set. In the next section, we
talk about modifying our current approach to improve the
results.

4.     FUTURE WORK
     Given that our systems do not perform well for valence
6.   REFERENCES                                               [18] Rahul Gupta, Nikolaos Malandrakis, Bo Xiao, Tanaya
 [1] Mira Balaban, Kemal Ebcioglu, and Otto E Laske.               Guha, Maarten Van Segbroeck, Matthew Black,
     Understanding music with ai: perspectives on music            Alexandros Potamianos, and Shrikanth Narayanan.
     cognition. 1992.                                              Multimodal prediction of affective dimensions and
 [2] Michael Tanner and Malcolm Budd. Understanding                depression in human-computer interactions. In
     music. Proceedings of the Aristotelian Society,               Proceedings of the 4th International Workshop on
     Supplementary Volumes, pages 215–248, 1985.                   Audio/Visual Emotion Challenge, pages 33–40. ACM,
 [3] Mohammad Soleymani, Michael Caro, Erik Schmidt,               2014.
     and Yi-Hsuan Yang. The mediaeval 2013 brave new          [19] Michel Valstar, Björn Schuller, Kirsty Smith, Timur
     task: Emotion in music. In MediaEval 2013 Workshop,           Almaev, Florian Eyben, Jarek Krajewski, Roddy
     Barcelona, Spain, 2013.                                       Cowie, and Maja Pantic. Avec 2014: 3d dimensional
 [4] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad                    affect and depression recognition challenge. In
     Soleymani. Emotion in music task at mediaeval 2014.           Proceedings of the 4th International Workshop on
     In MediaEval 2014 Workshop, Barcelona, Spain, 2014.           Audio/Visual Emotion Challenge, pages 3–10. ACM,
 [5] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad                    2014.
     Soleymani. Emotion in music task at mediaeval 2015.      [20] Vikramjit Mitra, Elizabeth Shriberg, Mitchell
     In MediaEval 2015 Workshop, Wurzen, Germany,                  McLaren, Andreas Kathol, Colleen Richey, Dimitra
     2015.                                                         Vergyri, and Martin Graciarena. The sri avec-2014
 [6] Naveen Kumar, Rahul Gupta, Tanaya Guha, Colin                 evaluation system. In Proceedings of the 4th
     Vaz, Maarten Van Segbroeck, Jangwon Kim, and                  International Workshop on Audio/Visual Emotion
     Shrikanth S Narayanan. Affective feature design and           Challenge, pages 93–101. ACM, 2014.
     predicting continuous affective dimensions from music.   [21] S Omar Ali and Zehra F Peynircioğlu. Songs and
     In MediaEval Workshop, Barcelona, Spain, 2014.                emotions: are lyrics and melodies equal partners?
 [7] Rahul Gupta, Naveen Kumar, and Shrikanth                      Psychology of Music, 34(4):511–534, 2006.
     Narayanan. Affect prediction in music using boosted      [22] Youngmoo E Kim, Erik M Schmidt, Raymond
     ensemble of filters. In The 2015 European Signal              Migneco, Brandon G Morton, Patrick Richardson,
     Processing Conference, Nice, France, 2015.                    Jeffrey Scott, Jacquelin A Speck, and Douglas
 [8] Florian Eyben, Martin Wöllmer, and Björn Schuller.          Turnbull. Music emotion recognition: A state of the
     Opensmile: the munich versatile and fast open-source          art review. In Proc. ISMIR, pages 255–266. Citeseer,
     audio feature extractor. In Proceedings of the                2010.
     international conference on Multimedia, pages
     1459–1462. ACM, 2010.
 [9] Jerome H Friedman. Stochastic gradient boosting.
     Computational Statistics & Data Analysis,
     38(4):367–378, 2002.
[10] Gene H Golub and Charles F Van Loan. An analysis
     of the total least squares problem. SIAM Journal on
     Numerical Analysis, 17(6):883–893, 1980.
[11] Jane Elith, John R Leathwick, and Trevor Hastie. A
     working guide to boosted regression trees. Journal of
     Animal Ecology, 77(4):802–813, 2008.
[12] Thomas G Dietterich. Ensemble methods in machine
     learning. In Multiple classifier systems, pages 1–15.
     Springer, 2000.
[13] Leo Breiman. Bagging predictors. Machine learning,
     24(2):123–140, 1996.
[14] George Foster and Roland Kuhn. Mixture-model
     adaptation for smt. 2007.
[15] WA Ainsworth. Mechanisms of selective feature
     adaptation. Perception & Psychophysics,
     21(4):365–370, 1977.
[16] Mihalis Nicolaou, Hatice Gunes, Maja Pantic, et al.
     Continuous prediction of spontaneous affect from
     multiple cues and modalities in valence-arousal space.
     Affective Computing, IEEE Transactions on,
     2(2):92–105, 2011.
[17] Angeliki Metallinou, Athanasios Katsamanis, and
     Shrikanth Narayanan. Tracking continuous emotional
     trends of participants during affective dyadic
     interactions using body language and speech
     information. Image and Vision Computing,
     31(2):137–152, 2013.