=Paper=
{{Paper
|id=Vol-1263/paper28
|storemode=property
|title=Dynamic Music Emotion Recognition Using State-Space Models
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_28.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MarkovM14
}}
==Dynamic Music Emotion Recognition Using State-Space Models==
Dynamic Music Emotion Recognition Using State-Space Models Konstantin Markov Tomoko Matsui Human Interface Laboratory Department of Statistical Modeling The University of Aizu Institute of Statistical Mathematics Fukushima, Japan Tokyo, Japan markov@u-aizu.ac.jp tmatsui@ism.ac.jp ABSTRACT where f () and g() are unknown functions governing tempo- This paper describes the temporal music emotion recogni- ral state dynamics and state-to-measurement mapping re- tion system developed at the University of Aizu for the Emo- spectively. System and observation noises vt and wt are as- tion in Music task of the MediaEval 2014 benchmark evalua- sumed to be independent. Probabilistically, a SSM can also tion campaign. The arousal-valence trajectory prediction is be defined using two distributions: p(xt |xt−1 ) and p(yt |xt ). cast as a time series filtering task and is modeled by a state- For a sequence of T measurements, the filtering task is to space models. These models include standard linear model approximate p(xt |y1:t ), while approximating p(xt |y1:T ) is (Kalman filter) as well as novel non-linear, non-parametric the goal of the Rauch-Tung-Striebel (RTS) smoothing. Gaussian Processes based dynamic system. The music sig- For continuous music emotion recognition, xt would rep- nal was parametrized using standard features extracted with resent the unknown A-V vector and yt would correspond the Marsyas toolkit. Based on the preliminary results ob- to feature vector(s). SSM learning in our case is simplified tained from small random validation set, clear advantage of since the state A-V labels are given for the training and f () any feature or model could not be observed. and g() can be learned independently. 1. INTRODUCTION 2.1 Kalman filter Gaussian Processes (GPs) [4] are becoming more and more The Kalman filter is a linear SSM where f (x) = Ax popular in the Machine Learning community for their ability and f (x) = Bx with A and B being unknown parame- to learn highly non-linear mappings between two continuous ters, and v and w are zero mean Gaussian noises. Thus, data spaces. Previously, we have successfully applied GPs both p(xt |xt−1 ) and p(yt |xt ) become Gaussians and simple for static music emotion recognition [3]. Dynamic or contin- analytic solution for the filtering and smoothing tasks can uous emotion estimation is more difficult task and there are be obtained. several approaches to solve it. The simplest one is to assume 2.2 Gaussian Process dynamic system that for a relatively short period of time emotion is constant When f () and g() are modeled by GPs, we get a Gaussian and apply static emotion recognition methods. A better ap- Process dynamic system. Such SSMs have been recently pro- proach is to consider emotion trajectory as a time varying posed, but lack efficient and commonly adopted algorithms process and try to track it or use time series modeling tech- for learning and inference. Availability of A-V values for niques. In [5], authors use Kalman filters to model emotion training, however, makes the learning task easy since each evolution in time for each of four data partitions. For eval- target dimension of f () and g() can be learned independently uation, KL divergence between the predicted and reference using GP regression training algorithm. For the inference, A-V points distributions is measured assuming ”perfect” test however, there is no straightforward solution. One can al- samples partitioning. Our approach is similar since we also ways opt for Monte Carlo sampling algorithms, but they are use data partitioning, however, we apply model selection notoriously slow. We used the solution proposed in [2]. It method. In addition, we present novel dynamic music emo- is based on analytic moment matching to derive Gaussian tion model based on GPs. The task and the database used approximations to the filtering and smoothing distributions. in this evaluation are described in detail in the Emotion in Music overview paper [1]. 3. EXPERIMENTS 2. STATE-SPACE MODELS The development dataset was randomly split into training State-space models (SSM) are widely used in time series and validation sets having 600 and 144 clips each. Full cross- analysis, prediction, and modeling. They consist of latent validation scenario was not adopted due to time constraints. state variable xt ∈ Re and observable measurement variable 3.1 Feature extraction yt ∈ Rd which are related as follows: Features were extracted from the audio signal which was xt = f (xt−1 ) + vt−1 (1) first downsampled to 22050 kHz. Using the Marsyas toolkit yt = g(xt ) + wt (2) we obtained features such as mfcc, spfe including zero-crossing rate, spectral flux, centroid, and rolloff, and spectral crest factor scf. All feature vectors were calculated from 512 sam- Copyright is held by the author/owner(s). ples frames with no overlap. First order statistics were cal- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain culated for windows of 1 sec. with 0.5 sec. overlap. Thus, Table 1: Kalman filter and linear RTS smoother Table 2: Kalman filter and linear RTS smoother VA- AROUSAL results. 144 clips validation set. LENCE results. 144 clips validation set. Features KF RTS Features KF RTS Corr.Coef. RMSE Corr.Coef. RMSE Corr.Coef. RMSE Corr.Coef. RMSE Single model Single model mfcc 0.2062 0.2894 0.1070 0.3008 mfcc 0.0411 0.6262 0.0598 0.7082 spfe 0.1976 0.2860 0.0998 0.3109 spfe 0.0332 0.3945 0.0464 0.4710 mfcc+spfe 0.2326 0.2378 0.0894 0.2291 mfcc+spfe 0.0304 0.6208 0.0725 0.6978 mfcc+scf 0.1171 0.2288 0.1611 0.2188 mfcc+scf 0.1545 0.6692 0.1401 0.7231 baseline 0.2791 0.3631 0.1898 0.4027 baseline 0.0753 0.2681 0.0779 0.2996 Multiple models Multiple models mfcc 0.1698 0.1384 0.0991 0.1284 mfcc -0.082 0.1847 -0.042 0.1915 spfe 0.0957 0.1874 0.0292 0.1772 spfe -0.055 0.2353 -0.060 0.2497 mfcc+spfe 0.2022 0.1290 0.1246 0.1277 mfcc+spfe -0.054 0.1866 -0.068 0.1914 mfcc+scf 0.0059 0.1613 0.0253 0.1615 mfcc+scf 0.0149 0.1688 -0.008 0.1703 baseline 0.0212 0.2276 0.0236 0.2259 baseline -0.080 0.2425 -0.058 0.2497 for the last 30 seconds of each clip there were 61 feature vec- Table 3: GP filter and GP-RTS smoother results. tors. In addition to these features, we also used the features Multiple models. 144 clips validation set. from the MediaEval2014 baseline system [1]. Features GP-F GP-RTS Corr.Coef. RMSE Corr.Coef. RMSE 3.2 Data clustering AROUSAL In a way similar to [5], we clustered all training clips into mfcc 0.0436 0.3088 0.0743 0.3207 four clusters based on their static A-V values. Separate spfe 0.0582 0.3048 0.0714 0.3486 SSMs were trained from each cluster’s data. During the baseline -0.0073 0.3025 0.0393 0.3444 test, the trajectory obtained from the model which showed VALENCE the best match, i.e. the highest likelihood, was taken as the mfcc 0.0217 0.2766 0.0313 0.3083 prediction result. spfe 0.0283 0.3297 -0.003 0.3515 baseline -0.011 0.3891 -0.020 0.4431 4. RESULTS In order to see the effect of data clustering, we also eval- uated linear system trained on all 600 clips. Tables 1 and 2 [2] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and show the average correlation coefficient as well as the aver- C. Rasmussen. Robust filtering and smoothing with gaussian age RMS error with respect to different features for Arousal processes. Automatic Control, IEEE Transactions on, and Valence respectively. As can be seen, clustered mul- 57(7):1865–1871, 2012. tiple models show lower correlation, but smaller RMSE. It [3] K. Markov, M. Iwata, and T. Matsui. Music emotion is possible that the clustering has reduced the amount of recognition using gaussian processes. In Proceedings of the training for each model resulting in less accurate parameter ACM multimedia 2013 workshop on Crowdsourcing for estimation. Table 3 shows results of the GP based system Multimedia, CrowdMM. ACM, 2013. evaluation with multiple models. Single model was not used [4] C. Rasmussen and C. Williams. Gaussian Processes for due to prohibitive memory requirements. Compared to the Machine Learning. Adaptive Computation and Machine corresponding multiple model results of the linear system, Learning. The MIT Press, 2006. only Valence shows some improvement. [5] E. Schmidt and Y. Kim. Prediction of time-varying musical Using the official test set consisting of 1000 clips we were mood distributions using kalman filtering. In Machine able to evaluate only the Kalman filter base system due to Learning and Applications (ICMLA), 2010 Ninth time limitations. Results using the baseline features as well International Conference on, pages 655–660, Dec 2010. as couple of Marsyas feature sets are presented in Table 4. 5. CONCLUSIONS We presented two state-space model based dynamic music Table 4: Kalman filter results using the 1000 clips emotion recognition systems - one linear and one based on test set. Gaussian Processes. The preliminary results did not show Features Corr.Coef. RMSE clear advantage of any system or feature set. This is proba- AROUSAL bly due to the small size of the validation set. More detailed mfcc+spfe 0.2735±0.4522 0.3733±0.1027 experiments involving more data are planned for the future. mfcc+scf 0.1622±0.5754 0.3541±0.0990 baseline 0.2063±0.5720 0.0804±0.0505 6. REFERENCES VALENCE [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music mfcc+spfe 0.0469±0.4326 0.2002±0.0971 task at MediaEval 2014. In MediaEval 2014 Workshop, mfcc+scf 0.0265±0.4378 0.1338±0.0806 Barcelona, Spain, Oct 2014. baseline 0.1665±0.5166 0.1385±0.0723