=Paper=
{{Paper
|id=Vol-1436/Paper59
|storemode=property
|title=UNIZA System for the "Emotion in Music" task at MediaEval 2015
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper59.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ChmulikGMJ15
}}
==UNIZA System for the "Emotion in Music" task at MediaEval 2015==
UNIZA System for the "Emotion in Music" task at MediaEval 2015 Michal Chmulik, Igor Guoth, Miroslav Malik, Roman Jarina Department of Telecommunications and Multimedia, Faculty of Electrical Engineering, University of Zilina Zilina, Slovakia michal.chmulik@fel.uniza.sk ABSTRACT algorithms run in parallel and at the end of each iteration, the best In this working notes paper, we present the UNIZA system for the individuals from the both algorithms are selected into the next recognition of dynamic music emotional dimensions arousal and iteration of the optimization process. The Root Mean Square Error valence. The developed system is based on Support Vector (RMSE) between the predicted and ground truth labels was used Regression with Radial Basis kernel function. We selected 2 sets as a fitness function for the optimization algorithms. The of features using stochastic evolutionary optimization algorithms optimization process was running in 50 iterations and repeated 50 namely Genetic Algorithm and Particle Swarm Algorithm. The times. Two best combinations of features have been selected for models score the average Root Mean Square Error 0.3605 for the the submission. The first set denoted as "optimal_1" consists of valence dimension and 0.2540 for the arousal dimension. 139 features and set denoted as "optimal_2" consists of 129 features. Both sets include 72 identical features - detailed 1. INTRODUCTION description is beyond the limit of paper pages but the sets The objective of the Emotion in Music task at MediaEval intersection contains mostly a number of auditory spectra 2015 is to automatically determine temporal dynamics in emotion coefficients, MFCC coefficients as well as their spectral as a sequence of numerical values in two dimensions: valence and skewness, slope, flux and delta regression variations. arousal (AV). The task comprises three scenarios: 1) Given a set In the system development stage, we also tested the features of baseline audio features, the participants are to return AV extracted by the MIRToolbox [5] - combination of chromagram, scores obtained by machine learning method of their choice; 2) onset detection, log-attack time, roughness, tempo, key and The participants are required to submit their own set of features, tonality. The feature extraction process has been performed on they believe that are most discriminative in term of emotion frames with different duration and overlapping depending on the determination; 3) The participants may return AV scores obtained particular feature. As a result, we have obtained 51 features and by using any combination of the features and machine learning this set is denoted as "MIR". We have used identical feature method. For more details about the task and data set, see the format as the baseline (non-overlapping segments of 500 ms) and overview paper [1]. besides the mean values and standard deviations, we have also used the maximal values. 2. APPROACH Table 1 states mean evaluation accuracy that we have UNIZA system for the dynamic emotion recognition is based obtained for the development data using evaluation metrics RMSE on the Support Vector Regression (SVR) and utilizes the [1] and Pearson's correlation coefficient r. The "default" run LIBSVM libraries. We follow the approach that we have already represents the first scenario of the task when our system was fed applied for emotion recognition from speech [2]. Development of with the baseline feature set. The other runs corresponds to the our system has been carried out in Matlab and C++ environments. third scenario where different feature sets were tested with our We have split the development data into 2 approximately equal regression models. Based on acquired preliminary results, we non-overlapping parts - the first one for the training of regression decided to further process and submit only feature sets with the models while the other part for models testing. highest ranking (e.g. "optimal_1" and "optimal_2"). SVR has employed the Radial Basis (RBF) kernel function. Search for optimal kernel parameters has been performed by the Table 1. Results of the system on the development data grid search method in cooperation with Bat Algorithm (BA) - for different feature sets. metaheuristic optimization technique [3]. The parameters of the Arousal Valence kernel were individually optimized for the both dimensions and run RMSE r RMSE r finally selected the one combination resulting in the best evaluation accuracy. The same kernel parameter values were used default 0.0815 0.4203 0.0724 0.4238 in all scenarios of the task. optimal_1 0.0806 0.4553 0.0669 0.4603 For the second and third scenarios, we have created 2 sets of optimal_2 0.0794 0.4681 0.0659 0.4718 features, which are selected from the baseline feature set using stochastic evolutionary optimization algorithms. For this purpose, MIR 0.0988 0.2058 0.1044 0.2104 we have used hybrid combination of Genetic Algorithm (GA) and def.+MIR 0.0887 0.3543 0.0771 0.3709 Particle Swarm Optimization algorithm (PSO) [4]. The GA/PSO hybrid approach works as follows. The both optimization Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany. 3. RESULTS AND DISCUSSION baseline feature set. The system efficiency may be improved by In Table 2, there is notified the official classification finer tuning of the kernel parameters individually for each accuracy of our system according to the evaluation metrics of the dimension. Also, application of other regression model could task [1] for the first scenario ("default" run) and the third scenario improve the system accuracy. of the task ("optimal_1", "optimal_2"). In the future, we would like to create a vast set of features (baseline + "MIR" and other musical-oriented features) and search Table 2. Official results of UNIZA team for different for the optimal subset giving the best classification accuracy that runs. would be at least equal baseline/full set accuracy. For the searching process, some of the state-of-the-art nature-inspired Valence optimization technique will be applied. run RMSE r default 0.3662±0.1747 -0.0218±0.4011 4. CONCLUSION We developed the SVR-based system for dynamic music optimal_1 0.3605±0.1727 -0.0141±0.4007 emotion recognition. Regrettably, our feature sets suggested optimal_2 0.3613±0.1737 -0.0161±0.3961 according to the evolutionary optimization methods did not cause Arousal significant improvement of classification accuracy of the system. On the other hand, the (almost) equal result were obtained using run RMSE r only approximately 50% of the baseline features. default 0.2554±0.0995 0.5100±0.2248 5. REFERENCES optimal_1 0.2571±0.0997 0.5097±0.2228 [1] Aljanaki A., Yang Y.-H., Soleymani M. 2015. Emotion in optimal_2 0.2540±0.1028 0.4930±0.2326 Music Task at MediaEval 2015. In MediaEval 2015 Workshop, 2015, Wurzen, Germany. As it can be seen, our feature set did not provide any [2] Hric, M.; Chmulik, M.; Guoth, I.; Jarina, R. 2015. SVM significant improvement of the system efficiency in comparison based speaker emotion recognition in continuous scale. In with the baseline feature set and the differences in RMSE are Proceedings of 25th International Conference barely noticeable. Anyhow, the best results represent the Radioelektronika 2015, 2015, Pardubice, Czech republic, "optimal_1" run for the valence dimension and the "optimal_2" 339-342. run for the arousal dimension. The arousal dimension acquires [3] Yang X.-S. 2014. Nature-Inspired Optimization Algorithms. better results than the valence dimension as is usual in the Elsevier, London. emotion recognition tasks. In comparison with the results from the development data, there can be seen a huge drop of the correlation [4] Kennedy J., Eberhart R.C. with Shi Y. 2001. Swarm coefficient r for the valence dimension. Intelligence. Morgan Kaufmann Publisher, San Francisco. Although our feature sets do not achieve significantly better [5] Lartillot O., Toiviainen P. 2007. A Matlab Toolbox for score, feature dimension of the sets are greatly reduced Musical Feature Extraction From Audio. In International (approximately 50% of the baseline) thus the computational Conference on Digital Audio Effects, 2007, Bordeaux, demands of the system is also greatly reduced. Based on the France, 237-244. results, it seems that there is a great redundancy of data in the