=Paper= {{Paper |id=Vol-1436/Paper59 |storemode=property |title=UNIZA System for the "Emotion in Music" task at MediaEval 2015 |pdfUrl=https://ceur-ws.org/Vol-1436/Paper59.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/ChmulikGMJ15 }} ==UNIZA System for the "Emotion in Music" task at MediaEval 2015== https://ceur-ws.org/Vol-1436/Paper59.pdf
            UNIZA System for the "Emotion in Music" task at
                           MediaEval 2015
                                                 Michal Chmulik, Igor Guoth,
                                                 Miroslav Malik, Roman Jarina
                                                 Department of Telecommunications
                                                 and Multimedia, Faculty of Electrical
                                                   Engineering, University of Zilina
                                                           Zilina, Slovakia
                                                    michal.chmulik@fel.uniza.sk


ABSTRACT                                                               algorithms run in parallel and at the end of each iteration, the best
In this working notes paper, we present the UNIZA system for the       individuals from the both algorithms are selected into the next
recognition of dynamic music emotional dimensions arousal and          iteration of the optimization process. The Root Mean Square Error
valence. The developed system is based on Support Vector               (RMSE) between the predicted and ground truth labels was used
Regression with Radial Basis kernel function. We selected 2 sets       as a fitness function for the optimization algorithms. The
of features using stochastic evolutionary optimization algorithms      optimization process was running in 50 iterations and repeated 50
namely Genetic Algorithm and Particle Swarm Algorithm. The             times. Two best combinations of features have been selected for
models score the average Root Mean Square Error 0.3605 for the         the submission. The first set denoted as "optimal_1" consists of
valence dimension and 0.2540 for the arousal dimension.                139 features and set denoted as "optimal_2" consists of 129
                                                                       features. Both sets include 72 identical features - detailed
1. INTRODUCTION                                                        description is beyond the limit of paper pages but the sets
     The objective of the Emotion in Music task at MediaEval           intersection contains mostly a number of auditory spectra
2015 is to automatically determine temporal dynamics in emotion        coefficients, MFCC coefficients as well as their spectral
as a sequence of numerical values in two dimensions: valence and       skewness, slope, flux and delta regression variations.
arousal (AV). The task comprises three scenarios: 1) Given a set             In the system development stage, we also tested the features
of baseline audio features, the participants are to return AV          extracted by the MIRToolbox [5] - combination of chromagram,
scores obtained by machine learning method of their choice; 2)         onset detection, log-attack time, roughness, tempo, key and
The participants are required to submit their own set of features,     tonality. The feature extraction process has been performed on
they believe that are most discriminative in term of emotion           frames with different duration and overlapping depending on the
determination; 3) The participants may return AV scores obtained       particular feature. As a result, we have obtained 51 features and
by using any combination of the features and machine learning          this set is denoted as "MIR". We have used identical feature
method. For more details about the task and data set, see the          format as the baseline (non-overlapping segments of 500 ms) and
overview paper [1].                                                    besides the mean values and standard deviations, we have also
                                                                       used the maximal values.
2. APPROACH                                                                  Table 1 states mean evaluation accuracy that we have
      UNIZA system for the dynamic emotion recognition is based        obtained for the development data using evaluation metrics RMSE
on the Support Vector Regression (SVR) and utilizes the                [1] and Pearson's correlation coefficient r. The "default" run
LIBSVM libraries. We follow the approach that we have already          represents the first scenario of the task when our system was fed
applied for emotion recognition from speech [2]. Development of        with the baseline feature set. The other runs corresponds to the
our system has been carried out in Matlab and C++ environments.        third scenario where different feature sets were tested with our
We have split the development data into 2 approximately equal          regression models. Based on acquired preliminary results, we
non-overlapping parts - the first one for the training of regression   decided to further process and submit only feature sets with the
models while the other part for models testing.                        highest ranking (e.g. "optimal_1" and "optimal_2").
      SVR has employed the Radial Basis (RBF) kernel function.
Search for optimal kernel parameters has been performed by the              Table 1. Results of the system on the development data
grid search method in cooperation with Bat Algorithm (BA) -            for different feature sets.
metaheuristic optimization technique [3]. The parameters of the                               Arousal             Valence
kernel were individually optimized for the both dimensions and             run           RMSE           r        RMSE            r
finally selected the one combination resulting in the best
evaluation accuracy. The same kernel parameter values were used            default       0.0815      0.4203       0.0724      0.4238
in all scenarios of the task.                                              optimal_1     0.0806      0.4553       0.0669      0.4603
      For the second and third scenarios, we have created 2 sets of
                                                                           optimal_2     0.0794      0.4681       0.0659      0.4718
features, which are selected from the baseline feature set using
stochastic evolutionary optimization algorithms. For this purpose,         MIR           0.0988      0.2058       0.1044      0.2104
we have used hybrid combination of Genetic Algorithm (GA) and              def.+MIR      0.0887      0.3543       0.0771      0.3709
Particle Swarm Optimization algorithm (PSO) [4]. The GA/PSO
hybrid approach works as follows. The both optimization

Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany.
3. RESULTS AND DISCUSSION                                                baseline feature set. The system efficiency may be improved by
     In Table 2, there is notified the official classification           finer tuning of the kernel parameters individually for each
accuracy of our system according to the evaluation metrics of the        dimension. Also, application of other regression model could
task [1] for the first scenario ("default" run) and the third scenario   improve the system accuracy.
of the task ("optimal_1", "optimal_2").                                       In the future, we would like to create a vast set of features
                                                                         (baseline + "MIR" and other musical-oriented features) and search
    Table 2. Official results of UNIZA team for different                for the optimal subset giving the best classification accuracy that
runs.                                                                    would be at least equal baseline/full set accuracy. For the
                                                                         searching process, some of the state-of-the-art nature-inspired
                                Valence
                                                                         optimization technique will be applied.
        run               RMSE                    r
        default      0.3662±0.1747        -0.0218±0.4011                 4. CONCLUSION
                                                                              We developed the SVR-based system for dynamic music
        optimal_1    0.3605±0.1727        -0.0141±0.4007                 emotion recognition. Regrettably, our feature sets suggested
        optimal_2    0.3613±0.1737        -0.0161±0.3961                 according to the evolutionary optimization methods did not cause
                                     Arousal                             significant improvement of classification accuracy of the system.
                                                                         On the other hand, the (almost) equal result were obtained using
        run               RMSE                    r                      only approximately 50% of the baseline features.
        default      0.2554±0.0995         0.5100±0.2248
                                                                         5. REFERENCES
        optimal_1    0.2571±0.0997         0.5097±0.2228                 [1] Aljanaki A., Yang Y.-H., Soleymani M. 2015. Emotion in
        optimal_2    0.2540±0.1028         0.4930±0.2326                     Music Task at MediaEval 2015. In MediaEval 2015
                                                                             Workshop, 2015, Wurzen, Germany.
     As it can be seen, our feature set did not provide any              [2] Hric, M.; Chmulik, M.; Guoth, I.; Jarina, R. 2015. SVM
significant improvement of the system efficiency in comparison               based speaker emotion recognition in continuous scale. In
with the baseline feature set and the differences in RMSE are                Proceedings of 25th International Conference
barely noticeable. Anyhow, the best results represent the                    Radioelektronika 2015, 2015, Pardubice, Czech republic,
"optimal_1" run for the valence dimension and the "optimal_2"                339-342.
run for the arousal dimension. The arousal dimension acquires
                                                                         [3] Yang X.-S. 2014. Nature-Inspired Optimization Algorithms.
better results than the valence dimension as is usual in the
                                                                             Elsevier, London.
emotion recognition tasks. In comparison with the results from the
development data, there can be seen a huge drop of the correlation       [4] Kennedy J., Eberhart R.C. with Shi Y. 2001. Swarm
coefficient r for the valence dimension.                                     Intelligence. Morgan Kaufmann Publisher, San Francisco.
     Although our feature sets do not achieve significantly better       [5] Lartillot O., Toiviainen P. 2007. A Matlab Toolbox for
score, feature dimension of the sets are greatly reduced                     Musical Feature Extraction From Audio. In International
(approximately 50% of the baseline) thus the computational                   Conference on Digital Audio Effects, 2007, Bordeaux,
demands of the system is also greatly reduced. Based on the                  France, 237-244.
results, it seems that there is a great redundancy of data in the