<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNIZA System for the "Emotion in Music" task at MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michal Chmulik</string-name>
          <email>michal.chmulik@fel.uniza.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Guoth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miroslav Malik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Jarina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Telecommunications and Multimedia, Faculty of Electrical Engineering, University of Zilina Zilina</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this working notes paper, we present the UNIZA system for the recognition of dynamic music emotional dimensions arousal and valence. The developed system is based on Support Vector Regression with Radial Basis kernel function. We selected 2 sets of features using stochastic evolutionary optimization algorithms namely Genetic Algorithm and Particle Swarm Algorithm. The models score the average Root Mean Square Error 0.3605 for the valence dimension and 0.2540 for the arousal dimension.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. APPROACH</title>
      <p>
        UNIZA system for the dynamic emotion recognition is based
on the Support Vector Regression (SVR) and utilizes the
LIBSVM libraries. We follow the approach that we have already
applied for emotion recognition from speech [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Development of
our system has been carried out in Matlab and C++ environments.
We have split the development data into 2 approximately equal
non-overlapping parts - the first one for the training of regression
models while the other part for models testing.
      </p>
      <p>
        SVR has employed the Radial Basis (RBF) kernel function.
Search for optimal kernel parameters has been performed by the
grid search method in cooperation with Bat Algorithm (BA)
metaheuristic optimization technique [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The parameters of the
kernel were individually optimized for the both dimensions and
finally selected the one combination resulting in the best
evaluation accuracy. The same kernel parameter values were used
in all scenarios of the task.
      </p>
      <p>
        For the second and third scenarios, we have created 2 sets of
features, which are selected from the baseline feature set using
stochastic evolutionary optimization algorithms. For this purpose,
we have used hybrid combination of Genetic Algorithm (GA) and
Particle Swarm Optimization algorithm (PSO) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The GA/PSO
hybrid approach works as follows. The both optimization
algorithms run in parallel and at the end of each iteration, the best
individuals from the both algorithms are selected into the next
iteration of the optimization process. The Root Mean Square Error
(RMSE) between the predicted and ground truth labels was used
as a fitness function for the optimization algorithms. The
optimization process was running in 50 iterations and repeated 50
times. Two best combinations of features have been selected for
the submission. The first set denoted as "optimal_1" consists of
139 features and set denoted as "optimal_2" consists of 129
features. Both sets include 72 identical features - detailed
description is beyond the limit of paper pages but the sets
intersection contains mostly a number of auditory spectra
coefficients, MFCC coefficients as well as their spectral
skewness, slope, flux and delta regression variations.
      </p>
      <p>
        In the system development stage, we also tested the features
extracted by the MIRToolbox [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] - combination of chromagram,
onset detection, log-attack time, roughness, tempo, key and
tonality. The feature extraction process has been performed on
frames with different duration and overlapping depending on the
particular feature. As a result, we have obtained 51 features and
this set is denoted as "MIR". We have used identical feature
format as the baseline (non-overlapping segments of 500 ms) and
besides the mean values and standard deviations, we have also
used the maximal values.
      </p>
      <p>
        Table 1 states mean evaluation accuracy that we have
obtained for the development data using evaluation metrics RMSE
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Pearson's correlation coefficient r. The "default" run
represents the first scenario of the task when our system was fed
with the baseline feature set. The other runs corresponds to the
third scenario where different feature sets were tested with our
regression models. Based on acquired preliminary results, we
decided to further process and submit only feature sets with the
highest ranking (e.g. "optimal_1" and "optimal_2").
run
MIR
      </p>
      <p>RMSE</p>
    </sec>
    <sec id="sec-2">
      <title>3. RESULTS AND DISCUSSION</title>
      <p>
        In Table 2, there is notified the official classification
accuracy of our system according to the evaluation metrics of the
task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for the first scenario ("default" run) and the third scenario
of the task ("optimal_1", "optimal_2").
runs.
      </p>
      <p>run
default</p>
      <p>RMSE
0.3662±0.1747
optimal_1</p>
      <p>0.3605±0.1727
optimal_2</p>
      <p>0.3613±0.1737
run
default</p>
      <p>RMSE
0.2554±0.0995
optimal_1</p>
      <p>0.2571±0.0997
optimal_2
0.2540±0.1028</p>
      <p>Valence
Arousal
r
r
-0.0218±0.4011
-0.0141±0.4007
-0.0161±0.3961
0.5100±0.2248
0.5097±0.2228
0.4930±0.2326</p>
      <p>As it can be seen, our feature set did not provide any
significant improvement of the system efficiency in comparison
with the baseline feature set and the differences in RMSE are
barely noticeable. Anyhow, the best results represent the
"optimal_1" run for the valence dimension and the "optimal_2"
run for the arousal dimension. The arousal dimension acquires
better results than the valence dimension as is usual in the
emotion recognition tasks. In comparison with the results from the
development data, there can be seen a huge drop of the correlation
coefficient r for the valence dimension.</p>
      <p>Although our feature sets do not achieve significantly better
score, feature dimension of the sets are greatly reduced
(approximately 50% of the baseline) thus the computational
demands of the system is also greatly reduced. Based on the
results, it seems that there is a great redundancy of data in the
baseline feature set. The system efficiency may be improved by
finer tuning of the kernel parameters individually for each
dimension. Also, application of other regression model could
improve the system accuracy.</p>
      <p>In the future, we would like to create a vast set of features
(baseline + "MIR" and other musical-oriented features) and search
for the optimal subset giving the best classification accuracy that
would be at least equal baseline/full set accuracy. For the
searching process, some of the state-of-the-art nature-inspired
optimization technique will be applied.</p>
    </sec>
    <sec id="sec-3">
      <title>4. CONCLUSION</title>
      <p>We developed the SVR-based system for dynamic music
emotion recognition. Regrettably, our feature sets suggested
according to the evolutionary optimization methods did not cause
significant improvement of classification accuracy of the system.
On the other hand, the (almost) equal result were obtained using
only approximately 50% of the baseline features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aljanaki</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Y.-H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soleymani</surname>
            <given-names>M.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Emotion in Music Task at MediaEval 2015</article-title>
          . In MediaEval 2015 Workshop,
          <year>2015</year>
          , Wurzen, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hric</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chmulik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guoth</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Jarina,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>SVM based speaker emotion recognition in continuous scale</article-title>
          .
          <source>In Proceedings of 25th International Conference Radioelektronika</source>
          <year>2015</year>
          ,
          <year>2015</year>
          , Pardubice, Czech republic,
          <fpage>339</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Yang</surname>
            <given-names>X.-S.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Nature-Inspired Optimization Algorithms</article-title>
          . Elsevier, London.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kennedy</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eberhart R</surname>
          </string-name>
          .C. with
          <string-name>
            <surname>Shi</surname>
            <given-names>Y.</given-names>
          </string-name>
          <year>2001</year>
          . Swarm Intelligence. Morgan Kaufmann Publisher, San Francisco.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lartillot</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toiviainen</surname>
            <given-names>P.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>A Matlab Toolbox for Musical Feature Extraction From Audio</article-title>
          .
          <source>In International Conference on Digital Audio Effects</source>
          ,
          <year>2007</year>
          , Bordeaux, France,
          <fpage>237</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>