<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yu-Hao Chin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jia-Ching Wang</string-name>
          <email>jiacwang@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering National Central University</institution>
          ,
          <country country="TW">Taiwan, R.O.C</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes our work for the “Emotion in Music” task of MediaEval 2015. The goal of the task is predicting affective content of a song. The affective content is presented in terms of valence and arousal criterions, which are shown in a timecontinuous fashion. We adopt deep recurrent neural network (DRNN) to predict the valence and arousal for each moment of a song, and Limited-Memory-Broyden-Fletcher-GoldfarbShanno algorithm (LBFGS) is used to update the weights when doing back-propagation. DRNN considers the target of the previous time segments when predicting the target of the current time segment. Such time-considering manners of predictions are believed to achieve better performance in comparison of common machine learning models. We finally use the baseline feature set, adopted by the champion of last year, after comparing it with our feature set. A 10-fold cross validation evaluation is used to do the inner-experiments. The system achieves r values of -0.5904 for valence and 0.4195 for arousal. The Root-Mean-Squared Error (RMSE) for valence and arousal are 0.4054 and 0.3804, respectively. For the evaluation dataset, the system achieves r values of -0.0103+-0.3420 for valence and 0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The “Emotion in Music” task asks participants to construct
a system that can automatically predict valence and arousals
values for each 500ms segment of a song. The development set
of the whole database consists of 431 clips, and each clip has a
length of 30 seconds. The annotators are asked to slide a pointer
on the monitor when annotating the valence and arousal values
for the clips. The valence and arousal annotations are provided
in a time-continuous manner. Please refer to [
        <xref ref-type="bibr" rid="ref1 ref6">1</xref>
        ] for more details.
The time-series annotations are related to each other and we thus
use a time-considering machine learning model (Deep Recurrent
Neural Network, DRNN) to do the work. The rest of paper is
organized as follows. Section II introduces a music information
retrieval feature set. Section III introduces recurrent neural
network and the Limited-Memory-Broyden–Fletcher–Goldfarb–
Shanno algorithm. Section IV shows the performance of our
system and makes a discussion about the experimental results.
Section V provides a conclusion of our work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. FEATURE EXTRACTION</title>
      <p>This section describes the feature set used in our work.
Specifically, this feature set is finally dropped in our submission
since the baseline feature set obtains a better performance. To
illustrate our experiments in Section IV clearly, we still
introduce this feature set in this paper.</p>
      <p>
        We extract 10 kinds of features that are often utilized in the
music emotion related research. A Matlab toolbox- MIR toolbox
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used to extract features from each music clip. The
extracted features are beatspectrum, event density, zero-crossing
rate, MFCC, roll-off, brightness, roughness, chromagram, pitch,
root mean square (RMS) energy, and low energy. These features
can be classified into five categories according to their
properties, i.e. rhythm, timbre, tonality, pitch, and dynamics.
Table I lists the class of each feature.
      </p>
      <p>Feature class
Dynamics
Rhythm
Timbre</p>
      <p>Pitch
Tonality</p>
      <p>Feature name
RMS energy, low energy
beatspectrum, event density
zero-crossing rate, Roll-off, brightness, MFCC</p>
      <p>pitch
chromagram</p>
    </sec>
    <sec id="sec-3">
      <title>3. APPROACH</title>
      <p>
        We use deep recurrent neural network to regress the
valence and arousal values for a song. Different from neural
network, deep recurrent neural network has at least one cyclic
path of connections [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We set one layer to the recurrent layer,
and the recurrent layer considers its nodes of the previous one
time step when computing the current value of the nodes. A such
model is called L intermediate layer deep neural network in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The weights of recurrent neural network can be updated
using various methods, such as back propagation through time,
real-time recurrent learning, and Kalman-filtering-based weight
estimation. This paper adopts back propagation through time to
update the weights. Specifically, the step size of the update is
estimated by a Limited-Memory-Broyden–Fletcher–Goldfarb–
Shanno algorithm, which can compute the step size systemically
rather than determine the step size by the multiplication of a
constant rate of learning and delta values.</p>
      <p>We adopt a multi-task architecture to predict the valence
and arousal jointly. This architecture has been proved effective
in various machine learning works. On the other hand, to
involve the contextual information among the segments of the
song, we concatenate the features of several segments together
to be an input vector of the model. The size of the concatenation
is not analyzed in this paper. We just empirically set the size to
three.</p>
    </sec>
    <sec id="sec-4">
      <title>4. RESULTS AND DISCUSSION</title>
      <p>This section consists of three subsections, i.e., experimental
setup, experimental results, and discussion.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1 Experimental Setup</title>
      <p>
        We adopt two feature sets, i.e., the MIR feature set
mentioned in Section 2 and the baseline feature set provided by
the organizers. The features are normalized by z-scores (i.e.
subtracted by mean statistic and divided by the standard
deviation). We train a recurrent neural network model to predict
the valence and arousal values, which is implemented using a
Matlab tool provided by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The number of hidden layers is set
to three, and only the second layer is a recurrent layer. The
number of hidden nodes in each layer is set to 500. A linear
function is applied to each output node, and a sigmoid function
is adopted to be the activation function of each hidden node. The
initialization of weights is implemented using a Xavier's weight
initialization trick [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We train the model in a batch manner.
The batch size is set to 388. The rate of learning of the back
propagation is set to 2. The training process of the model is
stopped after the number of iterations achieves 100. In order to
avoid the over-fitting problem, we add a noise to each target
when training the model. Specifically, we do not pre-train the
model. The experiment of the development set is done using a
10-fold cross validation. The performances of the methods are
evaluated in terms of R-Squared for valence, R-Squared for
arousal, Root-Mean-Squared Error (RMS) for valence, and
Root-Mean-Squared Error for arousal.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4.2 Experimental Results</title>
      <p>Table 2 shows the performances, which are obtained using
the development set, of two approaches: Approach 1) The MIR
feature set is extracted from the clips. A RNN model is adopted
to predict the VA values; Approach 2) The baseline feature set,
provided by the MediaEval 2015 official, is extracted from the
clips. A RNN model is adopted to predict the VA values as well.</p>
      <p>RMSE</p>
      <p>RMSE</p>
    </sec>
    <sec id="sec-7">
      <title>4.3 Discussion</title>
      <p>Apparently, our system does not obtain satisfied results in
the task. Such results may come from several weakness of RNN:
1) The residual cannot be well back propagated to the nodes in
the first layer; 2) The computation of the current node cannot
consider its previous states by high number of time steps; 3) The
parameters (e.g., batch size, number of layers, activation
function, normalization method, and rate of learning.) of the
model are not well set.</p>
    </sec>
    <sec id="sec-8">
      <title>5. CONCLUSION</title>
      <p>This paper presents our work of the 2015 MediaEval
Emotion in Music task. Our system adopts recurrent neural
network to regress the valence and arousal values. The system
achieves r values of -0.5904 for valence and 0.4195 for arousal.
The Root-Mean-Squared Error (RMSE) for valence and arousal
are 0.4054 and 0.3804, respectively. For the evaluation dataset,
the system achieves r values of -0.0103+-0.3420 for valence and
0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for
valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255,
respectively Our systems does not perform well in the task. The
unsatisfactory results may be obtained due to the lack of model
tuning. A pre-training process should be involved to improve the
performance.
r
0.3420</p>
      <p>RMSE
0.3359+</p>
      <p>RMSE</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2015</article-title>
          . In MediaEval 2015 Workshop,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Lartillot</surname>
          </string-name>
          and
          <string-name>
            <surname>P. Toiviainen.</surname>
          </string-name>
          <article-title>MIR in Matlab (II): A toolbox for musical feature extraction from audio</article-title>
          .
          <source>In Proc. Int. Conf. Music Information Retrieval</source>
          ,
          <year>2007</year>
          , pages.
          <fpage>127</fpage>
          -
          <lpage>130</lpage>
          , [Online] http://users.jyu.fi/lartillo/mirtoolbox/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jaeger</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach</article-title>
          . International University Bremen.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Smaragdis</surname>
          </string-name>
          .
          <article-title>Singing-voice separation from monaural recordings using deep recurrent neural networks</article-title>
          .
          <source>In Proc. Int. Conf. Music Information Retrieval</source>
          ,
          <year>2014</year>
          , pages.
          <fpage>477</fpage>
          -
          <lpage>482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Understanding the difficulty of training deep feed forward neural networks</article-title>
          .
          <source>AISTATS</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Approach 1 Approach 2 Approach 2 Since the MIR feature set performs worse than the baseline feature set, we only submitted Approach 2. Table 3 shows the official results of our submission</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>