<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emotion and Themes Recognition in Music Utilising Convolutional and Recurrent Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Applied Music Research Lab, Department of Music, University of Liverpool</institution>
          ,
          <country country="UK">U. K</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GLAM - Group on Language, Audio &amp; Music, Imperial College London</institution>
          ,
          <country country="UK">U. K</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Shahin Amiriparian</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ZD.B. Chair of Embedded Intelligence for Health Care &amp; Wellbeing, Univeristy of Augsburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Emotion is an inherent aspect of music, and associations to music can be made via both life experience and specific musical techniques applied by the composer. Computational approaches for music recognition have been well-established in the research community; however, deep approaches have been limited and not yet comparable to conventional approaches. In this study, we present our fusion system of end-to-end convolutional recurrent neural networks (CRNN) and pre-trained convolutional feature extractors for music emotion and theme recognition1. We train 9 models and conduct various late fusion experiments. Our best performing model (team name: AugLi) achieves 74.2 % ROC-AUC on the test partition which is 1.6 percentage points over the baseline system of the MediaEval 2019 Emotion &amp; Themes in Music task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The ability of music to express and induce emotions is a well-known
and demonstrable fact [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. It communicates and induces
similar emotional states in all listeners because musical parameters
(e. g., rhythm, melody, timbre, dynamics) encode afective
information that is implicitly decoded by listeners [
        <xref ref-type="bibr" rid="ref14 ref18">14, 18</xref>
        ]. Furthermore,
both music psychologists and computer scientists have provided
plenty of evidence that listeners construe emotional meaning by
attending to structural aspects of the acoustic signal at various
levels [
        <xref ref-type="bibr" rid="ref10 ref13 ref22">10, 13, 22</xref>
        ]. Recent deep learning solutions demonstrate the
suitability of recurrent neural networks (RNNs), autoencoders, and
convolutional neural networks (CNNs) for the task of audio-based
music emotion recognition (MER) [
        <xref ref-type="bibr" rid="ref17 ref23 ref25">17, 23, 25</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we have
utilised denoising autoencoders and a transfer learning approach
for time-continuous predictions of emotion in music and speech.
Furthermore, we have conducted both psychological and
computational experiments that aimed at clarifying the role of music
structure in the expression and induction of musical emotions [
        <xref ref-type="bibr" rid="ref11 ref15">11, 15</xref>
        ].
In this paper, we introduce our end-to-end architecture for the task
of emotion and theme recognition in music at MediaEval 2019 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        Our framework – which is motivated by our previous works with
CRNNs [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ] – is depicted in Figure 1. It consists of two models
whose predictions are fused to obtain the final predictions. These
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/amirip/AugLi-MediaEval</title>
        <p>
          models capture both shift-invariant, high-level features
(convolutional block), and long(er)-term temporal context (recurrent block)
from the musical inputs [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. The MTG-Jamendo dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
includes 18 486 audio tracks with 56 distinct mood and theme
annotations/tags. All audio files have at least one tag. The dataset provides
60-20-20 % splits for training, validation, and testing. For the full
description of the challenge data, please refer to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
2.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Convolutional Recurrent Neural Network</title>
      <p>
        The CRNN system (upper part of Figure 1) consists of a vgg-ish
model (which is trained on the Audioset dataset [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) with the
final global average pooling layer replaced by an RNN. Specifically,
we add 2 recurrent layers with 256 units (we tried 128, 256, and
512 units) and a dropout [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] of 0.3 (out of [0.2, 0.3, 0.4]) for each
layer, followed by a 1 024 unit dense layer, batch normalisation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
ReLU activation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and a dropout of 0.3. Tagging is performed
by a 52 unit dense layer with sigmoid activation. We initialise the
convolutional feature extractor with the oficial SoundNet trained
weights [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Subsequently, sequences of log Mel spectrograms are
generated using the kapre keras library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Afterwards, the input is
resampled to 16k Hz, and 64 Mel filters and an FFT window of 512
samples with a hop size of 256 are used. During training, we sample
a random 20 s chunk of every song and apply random Gaussian
noise with a maximum power of 0.2. For evaluation, we use the
centre 20 s chunk of each song. We apply the RMSprop optimiser [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
and train the network with a batch size of 32. We first train only
the top RNN and tagging layers for 20 epochs with a learning rate
of 0.001, keeping the weights of the pre-trained vgg-ish frozen. We
then unfreeze the feature extraction layers and resume training
from the best checkpoint – measured in validation Receiver
OperatingCharacteristic Curve (ROC-AUC) – with a reduced learning
rate of 0.0001 for another 80 epochs. Finally, the best overall model
is restored and evaluated on the test partition.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Utilising pre-trained CNNs</title>
      <p>
        The second model (see bottom part of Figure 1) uses our Deep
Spectrum system2[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to extract pre-trained CNN features from
Mel spectrograms (128 Mel filters) of the songs, which have been
shown to outperform engineered feature sets on a variety of
acoustic tasks [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
        ]. We use an ImageNet [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] pre-trained VGG16
architecture and forward plots of 1 and 5 second audio chunks through
the network [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The activations of the penultimate layer then
form our feature vectors. We extract these features for the first 30
2https://github.com/DeepSpectrum/DeepSpectrum
      </p>
      <p>GRU/
(B)LSTM
(B)LSTM</p>
      <p>
        GRU/
(B)LSTM
seconds (the minimum song duration in the dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) of each
song and use them as sequenced input for training RNNs. For both
feature types, three RNN architectures are trained which difer in
the choice of recurrent cells, as with the CRNN. We chose an
architecture with 2 recurrent layers of size 1 024 units each, followed by
a dense layer with the same number of units before the final densely
connected prediction layer. Afterwards, batch normalisation is used
after each of the recurrent layers and the penultimate dense layer.
Finally, a dropout of 0.4 is applied to the activations of the hidden
layers. We train the model using RMSprop with a learning rate of
0.001 and batch size 32 for a maximum of 1 000 epochs, but perform
early stopping if the validation ROC-AUC does not increase for
over 50 epochs. Thus, none of our models was trained for more than
200 epochs. As for the CRNN, we restore the best model checkpoint
before evaluating on the test partition.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Fusion Experiments</title>
      <p>To explore further potential performance improvements, we apply
model fusion experiments by averaging the prediction scores
returned by our networks for the test partition. From these scores,
we generate corresponding tag decisions with the oficial challenge
script. In total, we evaluate five diferent fusion scenarios: fusion
of all systems, fusion of all Deep Spectrum , fusion of all CRNN
systems and fusion of Deep Spectrum systems trained on 1 s and</p>
      <sec id="sec-5-1">
        <title>5 s feature windows, respectively.</title>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>
        The results of our experiments are shown in Table 1. Our best CRNN
model with GRU layers reaches 69.5 % ROC-AUC on the test set,
while a bi-directional LSTM trained on 1 s Deep Spectrum features
achieves 71.0 % ROC-AUC. These results can be explained by the
fact that we use a fixed size chunk of each song (20 s for CRNN and
30 s for Deep Spectrum + RNN) instead of the whole song. We made
this choice because training of the RNN models on longer sequences
quickly becomes computationally infeasible. Nonetheless, we can
see that fusion leads to an increase in performance. For each type of
system, in-group fusion only leads to marginal performance boosts.
We notice a larger positive efect by combining various system types
hinting at complimentary information found on diferent scales.
Finally, fusing all 9 systems increases the performance to 74.2 %
ROC-AUC on the test set. This shows that the features extracted
sults are given in macro ROC-AUC. Baseline accuracy on the
test set is 72.5 % ROC-AUC [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-6-1">
        <title>CRNN</title>
        <sec id="sec-6-1-1">
          <title>RNN type validation</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>LSTM</title>
          <p>GRU</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>BLSTM</title>
        </sec>
        <sec id="sec-6-1-4">
          <title>LSTM</title>
          <p>GRU</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>BLSTM</title>
        </sec>
        <sec id="sec-6-1-6">
          <title>LSTM</title>
          <p>GRU</p>
        </sec>
        <sec id="sec-6-1-7">
          <title>BLSTM</title>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Deep Spectrum [3] + RNN</title>
        <p>spectrogram width (s)
RNN type
validation
1
1
1
5
5
5</p>
      </sec>
      <sec id="sec-6-3">
        <title>Fusion</title>
        <p>fused models
All CRNN (3 models)</p>
        <sec id="sec-6-3-1">
          <title>All systems (9 models)</title>
          <p>All 1s Deep Spectrum (3 models)
All 5s Deep Spectrum (3 models)
All Deep Spectrum (6 models)
71.4
72.6
71.9
70.1
68.4
69.2
69.0
68.8
68.4
–
–
–
–
–
test
69.4
69.5
68.2
test
70.0
69.8
71.0
70.3
69.9
70.8
test
70.7
71.5
71.6
72.6
74.2
from spectrograms with an ImageNet pre-trained CNN provide
further information not found by training on audio data alone. Our
fusion configuration further achieves a macro average F1 of 17.5 %
and a macro PR-AUC of 11.7 %.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>We outperformed the competitive challenge baseline of MediaEval
2019 Emotion &amp; Themes in Music task after fusing the outputs of
our two systems (cf. Table 1) . We also demonstrated that the Deep
Spectrum + RNN approach (which makes use of CNNs pre-trained
on ImageNet) yields better results than the CRNN with the
vggish model. For the future work, a systematic comparison between
engineered and data-driven feature sets will be done by using the
same machine learning models. Its aim will be to determine the
usefulness of data-driven features for emotions and theme
predictions in music. We believe that this research direction can lead
to a better understanding of the relevant cues for emotion
communications in music and improvements in automated emotion
recognition systems.</p>
      <p>Emotion &amp; Themes in Music</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Alice Baird, Sahib Julka, Alyssa Alcorn, Sandra Ottl, Suncica Petrović, Eloise Ainger, Nicholas Cummins,
          <string-name>
            <given-names>and Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Recognition of Echolalic Autistic Child Vocalisations Utilising Convolutional Recurrent Neural Networks</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2018</year>
          ,
          <article-title>19th Annual Conference of the International Speech Communication Association</article-title>
          . ISCA, Hyderabad, India,
          <fpage>2334</fpage>
          -
          <lpage>2338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Nicholas Cummins,
          <string-name>
            <given-names>Maurice</given-names>
            <surname>Gerczuk</surname>
          </string-name>
          , Sergey Pugachevskiy, Sandra Ottl, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2018</year>
          . “
          <article-title>Are You Playing a Shooter Again?!” Deep Representation Learning for Audio-based Video Game Genre Recognition</article-title>
          .
          <source>IEEE Transactions on Games</source>
          <volume>11</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Freitag</surname>
          </string-name>
          , Sergey Pugachevskiy, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Snore Sound Classification Using Image-based Deep Spectrum Features</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2017</year>
          ,
          <article-title>18th Annual Conference of the International Speech Communication Association</article-title>
          . ISCA, Stockholm, Sweden,
          <fpage>3512</fpage>
          -
          <lpage>3516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Maurice Gerczuk, Sandra Ottl, Nicholas Cummins,
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Pugachevskiy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bag-of-DeepFeatures: Noise-Robust Deep Feature Representations for Audio Analysis</article-title>
          .
          <source>In Proceedings of the 31st International Joint Conference on Neural Networks (IJCNN)</source>
          . IEEE, Rio de Janeiro, Brazil,
          <fpage>2419</fpage>
          -
          <lpage>2425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Sahib Julka, Nicholas Cummins,
          <string-name>
            <given-names>and Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep Convolutional Recurrent Neural Networks for Rare Sound Event Detection</article-title>
          .
          <source>In Proceedings 44. Jahrestagung für Akustik</source>
          ,
          <string-name>
            <surname>DAGA</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>DEGA, Deutsche Gesellschaft fÃĳr Akustik e</article-title>
          .V. (DEGA), Munich, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yusuf</given-names>
            <surname>Aytar</surname>
          </string-name>
          , Carl Vondrick, and Antonio Torralba.
          <year>2016</year>
          .
          <article-title>Soundnet: Learning sound representations from unlabeled video</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , D. D.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>U. V.</given-names>
          </string-name>
          <string-name>
            <surname>Luxburg</surname>
            ,
            <given-names>I. Guyon</given-names>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <fpage>892</fpage>
          -
          <lpage>900</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo. In MediaEval Benchmarking Initiative for Multimedia Evaluation</article-title>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo Dataset for Automatic Music Tagging</article-title>
          .
          <source>In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML</source>
          <year>2019</year>
          ).
          <article-title>ICML, Long Beach</article-title>
          , CA, United States.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , Deokjin Joo, and
          <string-name>
            <given-names>Juho</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2017</year>
          . Kapre:
          <article-title>On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras</article-title>
          .
          <source>In Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning. ICML, International Conference on Machine Learning (ICML)</source>
          , Sydney, Australia.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Eduardo</given-names>
            <surname>Coutinho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Angelo</given-names>
            <surname>Cangelosi</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The Use of SpatioTemporal Connectionist Models in Psychological Studies of Musical Emotions</article-title>
          .
          <source>Music Perception: An Interdisciplinary Journal</source>
          <volume>27</volume>
          ,
          <issue>1</issue>
          (sep
          <year>2009</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Eduardo</given-names>
            <surname>Coutinho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Angelo</given-names>
            <surname>Cangelosi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Musical Emotions : Predicting Second-by-Second Subjective Feelings of Emotion From Low-Level Psychoacoustic Features</article-title>
          and
          <string-name>
            <given-names>Physiological</given-names>
            <surname>Measurements</surname>
          </string-name>
          .
          <source>Emotion</source>
          <volume>11</volume>
          ,
          <issue>4</issue>
          (aug
          <year>2011</year>
          ),
          <fpage>921</fpage>
          -
          <lpage>937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Eduardo</surname>
            <given-names>Coutinho</given-names>
          </string-name>
          , Jun Deng, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Transfer learning emotion manifestation across music and speech</article-title>
          .
          <source>In 2014 International Joint Conference on Neural Networks (IJCNN)</source>
          . IEEE,
          <fpage>3592</fpage>
          -
          <lpage>3598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Eduardo</given-names>
            <surname>Coutinho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Dibben</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Psychoacoustic cues to emotion in speech prosody and music</article-title>
          .
          <source>Cognition &amp; Emotion</source>
          <volume>27</volume>
          ,
          <issue>4</issue>
          (jun
          <year>2013</year>
          ),
          <fpage>658</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Eduardo</given-names>
            <surname>Coutinho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Shared acoustic codes underlie emotional communication in music and speechâĂŤEvidence from deep transfer learning</article-title>
          .
          <source>PloS one 12</source>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <year>e0179289</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Eduardo</surname>
            <given-names>Coutinho</given-names>
          </string-name>
          , Felix Weninger, Björn Schuller, and
          <string-name>
            <surname>Klaus</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Scherer</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The munich LSTM-RNN approach to the MediaEval 2014 “Emotion in Music” Task</article-title>
          .
          <source>In CEUR Workshop Proceedings</source>
          , Martha Larson, Bogdan Ionescu, Xavier Anguera, Maria Eskevich, Pavel Korshunov, Markus Schedl, Mohammad Soleymani, Georgios Petkos, Richard Sutclife, Jaeyoung Choi, and
          <string-name>
            <given-names>Gareth J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          (Eds.), Vol.
          <volume>1263</volume>
          . CEUR, Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          .
          <source>In 2009 IEEE Conference on Computer Vision</source>
          and Pattern Recognition. IEEE, Miami, FL,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yizhuo</surname>
            <given-names>Dong</given-names>
          </string-name>
          , Xinyu Yang,
          <string-name>
            <given-names>Xi</given-names>
            <surname>Zhao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Juan</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Eficient Model for Music Emotion Recognition</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Alf</given-names>
            <surname>Gabrielsson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Erik</given-names>
            <surname>Lindström</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The role of structure in the musical expression of emotions</article-title>
          .
          <source>In Handbook of music and emotion: Theory</source>
          , research, applications,
          <source>Patrik N. Juslin and John Sloboda (Eds.)</source>
          . Oxford University Press, Oxford,
          <fpage>367</fpage>
          -
          <lpage>400</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Shawn</surname>
            <given-names>Hershey</given-names>
          </string-name>
          , Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
          <string-name>
            <surname>Aren</surname>
            <given-names>Jansen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R Channing</given-names>
            <surname>Moore</surname>
          </string-name>
          , Manoj Plakal, Devin Platt,
          <article-title>Rif A Saurous, Bryan Seybold, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>CNN architectures for large-scale audio classification</article-title>
          .
          <source>In 2017 ieee international conference on acoustics, speech and signal processing (icassp)</source>
          .
          <source>IEEE</source>
          ,
          <fpage>131</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>arXiv preprint arXiv:1502.03167</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Patrik</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Juslin</surname>
          </string-name>
          and John Sloboda (Eds.).
          <year>2011</year>
          .
          <article-title>Handbook of music and emotion: Theory, research, applications</article-title>
          . Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Youngmoo</surname>
            <given-names>E Kim</given-names>
          </string-name>
          , Erik M Schmidt,
          <string-name>
            <given-names>Raymond</given-names>
            <surname>Migneco</surname>
          </string-name>
          , Brandon G Morton,
          <article-title>Patrick Richardson, Jefrey Scott, Jacquelin A Speck,</article-title>
          and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Turnbull</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Music emotion recognition: A state of the art review</article-title>
          .
          <source>In Proceedings of ISMIR</source>
          , Vol.
          <volume>86</volume>
          .
          <string-name>
            <surname>Utrecht</surname>
          </string-name>
          , Holland,
          <fpage>937</fpage>
          -
          <lpage>952</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Huaping</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Yong Fang, and
          <string-name>
            <given-names>Qinghua</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Music Emotion Recognition Using a Variant of Recurrent Neural Network</article-title>
          .
          <source>In 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA</source>
          <year>2018</year>
          ). Atlantis Press.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          . Haifa, Israel,
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Richard</surname>
            <given-names>Orjesek</given-names>
          </string-name>
          , Roman Jarina, Michal Chmulik, and
          <string-name>
            <given-names>Michal</given-names>
            <surname>Kuba</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>DNN Based Music Emotion Recognition from Raw Audio Signal</article-title>
          .
          <source>In 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA)</source>
          .
          <source>IEEE</source>
          , 1-
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The journal of machine learning research 15</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Tijmen</given-names>
            <surname>Tieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <source>2012. Lecture 6</source>
          .5
          <article-title>-rmsprop: Divide the gradient by a running average of its recent magnitude</article-title>
          .
          <source>COURSERA: Neural networks for machine learning 4</source>
          ,
          <issue>2</issue>
          (
          <year>2012</year>
          ),
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>