<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emotion and Theme Recognition in Music Using Attention-Based Methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Srividya Tirunellai Rajamani</string-name>
          <email>srividya.tirunellai@informatik.uni-augsburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kumar Rajamani</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Björn Schuller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GLAM - Group on Language, Audio, &amp; Music, Imperial College London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Medical Informatics, University of Lübeck</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Emotion and theme recognition in music plays a vital role in music information retrieval and recommendation systems. Deep learning based techniques have shown great promise in this regard. Realising optimal network configurations with least number of FLOPS and model parameters is of paramount importance to obtain eficient deployable models, especially for resource constrained hardware. Yet, not much research has happened in this direction especially in the context of music emotion recognition. As part of the MediaEval 2020: Emotions and Themes in Music challenge, we (team name: AUGment), propose novel integration of attention based techniques for the task of emotion/mood recognition in music. We demonstrate that using stand-alone self-attention in the later layers of a VGG-ish network, matches the baseline PR-AUC with 11 % fewer FLOPS and 22 % fewer parameters. Further, utilising the learnable Attentionbased Rectified Linear Unit (AReLU) activation helps to achieve better performance than the baseline. As an additional gain, a late fusion of these two models with the baseline also improved the PR-AUC and ROC-AUC by 1 %.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Automatic detection of mood/theme of music is a challenging and
widely researched topic that aids in music tagging and
recommendation systems. This involves acoustic feature extraction followed
by single or multi-label classification. Conventional approaches
used hand-crafted audio features representing physical or
perceived aspects of sound as input to machine learning algorithms
[
        <xref ref-type="bibr" rid="ref14 ref18 ref21">14, 18, 21</xref>
        ]. Contemporary methods make use of Deep Neural
Networks (DNNs) with hand-crafted or automatically learnt features
from audio [
        <xref ref-type="bibr" rid="ref1 ref10 ref12 ref13 ref24">1, 10, 12, 13, 24</xref>
        ].
      </p>
      <p>
        Attention based mechanisms have shown great promise and
achieved state-of-the-art results in several tasks such as
Natural language processing (NLP) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], image classification and
segmentation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], computer vision [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], as well as speech analysis
[
        <xref ref-type="bibr" rid="ref17 ref26 ref28 ref5 ref9">5, 9, 17, 26, 28</xref>
        ]. The efectiveness of these mechanisms in the task
of music mood/emotion recognition, however, is less explored. We
perform an investigation of the efectiveness of diferent attention
based techniques for multi-label music mood classification.
      </p>
    </sec>
    <sec id="sec-2">
      <title>EXPERIMENTAL SETUP</title>
      <p>
        The data used in the MediaEval 2020 task is a subset of the
MTGJamendo dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The subset used in the MediaEval 2020 task
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] includes 18 486 full-length audio tracks of varying length with
mood and theme annotations.The dataset comprises of 56 distinct
mood/themes tags. All tracks have at least one tag, but many have
more than one making it a multi-label classification task.
      </p>
      <p>
        The Mel-spectrogram is a widely used feature for audio related
tasks such as boundary detection, tagging [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and latent feature
learning. It is also shown to be an efective time-frequency
representation of audio for the task of automatic music tagging [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Using
Mel-spectrogram as the input enables the use of image classification
networks like Convolution Neural Networks (CNN) or Residual
Neural Networks (ResNets). CNNs, including their variants like
Visual Geometry Group (VGG) networks, have been successfully
used for image recognition [
        <xref ref-type="bibr" rid="ref25 ref29">25, 29</xref>
        ], object detection [
        <xref ref-type="bibr" rid="ref16 ref20">16, 20</xref>
        ], and
image segmentation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. VGG-like architectures that comprise of a
stack of convolution layer followed by a fully connected layer are
further shown to be well-suited for the task of music tagging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        We consider the first 1 400 time bins of the Mel-spectrogram
of each track as input, since the central theme or mood is usually
established in the opening of a track. This approach, as opposed
to taking time bins from the center of the track or using random
chunks, additionally ensures that the input is guaranteed to have
non-silent segments. Optionally, trimming silence from the start
would make it even more robust on tracks that potentially could
have delayed onset. A VGG-ish architecture [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with five 2D
convolutional layers followed by a dense connection is used as the
baseline for our experiments. We determine the eefctiveness of
various attention mechanisms on this baseline for the task of music
mood/theme detection1. Training is done for a maximum of 100
epochs with early stopping if the validation ROC-AUC does not
increase for over 35 epochs.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODS</title>
    </sec>
    <sec id="sec-4">
      <title>Stand-alone self-attention</title>
      <p>
        Self-attention is attention applied to a single context instead of
across multiple contexts (i. e., the query, keys, and values are
extracted from the same context). Stand-alone self-attention replaces
spatial convolutions with a form of self-attention rather than using
attention as an augmentation on top of convolutions. Stand-alone
self-attention especially in later layers of a network is shown to
outperform the baseline on image classification with far fewer
floating point operations per second (FLOPS) and parameters [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We
experiment using stand-alone self-attention in later layers of the
baseline VGG-ish network.
1The source code is published at
https://github.com/SrividyaTR/MediaEval2020EmotionAndThemeInMusic
Srividya Tirunellai Rajamani et al.
2
3
4
      </p>
      <p>Model
VGG-ish baseline
Self-attention in Layer3
Self-attention in Layer4
Self-attention in Layer5
GFLOPS
3.32
# Parameters
448 122</p>
      <p>ROC-AUC
.725</p>
      <p>
        PR-AUC
.107
The Attention-based Rectified Linear Unit (AReLU) is a learnable
activation function [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It exploits an element-wise attention
mechanism and amplifies positive elements and suppresses negative ones
through learnt, data-adaptive parameters. The network training
is more resistant to gradient vanishing as the attention module
within AReLU learns element-wise residues of the activated part
of the input. With only two extra learnable parameters (alpha and
beta) per layer, AReLU enables fast network training under small
learning rates. We experiment using AReLU activation in all of the
5 layers of the baseline VGG-ish network and observe improved
performance.
3.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Fusion Experiments</title>
      <p>We perform late fusion experiments by averaging the prediction
scores of our diferent models for the test partition. By a fusion
of the prediction scores from the stand-alone self-attention based
model, AReLU-activation based model, and the baseline, we further
improve the performance as compared to the baseline.
4</p>
    </sec>
    <sec id="sec-6">
      <title>SUBMISSIONS AND RESULTS</title>
      <p>Figure 1 provides an overview of our approach and the diferent
attention mechanisms that we utilise for the task of emotion and
theme recognition in music. Overall, we submitted 4 models to
the challenge. The first model is based on self-attention in Layer3
of the VGG-ish baseline and the second is based on using AReLU
activation in all the 5 convolution layers of the baseline. The next
2 submissions are a late-fusion of these 2 models and with the
baseline.</p>
      <p>Table 1 summarises the results of our experiments. Using
standalone self-attention instead of 2D convolution in Layer3 of the
VGG-ish network resulted in a PR-AUC comparable to the baseline
with 11 % fewer FLOPS and 22 % fewer parameters. Using AReLU
activation in all of the 5 layers of the VGG-ish network improved
the ROC-AUC as compared to the baseline. A late fusion of these 2
model’s prediction resulted in about 1 % increase in both PR-AUC
and ROC-AUC . A fusion of our model with the baseline model
helped in further improving the performance.</p>
      <p>We experimented using self-attention in other convolution
layers of the baseline VGG-ish network, but the best performance with
least trainable parameters was noted when using self-attention
in Layer3. Using self-attention in Layer4 also gave comparable
Self-A en on
based VGG-ish
Predic ons</p>
      <p>AReLU ac va on
based VGG-ish</p>
      <p>Predic ons
Music Track</p>
      <p>Melspectrogram</p>
      <p>Fusion</p>
      <p>Final
predic on
performance though with 1.2 % fewer FLOPS and 22 % fewer
parameters. Further, when using self-attention in initial layers (Layer1
or Layer2), the amount of memory required to hold the activations
was significantly large, leading to the observation that it works best
on down-sampled input. We also observed that using a batch-size
of 16 and learning rate of 0.0001 helped in faster convergence to
the best model. The best model was learnt within 25 epochs in all
our experiments.
5</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>
        We demonstrated the efectiveness of a self-attention-based
VGGlike network for multi-label emotion and theme recognition in
music. This network’s computational eficiency is particularly relevant
when executing the model inference on a mobile device or other
resource constrained computing hardware. We also established the
performance benefits of using AReLU activation for this task. A
potential future work is to evaluate the efectiveness of
incorporating AReLU activation within a self-attention based VGG-like
network instead of performing a late fusion. One should evaluate
the efectiveness of other attention-based techniques like attention
augmented convolution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for this task. Data Augmentation
using mix-up [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] could also be evaluated to analyse the impact on
performance.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Shahin</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          , Maurice Gerczuk, Eduardo Coutinho, Alice Baird, Sandra Ottl, Manuel Milling, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Emotion and Themes Recognition in Music Utilising Convolutional and Recurrent Neural Networks</article-title>
          .
          <source>In Proceedings of the MediaEval 2019 Workshop</source>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Irwan</given-names>
            <surname>Bello</surname>
          </string-name>
          , Barret Zoph, Ashish Vaswani, Jonathon Shlens, and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Attention Augmented Convolutional Networks</article-title>
          .
          <source>In Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV). Seoul, Korea,
          <fpage>3285</fpage>
          -
          <lpage>3294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo</article-title>
          .
          <source>In Proceedings of the MediaEval 2020 Workshop</source>
          . Online,
          <volume>14</volume>
          -
          <fpage>15</fpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo Dataset for Automatic Music Tagging</article-title>
          .
          <source>In Proceedings of the Machine Learning for Music Discovery Workshop, 36th International Conference on Machine Learning (ICML)</source>
          . California, United States.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Navdeep Jaitly, Quoc Le, and
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Listen, attend and spell: A neural network for large vocabulary conversational speech recognition</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . Shanghai, China,
          <fpage>4960</fpage>
          -
          <lpage>4964</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Dengsheng</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jun</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>AReLU: Attention-based Rectified Linear Unit</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .LG/
          <year>2006</year>
          .13858
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Liang-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          , George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Yuille</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          PP (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , György Fazekas, and
          <string-name>
            <surname>Mark</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sandler</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Automatic Tagging Using Deep Convolutional Neural Networks</article-title>
          .
          <source>In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR)</source>
          . New York City, United States,
          <fpage>805</fpage>
          -
          <lpage>811</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Chorowski</surname>
          </string-name>
          ,
          <article-title>Jan K and Bahdanau</article-title>
          , Dzmitry and Serdyuk, Dmitriy and Cho, Kyunghyun and Bengio, Yoshua.
          <year>2015</year>
          .
          <article-title>Attention-Based Models for Speech Recognition</article-title>
          .
          <source>In Proceedings of the 29th International Conference on Neural Information Processing Systems</source>
          , Vol.
          <volume>28</volume>
          . Montreal, Canada,
          <fpage>577</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Eduardo</surname>
            <given-names>Coutinho</given-names>
          </string-name>
          , Felix Weninger, Björn Schuller, and
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Scherer</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Munich LSTM-RNN Approach to the MediaEval 2014 Emotion in Music Task</article-title>
          .
          <source>Proceedings of the MediaEval 2014 Workshop</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Sander</given-names>
            <surname>Dieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Schrauwen</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Multiscale Approaches To Music Audio Feature Learning.</article-title>
          .
          <source>In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR)</source>
          .
          <source>Curitiba, Brazil</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Sander</given-names>
            <surname>Dieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Schrauwen</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>End-to-end learning for music audio</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . Florence, Italy,
          <fpage>6964</fpage>
          -
          <lpage>6968</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Dorfer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Training general-purpose audio tagging networks with noisy labels and iterative self-verification</article-title>
          .
          <source>In "Proceedings of the Detection and Classification of Acoustic Scenes and Events</source>
          <year>2018</year>
          <article-title>Workshop (DCASE)"</article-title>
          . Surrey, UK,
          <fpage>178</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Florian</surname>
            <given-names>Eyben</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klaus R. Scherer</surname>
            , Björn Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and
            <given-names>Khiet P.</given-names>
          </string-name>
          <string-name>
            <surname>Truong</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Afective Computing</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>7</volume>
          ,
          <issue>2</issue>
          (
          <year>2016</year>
          ),
          <fpage>190</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Chaitanya</surname>
            <given-names>Kaul</given-names>
          </string-name>
          , Suresh Manandhar, and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Pears</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Focusnet: An Attention-Based Fully Convolutional Network for Medical Image Segmentation</article-title>
          .
          <source>In Proceedings of IEEE International Symposium on Biomedical Imaging (ISBI)</source>
          . Venice, Italy,
          <fpage>455</fpage>
          -
          <lpage>458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Piotr Dollár, Ross Girshick, Kaiming He,
          <string-name>
            <surname>Bharath Hariharan</surname>
            , and
            <given-names>Serge</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Feature Pyramid Networks for Object Detection</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition. Honolulu, Hawaii, United States,
          <fpage>936</fpage>
          -
          <lpage>944</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Shuo</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Jinlong Jiao,
          <string-name>
            <given-names>Ziping</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Judith</given-names>
            <surname>Dineley</surname>
          </string-name>
          , Nicholas Cummins,
          <string-name>
            <given-names>and Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Hierarchical Component-attention Based Speaker Turn Embedding for Emotion Recognition</article-title>
          .
          <source>In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)</source>
          . Glasgow, Scotland, UK,
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Lie</given-names>
            <surname>Lu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Automatic mood detection and tracking of music audio signals</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>14</volume>
          (
          <year>2006</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Prajit</surname>
            <given-names>Ramachandran</given-names>
          </string-name>
          , Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and
          <string-name>
            <given-names>Jon</given-names>
            <surname>Shlens</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Stand-Alone Self-Attention in Vision Models</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Neural Information Processing Systems</source>
          , Vol.
          <volume>32</volume>
          . Vancouver, Canada,
          <fpage>68</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Björn</surname>
            <given-names>Schuller</given-names>
          </string-name>
          , Gerhard Rigoll, and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Lang</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Hidden Markov model-based speech emotion recognition</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong</source>
          , China, II-1
          <string-name>
            <surname>-</surname>
          </string-name>
          II-4.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Lukas</surname>
            <given-names>Stappen</given-names>
          </string-name>
          , Georgios Rizos, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild</article-title>
          .
          <source>In Proceedings of the International Conference on Multimodal Interaction (ICMI)</source>
          .
          <volume>858</volume>
          -
          <fpage>867</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Lukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is All you Need</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          , Vol.
          <volume>30</volume>
          .
          <string-name>
            <surname>California</surname>
          </string-name>
          , United States,
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Felix</surname>
            <given-names>Weninger</given-names>
          </string-name>
          , Florian Eyben, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The TUM Approach to the MediaEval Music Emotion Task Using Generic Affective Audio Features</article-title>
          .
          <source>Proceedings of the MediaEval 2013 Workshop</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Saining</surname>
            <given-names>Xie</given-names>
          </string-name>
          , Ross Girshick, Piotr Dollár, Zhuowen Tu, and
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition (CVPR). Honolulu</surname>
          </string-name>
          , Hawaii, United States,
          <fpage>5987</fpage>
          -
          <lpage>5995</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Yeonguk</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <surname>Yoon-Joong Kim</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database</article-title>
          .
          <source>Electronics</source>
          <volume>9</volume>
          (
          <year>2020</year>
          ),
          <fpage>713</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Hongyi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Moustapha Cisse,
          <string-name>
            <given-names>Yann N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          , and David LopezPaz.
          <year>2018</year>
          .
          <article-title>mixup: Beyond Empirical Risk Minimization</article-title>
          .
          <source>In Proceedings of the 6th International Conference on Learning Representations( ICLR)</source>
          . Vancouver, BC, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Ziping</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Zhongtian</given-names>
            <surname>Bao</surname>
          </string-name>
          , Zixing Zhang, Nicholas Cummins,
          <string-name>
            <given-names>Haishuai</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Hierarchical Attention Transfer Networks for Depression Assessment from Speech</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          .
          <volume>7159</volume>
          -
          <fpage>7163</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Barret</surname>
            <given-names>Zoph</given-names>
          </string-name>
          , Vijay Vasudevan, Jonathon Shlens, and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning Transferable Architectures for Scalable Image Recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR)</article-title>
          .
          <source>Salt Lake City</source>
          , Utah, United States,
          <fpage>8697</fpage>
          -
          <lpage>8710</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>