<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Music Mood and Theme Using Convolutional Neural Networks and Atention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alish Dipani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaurav Iyer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veeky Baths</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Upload AI LLC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cognitive Neuroscience Lab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>BITS Pilani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K.K.Birla Goa Campus</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India alish.dipani@uploadai.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@goa.bits-pilani.ac.in</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>veeky@goa.bits-pilani.ac.in</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present the UAI-CNRL submission to MediaEval 2020 task on Emotion and Theme Recognition in Music. We make use of the ResNet34 architecture, coupled with a self-attention module to detect moods/themes in music tracks. The autotagging-moodtheme subset of the MTG-Jamendo dataset was used to train the model. We show that the proposed model outperforms the provided VGG-ish and popularity baselines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Music has been shown to induce a variety of emotions such as
happiness, sadness, and anger [
        <xref ref-type="bibr" rid="ref27 ref7 ref8">7, 8, 27</xref>
        ]. This induction of emotions
can be attributed to diferent intrinsic properties such as tempo,
rhythm variations, intensity, mode and extrinsic properties such as
the association of music with personal events and previous
experiences [
        <xref ref-type="bibr" rid="ref12 ref23">12, 23</xref>
        ]. These emotional responses could also be one of the
important motivators for humans to listen to music [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20–22</xref>
        ].
      </p>
      <p>
        Automatic tagging and detection of emotions of music is a
dififcult task considering the subjectivity of human emotions. The
MTG-Jamendo dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] aims at tackling several such autotagging
tasks by providing royalty-free audios of consistent quality with
several tags for genre, instruments and mood/theme. The Emotion and
Theme Recognition Task of MediaEval 2020 uses the mood/theme
subset of the MTG-Jamendo dataset. The task is as follows - given
audio, automatically detect one or multiple moods/themes out of
56 given tags, for example, fun, sad, romantic, happy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In this paper, we describe our approach (team name: UAI-CNRL)
for this task by using convolutional neural networks to extract
features from the mel-spectrograms of the audios and multi-head
self-attention to predict the mood/theme by processing the
extracted features. Our approach achieves better performance than
the baselines.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Convolutional neural networks (CNNs) have been successful in
extracting meaningful features for tasks such as image recognition
[
        <xref ref-type="bibr" rid="ref10 ref14">10, 14</xref>
        ] and object detection [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In the field of audio processing,
CNNs have been used for a variety of tasks, such as automatic
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>
        We make use of a popular convolutional neural network
architecture, the ResNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a feature extractor to extract compact
representations of our data. We pair this with self-attention [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] in
order to capture long-term temporal attributes of the given data. We
also make use of batch normalization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and dropout [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] in order
to further regularize the model. We describe the model architecture
in this section. Our code and trained model are available at this
URL§.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>ResNet34</title>
      <p>Residual connections make training deep neural networks easier,
since they address the problem of vanishing gradients. We make
use of a standard ResNet34 architecture to take advantage of this
property. This is preceded by two convolutional layers in order to
reshape the data into a form that can be fed into the ResNet. Another
convolutional layer is used after the ResNet feature extractor to
reduce the number of channels.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Self-Attention</title>
      <p>The MTG-Jamendo dataset consists of tracks of varying lengths, a
majority of which are over 200 seconds. Using self-attention, we
attempt to capture long-range temporal attributes and summarize
the sequence of music representation.</p>
      <p>
        Our model architecture is inspired by the works done in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], which
uses multi-head attention along with positional encoding. 2 layers,
each consisting of 4 attention heads were used. The input sequence
length and embedding size used were unchanged.
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Data Augmentation</title>
      <p>
        3.3.1 Mixup. Previous submissions to MediaEval 2019 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] for
this task have shown that Mixup [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] greatly improves the
performance of the model being used. Mixup creates a new training
example by linearly combining two random, existing training
samples - in the feature space as well as in the label space. More formally,
Mixup trains a neural network on convex combinations of pairs of
examples and their labels. This helps the model alleviate unwanted
behaviours, such as memorization, especially since the dataset size
is relatively small.
      </p>
      <p>
        3.3.2 SpecAugment. SpecAugment [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is an augmentation
technique used for speech recognition, which involves augmenting the
spectrogram itself, instead of the waveform data. SpecAugment
modifies the spectrogram by warping it in the time axis, masking
blocks of frequency channels, and masking blocks of time steps.
This makes the model more robust to missing information in terms
of the input speech data as well as frequency information.
      </p>
      <p>3.3.3 Other Augmentations. Other transformation techniques,
such as random cropping and random scaling were used to further
augment the given data.
4</p>
    </sec>
    <sec id="sec-7">
      <title>TRAINING DETAILS</title>
      <p>This section describes the details of data pre-processing,
architecture and other training details.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Data Preparation</title>
      <p>We use the mel-spectrograms provided in the MTG-Jamendo dataset
for the purpose of training. Random cropping and scaling are used
to augment and transform the data into a tensor of length 4096
(approximately 87.4 seconds). Additionally, SpecAugment is used
to augment the dataset.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Architecture and Control Flow</title>
      <p>•
•
•
•
•</p>
      <p>The input tensor of shape (1, 96, 4096) is divided into 16
segments length-wise, each new segment being of length
256.</p>
      <p>Each segment is then processed through 2 convolutional
layers, in order to obtain a representation with 3 channels.
The obtained representation is then passed into the ResNet34
feature extractor, followed by a convolutional layer to
obtain an intermediate representation.</p>
      <p>The feature maps are then passed through the self-attention
module, followed by a series of linear layers to obtain the
ifnal class scores. Dropout is used to regularise the training
process.</p>
      <p>The model returns the outputs of the self-attention module
and the feature maps (after passing them through the linear
layers). Both outputs are used to compute the loss and
perform backpropagation, but only the outputs of the
selfattention module are used to make predictions.
4.3</p>
    </sec>
    <sec id="sec-10">
      <title>Hyperparameters and Other Details</title>
      <p>
        The model was trained with the Adam [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] optimizer, at a learning
rate of 1e-4, for 35 epochs. The values of 1 and 2 were set to 0.9
and 0.999 respectively. Binary cross entropy loss was used as the
loss function.
5
      </p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS</title>
      <p>The proposed model produces results that improve on those of the
given VGG-ish and popularity baselines. We obtain an
ROC-AUCmacro metric of 0.7360 and a PR-AUC-macro metric of 0.1275. For
comparison, the baseline VGG-ish model produces an ROC-AUC
macro of 0.7258 and a PR-AUC macro of 0.1077. Detailed results
can be found in Table 1.
6</p>
    </sec>
    <sec id="sec-12">
      <title>FUTURE WORK</title>
      <p>In this section, we discuss other approaches that we considered
towards the problem statement. These may be used as pointers
towards future work on tasks involving this dataset.</p>
      <p>Our approach can be broken down into two parts - first, the
extraction of features from the audio data and second, processing
the extracted features to predict the moods/themes. Both these
parts could be potentially improved upon, and we mention a few
ways to do so below.</p>
      <p>With respect to feature extraction:
With respect to the processing of extracted features:
•
•
•
•
•</p>
      <p>
        Using a wider range of features to aid the classification task
instead of using mel-spectrograms. For example, the LEAF
frontend proposed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] can be used for this approach.
Using self-supervised approach to extract features, such
as wav2vec 2.0 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This would also reduce reliance on
labelled data.
      </p>
      <p>
        Using temporal convolutional networks [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to extract
features directly from audio instead of using mel-spectrograms.
Using dual path processing inspired by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] in order to
capture long-term dependencies while also reducing
computational load.
      </p>
      <p>
        Exploring ways of processing the raw audio data with more
powerful models, such as WaveNet [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] in order to obtain
better insights into the dataset, and theme recognition in
general.
      </p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank Shell Xu Hu for helpful discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>A Universal Learnable Audio Frontend</article-title>
          . In Submitted to International Conference on Learning Representations. https: //openreview.net/forum?id=
          <article-title>jM76BCb6F9m under review</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alexei</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <surname>Henry Zhou</surname>
            , Abdelrahman Mohamed, and
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Auli</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>2006</year>
          .11477
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Emotion and Theme Recognition in Music Using Jamendo</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2020 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo Dataset for Automatic Music Tagging</article-title>
          .
          <source>In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML</source>
          <year>2019</year>
          ). Long Beach, CA, United States. http://hdl.handle.
          <source>net/10230/42015</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jingjing</given-names>
            <surname>Chen</surname>
          </string-name>
          , Qirong Mao, and Dong Liu.
          <year>2020</year>
          .
          <article-title>Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation</article-title>
          . arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>13975</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , George Fazekas, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Automatic tagging using deep convolutional neural networks</article-title>
          . (
          <year>2016</year>
          ).
          <source>arXiv:cs.SD/1606.00298</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Hauke</given-names>
            <surname>Egermann</surname>
          </string-name>
          , Nathalie Fernando, Lorraine Chuen, and
          <string-name>
            <surname>Stephen McAdams</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Music induces universal emotion-related psychophysiological responses: comparing Canadian listeners to Congolese Pygmies</article-title>
          .
          <source>Frontiers in psychology 5</source>
          (
          <year>2015</year>
          ),
          <fpage>1341</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Fritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Jentschke</surname>
          </string-name>
          , Nathalie Gosselin, Daniela Sammler, Isabelle Peretz, Robert Turner,
          <string-name>
            <surname>Angela D Friederici</surname>
            , and
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Koelsch</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Universal recognition of three basic emotions in music</article-title>
          .
          <source>Current biology 19</source>
          ,
          <issue>7</issue>
          (
          <year>2009</year>
          ),
          <fpage>573</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Anmol</given-names>
            <surname>Gulati</surname>
          </string-name>
          , James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han,
          <string-name>
            <surname>Shibo</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zhengdong Zhang, Yonghui Wu, and
          <string-name>
            <given-names>Ruoming</given-names>
            <surname>Pang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Conformer: Convolution-augmented Transformer for Speech Recognition</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:eess</article-title>
          .AS/
          <year>2005</year>
          .08100
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <article-title>(</article-title>
          <year>2015</year>
          ).
          <source>arXiv:cs.CV/1512.03385</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          . (
          <year>2015</year>
          ).
          <source>arXiv:cs.LG/1502.03167</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Stéphanie</surname>
            <given-names>Khalfa</given-names>
          </string-name>
          , Mathieu Roy, Pierre Rainville, Simone Dalla Bella, and
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Peretz</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Role of tempo entrainment in psychophysiological diferentiation of happy and sad music</article-title>
          ?
          <source>International Journal of Psychophysiology 68</source>
          ,
          <issue>1</issue>
          (
          <year>2008</year>
          ),
          <fpage>17</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.LG/1412.6980</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>Commun. ACM 60</source>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Colin</surname>
            <given-names>Lea</given-names>
          </string-name>
          , Rene Vidal, Austin Reiter, and
          <string-name>
            <given-names>Gregory D</given-names>
            <surname>Hager</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Temporal convolutional networks: A unified approach to action segmentation</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Xin</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Qingcai Chen, Xiangping Wu, Yan Liu, and Yang Liu.
          <year>2017</year>
          .
          <article-title>CNN based music emotion classification</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.MM/1704.05665</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yi</surname>
            <given-names>Luo</given-names>
          </string-name>
          , Zhuo Chen, and
          <string-name>
            <given-names>Takuya</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Dual-path RNN: eficient long sequence modeling for time-domain single-channel speech separation</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:eess</article-title>
          .AS/
          <year>1910</year>
          .06379
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Arsha</surname>
            <given-names>Nagrani</given-names>
          </string-name>
          , Joon Son Chung, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>VoxCeleb: A Large-Scale Speaker Identification Dataset</article-title>
          .
          <source>Interspeech 2017 (Aug</source>
          <year>2017</year>
          ). https://doi.org/10.21437/interspeech.2017-
          <fpage>950</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Daniel</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Chan</surname>
            , Yu Zhang, Chung-Cheng Chiu, Barret Zoph,
            <given-names>Ekin D.</given-names>
          </string-name>
          <string-name>
            <surname>Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</article-title>
          .
          <source>Interspeech 2019 (Sep</source>
          <year>2019</year>
          ). https://doi.org/10.21437/interspeech.2019-
          <fpage>2680</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Reybrouck</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tuomas</given-names>
            <surname>Eerola</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Music and its inductive power: a psychobiological and evolutionary approach to musical emotions</article-title>
          .
          <source>Frontiers in Psychology</source>
          <volume>8</volume>
          (
          <year>2017</year>
          ),
          <fpage>494</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Schäfer</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Sedlmeier</surname>
          </string-name>
          , Christine Städtler, and
          <string-name>
            <given-names>David</given-names>
            <surname>Huron</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The psychological functions of music listening</article-title>
          .
          <source>Frontiers in psychology 4</source>
          (
          <year>2013</year>
          ),
          <fpage>511</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Roni</surname>
            <given-names>Shifriss</given-names>
          </string-name>
          , Ehud Bodner, and
          <string-name>
            <given-names>Yuval</given-names>
            <surname>Palgi</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>When you're down and troubled: Views on the regulatory power of music</article-title>
          .
          <source>Psychology of Music 43</source>
          ,
          <issue>6</issue>
          (
          <year>2015</year>
          ),
          <fpage>793</fpage>
          -
          <lpage>807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>John</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Sloboda and Patrik N Juslin</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Psychological perspectives on music and emotion</article-title>
          .
          <source>Music and emotion: Theory and research</source>
          (
          <year>2001</year>
          ),
          <fpage>71</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: A Simple Way to Prevent Neural Networks from Overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          ,
          <issue>56</issue>
          (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          . http://jmlr.org/papers/v15/ srivastava14a.html
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Manoj</given-names>
            <surname>Sukhavasi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sainath</given-names>
            <surname>Adapa</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Music theme recognition using CNN and self-attention</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .SD/
          <year>1911</year>
          .07041
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Aaron</surname>
            <given-names>van den Oord</given-names>
          </string-name>
          , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Senior</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>WaveNet: A Generative Model for Raw Audio</article-title>
          . (
          <year>2016</year>
          ).
          <source>arXiv:cs.SD/1609.03499</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Västfjäll</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Emotion induction through music: A review of the musical mood induction procedure</article-title>
          .
          <source>Musicae Scientiae</source>
          <volume>5</volume>
          , 1_
          <issue>suppl</issue>
          (
          <year>2001</year>
          ),
          <fpage>173</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          . Attention Is All You Need. (
          <year>2017</year>
          ).
          <source>arXiv:cs.CL/1706.03762</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Minz</surname>
            <given-names>Won</given-names>
          </string-name>
          , Sanghyuk Chun, and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Toward interpretable music tagging with self-attention</article-title>
          . arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>04972</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Jeroen</given-names>
            <surname>Zegers</surname>
          </string-name>
          and Hugo Van hamme.
          <year>2019</year>
          .
          <article-title>CNN-LSTM models for Multi-Speaker Source Separation using Bayesian Hyper Parameter Optimization</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .LG/
          <year>1912</year>
          .09254
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Hongyi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Moustapha Cisse,
          <string-name>
            <given-names>Yann N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          , and David LopezPaz.
          <year>2018</year>
          .
          <article-title>mixup: Beyond Empirical Risk Minimization</article-title>
          . (
          <year>2018</year>
          ).
          <source>arXiv:cs.LG/1710.09412</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Yu</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , James Qin, Daniel S. Park, Wei Han,
          <string-name>
            <surname>Chung-Cheng</surname>
            <given-names>Chiu</given-names>
          </string-name>
          , Ruoming Pang,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , and
            <given-names>Yonghui</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:eess</article-title>
          .AS/
          <year>2010</year>
          .10504
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>