<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khaled Koutini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shreyan Chowdhury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Verena Haunschmid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hamid Eghbal-zadeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerhard Widmer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>We present CP-JKU submission to MediaEval 2019; a Receptive Field(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels. We perform an investigation regarding the impact of the RF of the CNNs on their performance on this dataset. We observe that ResNets with smaller receptive fields - originally adapted for acoustic scene classification - also perform well in the emotion tagging task. We improve the performance of such architectures using techniques such as Frequency Awareness and Shake-Shake regularization, which were used in previous work on general acoustic recognition tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Content based emotion recognition in music is a challenging task
in part because of noisy datasets and unavailability of royalty-free
audio of consistent quality. The recently released MTG-Jamendo
dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is aimed at addressing these issues.
      </p>
      <p>
        The Emotion and Theme Recognition Task of MediaEval 2019
uses a subset of this dataset containing relevant emotion tags, and
the task objective is to predict scores and decisions for these tags
from audio (or spectrograms). The details of this specific data subset,
task description, data splits, and evaluation strategy can be found
in the overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Convolutional Neural Networks (CNNs) achieve state-of-the-art
results in many tasks such as image classification [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ], acoustic
scene classification [
        <xref ref-type="bibr" rid="ref16 ref4">4, 16</xref>
        ] and audio tagging [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These models can
learn their own features and classifiers in an end-to-end fashion,
which as a result reduces the need for task-specific feature
engineering. Although CNNs are capable of learning high-level concepts
given very simple and low-level information, the careful design of
the network architectures in CNNs is a crucial step in achieving
good results.
      </p>
      <p>
        In a recent study [
        <xref ref-type="bibr" rid="ref14 ref16">14, 16</xref>
        ], Koutini et al. showed that the receptive
ifeld (RF) of CNN architectures is a very important factor when it
comes to processing audio signals. Based on these findings, a
regularization technique was proposed, that can significantly boost the
performance of CNNs when used with spectrogram features.
Further, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] a drawback of CNNs in the audio domain is highlighted,
which is caused by the lack of spatial ordering in convolutional
layers. As a solution, Frequency-Aware (FA) Convolutional Layers
were introduced, to be used in CNNs with the commonly-used
spectrogram input.
      </p>
      <p>
        The proposed RF-regularization and FA-CNNs have shown great
promise in several tasks in the field of Computational Auditory
Scene Analysis (CASA), and achieved top ranks in international
challenges [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this report, we extend the previous work to Music
Information Retrieval (MIR) and demonstrate that these models can
be used to recognize emotion in music, and achieve new
state-ofthe-art results.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SETUP</title>
    </sec>
    <sec id="sec-3">
      <title>Data Preparation</title>
      <p>
        We used a sampling rate of 44.1 kHz to extract the input features.
We apply a Short Time Fourier Transform (STFT). The window size
for the STFT is 2048 samples and the overlap between windows
is 75% for submissions 1, 2 and 3, and 25% for submissions 4 and
5. We use perceptually-weighted Mel-scaled spectrograms similar
to [
        <xref ref-type="bibr" rid="ref14 ref16 ref4">4, 14, 16</xref>
        ], which results in an input having 256 Mel bins in the
frequency dimension.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Optimization</title>
      <p>
        In a setup similar to [
        <xref ref-type="bibr" rid="ref14 ref16 ref17">14, 16, 17</xref>
        ], we use Adam [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for 200 epochs.
We start with 10 epochs warm-up learning rate, we train with a
constant learning rate of 1 × 10−4 for 60 epochs. After that, we use a
linear learning rate scheduler for 50 epochs, dropping the learning
rate to 1 × 10−6. We finally train for 80 more epochs using the final
learning rate.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Data Augmentation</title>
      <p>
        Mix-up [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] has proven essential in our experiments to boost the
perfomance and the generalization of our models. These results are
consistent with experience from our previous work [
        <xref ref-type="bibr" rid="ref14 ref16 ref17">14, 16, 17</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>ADAPTING CNNS</title>
      <p>
        Convolutional Neural Networks (CNNs) have shown great
success in many acoustic tasks [
        <xref ref-type="bibr" rid="ref11 ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref4 ref5 ref6 ref9">4–6, 9, 11, 14–20</xref>
        ]. In our
submissions, we build on this success and investigate their performance
on tasks more specific to music. We use mainly adapted versions
of ResNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We adapt the architectures to the task using the
guidelines proposed in Koutini et al.[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]1. We use the CNN variants
introduced in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>Receptive Field Regularization</title>
      <p>
        Limiting the receptive field (RF) has been shown to have a great
impact on the performance of a CNN in a number of acoustic
recognition and detection tasks [
        <xref ref-type="bibr" rid="ref14 ref16">14, 16</xref>
        ]. We investigated the influence
of the receptive field in this task in a setup similar to [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Figure 1 shows the PR-AUC on both the the validation (val) and
testing (test) sets, for ResNet models with diferent receptive fields
1The source code is published at https://github.com/kkoutini/cpjku_dcase19
and their SWA (see Section 3.4 below) variants. The results show the
larger receptive field causes performance drops in accordance to the
ifndings of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Moreover, further experiments showed that size of
the receptive field over the time dimension has lower significance
on performance.
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Frequency-Awareness and FA-ResNet</title>
      <p>
        Figure 1 shows that smaller-RF ResNets perform better. As shown
in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Frequency-Awareness can compensate for the lack of freuqency
information caused by the smaller RF. We use Frequency-Aware
ResNet (FA-ResNet) introduced in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Shake-Shake Regularization</title>
      <p>
        The Shake-Shake regularization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is proposed for improved
stability and robustness. As shown in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], although
ShakeShake ResNets do not perform well in the original acoustic scene
classification problem, it performed really well in this task.
3.4
      </p>
    </sec>
    <sec id="sec-10">
      <title>Model Averaging</title>
      <p>
        Stochastic Weight Averaging: Similar to [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ], we use
Stochastic Weight Averaging (SWA) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We add networks weights to
the average every 3 epochs. The averaged networks turned out to
out-perform each of the single networks.
      </p>
      <p>Snapshot Averaging: When computing the final prediction we
also average the predictions of 5 snapshots of the networks during
training. Specifically, we average the model with the highest
PRAUC on the validation set with the last 4 SWA models’ predictions
during training.</p>
      <p>Multi-model Averaging: We average diferent models that have
diferent architectures, initialization and/or receptive fields over
time.</p>
    </sec>
    <sec id="sec-11">
      <title>4 SUBMISSIONS AND RESULTS</title>
    </sec>
    <sec id="sec-12">
      <title>4.1 Submitted Models</title>
      <p>Overall, we submitted five models to the challenge: the first three
are variations of the approach described above; the other two were
models tested during our experiments, and were submitted as
additional baselines against which to compare our modified CNNs.
ShakeFAResNet We average the prediction of 5 Shake-Shake
regularized FA-ResNets with diferent initlizations. Their frequency
RF is regularized as explained in Section 3.1. They have however
diferent RF over the time dimension.</p>
      <p>FAResNet similar to Shakefaresnet, but without Shake-Shake
regularization.</p>
      <p>
        Avg_ensemble We average the prediction of all the models
included in both Shakefaresnet and Faresnet. In addition, we add a
RF-regularized ResNet and DenseNet as introduced in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
ResNet34 In our preliminary experiments, Vanilla Resnet-34
outperformed Resnet-18 and Resnet-50 on the validation set, so we
picked this architecture as an additional baseline.
      </p>
      <p>
        CRNN The CRNN network was motivated by the notion that global
structure of musical features could afect the perception of certain
aspects of music (like mood), as mentioned by Choi et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We
use an architecture similar to the one used by Choi et al, where the
CNN part acts as the feature extractor and the RNN part acts as
a temporal aggregator. This approach increased the performance
from the baseline CNN and the Resnet-34.
      </p>
      <p>CP_ResNet (not submitted to the challenge) We also show the
results of a single model RF-regularized ResNet.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Results</title>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been supported by the LCM – K2 Center within the
framework of the Austrian COMET-K2 program, and the European
Research Council (ERC) under the EU’s Horizon 2020 research and
innovation programme, under grant agreement No 670035 (project
“Con Espressione”).</p>
      <p>Emotion and Theme Recognition in Music</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo</article-title>
          . In MediaEval Benchmark Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo dataset for automatic music tagging</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , György Fazekas,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional recurrent neural networks for music classification</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . IEEE,
          <fpage>2392</fpage>
          -
          <lpage>2396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Dorfer</surname>
          </string-name>
          , Bernhard Lehner, Hamid Eghbal-zadeh, Christop Heindl, Fabian Paischer, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Acoustic scene classification with fully convolutional neural networks and I-vectors</article-title>
          .
          <source>In Proceedings of the Detection and Classification of Acoustic Scenes and Events</source>
          <year>2018</year>
          <article-title>Challenge (DCASE2018).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Dorfer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Training general-purpose audio tagging networks with noisy labels and iterative self-verification</article-title>
          .
          <source>In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</source>
          .
          <fpage>178</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Hamid</given-names>
            <surname>Eghbal-Zadeh</surname>
          </string-name>
          , Bernhard Lehner, Matthias Dorfer, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>CP-JKU Submissions for DCASE-2016: A Hybrid Approach Using Binaural i-Vectors and Deep Convolutional Neural Networks</article-title>
          .
          <source>In DCASE 2016-challenge on Detection and Classification of Acoustic Scenes and Events. DCASE2016 Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Gastaldi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Shake-shake regularization</article-title>
          .
          <source>arXiv preprint arXiv:1705.07485</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P. W.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Gemmeke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plakal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Saurous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Seybold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Slaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Weiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilson</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>CNN Architectures for Large-Scale Audio Classification</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          (
          <year>2017</year>
          -
          <fpage>03</fpage>
          ).
          <fpage>131</fpage>
          -
          <lpage>135</lpage>
          . https://doi. org/10.1109/ICASSP.
          <year>2017</year>
          .7952132
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gao</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Zhuang Liu,
          <string-name>
            <surname>Laurens Van Der Maaten</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Densely Connected Convolutional Networks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>4700</fpage>
          -
          <lpage>4708</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Turab</surname>
            <given-names>Iqbal</given-names>
          </string-name>
          , Qiuqiang Kong,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Plumbley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Wenwu</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Stacked convolutional neural networks for general-purpose audio tagging</article-title>
          .
          <source>DCASE2018 Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Pavel</surname>
            <given-names>Izmailov</given-names>
          </string-name>
          , Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson.
          <year>2018</year>
          .
          <article-title>Averaging Weights Leads to Wider Optima and Better Generalization</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>05407</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>In 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Khaled</surname>
            <given-names>Koutini</given-names>
          </string-name>
          , Hamid Eghbal-zadeh,
          <source>Matthias Dorfer, and Gerhard Widmer</source>
          .
          <year>2019</year>
          .
          <article-title>The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification</article-title>
          .
          <source>In Proceedings of the European Signal Processing Conference (EUSIPCO). A Coruña</source>
          ,
          <year>Spain</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Khaled</surname>
            <given-names>Koutini</given-names>
          </string-name>
          ,
          <article-title>Hamid Eghbal-zadeh, and</article-title>
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Iterative Knowledge Distillation in R-CNNs for Weakly-Labeled SemiSupervised Sound Event Detection</article-title>
          .
          <source>In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</source>
          (
          <year>2018</year>
          -
          <fpage>11</fpage>
          ).
          <fpage>173</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Khaled</surname>
            <given-names>Koutini</given-names>
          </string-name>
          ,
          <article-title>Hamid Eghbal-zadeh, and</article-title>
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>CPJKU submissions to DCASE'19: Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs</article-title>
          .
          <source>Technical Report. DCASE2019 Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Khaled</surname>
            <given-names>Koutini</given-names>
          </string-name>
          ,
          <article-title>Hamid Eghbal-zadeh, and</article-title>
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Receptive-field-regularized CNN variants for acoustic scene classification</article-title>
          .
          <source>In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019).</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Donmoon</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Subin</given-names>
            <surname>Lee</surname>
          </string-name>
          , Yoonchang Han, and
          <string-name>
            <given-names>Kyogu</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection Using Multiple Scale Input</article-title>
          . DCASE2017 Challenge.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Bernhard</surname>
            <given-names>Lehner</given-names>
          </string-name>
          , Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Korzeniowski, Khaled Koutini, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task</article-title>
          .
          <source>In DCASE 2017-challenge on Detection and Classification of Acoustic Scenes and Events . DCASE2017 Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Yuma</given-names>
            <surname>Sakashita</surname>
          </string-name>
          and
          <string-name>
            <given-names>Masaki</given-names>
            <surname>Aono</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions</article-title>
          .
          <source>DCASE2018 Challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Hongyi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Moustapha Cissé,
          <string-name>
            <given-names>Yann N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          , and David LopezPaz.
          <year>2018</year>
          .
          <article-title>mixup: Beyond Empirical Risk Minimization</article-title>
          .
          <source>In 6th International Conference on Learning Representations, ICLR</source>
          <year>2018</year>
          , Vancouver, BC, Canada, April 30 - May 3,
          <year>2018</year>
          , Conference Track Proceedings.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>