<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EficientNet</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tri-Nhan Do</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Tri Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hai-Dang Nguyen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>MediaEval 2020 provided a subset of the MTG-Jamendo dataset, aimed to recognize mood and theme in music. Team HCMUS proposes several solutions to build eficient classifiers to solve this problem. In addition to the mel-spectrogram features, new features extracted from the wavenet model is extracted and utilized to train the EficientNet model. As evaluated by the jury, our best result achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measurement. With fast training and lightweight features, our proposed methods are potential to work well with deeper neural networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Emotions and Themes in Music task in MediaEval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is dificult
and challenging due to the ambiguity of tags in the real world.
Mood is often influenced by human perception, diferent people will
have diferent feelings, moreover, this is a multi-class classification
problem with more than 56 tags. The dataset is pretty unbalanced
in the distribution of mood labels, each audio music is composed
of multi-labels that there can be many emotions in the same song.
      </p>
      <p>
        To be able to solve this task, the authors have tried many methods,
using many kinds of models, input features or loss functions. Our
best result is an ensemble of two kinds of diferent methods, one
using provided mel-spectrogram features with EficientNet model
and the other using waveNet features with MobileNetV2 model
[
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Data augmentation is important when training neural network
model. Traditional audio augmentation methods try to modify the
speed of the waveforms or alter the original signal samples with
noises, this method need much computational cost. With
SpecAugment approach[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], they adjust the spectrogram by warping it in the
time direction, masking blocks of consecutive frequency channels,
and masking blocks of utterances in time. This approach is more
simple, cost less time and resources.
      </p>
      <p>
        WaveNet model is applicable in many problem of signal
processing, time series forecasting and music generation[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore,
the authors also try following this approach by using a pre-trained
WaveNet model to extract feature vectors from raw audio and then,
using those features as inputs for convolutional neural networks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>We follow many approaches which include two main inputs:
melspectrogram features and wavenet features.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data analysis</title>
      <p>As in the figure, the green part shows the audio with only one label
mood/theme, the yellow part shows the audio with 2 to 3 moods,
the red part shows the audio with more than 3 moods. Number of
sample audio for training is 9949 with a total of 17885 moods. On
average, each class will have 319 audio with a standard deviation
of 202.75. The maximum number of moods of an audio is 8. Mood /
theme that appears most is happy with 927 audios.</p>
      <p>We can see that the data is extremely unbalanced, and some
classes have no audio representing it entirely. Therefore, it is
necessary to have a way to reduce the complexity of the data.
3.2</p>
      <p>Data preprocessing
3.2.1 Data balance: To reduce the ambiguity of the data, the
authors try to change each audio’s label from multi-label to
single label, keeping the most significant tag of each audio, reduce
standard deviation, give preference to moods with little data.</p>
      <p>
        3.2.2 Features preprocessing: Wavenet feature: Based on the
idea of using wavenet as classifier for raw waveform music audio
[
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ], the authors use WaveNet-style autoencoder model that
conditions an autoregressive decoder on temporal codes learned
from the raw audio waveform, this model was pretrained from
high-quality dataset of musical notes Nsynth [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Based on the dataset’s statistic, the minimum length of audio
is 30 seconds and based on the limitation of the authors’ training
machine, sound samples greater than 400 seconds in length will be</p>
      <sec id="sec-4-1">
        <title>Mel-spectrograms</title>
      </sec>
      <sec id="sec-4-2">
        <title>WaveNet features</title>
      </sec>
      <sec id="sec-4-3">
        <title>EficientNet-B0</title>
      </sec>
      <sec id="sec-4-4">
        <title>MobileNetV2</title>
      </sec>
      <sec id="sec-4-5">
        <title>Ensemble</title>
        <p>trimmed to take the middle part. Each sample is again randomly cut
for 30 seconds and then extract features from them. This approach
is quite subjective and causes loss of input data, we planned to
experiment with random cutting from 400 seconds of audios after
each epoch. The output of a 30 seconds audio is 16 frames multiply
with 937-time steps.</p>
        <p>Mel-spectrogram: Each sample feature has 96 channels and
time frames are randomly cropped to 6950 after each epoch.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Data augmentation</title>
      <p>SpecAugment: To train models more eficiently, the authors
include segmentation method SpecAugment which was introduced
by Google. This method masks blocks of consecutive time steps
and channels in each mel-spectrogram. The result when using this
method is increased significantly, PR-AUC-macro is improved from
0.134 to 0.139.</p>
      <p>Each input have 70% chance to be augmented by using
SpecAugment, each mel-spectrogram will have two blocks of time masking
and two blocks of channel masking.
3.4</p>
    </sec>
    <sec id="sec-6">
      <title>Deep Neural Network model</title>
      <p>Since both mel-spectrogram features and wavenet feature can be
expressed as images, the authors use convolutional models such
as MobileNet and EficientNet. The mel-spectrogram features are
passed to EficientNet-B0, on the other hand, the waveNet features
are passed to MobileNetV2 and EficientNet-B7. Because waveNet
features are not large enough to fit EficientNet-B7, the authors
duplicate the number of channels so that these kinds of features
can be used.</p>
      <p>
        In addition, we also tested the SVM model, InceptionNet, Resnet,
and to capture the long-term temporal characteristics, self-attention
was added as in the method of AMLAG 2019[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but this method
produce a slight improvement in the result.
3.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Loss function</title>
      <p>
        For the loss function, binary cross entropy loss (BCE) is applied
for both MobileNet V2 and EficientNet. The authors also try to
apply Focal Loss[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] since the dataset is pretty unbalanced, however
it does not gain better results on our dataset after the balance step.
4
      </p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>Our experiments are done on a computer server with Nvidia Quadro
k6000 graphic card. Method A,B and D are not submitted to the
challenge. We realize that data balancing method leads to a better
result comparing to the original dataset with default labels. Based
on the experiments on the validation set, our ensemble models are
calculated with factors of 0.7 and 0.3 for mel-spectrogram features
and wavenet features to gain the best results.</p>
      <sec id="sec-8-1">
        <title>Method A B C (run2)</title>
        <p>D
E (run3)
F (run1)
G (run4)</p>
      </sec>
      <sec id="sec-8-2">
        <title>Features and Model</title>
        <p>Mel-spectrogram EficientNet-B0
Mel-spectrogram EficientNet-B0
with data processing
Mel-spectrogram EficientNet-B0
using augmentation
WaveNet MobileNetV2
WaveNet EficientNet-B7</p>
        <p>Ensemble C and D
Ensemble C and E
PR-AUC-macro
0.127
0.134
0.139
0.102
0.105
0.1413
0.1414</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>The EficientNet model was shown to be more eficient than
previous models in the mood and theme classification problem. The
results can be improved by training mel-spectrogram features on
other more complex EficientNet models.</p>
      <p>Although the result when training on wavenet features is not
higher than mel-spectrogram features, but when assembling two
models using these features, the results are improved, it shows
that wavenet can extract other aspects of the dataset. Because the
wavenet features were extracted by using a pretrained model, the
augmentation methods have not been fully applied, for the future
work, there are still more improvements to come when training
WaveNet-style autoencoder models on the Jamendo dataset.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>Research is supported with computing infrastructure by SELAB
and AILAB, University of Science, Vietnam National University
Ho Chi Minh City.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan Minz Won Dmitry Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MediaEval 2020: Emotion and theme recognition in music using Jamendo</article-title>
          .
          <source>In MediaEval 2020 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Engel</surname>
          </string-name>
          , Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural audio synthesis of musical notes with wavenet autoencoders</article-title>
          .
          <source>In International Conference on Machine Learning. PMLR</source>
          ,
          <fpage>1068</fpage>
          -
          <lpage>1077</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Priya Goyal, Ross Girshick, Kaiming He, and
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Dollár</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Focal loss for dense object detection</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          . 2980-
          <fpage>2988</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Aaron</surname>
            <given-names>van den Oord</given-names>
          </string-name>
          , Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Senior</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Wavenet: A generative model for raw audio</article-title>
          .
          <source>arXiv preprint arXiv:1609.03499</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sandeep</given-names>
            <surname>Kumar</surname>
          </string-name>
          <string-name>
            <surname>Pandey</surname>
          </string-name>
          ,
          <source>HS Shekhawat, and SRM Prasanna</source>
          .
          <year>2019</year>
          .
          <article-title>Emotion recognition from raw speech using wavenet</article-title>
          .
          <source>In TENCON 2019-2019 IEEE Region 10 Conference (TENCON)</source>
          . IEEE,
          <fpage>1292</fpage>
          -
          <lpage>1297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Daniel</surname>
            <given-names>S Park</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Yu Zhang, Chung-Cheng Chiu, Barret Zoph,
          <string-name>
            <surname>Ekin D Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Specaugment: A simple data augmentation method for automatic speech recognition</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08779</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and
          <string-name>
            <surname>Liang-Chieh Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>4510</volume>
          -
          <fpage>4520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Manoj</given-names>
            <surname>Sukhavasi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sainath</given-names>
            <surname>Adapa</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Music theme recognition using CNN and self-attention</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>07041</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan and Quoc V Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Eficientnet: Rethinking model scaling for convolutional neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1905</source>
          .
          <volume>11946</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Xulong</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Yongwei Gao,
          <string-name>
            <given-names>Yi</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>04371</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>