<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SELAB-HCMUS at MediaEval 2021: Music Theme and Emotion Classification with Co-teaching Training Strategy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Phu-Thinh Pham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Hieu Huynh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>MediaEval 2021 ofers the third challenge motivating studies on automatically recognizing the emotions and themes conveyed in a music recording. Team SELAB-HCMUS proposes various methods to deal with this problem. In this competition, we have applied an eficient training strategy to solve the task. In addition, with short segments of input representations, the model achieves better results and a reduction in training time. From the oficial evaluation, the best result achieves 0.1435 PR-AUC and 0.7599 ROC-AUC measurements.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The third edition of Emotions and Themes in Music task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is
introduced in MediaEval workshop 2021. The aim of the task is
to predict the mood and themes of given raw audio, with 56 tags
in total, and audio can be labeled multiple tags, which could be
considered as a multi-label classification problem.
      </p>
      <p>Recognizing themes and emotions in music tracks is really
important and has a wide range of applications, such as music
recommendation systems or music analysis. However, this task is considered
to be extremely dificult. Determining the emotional perspectives
of a song can be quite ambiguous because a song’s emotion or
mood heavily depends on the sentiments of the one listening to
it. Moreover, the length of each audio file is quite long, which can
lead to an exponential growth in the number of parameters for the
deep learning models, while the emotions and themes in music
recordings could be determined by short segment audio.</p>
      <p>In order to tackle these problems, we tried to find some
alternative solutions to preprocess the data instead of training on the
whole audio. Furthermore, we utilized some pre-trained CNN
(Convolutional Neural Network) models to build the ensemble model,
which achieves 0.1435 PR-AUC and 0.7599 ROC-AUC. Along with
the mentioned methods, we also applied a model training approach
called co-teaching to improve the eficiency of our models.</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We have approached this problem in several diferent ways, which
can be divided into 3 subsections, namely data pre-processing,
model architecture, and co-teaching.
2.1</p>
      <p>Data pre-processing
2.1.1 Data shortening. At first, we attempted to train the whole
length of each track; however, it takes too much time to train,
approximately 20 minutes/epoch. Through data analysis, we recognize
that each music track can be counted as a sequence of repeated
rhythms. Thus, instead of training the whole track, we perform a
randomly cut on each track. During the training stage, each
melspectrogram instance will be randomly cut to have the dimension
of 96 × 960. In validating and testing stages, each sample is divided
into 16 segments, and the final prediction would be the average
score. This method has facilitated the training stage, which reduces
the training time remarkably from 20 minutes to 15 minutes per
epoch while preserving the models’ accuracy.</p>
      <p>
        2.1.2 Mixup. We adapt the learning principle Mixup [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] by
training on convex combinations of pairs of data and their labels.
The previous work [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] has shown the essentialness of this method
to enhance the performance and generalization of the models.
      </p>
      <p>
        2.1.3 SpecAugment. We use a simple data augmentation method
SpecAugment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], originally used for speech recognition, to train
models more efectively. This technique masks blocks of frequency
channels and consecutive time steps in mel-spectrogram features.
However, in validation and testing, we do not apply SpecAugment.
      </p>
      <p>
        2.1.4 Data balancing. Following the study [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we attempt to
reduce the ambiguity in the data by changing the labels from
multitags to single-tags. This has been proved to be efective in giving
better results in comparison to default labels.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Model Architecture</title>
      <p>
        Since the mel-spectrogram feature can be considered images, we
can apply CNN models, which are frequently used in image and
video processing, to the audio-based problem. The models used in
our experiment are EficientNet-B0 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ReXNet-100 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], MixNet-S
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], ResNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], RegNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. After conducting several experiments
and evaluations on those models, EficientNet and RexNet are the
most suitable architectures for the task.
      </p>
      <p>We tried to improve these two models more by applying the
co-teaching paradigm, which will be explained in Section 2.3.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Co-teaching</title>
      <p>
        Co-teaching [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an efective learning paradigm that trains 2
networks simultaneously, allowing them to teach each other. First, in
each mini-batch, each network ignores a small part of data,
considering only useful knowledge. Therefore, for each epoch, each
network selects small-loss instances in the data and uses them to
...
...
      </p>
      <sec id="sec-4-1">
        <title>EfficientNet-B0</title>
      </sec>
      <sec id="sec-4-2">
        <title>Epoch n</title>
      </sec>
      <sec id="sec-4-3">
        <title>ReXNet</title>
      </sec>
      <sec id="sec-4-4">
        <title>Optimization</title>
      </sec>
      <sec id="sec-4-5">
        <title>EfficientNet-B0</title>
      </sec>
      <sec id="sec-4-6">
        <title>EfficientNet-B0</title>
      </sec>
      <sec id="sec-4-7">
        <title>Optimization</title>
      </sec>
      <sec id="sec-4-8">
        <title>Mel-spectrograms</title>
      </sec>
      <sec id="sec-4-9">
        <title>Epoch 1</title>
      </sec>
      <sec id="sec-4-10">
        <title>ReXNet</title>
      </sec>
      <sec id="sec-4-11">
        <title>Epoch 2</title>
      </sec>
      <sec id="sec-4-12">
        <title>ReXNet</title>
        <p>optimize its peer network. This allows them to communicate with
each other to distinguish helpful data to be used.</p>
        <p>
          From the observation that this approach is potential, we have
applied this method for the task. Since the system consists of 2 models,
one model is EficientNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]; for the other one, we choose ReXNet
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This is a combination of two networks discussed in Section
2.2. The illustration of the system flow is shown in Figure 1. In our
experiments, we set the forget rate  = 0.05, which decides the
amount of data to be ignored.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>SUBMISSIONS AND RESULTS</title>
    </sec>
    <sec id="sec-6">
      <title>Experimental setup</title>
      <p>
        To conduct experiments, we use Adam [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] optimizer for 100 epochs.
We start training with a learning rate of 1 × 10−3. The learning
rate will be decreased 10 times after 5 consecutive epochs without
improvement. The loss function used for development is Binary
Cross Entropy (BCE). The experiments are carried out on Google
Colab Pro, with the GPU NVIDIA Tesla P100.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Submissions</title>
      <p>We submitted 4 runs, corresponding to 4 models having the highest
PR-AUC measurement, to the MediaEval 2021 organizers.
•
•
•
•</p>
      <p>Run 1: Model EficientNet-B0 trained by Co-teaching
strategy with the peer network ReXNet in parallel.</p>
      <p>Run 2: Model ReXNet trained by Co-teaching strategy with
the peer network EficientNet-B0 in parallel.</p>
      <p>Run 3: Model ReXNet using data processing.</p>
      <p>Run 4: Ensemble model from Run 1 and Run 2, with the
factor of 0.65 and 0.35 for Run 1 and Run 2, respectively.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <p>After experimenting and evaluating the selected models, the results
are shown in Table 1. In comparison, the EficientNet and ReXNet
models trained with the co-teaching method have better accuracy
than the traditional-trained models. Intended to increase the overall
accuracy, from these two models, we created the ensemble model.
To maximize the result, we apply linear optimization to find the</p>
      <sec id="sec-8-1">
        <title>Model</title>
        <p>
          Ensemble
(Run 4)
EficientNet-B0
(Run 1)
ReXNet
(Run 2)
ReXNet
(Run 3)
EficientNet-B0 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>Description</title>
        <p>Run 1 + 2</p>
      </sec>
      <sec id="sec-8-3">
        <title>Co-teaching</title>
        <p>w/ ReXNet</p>
        <p>Co-teaching
w/ EficientNet-B0</p>
        <p>Using proposed
data processing
Using data augmentation</p>
      </sec>
      <sec id="sec-8-4">
        <title>PR-AUC-macro</title>
        <p>0.1435
0.1415
0.1343
0.1262
0.139
optimal coeficient of each model based on the result from the
validation set. Finally, we get the best result of 0.1435 in PR-AUC,
which is slightly higher than working on an individual model.
4</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>This paper introduces a method to process data before training,
which is data shortening. We apply CNN, an approach commonly
used in image-based classification problems, to the audio-based
classification problem. Besides, by realizing the problem with a
large number of tags, we make use of the co-teaching paradigm,
which is designed to deal with the problem of the noisy labels. With
all of those eforts, we achieve our highest PR-AUC of 0.1435.</p>
      <p>For future work, we first aim to investigate the factors that
could improve the model’s accuracy. Then, we want to design more
eficient models specified for the music emotion recognition task.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and supported by Vingroup
Innovation Foundation (VINIF) under project code VINIF.2019.DA19</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Tri-Nhan</surname>
            <given-names>Do</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Tri</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hai-Dang</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Triet Tran</surname>
          </string-name>
          , and
          <string-name>
            <surname>Xuan-Nam Cao</surname>
          </string-name>
          .
          <year>2020</year>
          . HCMUS at MediaEval 2020:
          <article-title>Emotion Classiifcation Using Wavenet Features with SpecAugment and EficientNet</article-title>
          . In MediaEval 2020 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bo</given-names>
            <surname>Han</surname>
          </string-name>
          , Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and
          <string-name>
            <given-names>Masashi</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Co-teaching: Robust training of deep neural networks with extremely noisy labels</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>06872</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dongyoon</given-names>
            <surname>Han</surname>
          </string-name>
          , Sangdoo Yun, Byeongho Heo, and
          <string-name>
            <given-names>YoungJoon</given-names>
            <surname>Yoo</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Rethinking Channel Dimensions for Eficient Model Design</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>732</fpage>
          -
          <lpage>741</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Koutini</surname>
          </string-name>
          , Shreyan Chowdhury, Verena Haunschmid, Hamid Eghbal-Zadeh, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Emotion and theme recognition in music with Frequency-Aware RF-Regularized CNNs</article-title>
          , In MediaEval 2019 Workshop. arXiv preprint arXiv:
          <year>1911</year>
          .05833.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Koutini</surname>
          </string-name>
          , Hamid Eghbal-Zadeh,
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Dorfer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification</article-title>
          .
          <source>In 2019 27th European signal processing conference (EUSIPCO)</source>
          .
          <source>IEEE</source>
          , 1-
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Daniel</surname>
            <given-names>S Park</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Yu Zhang, Chung-Cheng Chiu, Barret Zoph,
          <string-name>
            <surname>Ekin D Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Specaugment: A simple data augmentation method for automatic speech recognition</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08779</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ilija</given-names>
            <surname>Radosavovic</surname>
          </string-name>
          , Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Dollár</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Designing network design spaces</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>10428</fpage>
          -
          <lpage>10436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Eficientnet: Rethinking model scaling for convolutional neural networks</article-title>
          .
          <source>In International Conference on Machine Learning. PMLR</source>
          ,
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan and Quoc V Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Mixconv: Mixed depthwise convolutional kernels</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>09595</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Philip</surname>
            <given-names>Tovstogan</given-names>
          </string-name>
          , Dmitry Bogdanov, and
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>MediaEval 2021: Emotion and Theme Recognition in Music Using Jamendo</article-title>
          . In MediaEval 2021 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hongyi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Moustapha Cisse,
          <string-name>
            <surname>Yann N Dauphin</surname>
          </string-name>
          , and David LopezPaz.
          <year>2017</year>
          .
          <article-title>mixup: Beyond empirical risk minimization</article-title>
          .
          <source>arXiv preprint arXiv:1710.09412</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>