=Paper= {{Paper |id=Vol-2882/MediaEval_20_paper_49 |storemode=property |title=HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Feature with SpecAugment and EfficientNet |pdfUrl=https://ceur-ws.org/Vol-2882/paper49.pdf |volume=Vol-2882 |authors=Tri-Nhan Do,Minh-Tri Nguyen,Hai-Dang Nguyen,Minh-Triet Tran,Xuan-Nam Cao |dblpUrl=https://dblp.org/rec/conf/mediaeval/DoNNTC20 }} ==HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Feature with SpecAugment and EfficientNet== https://ceur-ws.org/Vol-2882/paper49.pdf
          HCMUS at MediaEval 2020: Emotion Classification Using
           Wavenet Features with SpecAugment and EfficientNet
                                              Tri-Nhan Do1,3 , Minh-Tri Nguyen1,3 ,
                                    Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 , Xuan-Nam Cao1,3
                                                              1 University of Science, VNU-HCM
                                                         2 John von Neumann Institute, VNU-HCM
                                              3 Vietnam National University, Ho Chi Minh city, Vietnam

                      {dtnhan,nmtri17}@apcs.vn,nhdang@selab.hcmus.edu.vn,{tmtriet,cxnam}@fit.hcmus.edu.vn

ABSTRACT                                                                             3     APPROACH
MediaEval 2020 provided a subset of the MTG-Jamendo dataset,                         We follow many approaches which include two main inputs: mel-
aimed to recognize mood and theme in music. Team HCMUS pro-                          spectrogram features and wavenet features.
poses several solutions to build efficient classifiers to solve this
problem. In addition to the mel-spectrogram features, new features                   3.1    Data analysis
extracted from the wavenet model is extracted and utilized to train                  As in the figure, the green part shows the audio with only one label
the EfficientNet model. As evaluated by the jury, our best result                    mood/theme, the yellow part shows the audio with 2 to 3 moods,
achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measure-                         the red part shows the audio with more than 3 moods. Number of
ment. With fast training and lightweight features, our proposed                      sample audio for training is 9949 with a total of 17885 moods. On
methods are potential to work well with deeper neural networks.                      average, each class will have 319 audio with a standard deviation
                                                                                     of 202.75. The maximum number of moods of an audio is 8. Mood /
1    INTRODUCTION                                                                    theme that appears most is happy with 927 audios.
Emotions and Themes in Music task in MediaEval [1] is difficult                         We can see that the data is extremely unbalanced, and some
and challenging due to the ambiguity of tags in the real world.                      classes have no audio representing it entirely. Therefore, it is nec-
Mood is often influenced by human perception, different people will                  essary to have a way to reduce the complexity of the data.
have different feelings, moreover, this is a multi-class classification
problem with more than 56 tags. The dataset is pretty unbalanced
in the distribution of mood labels, each audio music is composed
of multi-labels that there can be many emotions in the same song.
   To be able to solve this task, the authors have tried many methods,
using many kinds of models, input features or loss functions. Our
best result is an ensemble of two kinds of different methods, one
using provided mel-spectrogram features with EfficientNet model
and the other using waveNet features with MobileNetV2 model
[7, 9].

2    RELATED WORK
Data augmentation is important when training neural network
model. Traditional audio augmentation methods try to modify the                          Figure 1: Histogram of mood and theme of training set
speed of the waveforms or alter the original signal samples with                     3.2    Data preprocessing
noises, this method need much computational cost. With SpecAug-                         3.2.1 Data balance: To reduce the ambiguity of the data, the
ment approach[6], they adjust the spectrogram by warping it in the                   authors try to change each audio’s label from multi-label to sin-
time direction, masking blocks of consecutive frequency channels,                    gle label, keeping the most significant tag of each audio, reduce
and masking blocks of utterances in time. This approach is more                      standard deviation, give preference to moods with little data.
simple, cost less time and resources.
   WaveNet model is applicable in many problem of signal pro-                           3.2.2 Features preprocessing: Wavenet feature: Based on the
cessing, time series forecasting and music generation[4]. Therefore,                 idea of using wavenet as classifier for raw waveform music audio
the authors also try following this approach by using a pre-trained                  [5, 10], the authors use WaveNet-style autoencoder model that
WaveNet model to extract feature vectors from raw audio and then,                    conditions an autoregressive decoder on temporal codes learned
using those features as inputs for convolutional neural networks.                    from the raw audio waveform, this model was pretrained from
                                                                                     high-quality dataset of musical notes Nsynth [2].
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                        Based on the dataset’s statistic, the minimum length of audio
MediaEval’20, December 14-15 2020, Online                                            is 30 seconds and based on the limitation of the authors’ training
                                                                                     machine, sound samples greater than 400 seconds in length will be
MediaEval’20, December 14-15 2020, Online                                                                                   Tri-Nhan Do et al.



                                                                     EfficientNet-B0


                                      Mel-spectrograms
                                                                                                 Ensemble


                                                                     MobileNetV2



                                      WaveNet features

                                                    Figure 2: Overview of submission 1.


trimmed to take the middle part. Each sample is again randomly cut         4     EXPERIMENTS AND RESULTS
for 30 seconds and then extract features from them. This approach          Our experiments are done on a computer server with Nvidia Quadro
is quite subjective and causes loss of input data, we planned to           k6000 graphic card. Method A,B and D are not submitted to the
experiment with random cutting from 400 seconds of audios after            challenge. We realize that data balancing method leads to a better
each epoch. The output of a 30 seconds audio is 16 frames multiply         result comparing to the original dataset with default labels. Based
with 937-time steps.                                                       on the experiments on the validation set, our ensemble models are
   Mel-spectrogram: Each sample feature has 96 channels and                calculated with factors of 0.7 and 0.3 for mel-spectrogram features
time frames are randomly cropped to 6950 after each epoch.                 and wavenet features to gain the best results.
                                                                               Method           Features and Model           PR-AUC-macro
3.3    Data augmentation                                                         A        Mel-spectrogram EfficientNet-B0        0.127
SpecAugment: To train models more efficiently, the authors in-                            Mel-spectrogram EfficientNet-B0
                                                                                  B                                               0.134
clude segmentation method SpecAugment which was introduced                                     with data processing
by Google. This method masks blocks of consecutive time steps                             Mel-spectrogram EfficientNet-B0
                                                                               C (run2)                                           0.139
and channels in each mel-spectrogram. The result when using this                                using augmentation
method is increased significantly, PR-AUC-macro is improved from                 D            WaveNet MobileNetV2                0.102
0.134 to 0.139.                                                              E (run3)        WaveNet EfficientNet-B7             0.105
   Each input have 70% chance to be augmented by using SpecAug-              F (run1)            Ensemble C and D                0.1413
ment, each mel-spectrogram will have two blocks of time masking              G (run4)            Ensemble C and E                0.1414
and two blocks of channel masking.
                                                                                              Table 1: Experiment results
3.4    Deep Neural Network model
Since both mel-spectrogram features and wavenet feature can be             5     CONCLUSION AND FUTURE WORKS
expressed as images, the authors use convolutional models such             The EfficientNet model was shown to be more efficient than pre-
as MobileNet and EfficientNet. The mel-spectrogram features are            vious models in the mood and theme classification problem. The
passed to EfficientNet-B0, on the other hand, the waveNet features         results can be improved by training mel-spectrogram features on
are passed to MobileNetV2 and EfficientNet-B7. Because waveNet             other more complex EfficientNet models.
features are not large enough to fit EfficientNet-B7, the authors             Although the result when training on wavenet features is not
duplicate the number of channels so that these kinds of features           higher than mel-spectrogram features, but when assembling two
can be used.                                                               models using these features, the results are improved, it shows
   In addition, we also tested the SVM model, InceptionNet, Resnet,        that wavenet can extract other aspects of the dataset. Because the
and to capture the long-term temporal characteristics, self-attention      wavenet features were extracted by using a pretrained model, the
was added as in the method of AMLAG 2019[8], but this method               augmentation methods have not been fully applied, for the future
produce a slight improvement in the result.                                work, there are still more improvements to come when training
                                                                           WaveNet-style autoencoder models on the Jamendo dataset.
3.5    Loss function
For the loss function, binary cross entropy loss (BCE) is applied          ACKNOWLEDGMENTS
for both MobileNet V2 and EfficientNet. The authors also try to            Research is supported with computing infrastructure by SELAB
apply Focal Loss[3] since the dataset is pretty unbalanced, however        and AILAB, University of Science, Vietnam National University -
it does not gain better results on our dataset after the balance step.     Ho Chi Minh City.
Emotions and Themes in Music                                                   MediaEval’20, December 14-15 2020, Online


REFERENCES
 [1] Philip Tovstogan Minz Won Dmitry Bogdanov, Alastair Porter. 2020.
     MediaEval 2020: Emotion and theme recognition in music using Ja-
     mendo. In MediaEval 2020 Workshop.
 [2] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Moham-
     mad Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio
     synthesis of musical notes with wavenet autoencoders. In International
     Conference on Machine Learning. PMLR, 1068–1077.
 [3] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
     2017. Focal loss for dense object detection. In Proceedings of the IEEE
     international conference on computer vision. 2980–2988.
 [4] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
     Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and
     Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio.
     arXiv preprint arXiv:1609.03499 (2016).
 [5] Sandeep Kumar Pandey, HS Shekhawat, and SRM Prasanna. 2019.
     Emotion recognition from raw speech using wavenet. In TENCON
     2019-2019 IEEE Region 10 Conference (TENCON). IEEE, 1292–1297.
 [6] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
     Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple
     data augmentation method for automatic speech recognition. arXiv
     preprint arXiv:1904.08779 (2019).
 [7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
     and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and
     linear bottlenecks. In Proceedings of the IEEE conference on computer
     vision and pattern recognition. 4510–4520.
 [8] Manoj Sukhavasi and Sainath Adapa. 2019. Music theme recognition
     using CNN and self-attention. arXiv preprint arXiv:1911.07041 (2019).
 [9] Mingxing Tan and Quoc V Le. 2019. Efficientnet: Rethinking
     model scaling for convolutional neural networks. arXiv preprint
     arXiv:1905.11946 (2019).
[10] Xulong Zhang, Yongwei Gao, Yi Yu, and Wei Li. 2020. Music Artist
     Classification with WaveNet Classifier for Raw Waveform Audio Data.
     arXiv preprint arXiv:2004.04371 (2020).