=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_49
|storemode=property
|title=HCMUS
at
MediaEval 2020: Emotion Classification Using Wavenet Feature with SpecAugment and
EfficientNet
|pdfUrl=https://ceur-ws.org/Vol-2882/paper49.pdf
|volume=Vol-2882
|authors=Tri-Nhan Do,Minh-Tri
Nguyen,Hai-Dang Nguyen,Minh-Triet
Tran,Xuan-Nam Cao
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DoNNTC20
}}
==HCMUS
at
MediaEval 2020: Emotion Classification Using Wavenet Feature with SpecAugment and
EfficientNet==
HCMUS at MediaEval 2020: Emotion Classification Using
Wavenet Features with SpecAugment and EfficientNet
Tri-Nhan Do1,3 , Minh-Tri Nguyen1,3 ,
Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 , Xuan-Nam Cao1,3
1 University of Science, VNU-HCM
2 John von Neumann Institute, VNU-HCM
3 Vietnam National University, Ho Chi Minh city, Vietnam
{dtnhan,nmtri17}@apcs.vn,nhdang@selab.hcmus.edu.vn,{tmtriet,cxnam}@fit.hcmus.edu.vn
ABSTRACT 3 APPROACH
MediaEval 2020 provided a subset of the MTG-Jamendo dataset, We follow many approaches which include two main inputs: mel-
aimed to recognize mood and theme in music. Team HCMUS pro- spectrogram features and wavenet features.
poses several solutions to build efficient classifiers to solve this
problem. In addition to the mel-spectrogram features, new features 3.1 Data analysis
extracted from the wavenet model is extracted and utilized to train As in the figure, the green part shows the audio with only one label
the EfficientNet model. As evaluated by the jury, our best result mood/theme, the yellow part shows the audio with 2 to 3 moods,
achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measure- the red part shows the audio with more than 3 moods. Number of
ment. With fast training and lightweight features, our proposed sample audio for training is 9949 with a total of 17885 moods. On
methods are potential to work well with deeper neural networks. average, each class will have 319 audio with a standard deviation
of 202.75. The maximum number of moods of an audio is 8. Mood /
1 INTRODUCTION theme that appears most is happy with 927 audios.
Emotions and Themes in Music task in MediaEval [1] is difficult We can see that the data is extremely unbalanced, and some
and challenging due to the ambiguity of tags in the real world. classes have no audio representing it entirely. Therefore, it is nec-
Mood is often influenced by human perception, different people will essary to have a way to reduce the complexity of the data.
have different feelings, moreover, this is a multi-class classification
problem with more than 56 tags. The dataset is pretty unbalanced
in the distribution of mood labels, each audio music is composed
of multi-labels that there can be many emotions in the same song.
To be able to solve this task, the authors have tried many methods,
using many kinds of models, input features or loss functions. Our
best result is an ensemble of two kinds of different methods, one
using provided mel-spectrogram features with EfficientNet model
and the other using waveNet features with MobileNetV2 model
[7, 9].
2 RELATED WORK
Data augmentation is important when training neural network
model. Traditional audio augmentation methods try to modify the Figure 1: Histogram of mood and theme of training set
speed of the waveforms or alter the original signal samples with 3.2 Data preprocessing
noises, this method need much computational cost. With SpecAug- 3.2.1 Data balance: To reduce the ambiguity of the data, the
ment approach[6], they adjust the spectrogram by warping it in the authors try to change each audio’s label from multi-label to sin-
time direction, masking blocks of consecutive frequency channels, gle label, keeping the most significant tag of each audio, reduce
and masking blocks of utterances in time. This approach is more standard deviation, give preference to moods with little data.
simple, cost less time and resources.
WaveNet model is applicable in many problem of signal pro- 3.2.2 Features preprocessing: Wavenet feature: Based on the
cessing, time series forecasting and music generation[4]. Therefore, idea of using wavenet as classifier for raw waveform music audio
the authors also try following this approach by using a pre-trained [5, 10], the authors use WaveNet-style autoencoder model that
WaveNet model to extract feature vectors from raw audio and then, conditions an autoregressive decoder on temporal codes learned
using those features as inputs for convolutional neural networks. from the raw audio waveform, this model was pretrained from
high-quality dataset of musical notes Nsynth [2].
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
Based on the dataset’s statistic, the minimum length of audio
MediaEval’20, December 14-15 2020, Online is 30 seconds and based on the limitation of the authors’ training
machine, sound samples greater than 400 seconds in length will be
MediaEval’20, December 14-15 2020, Online Tri-Nhan Do et al.
EfficientNet-B0
Mel-spectrograms
Ensemble
MobileNetV2
WaveNet features
Figure 2: Overview of submission 1.
trimmed to take the middle part. Each sample is again randomly cut 4 EXPERIMENTS AND RESULTS
for 30 seconds and then extract features from them. This approach Our experiments are done on a computer server with Nvidia Quadro
is quite subjective and causes loss of input data, we planned to k6000 graphic card. Method A,B and D are not submitted to the
experiment with random cutting from 400 seconds of audios after challenge. We realize that data balancing method leads to a better
each epoch. The output of a 30 seconds audio is 16 frames multiply result comparing to the original dataset with default labels. Based
with 937-time steps. on the experiments on the validation set, our ensemble models are
Mel-spectrogram: Each sample feature has 96 channels and calculated with factors of 0.7 and 0.3 for mel-spectrogram features
time frames are randomly cropped to 6950 after each epoch. and wavenet features to gain the best results.
Method Features and Model PR-AUC-macro
3.3 Data augmentation A Mel-spectrogram EfficientNet-B0 0.127
SpecAugment: To train models more efficiently, the authors in- Mel-spectrogram EfficientNet-B0
B 0.134
clude segmentation method SpecAugment which was introduced with data processing
by Google. This method masks blocks of consecutive time steps Mel-spectrogram EfficientNet-B0
C (run2) 0.139
and channels in each mel-spectrogram. The result when using this using augmentation
method is increased significantly, PR-AUC-macro is improved from D WaveNet MobileNetV2 0.102
0.134 to 0.139. E (run3) WaveNet EfficientNet-B7 0.105
Each input have 70% chance to be augmented by using SpecAug- F (run1) Ensemble C and D 0.1413
ment, each mel-spectrogram will have two blocks of time masking G (run4) Ensemble C and E 0.1414
and two blocks of channel masking.
Table 1: Experiment results
3.4 Deep Neural Network model
Since both mel-spectrogram features and wavenet feature can be 5 CONCLUSION AND FUTURE WORKS
expressed as images, the authors use convolutional models such The EfficientNet model was shown to be more efficient than pre-
as MobileNet and EfficientNet. The mel-spectrogram features are vious models in the mood and theme classification problem. The
passed to EfficientNet-B0, on the other hand, the waveNet features results can be improved by training mel-spectrogram features on
are passed to MobileNetV2 and EfficientNet-B7. Because waveNet other more complex EfficientNet models.
features are not large enough to fit EfficientNet-B7, the authors Although the result when training on wavenet features is not
duplicate the number of channels so that these kinds of features higher than mel-spectrogram features, but when assembling two
can be used. models using these features, the results are improved, it shows
In addition, we also tested the SVM model, InceptionNet, Resnet, that wavenet can extract other aspects of the dataset. Because the
and to capture the long-term temporal characteristics, self-attention wavenet features were extracted by using a pretrained model, the
was added as in the method of AMLAG 2019[8], but this method augmentation methods have not been fully applied, for the future
produce a slight improvement in the result. work, there are still more improvements to come when training
WaveNet-style autoencoder models on the Jamendo dataset.
3.5 Loss function
For the loss function, binary cross entropy loss (BCE) is applied ACKNOWLEDGMENTS
for both MobileNet V2 and EfficientNet. The authors also try to Research is supported with computing infrastructure by SELAB
apply Focal Loss[3] since the dataset is pretty unbalanced, however and AILAB, University of Science, Vietnam National University -
it does not gain better results on our dataset after the balance step. Ho Chi Minh City.
Emotions and Themes in Music MediaEval’20, December 14-15 2020, Online
REFERENCES
[1] Philip Tovstogan Minz Won Dmitry Bogdanov, Alastair Porter. 2020.
MediaEval 2020: Emotion and theme recognition in music using Ja-
mendo. In MediaEval 2020 Workshop.
[2] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Moham-
mad Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio
synthesis of musical notes with wavenet autoencoders. In International
Conference on Machine Learning. PMLR, 1068–1077.
[3] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
2017. Focal loss for dense object detection. In Proceedings of the IEEE
international conference on computer vision. 2980–2988.
[4] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and
Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio.
arXiv preprint arXiv:1609.03499 (2016).
[5] Sandeep Kumar Pandey, HS Shekhawat, and SRM Prasanna. 2019.
Emotion recognition from raw speech using wavenet. In TENCON
2019-2019 IEEE Region 10 Conference (TENCON). IEEE, 1292–1297.
[6] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple
data augmentation method for automatic speech recognition. arXiv
preprint arXiv:1904.08779 (2019).
[7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and
linear bottlenecks. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 4510–4520.
[8] Manoj Sukhavasi and Sainath Adapa. 2019. Music theme recognition
using CNN and self-attention. arXiv preprint arXiv:1911.07041 (2019).
[9] Mingxing Tan and Quoc V Le. 2019. Efficientnet: Rethinking
model scaling for convolutional neural networks. arXiv preprint
arXiv:1905.11946 (2019).
[10] Xulong Zhang, Yongwei Gao, Yi Yu, and Wei Li. 2020. Music Artist
Classification with WaveNet Classifier for Raw Waveform Audio Data.
arXiv preprint arXiv:2004.04371 (2020).