HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet

HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet Tri-NhanDo University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Minh-TriNguyen University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Hai-DangNguyen University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Minh-TrietTran University of Science VNU-HCM John von Neumann Institute VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Xuan-NamCao University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet D4451274E468C6B1D14714AE0FC3CE5D GROBID - A machine learning software for extracting information from scholarly documents

MediaEval 2020 provided a subset of the MTG-Jamendo dataset, aimed to recognize mood and theme in music. Team HCMUS proposes several solutions to build efficient classifiers to solve this problem. In addition to the mel-spectrogram features, new features extracted from the wavenet model is extracted and utilized to train the EfficientNet model. As evaluated by the jury, our best result achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measurement. With fast training and lightweight features, our proposed methods are potential to work well with deeper neural networks.

INTRODUCTION

Emotions and Themes in Music task in MediaEval [1] is difficult and challenging due to the ambiguity of tags in the real world. Mood is often influenced by human perception, different people will have different feelings, moreover, this is a multi-class classification problem with more than 56 tags. The dataset is pretty unbalanced in the distribution of mood labels, each audio music is composed of multi-labels that there can be many emotions in the same song.

To be able to solve this task, the authors have tried many methods, using many kinds of models, input features or loss functions. Our best result is an ensemble of two kinds of different methods, one using provided mel-spectrogram features with EfficientNet model and the other using waveNet features with MobileNetV2 model [7,9].

RELATED WORK

Data augmentation is important when training neural network model. Traditional audio augmentation methods try to modify the speed of the waveforms or alter the original signal samples with noises, this method need much computational cost. With SpecAugment approach [6], they adjust the spectrogram by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. This approach is more simple, cost less time and resources.

WaveNet model is applicable in many problem of signal processing, time series forecasting and music generation [4]. Therefore, the authors also try following this approach by using a pre-trained WaveNet model to extract feature vectors from raw audio and then, using those features as inputs for convolutional neural networks.

Data analysis

As in the figure, the green part shows the audio with only one label mood/theme, the yellow part shows the audio with 2 to 3 moods, the red part shows the audio with more than 3 moods. Number of sample audio for training is 9949 with a total of 17885 moods. On average, each class will have 319 audio with a standard deviation of 202.75. The maximum number of moods of an audio is 8. Mood / theme that appears most is happy with 927 audios.

We can see that the data is extremely unbalanced, and some classes have no audio representing it entirely. Therefore, it is necessary to have a way to reduce the complexity of the data.

Features preprocessing:

Wavenet feature: Based on the idea of using wavenet as classifier for raw waveform music audio [5,10], the authors use WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, this model was pretrained from high-quality dataset of musical notes Nsynth [2].

Based on the dataset's statistic, the minimum length of audio is 30 seconds and based on the limitation of the authors' training machine, sound samples greater than 400 seconds in length will be trimmed to take the middle part. Each sample is again randomly cut for 30 seconds and then extract features from them. This approach is quite subjective and causes loss of input data, we planned to experiment with random cutting from 400 seconds of audios after each epoch. The output of a 30 seconds audio is 16 frames multiply with 937-time steps.

Mel-spectrogram: Each sample feature has 96 channels and time frames are randomly cropped to 6950 after each epoch.

Data augmentation

SpecAugment: To train models more efficiently, the authors include segmentation method SpecAugment which was introduced by Google. This method masks blocks of consecutive time steps and channels in each mel-spectrogram. The result when using this method is increased significantly, PR-AUC-macro is improved from 0.134 to 0.139.

Each input have 70% chance to be augmented by using SpecAugment, each mel-spectrogram will have two blocks of time masking and two blocks of channel masking.

Deep Neural Network model

Since both mel-spectrogram features and wavenet feature can be expressed as images, the authors use convolutional models such as MobileNet and EfficientNet. The mel-spectrogram features are passed to EfficientNet-B0, on the other hand, the waveNet features are passed to MobileNetV2 and EfficientNet-B7. Because waveNet features are not large enough to fit EfficientNet-B7, the authors duplicate the number of channels so that these kinds of features can be used.

In addition, we also tested the SVM model, InceptionNet, Resnet, and to capture the long-term temporal characteristics, self-attention was added as in the method of AMLAG 2019 [8], but this method produce a slight improvement in the result.

Loss function

For the loss function, binary cross entropy loss (BCE) is applied for both MobileNet V2 and EfficientNet. The authors also try to apply Focal Loss [3] since the dataset is pretty unbalanced, however it does not gain better results on our dataset after the balance step.

EXPERIMENTS AND RESULTS

Our experiments are done on a computer server with Nvidia Quadro k6000 graphic card. Method A,B and D are not submitted to the challenge. We realize that data balancing method leads to a better result comparing to the original dataset with default labels. Based on the experiments on the validation set, our ensemble models are calculated with factors of 0.7 and 0.

CONCLUSION AND FUTURE WORKS

The EfficientNet model was shown to be more efficient than previous models in the mood and theme classification problem. The results can be improved by training mel-spectrogram features on other more complex EfficientNet models.

Although the result when training on wavenet features is not higher than mel-spectrogram features, but when assembling two models using these features, the results are improved, it shows that wavenet can extract other aspects of the dataset. Because the wavenet features were extracted by using a pretrained model, the augmentation methods have not been fully applied, for the future work, there are still more improvements to come when training WaveNet-style autoencoder models on the Jamendo dataset.

Figure 1 :1Figure 1: Histogram of mood and theme of training set 3.2 Data preprocessing 3.2.1 Data balance: To reduce the ambiguity of the data, the authors try to change each audio's label from multi-label to single label, keeping the most significant tag of each audio, reduce standard deviation, give preference to moods with little data.

MediaEval' 20 ,Figure 2 :202Figure 2: Overview of submission 1.

Table 1 :13 for mel-spectrogram features and wavenet features to gain the best results. Experiment resultsMethodFeatures and ModelPR-AUC-macroAMel-spectrogram EfficientNet-B00.127BMel-spectrogram EfficientNet-B0 with data processing0.134C (run2)Mel-spectrogram EfficientNet-B0 using augmentation0.139DWaveNet MobileNetV20.102E (run3)WaveNet EfficientNet-B70.105F (run1)Ensemble C and D0.1413G (run4)Ensemble C and E0.1414

ACKNOWLEDGMENTS

Research is supported with computing infrastructure by SELAB and AILAB, University of Science, Vietnam National University -Ho Chi Minh City.

Emotions and Themes in Music

MediaEval'20, December 14-15 2020, Online

MediaEval 2020: Emotion and theme recognition in music using Jamendo PhilipTovstogan MinzWon DmitryBogdanov AlastairPorter MediaEval 2020 Workshop 2020 Neural audio synthesis of musical notes with wavenet autoencoders JesseEngel CinjonResnick AdamRoberts SanderDieleman MohammadNorouzi DouglasEck KarenSimonyan International Conference on Machine Learning. PMLR 2017 Focal loss for dense object detection Tsung-YiLin PriyaGoyal RossGirshick KaimingHe PiotrDollár Proceedings of the IEEE international conference on computer vision the IEEE international conference on computer vision 2017 Wavenet: A generative model for raw audio AaronVan Den Oord SanderDieleman HeigaZen KarenSimonyan OriolVinyals AlexGraves NalKalchbrenner AndrewSenior KorayKavukcuoglu arXiv:1609.03499 2016. 2016 arXiv preprint Emotion recognition from raw speech using wavenet SandeepKumar Pandey Shekhawat Prasanna TENCON 2019-2019 IEEE Region 10 Conference (TENCON) IEEE 2019 Specaugment: A simple data augmentation method for automatic speech recognition WilliamDaniel S Park YuChan Chung-ChengZhang BarretChiu EkinDZoph Quoc VCubuk Le arXiv:1904.08779 2019. 2019 arXiv preprint Mobilenetv2: Inverted residuals and linear bottlenecks MarkSandler AndrewHoward MenglongZhu AndreyZhmoginov Liang-ChiehChen Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2018 ManojSukhavasi SainathAdapa arXiv:1911.07041 Music theme recognition using CNN and self-attention 2019. 2019 arXiv preprint Efficientnet: Rethinking model scaling for convolutional neural networks MingxingTan Quoc VLe arXiv:1905.11946 2019. 2019 arXiv preprint Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data XulongZhang YongweiGao YiYu WeiLi arXiv:2004.04371 2020. 2020 arXiv preprint