MediaEval 2019 Emotion and Theme Recognition task: A VQ-VAE Based Approach Hsiao-Tzu Hung†1 , Yu-Hua Chen†1 ,Maximilian Mayerl3 , Michael Vötter3 Eva Zangerle3 , Yi-Hsuan Yang1,2 1 Taiwan AI Labs, 2 Research Center for IT Innovation, Academia Sinica, Taiwan, 3 Universität Innsbruck, Austria fbiannahung@gmail.com,r08946011@ntu.edu.tw,maximilian.mayerl@uibk.ac.at michael.voetter@uibk.ac.at,eva.zangerle@uibk.ac.at,affige@gmail.com ABSTRACT Table 1: Number of audio samples of third-party datasets in the train, validation and test splits we made In this paper, we, Taiinn (Taiwan) team, use pre-trained VQ-VAE as a feature extractor and compare two types of classifier for audio- based emotion and theme recognition. The VQ-VAE is pre-trained Train Validation Test on the Million Song Dataset (MSD). We found better performance MSD [1] 557,315 37,008 0 in ROC-AUC by fixing the pre-trained parameters of VQ-VAE while MTAT [4] 16,776 1,339 2,651 training the classifier. In addition, an embedding with bigger shape works better than the one-dimensional counterpart. The code and submitted models can be found at: https://github.com/annahung31/ moodtheme-tagging. 2.2 Input feature We use librosa [5] to extract 128-dimensional log-mel spectrums from the audio files. The sampling rate is set to be 22,050 Hz, and 1 INTRODUCTION only first 1,024 frames are took for every clips, leading to a fixed-size matrix of 128 × 1024 per clip. This paper describes our submission to the MediaEval 2019 Emotion and Theme recognition task [2]. The goal is to automatically assign 2.3 Neural networks audio clips with emotion and theme tags using a data collection from Jamendo, a platform of copyright free music. The task can be 2.3.1 VQ-VAE as feature extractor. We use VQ-VAE as an feature considered as a multi-label, music auto-tagging problem [6]. extractor to get a discrete embedding from mel-spectrograms. The Lately, vector-quantized variational auto-encoder (VQ-VAE) [8] VQ-VAE basically contains an encoder and a decoder. The encoder has been shown effective for images and audio generation. It learns contains 5 convolutional layers, followed by two residual 3×3 blocks a quantized representation of its input in an unsupervised way. all having 256 feature maps. The kernel size and the stride of the first This motivates us to study the use of VQ-VAE for classification 4 layers is (4,3), (2,1), and those of the fifth layer are (5,4), (1,2). The problems such as the one involved in the MediaEval 2019 Emotion padding of every layer are (1,2), (1,4) ,(1,8), (1,16), (0,1). The dilation and Theme task. While our work remains preliminary, it seems no are the same as padding. As a result, the encoder will generate an previous work has used VQ-VAE for auto-tagging problems. embedding with shape of 256 × 4 × 512. The decoder consists two residual 3 × 3 blocks, followed by 5 transposed convolutional layers. The kernel size, stride and padding for the first later is (4,4), (1,2), 2 APPROACH (0,1), and are (4,3), (2,1), (0.1) for the second layer. For the remaining 2.1 Third-party dataset three layers, the kernel size, stride and padding are (4,3), (2,1), (1,1). Besides the Jamendo dataset prepared by the task organizers, we In the end of the decoder, an activation function of tanh is used. also use the million song dataset (MSD) [1] and the MagnaTagATune We call the this Type-1 VQ-VAE. (MTAT) dataset [4] in our work. The number of samples of the two To observe how the dimension of the embedding affects the datasets can be found in Table 1. We use MSD only for pre-training performance of tagging, we implement an alternative that uses the VQ-VAE model, so we only split the datset into training and (8,4) kernel for the fifth layer of the encoder, making the shape validation sets. As for MTAT, we use it as the second test set (in of the embedding 256 × 1 × 512. We may view it as a sequence of addition to Jamendo) for testing VQ-VAE, and hence we split it into 256-dimensional feature vectors. We call this one Type-2 VQ-VAE. training, validation, and test sets. We only consider the top-50 tags 2.3.2 Classifiers. We use two kinds of classifier for training. The (mostly genre and instrument tags [3]) for MTAT. first one is a GRU-classifier, with 2 bi-directional gated recurrent units (GRUs). After the first GRU, layer normalization is applied. † The two authors contributed equally to this work The output hidden states of the second GRU will then go through a fully-connected layer and sigmoid activation layer to get prediction. Copyright 2019 for this paper by its authors. Use The second one is a CNN (convolutional neural network)-classifier. permitted under Creative Commons License Attribution The model structure of the CNN classifier is basically the same as 4.0 International (CC BY 4.0). that proposed in [7], with the size of channels halved. MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France H.-S. Hung et al. Embedding Space Encoder Decoder Step 1 Step 2 CNN- classifier Output Output … … … … Encoder GRU- classifier Figure 1: Schematic architecture of the proposed neural network and training procedure. 2.4 Training Table 2: Testing (first seven rows) and validation (last five) scores on the MediaEval’19 Jamendo dataset. The training procedure, as depicted in Figure 1, is composed of two steps. In step-1, we pre-train VQ-VAE on MSD by minimizing the reconstruction error. In step-2, we cascade the encoder of VQ-VAE ROC-AUC PR-AUC F1(macro) F1(micro) trained in step-1 along with a classifier (a GRU or a CNN based Popularity 0.5000 0.0320 0.0570 0.0030 one), and train the network by binary cross entropy loss for genre, VGG-ish 0.7258 0.1077 0.1657 0.1771 mood or theme recognition (depending on the dataset). During the training process, we set the batch size to 12 and learning rate to Run-1 0.7103 0.0984 0.1183 0.1439 2e-4. The Adam optimizer is used to train the models. The networks Run-2 0.7141 0.1037 0.0901 0.1184 are trained for a maximum of 100 epochs with early stopping. Run-3 0.7147 0.0994 0.1013 0.1233 Run-4 0.7207 0.1077 0.1068 0.1522 Run-5 0.6916 0.0860 0.0884 0.1209 2.5 Methods Run-1 0.6829 0.0717 0.0891 0.1161 We submit the following five runs: Run-2 0.6973 0.0782 0.0838 0.1201 Run-3 0.6928 0.0746 0.0921 0.1227 • Run-1: type-1 VQ-VAE + GRU; updating both VQ-VAE and Run-4 0.6966 0.0770 0.0851 0.1142 GRU during step-2 training. Run-5 0.6662 0.0608 0.0746 0.0899 • Run-2: type-1 VQ-VAE + GRU; fixing VQ-VAE and updat- ing only the GRU during step-2 training. • Run-3: type-1 VQ-VAE + CNN; updating both VQ-VAE 3.2 Mood & theme classification on Jamendo and CNN during step-2 training. The result on the Jamendo dataset is shown in Table 2. We can • Run-4: type-1 VQ-VAE + CNN; fixing VQ-VAE and updat- see that, in terms of ROC-AUC, Run-2 outperforms Run-1, and ing only the CNN during step-2 training. Run-4 outperforms Run-3. This may indicate that it is better to • Run-5: type-2 VQ-VAE + GRU; updating both VQ-VAE and fix the VQ-VAE when training the classifiers. We can also see that GRU during step-2 training. the CNN classifier seems to perform slightly better than the GRU classifier. And, it seems that the type-1 VQ-VAE works than the type-2 counterpart. The best ROC-AUC 0.7207 is obtained by Run-4. 3 RESULTS AND ANALYSIS Yet, it is worse than VGG-ish, which represents a strong baseline. 3.1 Auto-tagging on MTAT 4 SUMMARY AND OUTLOOK To verify the effectiveness of the VQ-VAE based classification method, In this paper, we have reported a preliminary attempt that uses we firstly evaluate the run-1 method on MTAT for auto-tagging. pre-trained VQ-VAE model for music auto-tagging problems. From Specifically, in step-2 training, we update the type-1 VQ-VAE (pre- the evaluation result, it seems that either the approach is not that trained on MSD) along with the GRU classifier on MTAT and ob- promising for discrminative tasks, or that we have not fully capital- serve the performance of tagging. It turns out that the model attains ized its potential. We would like to further develop this approach ROC-AUC 0.90 when predicting top-50 tags, which is close to the in the near future, for both discrminative and generative problems performance of state-of-the-art models [6]. in music (e.g., to generate music in the audio domain). A VQ-VAE Based Approch MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, and and Paul Lamere Brian Whitman. 2011. The million song dataset. In Proc. International Society for Music Information Retrieval Conference (ISMIR). [2] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. 2019. MediaEval 2019: Emotion and theme recognition in music using Jamendo. In MediaEval 2019 Workshop. [3] Keunwoo Choi. 2017. List of automatic music tagging research articles that are evaluated against MagnaTagATune Dataset. https://github. com/keunwoochoi/magnatagatune-list. (2017). Online; accessed 29 September 2019. [4] Edith Law, Kris West, Michael I. Mandel, Mert Bay, and J. Stephen Downie. 2009. Evaluation of algorithms using games: The case of music tagging. In Proc. International Society for Music Information Retrieval Conference (ISMIR). [5] Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W . Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in Python. In Proc. Python in Science Conf. 18–25. [Online] https://librosa.github.io/librosa/. [6] Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang. 2019. Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from Bach. IEEE Signal Processing Magazine 36, 1 (2019), 41–51. [7] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann, and Xavier Serra. 2018. End-to-end learning for music audio tagging at scale. In Proc. International Society for Music Information Retrieval Conference (ISMIR). [8] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proc. Conference on Neural Information Processing Systems (NIPS).