MediaEval 2018 AcousticBrainz Genre Task: A CNN Baseline Relying on Mel-Features Hendrik Schreiber tagtraum industries incorporated, USA hs@tagtraum.com ABSTRACT Dataset Number of Parameters These working notes describe a relatively simple baseline for the AllMusic 918,646 MediaEval 2018 AcousticBrainz Genre Task. As classifier it uses Discogs 685,479 a fully convolutional neural network (CNN) based on only the Last.fm 691,683 lowlevel AcousticBrainz melband features as input. tagtraum 675,656 Subtask 2 1,258,315 1 INTRODUCTION Table 1: Number of network parameters per dataset. We present a baseline approach for the MediaEval 2018 Acoustic- Brainz Genre Task. The task is defined as follows: Based on provided track-level features, participants have to es- timate genre labels for four different datasets (AllMusic, Discogs, samples. Each of the 40-dimensional feature vectors is scaled so Last.fm, and tagtraum), featuring four different label namespaces. that its maximum is 1. Subtask 1 asks participants to train separately on each of the datasets and their respective labels and predict those labels for separate test 2.2 Neural Network sets. Subtask 2 allows training on the union of all four training We choose to use the fully convolutional network (FCN) architecture datasets, but still requires predictions for the four test sets in their depicted in Figure 1. In essence, the network consists of four similar respective label spaces. For more details about the tasks see [1]. feature extraction blocks, each formed by a one-dimensional con- volutional layer, an ELU activation function [2], a dropout layer [9] 2 APPROACH with dropout probability 0.2, an average pooling layer (omitted in Our baseline approach explores how well a convolutional neural the last extraction block), and lastly a batch normalization layer [4]. network (CNN) performs that has been trained on a relatively small From block to block the number of filters is increased from 64 to subset of the available pre-computed features. For this purpose we 512 as the length of the input decreases from 40 to 5 due to average have chosen to train only on Mel-features. The complete code is pooling with a pool size of 2. The feature extraction blocks are available on GitHub1 . followed by a classification block consisting of a one-dimensional convolution, an ELU activation function, a batch normalization 2.1 Feature Selection layer, a global average pooling layer and sigmoid output units. The Traditionally, music genre recognition (MGR) has often relied on sigmoid activation function for the output is used, because the task Mel-based features—in fact, one of the most often cited MGR publi- is a multi-label multi-class problem. Note that the number of output cations uses Mel-frequency cepstral coefficients (MFCCs) [10]. Mel- dimensions depends on the number of different labels in the dataset. based approaches attempt to capture the timbre of a track, thus We therefore refer to it with the placeholder OUT. The total number allowing conjectures about its instrumentation and genre. They do of parameters in each networks is listed in Table 1. not necessarily take temporal properties into account and therefore often ignore an important aspect of musical expression, which can 2.3 Training also be used for genre/style classification, see e.g., [8]. But since For subtask 1 we train the network using the provided training and we are only interested in finding a baseline for more sophisticated validation sets with binary cross-entropy as loss function, Adam [5] systems, using just the provided melbands features is a reasonable with a learning rate of 0.001 as optimizer, and a batch size of 1,000. approach. Lowlevel AcousticBrainz2 data offers nine different Mel- To avoid overfitting we employ early stopping with a patience of features (global statistics: min, max, mean, ...) with 40 bands each, 50 epochs and use the last model that still showed an improvement resulting in a total of 360 values per track. Because Mel-bands have in its validation loss. a spatial relationship to each other, we organize the data into nine Because the training data is very unbalanced, we experimented different channels, each featuring a 40-dimensional vector result- with balancing the training data with respect to the main genre ing in a (N , 40, 9)-dimensional tensor with N being the number of labels via oversampling. As this led to worse results, balancing is 1 https://github.com/hendriks73/melbaseline not part of this submission. 2 https://acousticbrainz.org/ For subtask 2 we gently normalize the provided labels by con- verting them to lowercase and removing all non-alphanumeric Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France characters. Based on these transformed labels we create a unified training set. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Hendrik Schreiber Average per Dataset AllMusic tagtraum Last.fm Discogs Input 9x40 Track P 0.292 0.3587 0.3707 0.3659 (all labels) R 0.4669 0.5074 0.4692 0.5436 F 0.306 0.3918 0.374 0.3972 Conv1D 64x3, ELU Track P 0.6013 0.6149 0.5617 0.6937 Dropout 0.2 (genre labels) R 0.6777 0.6772 0.6318 0.7522 AvgPool1D 2 F 0.6072 0.6271 0.5738 0.6902 BatchNorm Track P 0.2031 0.256 0.228 0.2049 Conv1D 128x3, ELU (subgenre R 0.3116 0.4135 0.3284 0.3662 Dropout 0.2 labels) F 0.216 0.2922 0.2461 0.236 Feature Extraction AvgPool1D 2 BatchNorm Label P 0.1141 0.1753 0.1993 0.1624 (all labels) R 0.1447 0.2213 0.2261 0.2115 Conv1D 256x3, ELU F 0.1148 0.1824 0.1977 0.1656 Dropout 0.2 AvgPool1D 2 Label P 0.3213 0.3444 0.3645 0.4523 BatchNorm (genre labels) R 0.3384 0.3627 0.3812 0.4519 F 0.3239 0.3467 0.3661 0.4466 Conv1D 512x3, ELU Label P 0.1083 0.1555 0.1826 0.1479 Dropout 0.2 (subgenre R 0.1392 0.2047 0.2105 0.1995 BatchNorm labels) F 0.1089 0.1632 0.1807 0.1515 Conv1D OUTx1 , ELU Table 2: Precision, recall and F-scores for subtask 1. Classification BatchNorm GlobalAvgPool1D Sigmoid OUT Average per Dataset Figure 1: Schematic architecture of the neural network. AllMusic tagtraum Last.fm Discogs Track P 0.285 0.359 0.3411 0.3563 (all labels) R 0.4713 0.5079 0.4773 0.544 2.4 Prediction F 0.3041 0.385 0.354 0.3877 The output of the last network layer consists of as many values Track P 0.5923 0.594 0.5068 0.6801 in the range [0, 1] as we have different labels in the dataset (OUT). (genre labels) R 0.6762 0.678 0.6434 0.7484 F 0.6011 0.6127 0.5413 0.6802 If one of these values is greater than a predefined threshold, we assume that the associated label is applicable for the track. In order Track P 0.2071 0.2486 0.2021 0.1959 to optimize the tradeoff between precision and recall, we choose (subgenre R 0.3198 0.4131 0.3282 0.37 this threshold individually for each label based on the maximum labels) F 0.2205 0.2825 0.2261 0.229 F-score for predictions on the validation set [6], also known as Label P 0.1127 0.1662 0.1703 0.1526 plug-in rule approach [3]. In case the threshold is not crossed by (all labels) R 0.1466 0.2272 0.2176 0.2138 any prediction for a given track, we divide all predictions by their F 0.1141 0.1797 0.1763 0.1619 thresholds and pick the label corresponding to the largest value. Label P 0.3174 0.3291 0.3304 0.4491 Since we are using one unified training set for subtask 2, we need (genre labels) R 0.3395 0.363 0.3874 0.4454 to reduce its output to labels that are valid in the context of a specific F 0.3191 0.3398 0.3484 0.4407 test dataset. We do so by reverting the applied normalization and Label P 0.1069 0.1471 0.1541 0.1378 dropping all labels not occurring in the test dataset. (subgenre R 0.1412 0.2114 0.2005 0.2022 labels) F 0.1083 0.161 0.1589 0.1479 3 RESULTS AND ANALYSIS Table 3: Precision, recall and F-scores for subtask 2. We evaluated a single run for both subtask 1 and 2. Results are listed in Tables 2 and 3. As expected, all results are well below last year’s winning submission [6], which used a much larger network and 2,646 features. But the achieved scores are competitive with last year’s second ranked submission [7], which used a similar number 4 DISCUSSION AND OUTLOOK of features, though very different ones. Somewhat unexpected, the We have shown that using a relatively small and simple convolu- network trained for subtask 2 was not able to benefit from the addi- tional neural network (CNN) trained only on global Mel-features tional training material and reaches generally slightly lower results can achieve respectable scores in this task. Adding temporal fea- than the networks trained on individual datasets for subtask 1. tures may improve the results further. AcousticBrainz Genre Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Dmitry Bogdanov, Alastair Porter, Julián Urbano, and Hendrik Schreiber. 2018. The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources. In MediaEval 2018 Workshop. Sophia Antipolis, France. [2] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus), In International Conference on Learning Representations (ICLR). arXiv preprint arXiv:1511.07289. [3] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and Eyke Hüllermeier. 2013. Optimizing the F- measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In International Conference on Machine Learning. Atlanta, GA, USA, 1130–1138. [4] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accel- erating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167 (2015). [5] Diederik P. Kingma and Jimmy Lei Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [6] Khaled Koutini, Alina Imenina, Matthias Dorfer, Alexander Gruber, and Markus Schedl. 2017. MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach. In MediaEval 2017 Workshop. Dublin, Ireland. [7] Benjamin Murauer, Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle, Martin Pichl, and Günther Specht. 2017. Hierarchical Multi- label Classification and Voting for Genre Classification. In MediaEval 2017 Workshop. Dublin, Ireland. [8] Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2008. Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection. EURASIP Journal on Audio, Speech, and Music Processing 2008 (2008), 12. [9] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958. [10] George Tzanetakis and Perry Cook. 2002. Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing 10, 5 (2002), 293–302.