INTRODUCTION

MediaEval 2018 AcousticBrainz Genre Task: A CNN Baseline Relying on Mel-Features

USA hs@tagtraum.com

0 1 0 AllMusic Discogs Last.fm tagtraum Subtask 2 1 Number of Parameters

2018

29 31

These working notes describe a relatively simple baseline for the MediaEval 2018 AcousticBrainz Genre Task. As classifier it uses a fully convolutional neural network (CNN) based on only the lowlevel AcousticBrainz melband features as input.

INTRODUCTION

We present a baseline approach for the MediaEval 2018 AcousticBrainz Genre Task. The task is defined as follows:

Based on provided track-level features, participants have to estimate genre labels for four diferent datasets (AllMusic, Discogs, Last.fm, and tagtraum), featuring four diferent label namespaces. Subtask 1 asks participants to train separately on each of the datasets and their respective labels and predict those labels for separate test sets. Subtask 2 allows training on the union of all four training datasets, but still requires predictions for the four test sets in their respective label spaces. For more details about the tasks see [ 1 ].

APPROACH

Our baseline approach explores how well a convolutional neural network (CNN) performs that has been trained on a relatively small subset of the available pre-computed features. For this purpose we have chosen to train only on Mel-features. The complete code is available on GitHub1.

Feature Selection

Traditionally, music genre recognition (MGR) has often relied on Mel-based features—in fact, one of the most often cited MGR publications uses Mel-frequency cepstral coeficients (MFCCs) [ 10 ]. Melbased approaches attempt to capture the timbre of a track, thus allowing conjectures about its instrumentation and genre. They do not necessarily take temporal properties into account and therefore often ignore an important aspect of musical expression, which can also be used for genre/style classification, see e.g., [ 8 ]. But since we are only interested in finding a baseline for more sophisticated systems, using just the provided melbands features is a reasonable approach. Lowlevel AcousticBrainz2 data ofers nine diferent Melfeatures (global statistics: min, max, mean, ...) with 40 bands each, resulting in a total of 360 values per track. Because Mel-bands have a spatial relationship to each other, we organize the data into nine diferent channels, each featuring a 40-dimensional vector resulting in a (N , 40, 9)-dimensional tensor with N being the number of 1https://github.com/hendriks73/melbaseline 2https://acousticbrainz.org/ 918,646 685,479 691,683 675,656 1,258,315 samples. Each of the 40-dimensional feature vectors is scaled so that its maximum is 1. 2.2

Neural Network

We choose to use the fully convolutional network (FCN) architecture depicted in Figure 1. In essence, the network consists of four similar feature extraction blocks, each formed by a one-dimensional convolutional layer, an ELU activation function [ 2 ], a dropout layer [ 9 ] with dropout probability 0.2, an average pooling layer (omitted in the last extraction block), and lastly a batch normalization layer [ 4 ]. From block to block the number of filters is increased from 64 to 512 as the length of the input decreases from 40 to 5 due to average pooling with a pool size of 2. The feature extraction blocks are followed by a classification block consisting of a one-dimensional convolution, an ELU activation function, a batch normalization layer, a global average pooling layer and sigmoid output units. The sigmoid activation function for the output is used, because the task is a multi-label multi-class problem. Note that the number of output dimensions depends on the number of diferent labels in the dataset. We therefore refer to it with the placeholder OUT. The total number of parameters in each networks is listed in Table 1. 2.3

Training

For subtask 1 we train the network using the provided training and validation sets with binary cross-entropy as loss function, Adam [ 5 ] with a learning rate of 0.001 as optimizer, and a batch size of 1,000. To avoid overfitting we employ early stopping with a patience of 50 epochs and use the last model that still showed an improvement in its validation loss.

Because the training data is very unbalanced, we experimented with balancing the training data with respect to the main genre labels via oversampling. As this led to worse results, balancing is not part of this submission.

For subtask 2 we gently normalize the provided labels by converting them to lowercase and removing all non-alphanumeric characters. Based on these transformed labels we create a unified training set.

Dataset AllMusic tagtraum Last.fm Discogs Average per Track (all labels) Track (genre labels) Track (subgenre labels) Label (all labels) Label (genre labels) Label (subgenre labels) Average per Track (all labels) Track (genre labels) Track (subgenre labels) Label (all labels) Label (genre labels) Label (subgenre labels) P

R F P R F P R F P R F P R F P R F P R F P R F P R F P R F P R F P R F The output of the last network layer consists of as many values in the range [ 0, 1 ] as we have diferent labels in the dataset ( OUT). If one of these values is greater than a predefined threshold, we assume that the associated label is applicable for the track. In order to optimize the tradeof between precision and recall, we choose this threshold individually for each label based on the maximum F-score for predictions on the validation set [ 6 ], also known as plug-in rule approach [ 3 ]. In case the threshold is not crossed by any prediction for a given track, we divide all predictions by their thresholds and pick the label corresponding to the largest value.

Since we are using one unified training set for subtask 2, we need to reduce its output to labels that are valid in the context of a specific test dataset. We do so by reverting the applied normalization and dropping all labels not occurring in the test dataset. 3

RESULTS AND ANALYSIS

We evaluated a single run for both subtask 1 and 2. Results are listed in Tables 2 and 3. As expected, all results are well below last year’s winning submission [ 6 ], which used a much larger network and 2,646 features. But the achieved scores are competitive with last year’s second ranked submission [ 7 ], which used a similar number of features, though very diferent ones. Somewhat unexpected, the network trained for subtask 2 was not able to benefit from the additional training material and reaches generally slightly lower results than the networks trained on individual datasets for subtask 1. We have shown that using a relatively small and simple convolutional neural network (CNN) trained only on global Mel-features can achieve respectable scores in this task. Adding temporal features may improve the results further.

[1]

Dmitry

Bogdanov , Alastair Porter,

Julián

Urbano , and

Hendrik

Schreiber . 2018 . The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources . In MediaEval 2018 Workshop. Sophia Antipolis, France.

[2] Djork-Arné

Clevert

, Thomas Unterthiner, and

Sepp

Hochreiter . 2015 . Fast and accurate deep network learning by exponential linear units (elus) , In International Conference on Learning Representations (ICLR) . arXiv preprint arXiv:1511 . 07289 .

[3]

Krzysztof

Dembczynski , Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and

Eyke

Hüllermeier . 2013 . Optimizing the Fmeasure in multi-label classification: Plug-in rule approach versus structured loss minimization . In International Conference on Machine Learning. Atlanta , GA , USA, 1130 - 1138 .

[4]

Sergey

Iofe and

Christian

Szegedy . 2015 . Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . arXiv preprint arXiv:1502.03167 ( 2015 ).

[5] Diederik

Kingma and Jimmy Lei Ba. 2014 . Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980 ( 2014 ).

[6]

Khaled

Koutini , Alina Imenina, Matthias Dorfer,

Alexander

Gruber , and

Markus

Schedl . 2017 . MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach . In MediaEval 2017 Workshop. Dublin, Ireland.

[7]

Benjamin

Murauer , Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle,

Martin

Pichl , and

Günther

Specht . 2017 . Hierarchical Multilabel Classification and Voting for Genre Classification . In MediaEval 2017 Workshop. Dublin, Ireland.

[8]

Björn

Schuller , Florian Eyben, and

Gerhard

Rigoll . 2008 . Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection . EURASIP Journal on Audio, Speech, and Music Processing 2008 ( 2008 ), 12 .

[9]

Nitish

Srivastava , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and

Ruslan

Salakhutdinov . 2014 . Dropout: a simple way to prevent neural networks from overfitting . The Journal of Machine Learning Research 15 , 1 ( 2014 ), 1929 - 1958 .

[10]

George

Tzanetakis and

Perry

Cook . 2002 . Musical Genre Classification of Audio Signals . IEEE Transactions on Speech and Audio Processing 10 , 5 ( 2002 ), 293 - 302 .