INTRODUCTION

MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach

Khaled Koutini

khaled.koutini@jku.at 0

Alina Imenina

Matthias Dorfer

Alexander Rudolf Gruber

Markus Schedl

markus.schedl@jku.at 0 0 Johannes Kepler University Linz , Austria

2017

13 15

This report describes the approach developed by the JKU team for the MediaEval 2017 AcousticBrainz Genre Task. After experimenting with various classifiers on the development dataset, our final approach is based on multilayer perceptron classifiers.

INTRODUCTION

We present an approach for recognizing genre for unknown music recordings given the data provided in the AcousicBrainz dataset [ 5 ]. Details about data, task, and evaluation are described in [ 2 ]. Our work is developed for both subtasks of the MediaEval 2017 AcousticBrainz Genre Task. For the single-source classification subtask a multilayer perceptron is applied on each source. For the multiple-source classification subtask we use similarity measures between sources to adjust the probability of the record belonging to a certain genre in each source.

APPROACH

We split the ground truth of each source using the script provided by the organizers, into a training and a validation set, where each comprises 80% and 20% of the original data respectively. The split also ensures that no recording from the same recordings group appears in both the training and validation sets, in order to avoid the album efect.

Features Selection

As stated in the overview paper [ 2 ], we are given for each recording a set of features extracted using Essentia [ 3 ]. Given the large number of provided features, a fine-grained manual inspection for individual features is not feasible. Instead, we pick broad features groups high in the Essentia feature groups hierarchy. Namely, We use all the low level features, rhythm features except beats_count and beats_position, tonal features except chords_key, chords_scale, key_key and key_scale. Overall, this yields 2646 numerical features per recording.

Neural Network

We tried various neural network architectures and compared their performance based on the mean label-wise F-score of batches, using the Lastfm dataset. The best performing architecture is outlined in Table 1.

2.2.1 Input layer. As stated in Section 2.1, there are 2646 input features. We normalized the input using z-score normalization.

Input: 2646

First layer: 4000 Dense(ReLU) + Drop-Out(0.5) 4000 Dense(tanh) + Drop-Out(0.5) 4000 Dense(sigmoid) + Drop-Out(0.5)

Second layer:

Concat layer 8000 Dense + Drop-Out(0.6) Batch-Normalization layer

Non-linearity (ReLU)

Output layer: k-bins sigmoid 2.2.2 First hidden layer. The first hidden layer is a dense layer consisting of 12000 units where the first 4000 units have a rectified linear [ 4 ] activation function, the next 4000 units have a tanh activation function and the last 4000 units have a siдmoid activation function. As shown in Table 1, each group of units is followed by a dropout layer with a dropout-probability of 0.5.

2.2.3 Second hidden layer. The second hidden layer consists of 8000 batch-normalized rectified linear units. As input to this layer we concatenate the output of the 3 groups of the first layer and add the second layer with no activation function or bias. We again apply dropout with a probability of 0.6.

2.2.4 Output layer. The output layer consists of k units, where k is source specific, denoting the number of labels of the source (genre or sub-genre), the activation function of the output layer is siдmoid.

2.2.5 Loss function. We used mean binary cross-entropy as loss function for the network. 2.3

Adjusting Threshold

The output of our neural network are k numerical values for each recording, as stated in Section 2.2.4. Each output is in the range [ 0, 1 ] representing the probability of the label (genre or sub-genre) corresponding to the respective output neuron. If the probability of a label for a given recording is larger than a predefined threshold, we assign that label to the recording. Based on our experiments, we found that using a threshold of 0.5 for all of the labels results in high precision but low recall. Since the goal of the task is to optimize precision, recall and F-score, we adjusted the threshold for each label individually to obtain the best value for these evaluation measures. Best results are obtained when using static thresholds of either 0.2 or 0.3 for all labels or by using a dynamic threshold for each label, estimated by maximizing the mean F-score. The second part of the task [ 2 ] consisted of combining information from multiple source to predict labels of one source. To achieve this, we calculate the similarity between every label of one source and every label of all other sources, in order to adjust the probability of assigning a source label to a recording using other source’s labels probabilities from models trained on these other sources. As stated in the overview paper [ 2 ], the datasets of diferent sources intersect. We exploit this intersection to estimate the similarity between the labels of diferent sources.

Labels are modeled as vectors, where each label is a vector of the recordings annotated with it in the ground truth. The similarity between labels from diferent sources is measured as the cosine similarity between these label vectors. Based on that we compute similarity matrices Mi, j between diferent sources where element (i, j) holds the similarity of label i of the first source and label j in the second source. We use these pairwise similarities as conversion matrices to project probabilities produced by a model trained on one source to the labels of another source. For a specific recording, the probabilities Pi of source labels i are a vector of length ni . This vector is produced by a model trained on the training set of source i. To also make use of the models trained on other sources we compute Pj · Mj,i which is a vector of the same length ni also representing the probabilities of source i labels. However, this vector is produced by a model trained on source j by projecting the probabilities Pj using the respective conversion matrix. The final label probabilities (task 2) for a specific recording of source k are the weighted average label probabilities produced by the model trained on the recording’s source training set as well as the projected label probabilities of all other sources (see Equation (1)).

Yk =

Pk + 31 Íi,k Pi · Mi,k 2 (1) 3

RESULTS AND ANALYSIS

predicting sub-genres labels is harder than predicting genre labels which might be a result of the fewer training examples of those sub-genres in the dataset.

We submitted 3 runs for the first task [ 2 ], two runs using static threshold of 0.2 and 0.3, and a run using dynamic thresholds as described in section 2.3. We also submitted 5 runs for the second task [ 2 ], two runs identical to the static threshold runs of task1, and 3 runs based on probabilities calculated as described in section 2.4 using static threshold of 0.2 and 0.3 and dynamic thresholds.

Table 4 summarizes the f-score oficial results [ 1 ] of our best run of each source.

[1]

Bogdanov ,

Porter ,

Urbano , and

Schreiber . 2017 . Oficial results of the MediaEval 2017 AcousticBrainz Genre Task . https:// multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/. ( 2017 ). [Online; accessed 09-September-2017].

[2]

Bogdanov ,

Porter ,

Urbano , and

Schreiber . 2017 . The MediaEval 2017 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources . In Working Notes Proceedings of the MediaEval 2016 Workshop . Dublin, Ireland.

[3]

Bogdanov ,

Wack ,

Gómez ,

Gulati ,

Herrera ,

Mayor , G. Roma, J. Salamon,

J.R.

Zapata , and

Serra . 2013 . Essentia: An Audio Analysis Library for Music Information Retrieval . In International Society for Music Information Retrieval (ISMIR'13) Conference . Curitiba, Brazil, 493 - 498 .

[4]

Vinod

Nair and

Geofrey E

Hinton . 2010 . Rectified linear units improve restricted boltzmann machines . In Proceedings of the 27th international conference on machine learning (ICML-10) . 807 - 814 .

[5]

Porter ,

Bogdanov ,

Kaye ,

Tsukanov , and

Serra . 2015 . Acousticbrainz: a community platform for gathering music information obtained from audio . In Proceedings of the 16th International Society for Music Information Retrieval Conference . Malaga, Spain, 786 - 792 .