MediaEval 2017 AcousticBrainz Genre Task: Multilayer
                            Perceptron Approach
           Khaled Koutini, Alina Imenina, Matthias Dorfer, Alexander Rudolf Gruber, Markus Schedl
                                                        Johannes Kepler University Linz, Austria
                                                      khaled.koutini@jku.at,markus.schedl@jku.at
ABSTRACT                                                                     Table 1: Model Specifications. ReLU: Rectified Linear
                                                                             Unit [4], σ : the number of output units, which is the number
This report describes the approach developed by the JKU team for
                                                                             of possible labels (unique genres + unique sub-genres). For
the MediaEval 2017 AcousticBrainz Genre Task. After experiment-
                                                                             training a constant batch size of 500 samples is used, and a
ing with various classifiers on the development dataset, our final
                                                                             learning rate of 0.001
approach is based on multilayer perceptron classifiers.

                                                                                                       Input: 2646
1     INTRODUCTION                                                                                      First layer:
We present an approach for recognizing genre for unknown music                              4000 Dense(ReLU) + Drop-Out(0.5)
recordings given the data provided in the AcousicBrainz dataset [5].                        4000 Dense(tanh) + Drop-Out(0.5)
Details about data, task, and evaluation are described in [2].                             4000 Dense(sigmoid) + Drop-Out(0.5)
Our work is developed for both subtasks of the MediaEval 2017                                         Second layer:
AcousticBrainz Genre Task. For the single-source classification                                       Concat layer
subtask a multilayer perceptron is applied on each source. For the                             8000 Dense + Drop-Out(0.6)
multiple-source classification subtask we use similarity measures                               Batch-Normalization layer
between sources to adjust the probability of the record belonging                                 Non-linearity (ReLU)
to a certain genre in each source.                                                                    Output layer:
                                                                                                     k-bins sigmoid
2     APPROACH
We split the ground truth of each source using the script provided
by the organizers, into a training and a validation set, where each             2.2.2 First hidden layer. The first hidden layer is a dense layer
comprises 80% and 20% of the original data respectively. The split           consisting of 12000 units where the first 4000 units have a rectified
also ensures that no recording from the same recordings group                linear [4] activation function, the next 4000 units have a tanh acti-
appears in both the training and validation sets, in order to avoid          vation function and the last 4000 units have a siдmoid activation
the album effect.                                                            function. As shown in Table 1, each group of units is followed by a
                                                                             dropout layer with a dropout-probability of 0.5.
2.1     Features Selection
                                                                               2.2.3 Second hidden layer. The second hidden layer consists of
As stated in the overview paper [2], we are given for each record-
                                                                             8000 batch-normalized rectified linear units. As input to this layer
ing a set of features extracted using Essentia [3]. Given the large
                                                                             we concatenate the output of the 3 groups of the first layer and
number of provided features, a fine-grained manual inspection for
                                                                             add the second layer with no activation function or bias. We again
individual features is not feasible. Instead, we pick broad features
                                                                             apply dropout with a probability of 0.6.
groups high in the Essentia feature groups hierarchy. Namely, We
use all the low level features, rhythm features except beats_count              2.2.4 Output layer. The output layer consists of k units, where
and beats_position, tonal features except chords_key, chords_scale,          k is source specific, denoting the number of labels of the source
key_key and key_scale. Overall, this yields 2646 numerical features          (genre or sub-genre), the activation function of the output layer is
per recording.                                                               siдmoid.

2.2     Neural Network                                                         2.2.5 Loss function. We used mean binary cross-entropy as loss
                                                                             function for the network.
We tried various neural network architectures and compared their
performance based on the mean label-wise F-score of batches, using
the Lastfm dataset. The best performing architecture is outlined in
                                                                             2.3    Adjusting Threshold
Table 1.                                                                     The output of our neural network are k numerical values for each
                                                                             recording, as stated in Section 2.2.4. Each output is in the range
   2.2.1 Input layer. As stated in Section 2.1, there are 2646 input         [0, 1] representing the probability of the label (genre or sub-genre)
features. We normalized the input using z-score normalization.               corresponding to the respective output neuron. If the probability of
                                                                             a label for a given recording is larger than a predefined threshold,
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                          we assign that label to the recording. Based on our experiments,
                                                                             we found that using a threshold of 0.5 for all of the labels results
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                   K. Koutini et al.


in high precision but low recall. Since the goal of the task is to          label probabilities produced by the model trained on the recording’s
optimize precision, recall and F-score, we adjusted the threshold for       source training set as well as the projected label probabilities of all
each label individually to obtain the best value for these evaluation       other sources (see Equation (1)).
measures. Best results are obtained when using static thresholds of                                   Pk + 13 i,k Pi · Mi,k
                                                                                                              Í
either 0.2 or 0.3 for all labels or by using a dynamic threshold for                             Yk =                                           (1)
                                                                                                                 2
each label, estimated by maximizing the mean F-score.
                                                                            3    RESULTS AND ANALYSIS
Table 2: Lastfm validation set evaluation results. P: Preci-                Table 3 shows the evaluation results on the validation sets of differ-
sion, R: Recall, F: F-score                                                 ent sources using models trained on the respected training sets and
                                                                            using dynamic thresholds, as produced by the evaluation scripts
             Average per                     Threshold                      provided by the task organizers [2]. We can clearly observe that
                                       0.2     0.3 dynamic
                                 P    0.54    0.60    0.59                  Table 3: Validation set evaluation results of the 4 sources us-
           track (all labels)    R    0.64    0.59    0.59                  ing dynamic threshold (section 2.3). P: Precision, R: Recall,
                                 F    0.55    0.55    0.54                  F: F-score
                                 P    0.69    0.71    0.70
         track (genre labels)    R    0.79    0.76    0.73                      Average per                        Source
                                 F    0.71    0.71    0.70                                             Allmusic    Tagtraum    Lastfm       Discogs
                                 P    0.39    0.46    0.44                                        P      0.53        0.53       0.59         0.50
                                                                                 track (all
           label (all labels)    R    0.36    0.32    0.35                                        R      0.56        0.59       0.59         0.62
                                                                                  labels)
                                 F    0.35    0.35    0.37                                        F      0.48        0.53       0.54         0.51
                                 P    0.51    0.58    0.60                                        P      0.71        0.72       0.70         0.78
                                                                                track (genre
         label (genre labels)    R    0.58    0.53    0.53                                        R      0.74        0.76       0.73         0.80
                                                                                   labels)
                                 F    0.54    0.54    0.56                                        F      0.70        0.72       0.70         0.76
                                                                                                  P      0.49        0.32       0.44         0.28
                                                                                 label (all
   Table 2 shows the evaluation results for the mentioned threshold                               R      0.36        0.32       0.35         0.29
                                                                                  labels)
setups on the Lastfm validation set, using the evaluation scripts                                 F      0.40        0.30       0.37         0.27
provided by the task organizers [2].                                                              P      0.56        0.55       0.60         0.59
                                                                                label (genre
                                                                                                  R      0.48        0.51       0.53         0.54
                                                                                   labels)
2.4    Combining Different Sources                                                                F      0.51        0.52       0.56         0.56
The second part of the task [2] consisted of combining information
from multiple source to predict labels of one source. To achieve this,      predicting sub-genres labels is harder than predicting genre labels
we calculate the similarity between every label of one source and           which might be a result of the fewer training examples of those
every label of all other sources, in order to adjust the probability of     sub-genres in the dataset.
assigning a source label to a recording using other source’s labels            We submitted 3 runs for the first task [2], two runs using static
probabilities from models trained on these other sources. As stated         threshold of 0.2 and 0.3, and a run using dynamic thresholds as
in the overview paper [2], the datasets of different sources intersect.     described in section 2.3. We also submitted 5 runs for the second
We exploit this intersection to estimate the similarity between the         task [2], two runs identical to the static threshold runs of task1, and
labels of different sources.                                                3 runs based on probabilities calculated as described in section 2.4
    Labels are modeled as vectors, where each label is a vector of          using static threshold of 0.2 and 0.3 and dynamic thresholds.
the recordings annotated with it in the ground truth. The similarity           Table 4 summarizes the f-score official results [1] of our best run
between labels from different sources is measured as the cosine             of each source.
similarity between these label vectors. Based on that we compute
                                                                                              Table 4: Official results [1] (F-score)
similarity matrices Mi, j between different sources where element
(i, j) holds the similarity of label i of the first source and label j in
                                                                                  Average per                                Source
the second source. We use these pairwise similarities as conversion
matrices to project probabilities produced by a model trained on                                      Allmusic    Tagtraum    Lastfm      Discogs
one source to the labels of another source. For a specific recording,               track (all
                                                                                                        0.43        0.53       0.55         0.51
the probabilities Pi of source labels i are a vector of length ni . This             labels)
vector is produced by a model trained on the training set of source i.            track (genre
                                                                                                        0.67        0.73       0.72         0.77
To also make use of the models trained on other sources we compute                   labels)
P j · M j,i which is a vector of the same length ni also representing               label (all
                                                                                                        0.20        0.28       0.35         0.25
the probabilities of source i labels. However, this vector is produced               labels)
by a model trained on source j by projecting the probabilities P j                label (genre
                                                                                                        0.41        0.50       0.55         0.56
using the respective conversion matrix. The final label probabilities                labels)
(task 2) for a specific recording of source k are the weighted average
MediaEval 2017 AcousticBrainz Genre Task: Team JKU                             MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] D Bogdanov, A Porter, J Urbano, and H Schreiber. 2017. Official
    results of the MediaEval 2017 AcousticBrainz Genre Task. https://
    multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/.
    (2017). [Online; accessed 09-September-2017].
[2] D Bogdanov, A Porter, J Urbano, and H Schreiber. 2017. The Media-
    Eval 2017 AcousticBrainz Genre Task: Content-based Music Genre
    Recognition from Multiple Sources. In Working Notes Proceedings of
    the MediaEval 2016 Workshop. Dublin, Ireland.
[3] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G.
    Roma, J. Salamon, J.R. Zapata, and X. Serra. 2013. Essentia: An Audio
    Analysis Library for Music Information Retrieval. In International
    Society for Music Information Retrieval (ISMIR’13) Conference. Curitiba,
    Brazil, 493–498.
[4] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve
    restricted boltzmann machines. In Proceedings of the 27th international
    conference on machine learning (ICML-10). 807–814.
[5] A. Porter, D. Bogdanov, R. Kaye, R. Tsukanov, and X. Serra. 2015.
    Acousticbrainz: a community platform for gathering music informa-
    tion obtained from audio. In Proceedings of the 16th International
    Society for Music Information Retrieval Conference. Malaga, Spain,
    786–792.