             MediaEval 2018 AcousticBrainz Genre Task: A baseline
              combining deep feature embeddings across datasets
                                            Sergio Oramas1 , Dmitry Bogdanov2 , Alastair Porter2
                                                                 1 Pandora Media Inc., US
                                                             2 Universitat Pompeu Fabra, Spain


ABSTRACT                                                                        3 APPROACH
In this paper we present a baseline approach for the MediaEval                  3.1 Input features
2018 AcousticBrainz Genre Task that takes advantage of stacking
                                                                                We use all available features extracted from music audio record-
multiple feature embeddings learned on individual genre datasets
                                                                                ings using Essentia [2] and provided for the challenge. As a pre-
by simple deep learning architectures. Although we employ basic
                                                                                processing step, we apply one-hot encoding for a few categorical
neural networks, the combination of their deep feature embed-
                                                                                features related to tonality (key_key, key_scale, chords_key, and
dings provides a significant gain in performance compared to each
                                                                                chords_scale) and standardize all features (zero mean, unit vari-
individual network.
                                                                                ance). In total, this amounts to 2669 input features.

                                                                                3.2    Neural network architecture
1    INTRODUCTION                                                               A simple feedforward network is used to predict the probabilities
This paper describes our baseline submission to the MediaEval 2018              of each genre given a track. The network consists of an input layer
AcousticBrainz Genre Task [1]. The goal of the task is to automati-             of 2669 units (the size of the feature vector for an input recording),
cally classify music tracks by genres based on pre-computed audio               followed by a hidden dense layer of 256 units with ReLu activation,
content features provided by the organizers. Four different genre               and the output layer where the number of units coincide with
datasets coming from different annotation sources with different                the number of genres to be predicted in each dataset. Dropout of
genre taxonomies are used in the challenge. For each dataset, train-            0.5 is applied after the input and the hidden layer. As the targeted
ing, validation, and testing splits are provided. This allows to build          genre classification task is multi-label, the output layer uses sigmoid
and evaluate classifier models for each genre dataset independently             activations and is evaluated with a binary cross-entropy loss.
(Subtask 1) as well as explore combinations of genre sources in                     Mini-batches of 32 items are randomly sampled from the training
order to boost performance of the models (Subtask 2).                           data to compute the gradient, and the Adam [3] optimizer is used
   For this baseline, we decided to focus on demonstration of pos-              to train the models, with the default suggested learning parameters.
sibilities of merging different genre ground truth sources using                The networks are trained for a maximum of 100 epochs with early
a simple deep learning architecture. To this end, we explore how                stopping. Once trained, we extract the 256-dimensional vectors
stacking deep feature embeddings obtained on different datasets                 from the hidden layer for the training, validation, and test sets.
can benefit genre recognition systems.                                              The model architecture is used to train a multi-label genre clas-
                                                                                sifier on each of the four datasets. The models are trained on 80%
                                                                                of the training set and validated after each epoch using the other
2    RELATED WORK                                                               20% using the split script with release-group filtering provided by
Submission to the previous edition of the task have explored late fu-           the organizers. Predictions are computed for the validation and test
sion of predictions made by classifier models trained for each genre            sets.
source individually. In order to predict genres following a taxonomy
of a target source, the proposed solutions applied genre mapping                3.3    Embedding fusion approach
between taxonomies, either by computing genre co-occurrences on
                                                                                Following the described methodology, one model per dataset is
the intersection of all four training genre datasets [4], or by textual
                                                                                trained and these models serve for predictions in Subtask 1. Then,
string matching [6].
                                                                                the given models are used as feature extractors. All four models
   In our baseline, we propose an alternative early fusion approach,
                                                                                share the same input format, so input feature vectors from one
similar to the one proposed in [7] for multimodal genre classifi-
                                                                                dataset can be used as input to a model trained on other dataset.
cation. The approach incorporates knowledge across datasets by
                                                                                Thus, for each model we feed all tracks from the training, validation
stacking deep feature embeddings learned on each dataset individ-
                                                                                and test sets of each dataset, and obtain the activations of the hidden
ually and using those as an input to predict genres for each test
                                                                                layer as a 256-dimensional feature embedding. Therefore, for each
                                                                                track in each dataset we obtain four different feature embeddings,
                                                                                coming from each of the four previously trained models.
                                                                                   Given the four feature embeddings of each track, we apply l2-
                                                                                1024-dimensional feature vector. Following this process, we obtain
            Table 1: ROC AUC on validation datasets                         Table 2: Precision, recall and F-scores on test datasets

                      AllMusic      Discogs      Lastfm       Tagtraum     Average per                          Dataset
                                                                                                AllMusic   Discogs Lastfm      Tagtraum
       Subtask 1        0.6476       0.7592      0.8276        0.8017
       Subtask 2        0.8122       0.8863      0.9064        0.8874                      Subtask 1 (individual models)
                                                                            Recording      P     0.0147     0.0591    0.0976    0.0992
                                                                            (all labels)   R     0.5753     0.5263    0.4512    0.5017
new feature vectors for every track in the training, validation and                        F     0.0280     0.1035    0.1506    0.1623
test sets of each dataset. Then, we use these feature vectors as input
                                                                            Recording      P     0.2786     0.6305    0.3974    0.2991
of a simple network where the input layer is directly connected
                                                                             (genres)      R     0.6960     0.7289    0.5966    0.6630
to the output layer. Dropout of 0.5 is applied after the input layer.
                                                                                           F     0.3399     0.6382    0.4455    0.4040
The output layer is exactly the same as in the network described in
above, where sigmoid activation and binary cross-entropy loss are           Recording      P     0.0114     0.0256    0.0518    0.0636
applied. The new network is trained following the same methodol-           (subgenres)     R     0.4861     0.3497    0.3295    0.4164
ogy described before, with Adam as the optimizer and mini-batches                          F     0.0219     0.0467    0.0856    0.1083
of 32 items randomly sampled. The network is trained on 80% of                 Label       P     0.0225     0.0744    0.0732    0.0951
the training data and validated on the other 20%. Following this            (all labels)   R     0.4943     0.2588    0.2330    0.2412
approach, we train a network per dataset, and obtain the genre                             F     0.0324     0.0935    0.0947    0.1141
probability predictions of the validation and test sets for Subtask 2.
                                                                               Label       P     0.1658     0.3733    0.2321    0.2551
3.4     Predictions thresholding                                             (genres)      R     0.4729     0.4229    0.3484    0.3573
                                                                                           F     0.1938     0.3801    0.2546    0.2676
The predictions made by each model contain continuous values,
while the task requires binary prediction of genre labels. We there-          Label        P     0.0184     0.0595    0.0572    0.0764
fore apply a plug-in rule approach thresholding the prediction             (subgenres)     R     0.4949     0.2506    0.2213    0.2276
values in order to maximize the evaluation metrics. We decided to                          F     0.0278     0.0792    0.0786    0.0962
maximize the macro F-score, and applied thresholds individual for                              Subtask 2 (fusion models)
each genre label that we estimated on the validation data [5].
                                                                            Recording      P     0.1340     0.2775    0.2718    0.2972
4     RESULTS AND ANALYSIS                                                  (all labels)   R     0.4809     0.5432    0.4762    0.5127
                                                                                           F     0.1880     0.3320    0.3066    0.3451
We evaluated a single run for both Subtask 1 and 2. Table 1 presents
the ROC AUC metric on the validation sets. Table 2 presents the             Recording      P     0.5689     0.6877    0.5407    0.6061
final results after applying thresholding on the test datasets. As the       (genres)      R     0.6905     0.7473    0.6335    0.6885
general pattern, we can clearly see the benefit of models based on                         F     0.5880     0.6845    0.5602    0.6243
embedding fusion approach compared to the models trained indi-              Recording      P     0.0946     0.1472    0.1570    0.2022
vidually on each dataset. While the individual models (Subtask 1)          (subgenres)     R     0.3251     0.3703    0.3368    0.4148
are hardly usable compared to the random and popularity baselines,                         F     0.1343     0.1892    0.1911    0.2465
the combined models got a significant improvement in performance,
                                                                               Label       P     0.0614     0.1087    0.1108    0.1235
being competitive with last years’ second ranked submission [6].
                                                                            (all labels)   R     0.1640     0.2226    0.2168    0.2324
   In our experiments, we focused on optimizing macro F-score,
                                                                                           F     0.0736     0.1247    0.1314    0.1400
however choosing this metric for threshold optimization can have a
negative effect on micro-averaged metrics. In the case of infrequent           Label       P     0.2907     0.4404    0.3077    0.2878
subgenre labels and an uninformative classifier, an optimal, but             (genres)      R     0.3713     0.4713    0.3735    0.3565
undesirable strategy may involve predicting those labels always [5].                       F     0.3080     0.4393    0.3246    0.3053
Indeed, this was the case for the individual models, but the hybrid           Label        P     0.0550     0.0922    0.0909    0.1043
models did not have this issue.                                            (subgenres)     R     0.1582     0.2102    0.2009    0.2179
                                                                                           F     0.0670     0.1089    0.1119    0.1206
In our baseline approach we focused on Subtask 2 and demonstrated
the advantage of fusing feature embeddings learned on individual
genre datasets on the example of a simple feedforward network            ACKNOWLEDGMENTS
architecture. We may expect further improvements in performance          This research has received funding from the European Union’s
by means of a more sophisticated network architecture (for exam-         Horizon 2020 research and innovation programme under grant
ple [4]). The code of the baseline is available online.1                 agreements No 688382 (AudioCommons) and 770376-2 (TROMPA),
                                                                         as well as the Ministry of Economy and Competitiveness of the
1 https://github.com/MTG/acousticbrainz-mediaeval-baselines              Spanish Government (Reference: TIN2015-69935-P).
