INTRODUCTION

MediaEval 2018 AcousticBrainz Genre Task: A baseline combining deep feature embeddings across datasets

Sergio Oramas

Dmitry Bogdanov

dmitry.bogdanov@upf.edu 0

Alastair Porter

alastair.porter@upf.edu 0

Pandora Media Inc.

0 Universitat Pompeu Fabra , Spain

2018

29 31

In this paper we present a baseline approach for the MediaEval 2018 AcousticBrainz Genre Task that takes advantage of stacking multiple feature embeddings learned on individual genre datasets by simple deep learning architectures. Although we employ basic neural networks, the combination of their deep feature embeddings provides a significant gain in performance compared to each individual network.

INTRODUCTION

This paper describes our baseline submission to the MediaEval 2018 AcousticBrainz Genre Task [ 1 ]. The goal of the task is to automatically classify music tracks by genres based on pre-computed audio content features provided by the organizers. Four diferent genre datasets coming from diferent annotation sources with diferent genre taxonomies are used in the challenge. For each dataset, training, validation, and testing splits are provided. This allows to build and evaluate classiefir models for each genre dataset independently (Subtask 1) as well as explore combinations of genre sources in order to boost performance of the models (Subtask 2).

For this baseline, we decided to focus on demonstration of possibilities of merging diferent genre ground truth sources using a simple deep learning architecture. To this end, we explore how stacking deep feature embeddings obtained on diferent datasets can benefit genre recognition systems.

RELATED WORK

Submission to the previous edition of the task have explored late fusion of predictions made by classifier models trained for each genre source individually. In order to predict genres following a taxonomy of a target source, the proposed solutions applied genre mapping between taxonomies, either by computing genre co-occurrences on the intersection of all four training genre datasets [ 4 ], or by textual string matching [ 6 ].

In our baseline, we propose an alternative early fusion approach, similar to the one proposed in [ 7 ] for multimodal genre classification. The approach incorporates knowledge across datasets by stacking deep feature embeddings learned on each dataset individually and using those as an input to predict genres for each test dataset. 3.1

APPROACH Input features

We use all available features extracted from music audio recordings using Essentia [ 2 ] and provided for the challenge. As a preprocessing step, we apply one-hot encoding for a few categorical features related to tonality (key_key, key_scale, chords_key, and chords_scale) and standardize all features (zero mean, unit variance). In total, this amounts to 2669 input features. 3.2

Neural network architecture

A simple feedforward network is used to predict the probabilities of each genre given a track. The network consists of an input layer of 2669 units (the size of the feature vector for an input recording), followed by a hidden dense layer of 256 units with ReLu activation, and the output layer where the number of units coincide with the number of genres to be predicted in each dataset. Dropout of 0.5 is applied after the input and the hidden layer. As the targeted genre classification task is multi-label, the output layer uses sigmoid activations and is evaluated with a binary cross-entropy loss.

Mini-batches of 32 items are randomly sampled from the training data to compute the gradient, and the Adam [ 3 ] optimizer is used to train the models, with the default suggested learning parameters. The networks are trained for a maximum of 100 epochs with early stopping. Once trained, we extract the 256-dimensional vectors from the hidden layer for the training, validation, and test sets.

The model architecture is used to train a multi-label genre classifier on each of the four datasets. The models are trained on 80% of the training set and validated after each epoch using the other 20% using the split script with release-group filtering provided by the organizers. Predictions are computed for the validation and test sets. 3.3

Embedding fusion approach

Following the described methodology, one model per dataset is trained and these models serve for predictions in Subtask 1. Then, the given models are used as feature extractors. All four models share the same input format, so input feature vectors from one dataset can be used as input to a model trained on other dataset. Thus, for each model we feed all tracks from the training, validation and test sets of each dataset, and obtain the activations of the hidden layer as a 256-dimensional feature embedding. Therefore, for each track in each dataset we obtain four diferent feature embeddings, coming from each of the four previously trained models.

Given the four feature embeddings of each track, we apply l 2norm to each of them and then stack them together into a single 1024-dimensional feature vector. Following this process, we obtain new feature vectors for every track in the training, validation and test sets of each dataset. Then, we use these feature vectors as input of a simple network where the input layer is directly connected to the output layer. Dropout of 0.5 is applied after the input layer. The output layer is exactly the same as in the network described in above, where sigmoid activation and binary cross-entropy loss are applied. The new network is trained following the same methodology described before, with Adam as the optimizer and mini-batches of 32 items randomly sampled. The network is trained on 80% of the training data and validated on the other 20%. Following this approach, we train a network per dataset, and obtain the genre probability predictions of the validation and test sets for Subtask 2. 3.4

Predictions thresholding

The predictions made by each model contain continuous values, while the task requires binary prediction of genre labels. We therefore apply a plug-in rule approach thresholding the prediction values in order to maximize the evaluation metrics. We decided to maximize the macro F-score, and applied thresholds individual for each genre label that we estimated on the validation data [ 5 ]. 4

RESULTS AND ANALYSIS

We evaluated a single run for both Subtask 1 and 2. Table 1 presents the ROC AUC metric on the validation sets. Table 2 presents the ifnal results after applying thresholding on the test datasets. As the general pattern, we can clearly see the benefit of models based on embedding fusion approach compared to the models trained individually on each dataset. While the individual models (Subtask 1) are hardly usable compared to the random and popularity baselines, the combined models got a significant improvement in performance, being competitive with last years’ second ranked submission [ 6 ].

In our experiments, we focused on optimizing macro F-score, however choosing this metric for threshold optimization can have a negative efect on micro-averaged metrics. In the case of infrequent subgenre labels and an uninformative classifier, an optimal, but undesirable strategy may involve predicting those labels always [ 5 ]. Indeed, this was the case for the individual models, but the hybrid models did not have this issue. 5

DISCUSSION AND OUTLOOK

In our baseline approach we focused on Subtask 2 and demonstrated the advantage of fusing feature embeddings learned on individual genre datasets on the example of a simple feedforward network architecture. We may expect further improvements in performance by means of a more sophisticated network architecture (for example [ 4 ]). The code of the baseline is available online.1 1https://github.com/MTG/acousticbrainz-mediaeval-baselines

ACKNOWLEDGMENTS

This research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No 688382 (AudioCommons) and 770376-2 (TROMPA), as well as the Ministry of Economy and Competitiveness of the Spanish Government (Reference: TIN2015-69935-P).

AllMusic Dataset

Discogs Lastfm

Tagtraum Subtask 1 (individual models)

Subtask 2 (fusion models)

[1]

Dmitry

Bogdanov , Alastair Porter,

Julián

Urbano , and

Hendrik

Schreiber . 2018 . The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources . In MediaEval 2018 Workshop. Sophia Antipolis, France.

[2]

Bogdanov ,

Wack ,

Gómez ,

Gulati ,

Herrera ,

Mayor , G. Roma, J. Salamon,

J.R.

Zapata , and

Serra . 2013 . Essentia: An Audio Analysis Library for Music Information Retrieval . In International Society for Music Information Retrieval (ISMIR'13) Conference . Curitiba, Brazil, 493 - 498 .

[3] Diederik

Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980 ( 2014 ).

[4]

Khaled

Koutini , Alina Imenina, Matthias Dorfer, Alexander Rudolf Gruber, and

Markus

Schedl . 2017 . MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach . In MediaEval 2017 Workshop. Dublin, Ireland.

[5] Zachary

C Lipton

, Charles Elkan , and Balakrishnan Naryanaswamy . 2014 . Optimal thresholding of classifiers to maximize F1 measure . In Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 225 - 239 .

[6]

Benjamin

Murauer , Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle, Martin Pichl, and GÃĳnther Specht . 2017 . Hierarchical Multilabel Classification and Voting for Genre Classification . In MediaEval 2017 Workshop. Dublin, Ireland.

[7]

Sergio

Oramas , Francesco Barbieri, Oriol Nieto, and

Xavier

Serra . 2018 . Multimodal Deep Learning for Music Genre Classification . Transactions of the International Society for Music Information Retrieval 1 , 1 ( 2018 ).