MediaEval 2018 AcousticBrainz Genre Task: A baseline combining deep feature embeddings across datasets Sergio Oramas1 , Dmitry Bogdanov2 , Alastair Porter2 1 Pandora Media Inc., US 2 Universitat Pompeu Fabra, Spain soramas@pandora.com,dmitry.bogdanov@upf.edu,alastair.porter@upf.edu ABSTRACT 3 APPROACH In this paper we present a baseline approach for the MediaEval 3.1 Input features 2018 AcousticBrainz Genre Task that takes advantage of stacking We use all available features extracted from music audio record- multiple feature embeddings learned on individual genre datasets ings using Essentia [2] and provided for the challenge. As a pre- by simple deep learning architectures. Although we employ basic processing step, we apply one-hot encoding for a few categorical neural networks, the combination of their deep feature embed- features related to tonality (key_key, key_scale, chords_key, and dings provides a significant gain in performance compared to each chords_scale) and standardize all features (zero mean, unit vari- individual network. ance). In total, this amounts to 2669 input features. 3.2 Neural network architecture 1 INTRODUCTION A simple feedforward network is used to predict the probabilities This paper describes our baseline submission to the MediaEval 2018 of each genre given a track. The network consists of an input layer AcousticBrainz Genre Task [1]. The goal of the task is to automati- of 2669 units (the size of the feature vector for an input recording), cally classify music tracks by genres based on pre-computed audio followed by a hidden dense layer of 256 units with ReLu activation, content features provided by the organizers. Four different genre and the output layer where the number of units coincide with datasets coming from different annotation sources with different the number of genres to be predicted in each dataset. Dropout of genre taxonomies are used in the challenge. For each dataset, train- 0.5 is applied after the input and the hidden layer. As the targeted ing, validation, and testing splits are provided. This allows to build genre classification task is multi-label, the output layer uses sigmoid and evaluate classifier models for each genre dataset independently activations and is evaluated with a binary cross-entropy loss. (Subtask 1) as well as explore combinations of genre sources in Mini-batches of 32 items are randomly sampled from the training order to boost performance of the models (Subtask 2). data to compute the gradient, and the Adam [3] optimizer is used For this baseline, we decided to focus on demonstration of pos- to train the models, with the default suggested learning parameters. sibilities of merging different genre ground truth sources using The networks are trained for a maximum of 100 epochs with early a simple deep learning architecture. To this end, we explore how stopping. Once trained, we extract the 256-dimensional vectors stacking deep feature embeddings obtained on different datasets from the hidden layer for the training, validation, and test sets. can benefit genre recognition systems. The model architecture is used to train a multi-label genre clas- sifier on each of the four datasets. The models are trained on 80% of the training set and validated after each epoch using the other 2 RELATED WORK 20% using the split script with release-group filtering provided by Submission to the previous edition of the task have explored late fu- the organizers. Predictions are computed for the validation and test sion of predictions made by classifier models trained for each genre sets. source individually. In order to predict genres following a taxonomy of a target source, the proposed solutions applied genre mapping 3.3 Embedding fusion approach between taxonomies, either by computing genre co-occurrences on Following the described methodology, one model per dataset is the intersection of all four training genre datasets [4], or by textual trained and these models serve for predictions in Subtask 1. Then, string matching [6]. the given models are used as feature extractors. All four models In our baseline, we propose an alternative early fusion approach, share the same input format, so input feature vectors from one similar to the one proposed in [7] for multimodal genre classifi- dataset can be used as input to a model trained on other dataset. cation. The approach incorporates knowledge across datasets by Thus, for each model we feed all tracks from the training, validation stacking deep feature embeddings learned on each dataset individ- and test sets of each dataset, and obtain the activations of the hidden ually and using those as an input to predict genres for each test layer as a 256-dimensional feature embedding. Therefore, for each dataset. track in each dataset we obtain four different feature embeddings, coming from each of the four previously trained models. Copyright held by the owner/author(s). Given the four feature embeddings of each track, we apply l2- MediaEval’18, 29-31 October 2018, Sophia Antipolis, France norm to each of them and then stack them together into a single 1024-dimensional feature vector. Following this process, we obtain MediaEval’18, 29-31 October 2018, Sophia Antipolis, France S. Oramas et al. Table 1: ROC AUC on validation datasets Table 2: Precision, recall and F-scores on test datasets AllMusic Discogs Lastfm Tagtraum Average per Dataset AllMusic Discogs Lastfm Tagtraum Subtask 1 0.6476 0.7592 0.8276 0.8017 Subtask 2 0.8122 0.8863 0.9064 0.8874 Subtask 1 (individual models) Recording P 0.0147 0.0591 0.0976 0.0992 (all labels) R 0.5753 0.5263 0.4512 0.5017 new feature vectors for every track in the training, validation and F 0.0280 0.1035 0.1506 0.1623 test sets of each dataset. Then, we use these feature vectors as input Recording P 0.2786 0.6305 0.3974 0.2991 of a simple network where the input layer is directly connected (genres) R 0.6960 0.7289 0.5966 0.6630 to the output layer. Dropout of 0.5 is applied after the input layer. F 0.3399 0.6382 0.4455 0.4040 The output layer is exactly the same as in the network described in above, where sigmoid activation and binary cross-entropy loss are Recording P 0.0114 0.0256 0.0518 0.0636 applied. The new network is trained following the same methodol- (subgenres) R 0.4861 0.3497 0.3295 0.4164 ogy described before, with Adam as the optimizer and mini-batches F 0.0219 0.0467 0.0856 0.1083 of 32 items randomly sampled. The network is trained on 80% of Label P 0.0225 0.0744 0.0732 0.0951 the training data and validated on the other 20%. Following this (all labels) R 0.4943 0.2588 0.2330 0.2412 approach, we train a network per dataset, and obtain the genre F 0.0324 0.0935 0.0947 0.1141 probability predictions of the validation and test sets for Subtask 2. Label P 0.1658 0.3733 0.2321 0.2551 3.4 Predictions thresholding (genres) R 0.4729 0.4229 0.3484 0.3573 F 0.1938 0.3801 0.2546 0.2676 The predictions made by each model contain continuous values, while the task requires binary prediction of genre labels. We there- Label P 0.0184 0.0595 0.0572 0.0764 fore apply a plug-in rule approach thresholding the prediction (subgenres) R 0.4949 0.2506 0.2213 0.2276 values in order to maximize the evaluation metrics. We decided to F 0.0278 0.0792 0.0786 0.0962 maximize the macro F-score, and applied thresholds individual for Subtask 2 (fusion models) each genre label that we estimated on the validation data [5]. Recording P 0.1340 0.2775 0.2718 0.2972 4 RESULTS AND ANALYSIS (all labels) R 0.4809 0.5432 0.4762 0.5127 F 0.1880 0.3320 0.3066 0.3451 We evaluated a single run for both Subtask 1 and 2. Table 1 presents the ROC AUC metric on the validation sets. Table 2 presents the Recording P 0.5689 0.6877 0.5407 0.6061 final results after applying thresholding on the test datasets. As the (genres) R 0.6905 0.7473 0.6335 0.6885 general pattern, we can clearly see the benefit of models based on F 0.5880 0.6845 0.5602 0.6243 embedding fusion approach compared to the models trained indi- Recording P 0.0946 0.1472 0.1570 0.2022 vidually on each dataset. While the individual models (Subtask 1) (subgenres) R 0.3251 0.3703 0.3368 0.4148 are hardly usable compared to the random and popularity baselines, F 0.1343 0.1892 0.1911 0.2465 the combined models got a significant improvement in performance, Label P 0.0614 0.1087 0.1108 0.1235 being competitive with last years’ second ranked submission [6]. (all labels) R 0.1640 0.2226 0.2168 0.2324 In our experiments, we focused on optimizing macro F-score, F 0.0736 0.1247 0.1314 0.1400 however choosing this metric for threshold optimization can have a negative effect on micro-averaged metrics. In the case of infrequent Label P 0.2907 0.4404 0.3077 0.2878 subgenre labels and an uninformative classifier, an optimal, but (genres) R 0.3713 0.4713 0.3735 0.3565 undesirable strategy may involve predicting those labels always [5]. F 0.3080 0.4393 0.3246 0.3053 Indeed, this was the case for the individual models, but the hybrid Label P 0.0550 0.0922 0.0909 0.1043 models did not have this issue. (subgenres) R 0.1582 0.2102 0.2009 0.2179 F 0.0670 0.1089 0.1119 0.1206 5 DISCUSSION AND OUTLOOK In our baseline approach we focused on Subtask 2 and demonstrated the advantage of fusing feature embeddings learned on individual genre datasets on the example of a simple feedforward network ACKNOWLEDGMENTS architecture. We may expect further improvements in performance This research has received funding from the European Union’s by means of a more sophisticated network architecture (for exam- Horizon 2020 research and innovation programme under grant ple [4]). The code of the baseline is available online.1 agreements No 688382 (AudioCommons) and 770376-2 (TROMPA), as well as the Ministry of Economy and Competitiveness of the 1 https://github.com/MTG/acousticbrainz-mediaeval-baselines Spanish Government (Reference: TIN2015-69935-P). AcousticBrainz Genre Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Dmitry Bogdanov, Alastair Porter, Julián Urbano, and Hendrik Schreiber. 2018. The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources. In MediaEval 2018 Workshop. Sophia Antipolis, France. [2] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J.R. Zapata, and X. Serra. 2013. Essentia: An Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval (ISMIR’13) Conference. Curitiba, Brazil, 493–498. [3] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [4] Khaled Koutini, Alina Imenina, Matthias Dorfer, Alexander Rudolf Gruber, and Markus Schedl. 2017. MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach. In MediaEval 2017 Work- shop. Dublin, Ireland. [5] Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014. Optimal thresholding of classifiers to maximize F1 measure. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 225–239. [6] Benjamin Murauer, Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle, Martin Pichl, and GÃijnther Specht. 2017. Hierarchical Mul- tilabel Classification and Voting for Genre Classification. In MediaEval 2017 Workshop. Dublin, Ireland. [7] Sergio Oramas, Francesco Barbieri, Oriol Nieto, and Xavier Serra. 2018. Multimodal Deep Learning for Music Genre Classification. Transac- tions of the International Society for Music Information Retrieval 1, 1 (2018).