MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, Minz Won Universitat Pompeu Fabra, Spain name.surname@upf.edu ABSTRACT Participants are expected to train a model that takes raw audio This paper provides an overview of the Emotion and Theme recog- as an input and outputs the predicted tags. To solve the task, par- nition in Music task organized as part of the MediaEval 2019 Bench- ticipants can use any audio input representation they desire, be it marking Initiative for Multimedia Evaluation. The goal of this task traditional handcrafted audio features, spectrograms, or raw audio is to automatically recognize the emotions and themes conveyed in inputs for deep learning approaches. We also provide a handcrafted a music recording by means of audio analysis. We provide a large feature set extracted by the Essentia [2] audio analysis library as dataset of audio and labels that the participants can use to train a reference. We allow the use of third-party datasets for model and evaluate their systems. We also provide a baseline solution development and training, but this should be mentioned explicitly that utilizes VGG-ish architecture. This overview paper presents by participants if they do this. the task challenges, the employed ground-truth information and We provide a dataset that is split into training, validation and dataset, and the evaluation methodology. testing subsets with mood and theme labels properly balanced between subsets. The generated outputs for the test dataset will be evaluated according to typical performance metrics. 1 INTRODUCTION Emotion and theme recognition is a popular task in music informa- 3 DATA tion retrieval that is relevant for music search and recommendation The dataset used for this task is the autotagging-moodtheme subset systems. We invite participants to try their skills at recognizing of the MTG-Jamendo Dataset [3], built using audio data from Ja- moods and themes conveyed by the audio tracks. mendo and made available under Creative Commons licenses. In The last emotion recognition task in MediaEval [1] was in 2014, contrast to other open music archives Jamendo targets its business and there has been decline of interest since then. We bring the task on royalty free music for commercial use, including music stream- back with openly available good quality audio data and labels from ing for venues. It ensures a basic technical quality assessment for Jamendo.1 Jamendo includes both mood and theme annotations in their collection, thus the audio quality level is significantly more their database. consistent with commercial music streaming services. While there is a difference between emotions and moods, for this This subset includes 18,486 audio tracks with mood and theme task we use the mood annotations as a proxy to understanding the annotations. There are 56 distinct tags in the dataset. All tracks emotions conveyed by the music. Themes are more ambiguous, but have at least one tag, but many have more than one. The top 40 they usually describe well the concept or meaning that the artist is tags are shown in the Figure 1. trying to convey with the music, or set the appropriate context for As part of the pre-processing of the dataset, some tags were the music to be listened in. merged to consolidate variant spellings and tags with the same Target audience: Researchers in areas of music information re- meaning, (e.g., “dreamy” to “dream”, “emotion” to “emotional”). The trieval, music psychology, machine learning a generally music and exact mapping is available in the dataset repository.2 In addition, technology enthusiasts. tracks shorter than 30 seconds were removed and tags used by less than 50 unique artists were discarded. Some tags were discarded 2 TASK DESCRIPTION while generating training, validation, and testing splits to ensure the absence of an artist and album effect [5] resulting in 56 tags This task involves the prediction of moods and themes conveyed after all pre-processing steps. by a music track, given an audio signal. Moods are often feelings We provide audio files in 320kbps MP3 format (152 GB) as well conveyed by the music (e.g. happy, sad, dark, melancholy) and as compressed .npy files with pre-computed mel-spectrograms (68 themes are associations with events or contexts where the music is GB). Scripts and instructions to download the data are provided in suited to be played (e.g. epic, melodic, christmas, love, film, space). the dataset repository. We do not make a distinction between moods and themes for the purpose of this task. Each track is tagged with at least one tag that serves as a ground-truth. 3.1 Training, validation and test data The MTG-Jamendo dataset provides multiple random data splits 1 https://jamendo.com for training, validation and testing (60-20-20%). For this challenge we use split-0. Participants are expected to develop their systems Copyright 2019 for this paper by its authors. Use using the provided training and validation splits. permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 https://github.com/MTG/mtg-jamendo-dataset MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France D. Bogdanov et al. 1657 1500 # of tracks 1000 500 119 0 happy epic space cool film energetic soundscape soft relaxing love inspiring sad emotional melodic commercial movie action sport hopeful groovy sexy fast dark dream uplifting christmas positive meditative advertising deep corporate calm motivational romantic summer documentary background fun children upbeat melancholic slow drama ballad dramatic adventure trailer party game powerful retro nature funny holiday heavy travel Figure 1: All tags The validation set should be used for tuning hyperparameters Table 1: Baseline results of the models and regularization against overfitting by early stop- ping. These optimizations should not be done using the test set, Metric VGG-ish Popular which should be only used to estimate the performance of the final ROC-AUC 0.725 0.500 submissions. PR-AUC 0.107 0.031 We place no restrictions on the use of third party datasets for the development of the systems. In this case, we ask that participants precision-macro 0.138 0.001 also provide a baseline system using only data from the official recall-macro 0.308 0.017 training/validation set. Similarly, if one wants to append the valida- F-score-macro 0.165 0.002 tion set to the training data to build a model using more data for precision-micro 0.116 0.079 the final submission, a baseline using only training set for training recall-micro 0.373 0.044 should be provided. F-score-micro 0.177 0.057 4 SUBMISSIONS AND EVALUATION Participants should generate predictions for the test split and submit Note that we rely on the fairness of submissions and do not those to the task organizers as well as self-computed metrics. We hide the ground truth for the test split. It is publicly available for provide evaluation scripts in the GitHub repository.3 benchmarking as a part of the MTG-Jamendo Dataset outside this To have a better understanding of the behavior of the proposed challenge. For transparency and reproducibility, we encourage the systems, we ask participants to submit both prediction scores (prob- participants to publicly release their code under an open-source/free abilities or activation values) and binary classifications decisions for software license on GitHub or another platform. each tag for each track in the test set. We provide a script to calcu- late activation thresholds and generate decisions from predictions by maximizing macro F-score. See the documentation in the evalu- 5 BASELINES ation scripts directory in the dataset repository for instructions on 5.1 VGG-ish baseline approach how to do this. We used a broadly used VGG-ish architecture [4] as our main base- We will use the following metrics, both types commonly used in line. It consists of five 2D convolutional layers followed by a dense the evaluation of auto-tagging systems: connection. The implementation is available in the MTG-Jamendo • Macro ROC-AUC and PR-AUC on tag prediction scores. Dataset repository. We trained our model for 1000 epochs and used • Micro- and macro-averaged precision, recall and F-score the validation set to choose the best model. We found optimal deci- for binary decisions. sion thresholds for the activation values individually for each tag, Participants should report the obtained metric scores on the maximizing macro F-score. The evaluation results on the test set validation split and test split if they have run such a test on their are presented in Table 1. own. Participants should also report whether they used the whole development dataset or only a part for each submission. 5.2 Popularity baseline We will generate rankings of the submissions by ROC-AUC, The popularity baseline always predicts the most frequent tag in the PR-AUC and micro and macro F-score. For leaderboard purposes training set (Table 1). For the training set of split-0 this is “happy”. we will use PR-AUC as the main metric, however we encourage comprehensive evaluation of the systems by using all metrics with 6 CONCLUSIONS the goal of generating more valuable insights on the proposed By bringing Emotion and Theme recognition in Music to MediaEval models when reporting evaluation results in the working notes. we hope to benefit from contributions and expertise of a broader A maximum of five evaluation runs per participating team are machine learning and multimedia retrieval community. We refer to allowed. the MediaEval 2019 proceedings for further details on the methods 3 https://github.com/MTG/mtg-jamendo-dataset/tree/master/scripts/mediaeval2019 and results of teams participating in the task. Emotion and Theme recognition in music using Jamendo MediaEval’19, 27-29 October 2019, Sophia Antipolis, France ACKNOWLEDGMENTS We are thankful to Jamendo for providing us the data and labels. This work has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 765068. This work was funded by the predoctoral grant MDM-2015-0502- 17-2 from the Spanish Ministry of Economy and Competitiveness linked to the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502). REFERENCES [1] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2014. Emotion in Music Task at MediaEval 2014.. In MediaEval. [2] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J.R. Zapata, and X. Serra. 2013. Essentia: An Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval Conference. Curitiba, Brazil. [3] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music Tagging. In Proceedings of the Machine Learning for Music Discovery Workshop, 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. http://mtg.upf.edu/ node/3957 [4] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298 (2016). [5] Arthur Flexer and Dominik Schnitzer. 2009. Album and Artist Effects for Audio Similarity at the Scale of the Web. In Sound and Music Computing Conference.