MediaEval 2019: Emotion and Theme Recognition in Music
                           Using Jamendo
                                   Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, Minz Won
                                                             Universitat Pompeu Fabra, Spain
                                                                 name.surname@upf.edu

ABSTRACT                                                                          Participants are expected to train a model that takes raw audio
This paper provides an overview of the Emotion and Theme recog-                as an input and outputs the predicted tags. To solve the task, par-
nition in Music task organized as part of the MediaEval 2019 Bench-            ticipants can use any audio input representation they desire, be it
marking Initiative for Multimedia Evaluation. The goal of this task            traditional handcrafted audio features, spectrograms, or raw audio
is to automatically recognize the emotions and themes conveyed in              inputs for deep learning approaches. We also provide a handcrafted
a music recording by means of audio analysis. We provide a large               feature set extracted by the Essentia [2] audio analysis library as
dataset of audio and labels that the participants can use to train             a reference. We allow the use of third-party datasets for model
and evaluate their systems. We also provide a baseline solution                development and training, but this should be mentioned explicitly
that utilizes VGG-ish architecture. This overview paper presents               by participants if they do this.
the task challenges, the employed ground-truth information and                    We provide a dataset that is split into training, validation and
dataset, and the evaluation methodology.                                       testing subsets with mood and theme labels properly balanced
                                                                               between subsets. The generated outputs for the test dataset will be
                                                                               evaluated according to typical performance metrics.
1    INTRODUCTION
Emotion and theme recognition is a popular task in music informa-
                                                                               3     DATA
tion retrieval that is relevant for music search and recommendation            The dataset used for this task is the autotagging-moodtheme subset
systems. We invite participants to try their skills at recognizing             of the MTG-Jamendo Dataset [3], built using audio data from Ja-
moods and themes conveyed by the audio tracks.                                 mendo and made available under Creative Commons licenses. In
   The last emotion recognition task in MediaEval [1] was in 2014,             contrast to other open music archives Jamendo targets its business
and there has been decline of interest since then. We bring the task           on royalty free music for commercial use, including music stream-
back with openly available good quality audio data and labels from             ing for venues. It ensures a basic technical quality assessment for
Jamendo.1 Jamendo includes both mood and theme annotations in                  their collection, thus the audio quality level is significantly more
their database.                                                                consistent with commercial music streaming services.
   While there is a difference between emotions and moods, for this               This subset includes 18,486 audio tracks with mood and theme
task we use the mood annotations as a proxy to understanding the               annotations. There are 56 distinct tags in the dataset. All tracks
emotions conveyed by the music. Themes are more ambiguous, but                 have at least one tag, but many have more than one. The top 40
they usually describe well the concept or meaning that the artist is           tags are shown in the Figure 1.
trying to convey with the music, or set the appropriate context for               As part of the pre-processing of the dataset, some tags were
the music to be listened in.                                                   merged to consolidate variant spellings and tags with the same
   Target audience: Researchers in areas of music information re-              meaning, (e.g., “dreamy” to “dream”, “emotion” to “emotional”). The
trieval, music psychology, machine learning a generally music and              exact mapping is available in the dataset repository.2 In addition,
technology enthusiasts.                                                        tracks shorter than 30 seconds were removed and tags used by less
                                                                               than 50 unique artists were discarded. Some tags were discarded
2    TASK DESCRIPTION                                                          while generating training, validation, and testing splits to ensure
                                                                               the absence of an artist and album effect [5] resulting in 56 tags
This task involves the prediction of moods and themes conveyed
                                                                               after all pre-processing steps.
by a music track, given an audio signal. Moods are often feelings
                                                                                  We provide audio files in 320kbps MP3 format (152 GB) as well
conveyed by the music (e.g. happy, sad, dark, melancholy) and
                                                                               as compressed .npy files with pre-computed mel-spectrograms (68
themes are associations with events or contexts where the music is
                                                                               GB). Scripts and instructions to download the data are provided in
suited to be played (e.g. epic, melodic, christmas, love, film, space).
                                                                               the dataset repository.
We do not make a distinction between moods and themes for the
purpose of this task. Each track is tagged with at least one tag that
serves as a ground-truth.
                                                                               3.1    Training, validation and test data
                                                                               The MTG-Jamendo dataset provides multiple random data splits
1 https://jamendo.com                                                          for training, validation and testing (60-20-20%). For this challenge
                                                                               we use split-0. Participants are expected to develop their systems
Copyright 2019 for this paper by its authors. Use                              using the provided training and validation splits.
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                 2 https://github.com/MTG/mtg-jamendo-dataset
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                            D. Bogdanov et al.


                     1657
              1500
# of tracks


              1000
               500                                                                                                                                   119
                 0
                              happy


                                 epic


                               space


                                 cool
                                 film
                         energetic


                      soundscape
                                  soft
                           relaxing


                                 love
                          inspiring
                                  sad
                        emotional
                           melodic


                      commercial
                              movie
                             action

                                sport


                            hopeful

                             groovy


                                 sexy
                                  fast
                                 dark
                             dream


                          uplifting


                         christmas

                            positive
                       meditative
                       advertising
                                deep


                         corporate


                                calm
                     motivational
                         romantic


                           summer
                     documentary


                      background
                                  fun

                           children
                             upbeat
                      melancholic
                                 slow
                             drama

                              ballad
                          dramatic
                        adventure


                              trailer
                                party
                               game

                          powerful
                                retro
                             nature


                              funny
                            holiday
                              heavy
                               travel
                                                                       Figure 1: All tags


   The validation set should be used for tuning hyperparameters                                         Table 1: Baseline results
of the models and regularization against overfitting by early stop-
ping. These optimizations should not be done using the test set,                                  Metric              VGG-ish      Popular
which should be only used to estimate the performance of the final
                                                                                                  ROC-AUC             0.725        0.500
submissions.
                                                                                                  PR-AUC              0.107        0.031
   We place no restrictions on the use of third party datasets for the
development of the systems. In this case, we ask that participants                                precision-macro     0.138        0.001
also provide a baseline system using only data from the official                                  recall-macro        0.308        0.017
training/validation set. Similarly, if one wants to append the valida-                            F-score-macro       0.165        0.002
tion set to the training data to build a model using more data for                                precision-micro     0.116        0.079
the final submission, a baseline using only training set for training                             recall-micro        0.373        0.044
should be provided.                                                                               F-score-micro       0.177        0.057
4             SUBMISSIONS AND EVALUATION
Participants should generate predictions for the test split and submit
                                                                                      Note that we rely on the fairness of submissions and do not
those to the task organizers as well as self-computed metrics. We
                                                                                   hide the ground truth for the test split. It is publicly available for
provide evaluation scripts in the GitHub repository.3
                                                                                   benchmarking as a part of the MTG-Jamendo Dataset outside this
   To have a better understanding of the behavior of the proposed
                                                                                   challenge. For transparency and reproducibility, we encourage the
systems, we ask participants to submit both prediction scores (prob-
                                                                                   participants to publicly release their code under an open-source/free
abilities or activation values) and binary classifications decisions for
                                                                                   software license on GitHub or another platform.
each tag for each track in the test set. We provide a script to calcu-
late activation thresholds and generate decisions from predictions
by maximizing macro F-score. See the documentation in the evalu-                   5 BASELINES
ation scripts directory in the dataset repository for instructions on              5.1 VGG-ish baseline approach
how to do this.                                                                    We used a broadly used VGG-ish architecture [4] as our main base-
   We will use the following metrics, both types commonly used in                  line. It consists of five 2D convolutional layers followed by a dense
the evaluation of auto-tagging systems:                                            connection. The implementation is available in the MTG-Jamendo
               • Macro ROC-AUC and PR-AUC on tag prediction scores.                Dataset repository. We trained our model for 1000 epochs and used
               • Micro- and macro-averaged precision, recall and F-score           the validation set to choose the best model. We found optimal deci-
                 for binary decisions.                                             sion thresholds for the activation values individually for each tag,
   Participants should report the obtained metric scores on the                    maximizing macro F-score. The evaluation results on the test set
validation split and test split if they have run such a test on their              are presented in Table 1.
own. Participants should also report whether they used the whole
development dataset or only a part for each submission.                            5.2      Popularity baseline
   We will generate rankings of the submissions by ROC-AUC,                        The popularity baseline always predicts the most frequent tag in the
PR-AUC and micro and macro F-score. For leaderboard purposes                       training set (Table 1). For the training set of split-0 this is “happy”.
we will use PR-AUC as the main metric, however we encourage
comprehensive evaluation of the systems by using all metrics with                  6     CONCLUSIONS
the goal of generating more valuable insights on the proposed
                                                                                   By bringing Emotion and Theme recognition in Music to MediaEval
models when reporting evaluation results in the working notes.
                                                                                   we hope to benefit from contributions and expertise of a broader
A maximum of five evaluation runs per participating team are
                                                                                   machine learning and multimedia retrieval community. We refer to
allowed.
                                                                                   the MediaEval 2019 proceedings for further details on the methods
3 https://github.com/MTG/mtg-jamendo-dataset/tree/master/scripts/mediaeval2019     and results of teams participating in the task.
Emotion and Theme recognition in music using Jamendo                          MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


ACKNOWLEDGMENTS
We are thankful to Jamendo for providing us the data and labels.
   This work has received funding from the European Union’s
Horizon 2020 research and innovation program under the Marie
Skłodowska-Curie grant agreement No. 765068.
   This work was funded by the predoctoral grant MDM-2015-0502-
17-2 from the Spanish Ministry of Economy and Competitiveness
linked to the Maria de Maeztu Units of Excellence Programme
(MDM-2015-0502).

REFERENCES
 [1] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2014.
     Emotion in Music Task at MediaEval 2014.. In MediaEval.
 [2] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G.
     Roma, J. Salamon, J.R. Zapata, and X. Serra. 2013. Essentia: An Audio
     Analysis Library for Music Information Retrieval. In International
     Society for Music Information Retrieval Conference. Curitiba, Brazil.
 [3] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and
     Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music
     Tagging. In Proceedings of the Machine Learning for Music Discovery
     Workshop, 36th International Conference on Machine Learning, ICML
     2019, 9-15 June 2019, Long Beach, California, USA. http://mtg.upf.edu/
     node/3957
 [4] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic
     tagging using deep convolutional neural networks. arXiv preprint
     arXiv:1606.00298 (2016).
 [5] Arthur Flexer and Dominik Schnitzer. 2009. Album and Artist Effects
     for Audio Similarity at the Scale of the Web. In Sound and Music
     Computing Conference.