=Paper= {{Paper |id=Vol-1436/Paper10 |storemode=property |title=Emotion in Music Task at MediaEval 2015 |pdfUrl=https://ceur-ws.org/Vol-1436/Paper10.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiYS15 }} ==Emotion in Music Task at MediaEval 2015== https://ceur-ws.org/Vol-1436/Paper10.pdf
                   Emotion in Music Task at MediaEval 2015

                   Anna Aljanaki                       Yi-Hsuan Yang                  Mohammad Soleymani
             Information and Computing                  Academia Sinica                Computer Science Dept.
                      Sciences                              Taipei                      University of Geneva
                  Utrecht University                        Taiwan                          Switzerland
                   the Netherlands                    yang@citi.sinica.edu.tw         mohammad.soleymani@unige.ch∗
                    a.aljanaki@uu.nl



ABSTRACT                                                            to develop a benchmark and an evaluation framework for
The Emotion in Music task is held for the third consecu-            such a comparison.
tive year at the MediaEval benchmarking campaign. The                  The task is held for the third year in the MediaEval bench-
unceasing interest towards the task shows that the music            marking campaign for multimedia evaluation1 [1,14]. Build-
emotion recognition (MER) problem is truly important to             ing on our experience in the last two years, we concentrate
the community, and there is a lot remaining to be discovered        on a single dynamic emotion characterization task and on
about it. Automatic MER methods could greatly improve               offering high quality ground truth.
the accessibility of music collections by providing quick and          The only other current evaluation task for MER is the
standardized means of music categorization and indexing.            audio mood classification (AMC) task of the annual music
In the Emotion in Music task we provide a benchmark for             information retrieval evaluation exchange2 (MIREX) [8]. In
automatic MER methods. This year, we concentrated on                this task, 600 audio files are provided to the participants
a single task, which proved to be the most challenging in           of the task, who have agreed not to distribute the files for
the previous years: dynamic emotion characterization. We            commercial purposes. However, AMC has been criticized for
put special emphasis on providing high-quality ground truth         using an emotional model that is not based on psychological
data and maximizing inter-annotator agreement. As a con-            research. Namely, this benchmark uses five discrete emotion
sequence of meeting a higher quality demand, the dataset            clusters, derived from cluster analysis of online tags, instead
both for training and evaluation is smaller than in the pre-        of more widely accepted dimensional or categorical models of
vious years. The dataset consists of music licensed under           emotion. It was noted that there exists semantic or acoustic
Creative Commons from the Free Music Archive, medleyDB              overlap between clusters [12]. Furthermore, the dataset only
dataset and Jamendo. This paper describes the dataset col-          applies a singular static rating per audio clip, which belies
lection, annotations, and evaluation criteria of the task.          the time-varying nature of music. Since 2013, another set
                                                                    of 1,438 segments of 30 seconds clipped from Korean pop
                                                                    songs has been used in MIREX as well. However, the same
1.   INTRODUCTION                                                   five-class taxonomy is adopted for this Korean set.
   Contemporary music listeners rely on online music ser-              Since the first edition of the Emotion in Music task in
vices such as Spotify, iTunes or Soundcloud to access their         2013 we have opted for characterizing the per-second emo-
favorite music. In order to make their collections accessible,      tion of music as numerical values in two dimensions — va-
music libraries need to classify music by genre, instrumenta-       lence (positive or negative emotions expressed in music) and
tion, tempo and mood. Automatic solutions to auto-tagging           arousal (energy of the music) (VA) [13, 17], making it eas-
problem are invaluable because they make annotation fast,           ier to depict the temporal dynamics of emotion variation.
cheap and standardized. Emotion is one of the most im-              The VA model has been widely adopted in affective re-
portant search criteria for music. Automatic MER (music             search [2, 6, 9–11, 15, 18–20]. However, the model is not free
emotion recognition) algorithms rely on ground truth for            of criticisms and some other alternatives may be consid-
training. There are many ways in which such a ground truth          ered in the future. For example, the VA model has been
can be generated [6]; using different affective representations     criticized for being too reductionist and that other dimen-
or different temporal granularity. Depending on the affec-          sions such as dominance should be added [5]. Moreover, the
tive model or temporal resolution, the evaluation criteria can      terms ‘valence’ and ‘arousal’ may be sometimes too abstract
vary. These discrepancies make it very difficult to compare         for people to have a common understanding of its meaning.
different methods. The Emotion in Music task is designed            Such drawbacks of the VA model can further harm the inter-
∗                                                                   annotator agreement of the annotations for an annotation
  This research was supported in part by Ambizione pro-             task which is already inherently fairly subjective.
gram of the Swiss National Science Foundation and the
FES project COMMIT/. We thank Alexander Lansky, from
Queens University, Canada and Yu-Hao Chin from National
Central University, Taiwan for assistance with the song se-         2.     TASK DESCRIPTION
lection and annotations.                                              This year we offer only one task - dynamic emotion char-
                                                                    acterization. However, in order to permit a thorough com-
Copyright is held by the author/owner(s).                           1
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-            http://www.multimediaeval.org
                                                                    2
many                                                                    http://www.music-ir.org/mirex/wiki/
parison between different methods, this year we require the                           Arousal                  Valence
participants to submit two different runs.                                         RMSE       r             RMSE       r
    • In one run, the participants are required to submit         openSMILE
                                                                             0.27±0.11 0.36±0.26 0.37±0.18 0.01±0.38
       their features and we use a baseline regression method       + MLR
       (linear regression) to estimate dynamic affect. Any          Average
                                                                             0.28±0.13     –     0.29±0.14     –
       features automatically extracted from the audio or the       baseline
       metadata provided by the organizers are allowed.
    • In the second required run, all the participants are                        Table 1: Baseline results.
       required to use the baseline features that we provided
       (see Section 3 for details) to compare their machine
                                                                    The evaluation data we collected this year is different in
       learning methods. Participants are also free to submit
                                                                  several respects. First, we opted for full-length songs to
       any combination of the features and machine learning
                                                                  cover the whole affective variation. Second, we partially an-
       methods up to the total of five runs.
                                                                  notated the data in the laboratory. The evaluation set is an-
   The participants will estimate the valence and arousal
                                                                  notated by 6 people; two onsite and 4 conscientious MTurk
scores continuously in time for every segment (half a sec-
                                                                  workers, where 29% of the annotations was done in the lab.
ond long) on a scale from –1 to 1. The participants have
                                                                  This way, we can compare the agreement between the onsite
to submit both predictions of valence and arousal, their fea-
                                                                  workers and the crowdworkers. The annotators listened to
ture set, if different from the basic provided one, and their
                                                                  the entire song before starting with the annotation, to get
predictions when using a universal feature set. We will use
                                                                  familiar with the music and to reduce the reaction time lag.
the Root-Mean-Square Error (RMSE) as the primary eval-
                                                                  The workers were only payed the full fee after their work
uation measure. We will also report the Pearson correlation
                                                                  was reviewed and appeared to be of high quality. The Cron-
(r) of the prediction and the ground truth. We will rank
                                                                  bach’s α this year is 0.65 ± 0.28 for arousal, and 0.29 ± 0.94
the submissions based on the averaged RMSE. Whenever
                                                                  for valence. In comparison, the Cronbach’s α for another
the difference based on the one sided Wilcoxon test is not
                                                                  two existing datasets MoodSwing [16] and AMG1608 [4], is
significant (p>0.05), we will use the averaged correlation co-
                                                                  0.41, 0.46 for arousal, and 0.25, 0.31 for valence, respectively.
efficient to break the tie.
                                                                  As compared to our dataset, the consistency of annotations
                                                                  has improved for arousal, but not for valence.
3.   DATASETS AND GROUND TRUTH                                      It can be found that, there is a mismatch between the
   Our datasets consist of royalty-free music from several        training and test sets in terms of the duration of the clips
sources: freemusicarchive.org (FMA), jamendo.com, and             (45-second segments versus full songs) and the data sources
the medleyDB dataset [3]. The development set consists of         (FMA versus medleyDB and jamendo). In contrast, in either
431 clips of 45 seconds, which were selected from last year’s     2013 or 2014 the training and test sets were of the same
data based on inter-annotator agreement criteria. The test        length and both were from FMA [1, 14].
set comprises 58 complete music pieces with an average du-
ration of 234 ± 105.7 seconds.                                    3.1    Baseline features
   The development set is a subset of clips from last years          In order to enable comparison between different machine
[1, 14], all of which are from FMA. The subset was selected       learning algorithms, we provide a baseline universal feature
according to the procedure described below:                       set, extracted with openSMILE [7], consisting of 260 low-
    1. We deleted the annotations which Pearson’s correla-        level features (mean and standard deviation of 65 low-level
       tion with the averaged annotations for the same song       acoustic descriptors, and their first-order derivatives). In
       is below 0.1. If less than 5 annotators remain after the   addition to the audio features, we also provide meta-data
       deletion, we discarded the song.                           covering the genre labels obtained from FMA, and, for some
    2. For the remaining songs and remaining annotations,         of the songs, folksonomy tags crawled from last.fm.
       we calculated the Cronbach’s α. If it was bigger than
       0.6, the song was retained.                                4.    BASELINE RESULTS
    3. The mean (bias) of every dynamic annotation was
                                                                     For the baseline, we used the openSMILE toolbox [7] to
       changed to match the averaged static annotation for
                                                                  extract 260 feature from nonoverlapping segments of 500ms,
       the same song.
                                                                  with frame size of 60ms with a 10ms step. We used multiple
This procedure resulted in a reduction from 1,744 songs to
                                                                  linear regression (MLR), following last years. The results are
431 songs (the rest did not have consistent enough annota-
                                                                  shown in the first row of Table 1. Compared to the last year
tions), each of which was annotated by 5–7 workers from
                                                                  (for arousal, r = 0.27±0.12, for valence, r = 0.19±0.11), the
the Amazon Mechanical Turk (MTurk). The Cronbach’s α
                                                                  baseline is worse. We also calculated an average baseline by
is 0.76 ± 0.12 for arousal, and 0.73 ± 0.12 for valence.
                                                                  using the average of all the development set ground truth as
   The evaluation set consists of 58 complete songs, one half
                                                                  the prediction result for all the songs. In terms of RMSE,
from medleyDB dataset [3] of royalty-free multitrack record-
                                                                  this average baseline performs better for valence and at the
ings and another half from the jamendo.com music website,
                                                                  same level for arousal.
which provides music under Creative Commons license. We
selected songs with some emotional variation in them from
genres corresponding to the ones in the development set.          5.    CONCLUSIONS
We used the same annotation interface as the previous two           A task has been developed to analyze emotion in music.
years: a slider that is continuously moved by an annotator        Annotations were collected using both onsite annotator and
while listening to music. The position of the slider indicates    crowdsourcing workers. The quest for higher quality labels
the magnitude of valence or arousal.                              has led to a lower number of training and evaluation samples.
6.   REFERENCES                                                [18] J.-C. Wang, Y.-H. Yang, H.-M. Wang, and S.-K. Jeng.
 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion             Modeling the affective content of music with a
     in music task at MediaEval 2014. In MediaEval 2014             Gaussian mixture model. IEEE Transactions on
     Workshop, 2014.                                                Affective Computing, 6(1):56–68, 2015.
 [2] M. Barthet, G. Fazekas, and M. Sandler.                   [19] S. Wang and Q. Ji. Video affective content analysis: a
     Multidisciplinary perspectives on music emotion                survey of state of the art methods. IEEE Trans.
     recognition: Implications for content and                      Affective Computing, PP(99):1, 2015.
     context-based models. In Int’l Symp. Computer Music       [20] Y.-H. Yang and H.-H. Chen. Machine recognition of
     Modelling & Retrieval, pages 492–507, 2012.                    music emotion: A review. ACM Trans. Intel. Systems
 [3] R. Bittner, J. Salamon, M. Tierney, M. Mauch,                  & Technology, 3(4), 2012.
     C. Cannam, and J. P. Bello. MedleyDB: A multitrack
     dataset for annotation-intensive mir research. In Proc.
     ISMIR, 2014.
 [4] Y.-A. Chen, J.-C. Wang, Y.-H. Yang, and H. H. Chen.
     The AMG1608 dataset for music emotion recognition.
     In Proc. IEEE Int. Conf. Acoust., Speech, Signal
     Process., pages 693–697, 2015.
 [5] G. Collier. Beyond valence and activity in the
     emotional connotations of music. Psychology of Music,
     35(1):110–131, 2007.
 [6] T. Eerola. Modelling emotions in music: Advances in
     conceptual, contextual and validity issues. In AES
     International Confernece, 2014.
 [7] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
     Recent developments in openSMILE, the Munich
     Open-source Multimedia Feature Extractor. In
     Proceedings of ACM MM, pages 835–838, 2013.
 [8] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F.
     Ehmann. The 2007 MIREX audio mood classification
     task: Lessons learned. In Proc. Int. Soc. Music Info.
     Retrieval Conf., pages 462–467, 2008.
 [9] A. Huq, J. P. Bello, and R. Rowe. Automated music
     emotion recognition: A systematic evaluation. Journal
     of New Music Research, 39(3):227–244, 2010.
[10] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton,
     P. Richardson, J. Scott, J. Speck, and D. Turnbull.
     Music emotion recognition: A state of the art review.
     In Proc. Int. Soc. Music Info. Retrieval Conf., 2010.
[11] S. Koelstra, C. Mühl, M. Soleymani, J.-S. Lee,
     A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and
     I. Patras. DEAP: A database for emotion analysis;
     using physiological signals. IEEE Trans. Affective
     Computing, 3(1):18–31, 2012.
[12] C. Laurier and P. Herrera. Audio music mood
     classification using support vector machine. In MIREX
     Task on Audio Mood Classification, 2007.
[13] J. A. Russell. A circumplex model of affect. J.
     Personality & Social Science, 39(6):1161–1178, 1980.
[14] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha,
     and Y.-H. Yang. 1000 songs for emotional analysis of
     music. In Proceedings of the 2nd ACM International
     Workshop on Crowdsourcing for Multimedia, pages
     1–6, 2013.
[15] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic.
     Corpus development for affective video indexing.
     IEEE Trans. Multimedia, 16(4):1075–1089, 2014.
[16] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E.
     Kim. A comparative study of collaborative vs.
     traditional musical mood annotation. In Proc. Int.
     Soc. Music Info. Retrieval Conf., 2011.
[17] R. E. Thayer. The Biopsychology of Mood and
     Arousal. Oxford University Press, New York, 1989.