=Paper=
{{Paper
|id=Vol-1436/Paper82
|storemode=property
|title=MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper82.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiWV15
}}
==MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper82.pdf</pdf>
<pre>
       MediaEval 2015: A Segmentation-based Approach to
                 Continuous Emotion Tracking

                   Anna Aljanaki                        Frans Wiering                 Remco C. Veltkamp
             Information and Computing            Information and Computing          Information and Computing
                      Sciences                             Sciences                           Sciences
                  Utrecht University                   Utrecht University                 Utrecht University
                   the Netherlands                      the Netherlands                    the Netherlands
                    a.aljanaki@uu.nl                     F.Wiering@uu.nl                  R.C.Veltkamp@uu.nl


ABSTRACT                                                           2.1     Annotation preprocessing
In this paper we approach the task of continuous music emo-           The development set consists of excerpts of 45 seconds,
tion recognition using unsupervised audio segmentation as          but the annotations are only provided from the 15s second
a preparatory step. The MediaEval task requires predict-           onwards, to provide a generous habituation time to the an-
ing emotion of the song with a high time resolution of 2Hz.        notators. Nevertheless, dynamic emotion annotations can
Though this resolution is necessary to find exact locations of     have a time lag of 2-4 seconds because of the annotators’
emotional changes, we believe that those changes occur more        reaction time [5]. To compensate for it, we shift the annota-
sparsely. We suggest that using bigger time windows for fea-       tions by 3 seconds (i.e., we use audio from 12 to 42 second
ture extraction and emotion prediction might make emotion          to extract the features, and couple it with the annotations
recognition more accurate. We use an unsupervised method           from 15 to 45 seconds).
Structure Features [6] to segment the audio both from the
development set and the evaluation set. Then we use Gaus-
sian Process regression to predict the emotion of the segment      2.2     Feature extraction (Essentia)
using features extracted with the Essentia and openSMILE              We use the open-source framework Essentia [3] to extract
frameworks.                                                        a range of high (scale, tempo, tonal stability, etc.) and low
                                                                   level (spectral shape, mfcc, chroma, energy, dissonance, etc.)
1.   INTRODUCTION                                                  features, for a total of 40 features. For low-level timbral
                                                                   features we use a half-overlapping window of 100ms, for high
   This working notes paper describes a submission to the          level features we use a window of 3 seconds.
Emotion in Music task in the Mediaeval 2015 benchmark.
The task requires predicting emotion of the music (arousal            We use the same set of features both for segmentation and
or valence) based on musical audio continuously (over time)        for emotion recognition, but for segmentation purposes the
with a resolution of 2Hz. The organizers provided an an-           features are smoothed with a median sliding window and
notated development set of 431 excerpts of 45 seconds, and         resampled according to beats detected using the Essentia
an evaluation set of 58 full-length songs. For more detail we      BeatTracker algorithm.
refer to the task overview paper [2].
   We approach the task of music emotion recognition by            2.3     Segmentation
taking it to a higher time-resolution, i.e., to a segment level
emotion recognition. We use unsupervised audio segmen-                We use an unsupervised method to perform the segmenta-
tation method to segment the music into emotionally ho-            tion of both development and evaluation set audio. We chose
mogenous excerpts; next, we predict the emotion for every          SF (Structural Features) because it performed best in an
segment and then resample the result to 2Hz. As one of the         evaluation of segmentation methods when applied to emo-
task requirements, baseline features from the openSMILE            tional segmentation, with recall of 67% of emotional bound-
framework [4] (260 low level spectral features) have to be         aries [1]. Using SF method to segment the development
used. We also use create our own feature set using Essen-          set (instead of employing labeled emotionally homogenous
tia, which also contains high-level features, and uses bigger      segments as the ground truth) is a weak spot of our ap-
time windows for feature extraction, which becomes possible        proach, because it degrades the quality of the ground truth
when predicting emotion of the music per segment.                  data, which is not completely human-annotated after this
                                                                   step anymore. Our method could use any other dataset of
                                                                   music excerpts labeled with valence and arousal, but for the
2.   APPROACH                                                      purposes of participating in MediaEval benchmark we are
   In this section we will describe the main steps of our ap-      using the standard development set provided to all the par-
proach, namely, annotation preprocessing, feature extrac-          ticipants.
tion, segmentation method and learning algorithm.                     The SF method is both homogeneity and repetition based.
                                                                   It uses a variant of lag matrix to obtain structural features.
                                                                   The SF are differentiated to obtain a novelty curve, on which
Copyright is held by the author/owner(s).                          peak picking is performed. The SF method calculates self-
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        similarity between samples i and j as follows:
                                                                  Framework      Target        RMSE               r
                                                                    Essentia     Valence   0.3576±0.1952   -0.1214±0.4156
                                                                    Essentia     Arousal   0.2640±0.1341    0.4050±0.3361
                                                                  openSMILE      Valence   0.2946±0.1473   -0.0853±0.3863
                                                                  openSMILE      Arousal   0.2854±0.1242    0.1669±0.3955

                                                                                Table 1: Evaluation results.


                                                                  4.   CONCLUSION
                                                                     In this paper we described an approach to music emo-
                                                                  tion variation detection which uses an intermediary step -
                                                                  music segmentation into fragments of homogenous emotion.
                                                                  We used Gaussian Processes modeling to predict the emo-
Figure 1: Prediction of arousal for the song ”You                 tion per segment, and two different frameworks (Essentia
listen” by Meaxic.                                                and openSMILE) to extract the features, which were used
                                                                  both during the segmentation and for emotion recognition.
                                                                  Bringing the problem from a level of sound fragment (half
                                                                  a second) into a level of short musical segment (10 seconds
                Si,j = Θ (εi,j − ||xi − xj ||) ,           (1)    on average) has two advantages. Firstly, employing longer
                                                                  segments allows to extract musically meaningful features,
where Θ(z) is a Heaviside step function, xi is a feature time     such as tonality or tempo. Secondly, averaging features and
series transformed using delay coordinates, ||z|| is a Eu-        annotations over longer segments could be beneficial as a
clidean norm, and ε is a threshold, which is set adaptively for   smoothing step. The runs produced with baseline openS-
each cell of the matrix S. From the matrix S structural fea-      MILE low level spectral features could not benefit from these
tures are then obtained using a lag-matrix, and computing         advantages, which could explain part of the difference in per-
the difference between successive structural features yields      formance on arousal. Both algorithms performed very bad
a novelty curve.                                                  on valence.
   By means of the segmentation step we obtain 1304 seg-
ments with an average segment length of 10.8±5.7 seconds          5.   ACKNOWLEDGEMENTS
using Essentia features, and 1017 segments with an average
                                                                    This publication was supported by the Dutch national
length of 10.7±5.3 seconds using openSMILE features on the
                                                                  program COMMIT/.
development set. For each of the segments, we average the
continuous emotion annotation inside the segment to obtain
the training data.                                                6.   REFERENCES
                                                                  [1] A. Aljanaki, F. Wiering, and R. C. Veltkamp. Emotion
   We also segment the songs from the evaluation set in the
                                                                      based segmentation of musical audio. In Proceedings of
same way.
                                                                      the 16th International Society for Music Information
2.4    Learning algorithm                                             Retrieval Conference, 2015.
                                                                  [2] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
   We use Gaussian Processes regression to predict the va-
                                                                      in music task at mediaeval 2015. In Working Notes
lence and arousal values per segment, using maximum like-
                                                                      Proceedings of the MediaEval 2015 Workshop,
lihood estimation of the best set of parameters. We use a
                                                                      September 2015.
squared exponential autocorrelation function (radial basis
function):                                                        [3] D. Bogdanov, N. Wack, E. Gomez, S. Gulati,
                                                                      P. Herrera, and O. Mayor. Essentia: an audio analysis
                                                                      library for music information retrieval. In International
                                (i − j)2                              Society for Music Information Retrieval Conference,
                  K(i, j) = exp −        ,              (2)
                                   2θ2                                pages 493–498, 2013.
where θ is a tuned parameter, and i and j are the points in       [4] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
feature space.                                                        Recent developments in opensmile, the munich
                                                                      open-source multimedia feature extractor. In ACM
3.    EVALUATION                                                      Multimedia, pages 835–838, 2013.
   Figure 1 shows an example of the output of the algorithm.      [5] E. Schubert. Handbook of Music and Emotion: Theory,
   The task is evaluated based on RMSE and Pearson’s cor-             Research, Applications, chapter Continuous self-report
relation coefficient between the ground truth and the predic-         methods., pages 223–253. Oxford University Press,
tion, averaged across the 58 songs of the testset. The results        2011.
are displayed in the table 1.                                     [6] J. Serra, M. Muller, P. Grosche, and J. L. Arcos.
   The algorithm based on features from Essentia performs             Unsupervised music structure annotation by time series
much better for arousal (both in terms of correlation and             structure features and segment similarity. IEEE
RMSE), but worse for valence. Both algorithms perform                 Transactions on Multimedia, Special Issue on Music
unacceptably bad on valence.                                          Data Mining, 2014.

</pre>