<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2015: A Segmentation-based Approach to Continuous Emotion Tracking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Aljanaki</string-name>
          <email>a.aljanaki@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frans Wiering</string-name>
          <email>F.Wiering@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Remco C. Veltkamp</string-name>
          <email>R.C.Veltkamp@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information and Computing, Sciences, Utrecht University</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we approach the task of continuous music emotion recognition using unsupervised audio segmentation as a preparatory step. The MediaEval task requires predicting emotion of the song with a high time resolution of 2Hz. Though this resolution is necessary to nd exact locations of emotional changes, we believe that those changes occur more sparsely. We suggest that using bigger time windows for feature extraction and emotion prediction might make emotion recognition more accurate. We use an unsupervised method Structure Features [6] to segment the audio both from the development set and the evaluation set. Then we use Gaussian Process regression to predict the emotion of the segment using features extracted with the Essentia and openSMILE frameworks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This working notes paper describes a submission to the
Emotion in Music task in the Mediaeval 2015 benchmark.
The task requires predicting emotion of the music (arousal
or valence) based on musical audio continuously (over time)
with a resolution of 2Hz. The organizers provided an
annotated development set of 431 excerpts of 45 seconds, and
an evaluation set of 58 full-length songs. For more detail we
refer to the task overview paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        We approach the task of music emotion recognition by
taking it to a higher time-resolution, i.e., to a segment level
emotion recognition. We use unsupervised audio
segmentation method to segment the music into emotionally
homogenous excerpts; next, we predict the emotion for every
segment and then resample the result to 2Hz. As one of the
task requirements, baseline features from the openSMILE
framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (260 low level spectral features) have to be
used. We also use create our own feature set using
Essentia, which also contains high-level features, and uses bigger
time windows for feature extraction, which becomes possible
when predicting emotion of the music per segment.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>In this section we will describe the main steps of our
approach, namely, annotation preprocessing, feature
extraction, segmentation method and learning algorithm.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Annotation preprocessing</title>
      <p>
        The development set consists of excerpts of 45 seconds,
but the annotations are only provided from the 15s second
onwards, to provide a generous habituation time to the
annotators. Nevertheless, dynamic emotion annotations can
have a time lag of 2-4 seconds because of the annotators'
reaction time [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To compensate for it, we shift the
annotations by 3 seconds (i.e., we use audio from 12 to 42 second
to extract the features, and couple it with the annotations
from 15 to 45 seconds).
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Feature extraction (Essentia)</title>
      <p>
        We use the open-source framework Essentia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to extract
a range of high (scale, tempo, tonal stability, etc.) and low
level (spectral shape, mfcc, chroma, energy, dissonance, etc.)
features, for a total of 40 features. For low-level timbral
features we use a half-overlapping window of 100ms, for high
level features we use a window of 3 seconds.
      </p>
      <p>We use the same set of features both for segmentation and
for emotion recognition, but for segmentation purposes the
features are smoothed with a median sliding window and
resampled according to beats detected using the Essentia
BeatTracker algorithm.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Segmentation</title>
      <p>
        We use an unsupervised method to perform the
segmentation of both development and evaluation set audio. We chose
SF (Structural Features) because it performed best in an
evaluation of segmentation methods when applied to
emotional segmentation, with recall of 67% of emotional
boundaries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Using SF method to segment the development
set (instead of employing labeled emotionally homogenous
segments as the ground truth) is a weak spot of our
approach, because it degrades the quality of the ground truth
data, which is not completely human-annotated after this
step anymore. Our method could use any other dataset of
music excerpts labeled with valence and arousal, but for the
purposes of participating in MediaEval benchmark we are
using the standard development set provided to all the
participants.
      </p>
      <p>The SF method is both homogeneity and repetition based.
It uses a variant of lag matrix to obtain structural features.
The SF are di erentiated to obtain a novelty curve, on which
peak picking is performed. The SF method calculates
selfsimilarity between samples i and j as follows:
jjxi
xjjj) ;
(1)
where (z) is a Heaviside step function, xi is a feature time
series transformed using delay coordinates, jjzjj is a
Euclidean norm, and " is a threshold, which is set adaptively for
each cell of the matrix S. From the matrix S structural
features are then obtained using a lag-matrix, and computing
the di erence between successive structural features yields
a novelty curve.</p>
      <p>By means of the segmentation step we obtain 1304
segments with an average segment length of 10.8 5.7 seconds
using Essentia features, and 1017 segments with an average
length of 10.7 5.3 seconds using openSMILE features on the
development set. For each of the segments, we average the
continuous emotion annotation inside the segment to obtain
the training data.</p>
      <p>We also segment the songs from the evaluation set in the
same way.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Learning algorithm</title>
      <p>We use Gaussian Processes regression to predict the
valence and arousal values per segment, using maximum
likelihood estimation of the best set of parameters. We use a
squared exponential autocorrelation function (radial basis
function):
(i j)2
2 2</p>
      <p>;
K(i; j) = exp
(2)
where is a tuned parameter, and i and j are the points in
feature space.</p>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION</title>
      <p>Figure 1 shows an example of the output of the algorithm.</p>
      <p>The task is evaluated based on RMSE and Pearson's
correlation coe cient between the ground truth and the
prediction, averaged across the 58 songs of the testset. The results
are displayed in the table 1.</p>
      <p>The algorithm based on features from Essentia performs
much better for arousal (both in terms of correlation and
RMSE), but worse for valence. Both algorithms perform
unacceptably bad on valence.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>In this paper we described an approach to music
emotion variation detection which uses an intermediary step
music segmentation into fragments of homogenous emotion.
We used Gaussian Processes modeling to predict the
emotion per segment, and two di erent frameworks (Essentia
and openSMILE) to extract the features, which were used
both during the segmentation and for emotion recognition.
Bringing the problem from a level of sound fragment (half
a second) into a level of short musical segment (10 seconds
on average) has two advantages. Firstly, employing longer
segments allows to extract musically meaningful features,
such as tonality or tempo. Secondly, averaging features and
annotations over longer segments could be bene cial as a
smoothing step. The runs produced with baseline
openSMILE low level spectral features could not bene t from these
advantages, which could explain part of the di erence in
performance on arousal. Both algorithms performed very bad
on valence.
5.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This publication was supported by the Dutch national
program COMMIT/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wiering</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Veltkamp</surname>
          </string-name>
          .
          <article-title>Emotion based segmentation of musical audio</article-title>
          .
          <source>In Proceedings of the 16th International Society for Music Information Retrieval Conference</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Mayor</surname>
          </string-name>
          .
          <article-title>Essentia: an audio analysis library for music information retrieval</article-title>
          .
          <source>In International Society for Music Information Retrieval Conference</source>
          , pages
          <volume>493</volume>
          {
          <fpage>498</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent developments in opensmile, the munich open-source multimedia feature extractor</article-title>
          .
          <source>In ACM Multimedia</source>
          , pages
          <volume>835</volume>
          {
          <fpage>838</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Schubert</surname>
          </string-name>
          . Handbook of Music and Emotion: Theory, Research, Applications, chapter
          <source>Continuous self-report methods.</source>
          , pages
          <volume>223</volume>
          {
          <fpage>253</fpage>
          . Oxford University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grosche</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Arcos</surname>
          </string-name>
          .
          <article-title>Unsupervised music structure annotation by time series structure features and segment similarity</article-title>
          .
          <source>IEEE Transactions on Multimedia, Special Issue on Music Data Mining</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>