<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Affective Feature Design and Predicting Continuous Affective Dimensions from Music</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Naveen Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rahul Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanaya Guha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Colin Vaz Maarten Van Segbroeck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jangwon Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shrikanth S. Narayanan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Signal Analysis and Interpretation Lab (SAIL) University of Southern California</institution>
          ,
          <addr-line>Los Angeles</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper presents a ective features designed for music and develops a method to predict dynamic emotion ratings along the arousal and valence dimensions. We learn a model to predict continuous time emotion ratings based on combination of global and local features. This allows us to exploit information from both the scales to make a more robust prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2014 Emotion in Music challenge consists
of two tasks: desiging a ective features, and predicting
continuous emotion dimensions (arousal and valence) from
music [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These tasks are important to our understanding of
audio-based emotion prediction [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] which nds
applications in interaction modeling, automatic assessment of
socioemotional state of people.
      </p>
    </sec>
    <sec id="sec-2">
      <title>AFFECTIVE FEATURE DESIGN</title>
      <p>Designing features that can correlate well with human
affective dimensions is critical. Below, we describe three new
features designed to capture a ect from music.</p>
      <p>Compressibility features (comp): We hypothesize that
stronger emotion like high arousal or high valence is evoked
by complex interplay of various musical components, and
therefore, the complexity of a music signal may be
correlated with a ect. We measure the complexity of a music
signal by its compressibility i.e. how much the signal can be
compressed. Intuitively, the more a given signal can be
compressed, the lower is its complexity. This quantity is related
to a theoretical measure of data complexity (Kolmogorov
complexity) which is in general a non-computable quantity.
In practice, Kolmogorov complexity is often approximated
by the length of the compressed data.</p>
      <p>We rst convert each mp3 music le to raw audio format,
and compress it using a lossless audio codec (FLAC). The
compressed le length of the music le form the global
compressibility feature (global comp). We also use this idea to
create a dynamic feature (dynamic comp) where each 0.5 sec
segment of the music le is compressed in the same manner,
and a dynamic feature of compressibility is created.</p>
      <p>Median Spectral Band Energy (MSBE) This
feature is motivated by the observation that high arousal and
valence songs often involve numerous instruments playing
in tandem to crate the perception of a rich sound. This
gives rise to a large spectral bandwidth. We propse to use
the median spectral energy across bands as a robust metric
to capture this e ect. This feature is extracted at a global
level, one value for each song.</p>
      <p>Spectral Centre of Mass (SCOM) We also found the
spectral center of mass to be typically low for lower arousal
songs. This is again correlated with the fact that high
arousal songs are usually more wideband. SCOM is also
a static feature that computes a single value per song.</p>
      <p>Table 1 below presents correlations between the proposed
features and their corresponding static or dynamic emotion
ratings.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>PREDICTING CONTINUOUS EMOTION</title>
    </sec>
    <sec id="sec-4">
      <title>RATINGS</title>
      <p>To predict continous arousal/ valence emotion ratings, we
try to incorporate information from both the local and global
scales. This is performed by extracting features at the global
and local scales (every 0.5s). Each of these systems is
discussed in detail below.</p>
      <p>Frame Level Prediction We use dynamic features
extracted at an interval of 0.5s for directly predicting
continuous arousal and valence ratings for each song . The
dynamic features sets openSMILE and dynamic complexity
are extracted for this purpose.</p>
      <p>We train separate linear regression models for arousal and
valence over all frames in the training set which is used
to directly make independent predictions for each frame in
the test set. Finally, the resulting predictions are smoothed
over time using a moving average lter to incorporate the
smoothness expected of human annotations.</p>
      <p>Predicting dynamic ratings using global features
In addition to frame level predictions, we also hypothesize
that the dynamic ratings of each song are also a ected by
global factors of a song. Hence, we also try to predict the
dynamic ratings, using features extracted over the entire 45
second clip of songs. To predict the dynamic ratings using
static features, we rst parametrize the ratings using a Haar
transform. The haar coe cients for each song's ratings are
then used as alternate labels for our models. This
particular choice of label space was motivated by the smooth and
sometimes piecewise constant nature of annotated ratings.
In addition, it can also be seen from Fig. 3 that only a few
openSMILE
openSMILE
openSMILE
comp, SCOM, MSBE
baseline</p>
      <p>Lin.Reg.</p>
      <p>Lin.Reg. + Normalization</p>
      <p>PLSR on Haar coe
of the coe cients are strongly correlated with the emotion
ratings allowing for a sparse and robust representation.</p>
      <p>We compute 64 Haar coe cients to encode the emotion
dynamics over the length of each song. We learn a
PartialLeast Squares Regression (PLSR) model to predict each of
the haar coe cients (for both arousal and valence) as a label
using global features such as static compressibility, sCOM
and MSBE. The predicted haar coe cients are nally used
to reconstruct back the dynamic emotion ratings via an
inverse Haar transform. This method incorporates the
temporal smoothness constraint within the algorithm by
performing prediction in a label space where the ratings have
an inherent sparse representation. More importantly, this
system captures those aspects of emotion dynamics that are
governed by global characteristics of the song.
linear regression models. We use an opensmile feature set
comprising approximately 6000 features to predict arousal
and valence ratings at an interval of 0.5 seconds. Given that
the a ective dimensions evolve smoothly over time, we
incorporate context from neighboring frames. For each frame,
we compute the unweighted average of a ective dimension
predictions over a window centered at that frame. The
context window lengths for arousal and valence (system 1
submission) are 6.5 and 20 seconds long respectively. The
window lengths re ect the fact that valence evolves slower
as compared to arousal. For system 2, we additionally
normalize the system 1 predictions for songs which have outlier
predictions lying outside the range [ 1; 1]. This helps
reduce the RMSE. For system 3, we observe that smoothing
doesn't help as much because the model inherently already
ensures smoothness by choice of an inherently sparse label
space.</p>
      <p>Correlation and RMSE between predicted and annotated
emotion ratings in reported for each tasks, averaged over
songs is reported as an evaluation metric for all systems.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>From our experiments, we observe that dynamic emotion
ratings in a song depend not only on local characteristics of
the music, but also on overall global features of a song. We
also note that it helps to take into account context from
adjacent frames. This is evident from the improved prediction
results obtained by smoothing predictions using a moving
average lter.
10
10
20 30 40 50
Haar coefficients for arousal ratings
60
70
20 30 40 50
Haar coefficients for valence ratings
60
70</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND EVALUATION</title>
      <p>We submitted 3 runs for each of our system to the
challenge. Systems 1 and 2 used frame level prediction using</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2014</article-title>
          . In Mediaeval 2014 Workshop, Barcelona, Spain,
          <source>October 16-17</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Robust unsupervised arousal rating: A rule-based framework with knowledge-inspired vocal features</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Metallinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wollmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsamanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>Context-sensitive learning for enhanced audiovisual emotion classi cation. A ective Computing</article-title>
          , IEEE Transactions on,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <volume>184</volume>
          {
          <fpage>198</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>