<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The TUM Approach to the MediaEval Music Emotion Task Using Generic Affective Audio Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Weninger, Florian Eyben</string-name>
          <email>eyben@tum.de</email>
          <email>weninger@tum.de</email>
          <email>{weninger,eyben}@tum.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Björn Schuller</string-name>
          <email>schuller@IEEE.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <addr-line>London SW7 2AZ</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Machine Intelligence &amp; Signal Processing Group, MMK, Technische Universität München</institution>
          ,
          <addr-line>80290 Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>include mean, moments, quartiles, 1- and 99-percentiles, as well as This paper describes the TUM approach for the MediaEval contour related measurements such as (relative) rise and fall times, Emotion in Music task which consists of non-prototypical amplitudes and standard deviations of local maxima ('peaks'), and music retrieved from the web, annotated by crowdsourcing. linear and quadratic regression coefficients. An exhaustive list of We use Support Vector Machines and BLSTM recurrent the LLDs and functionals along with a detailed analysis of feature neural networks for static and dynamic arousal and valence relevance for music mood regression is found in [3]. Extraction regression. A generic set of acoustic features is used that of acoustic features is done with our open-source toolkit openShas been proven e ective for a ect prediction across multiple MILE [4] which can be used 'out-of-the-box' to extract the Comdomains. In the result, the best models explain 64 and 48 % ParE set, so that our features can be reproduced by the interested of the annotations' variance for arousal and valence in the reader. Prior to feature extraction, songs are normalized to -3 dB static case, and an average Kendall's tau with the songs' maximum amplitude using 'sox'. This is done to remove noise in emotion contour of .18 and .12 is achieved in the dynamic energy-related features and improve generalization. case. As regressors, we use Support Vector Regression (SVR) for songlevel regression and bidirectional Long Short-Term Memory recurrent neural networks (BLSTM-RNNs) for dynamic regression. 1. INTRODUCTION Both use the same input features, normalized to the range [ 1; +1] The 2013 MediaEval 'Emotion in Music' task is to provide con- for SVR and standardized to zero mean and unit variance (on the tinuous valued arousal and valence estimates both for whole songs training data) for BLSTM-RNNs. Separate SVR models are trained (static) and sequences of one second long segments (dynamic). For for arousal and valence regression while BLSTM-RNNs learn both details on the task, we refer to the paper describing the task [1]. In arousal and valence prediction in a multi-task learning fashion. For the following we describe our approach. BLSTM-RNNs, the regression targets are standardized as well. In addition, we investigate adding delta regression coefficients of the arousal and valence targets as additional regression tasks, in or2. METHOD der to improve modeling of the dynamic emotion profile. The Our approach is based on supra-segmental features calculated complexity constant for SVR training was varied from 10 4 to by applying statistical functionals, such as mean and moments, to 10 1. BLSTM-RNNs with two hidden layers (128 LSTM units per the contours of frame-wise low-level descriptors (LLDs), such as layer and direction) are used; thus, the first layer performs inforMFCCs or energy, over either fixed length segments (one second, mation reduction to a 128-dimensional feature set. The segments corresponding to the annotated intervals in the corpus) or whole of each song are processed in order, forming sequences. Gradisongs. In particular, we use the set of affective features developed ent descent with 25 sequences per weight update is used for trainas baseline for the 2013 Computational Paralinguistics Evaluation ing. An early stopping strategy is used, using a held out valida(ComParE) campaign [2]. It has been shown in [3] that this set tion set in each fold. Training is stopped after a maximum of 100 provides robust cross-domain assessment of emotion (continuous iterations or after 20 iterations without improving the validation arousal and valence) in speech, music, and acoustic events. Despite set error (sum of squared errors). To alleviate over-fitting to the its rather 'brute-force' nature, it has been shown to outperform a high dimensional input feature set, Gaussian noise with zero mean more hand-crafted set of musically motivated features for the task and standard deviation 0.6 is added to the input activations, and of music mood regression. sequences are presented in random order during training. SVR The ComParE feature set contains 6 373 features. LLDs include models are trained with Weka [5] using Sequential Minimal Optiauditory weighted frequency bands, their sum (corresponding to mization (SMO). BLSTM-RNNs are trained with our open-source loudness), spectral measures such as centroid, roll-of point, skew- CUda RecuRrent Neural Network Toolkit (CURRENNT)1 for furness, sharpness, and spectral flux. Furthermore, voicing related ther reproducibility. All hyper-parameters not mentioned in the LLDs such as fundamental frequency (corresponding to 'main melody') above are left at the toolkits' defaults. and harmonics-to-noise ratio (corresponding to 'percussiveness') are added. Delta regression coefficients (weighted discrete derivatives) are added to capture time dynamics. Statistical functionals 3. RESULTS</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1https://sourceforge.net/p/currennt
(a) Song level, SVR
(b) Segment level, BLSTM
(c) Song level, BLSTM (average segment
level predictions)</p>
    </sec>
    <sec id="sec-2">
      <title>Tasks</title>
      <p>A+V
A+V+</p>
    </sec>
    <sec id="sec-3">
      <title>Arousal</title>
      <p>R2 MLE
.081
.080</p>
    </sec>
    <sec id="sec-4">
      <title>Valence</title>
      <p>R2 MLE
.087
.088
set. Evaluation measures are computed on the entire development
set (not by averaging across folds). The fold subdivision follows a
simple modulo based scheme (song ID modulo 10), and is thus
easily reproducible and song independent (in the case of regression on
segments). We report the official challenge metrics, determination
coefficient (R2) for whole song regression and average Kendall’s
per song ( ) for segment regression, along with mean linear
error (MLE). MLE is calculated after scaling the annotations to the
range [ 0:5; +0:5]. On segment level, we also report R2 (across
all segments) to assess the overall regression performance without
taking into account the modeling of the emotional profile of a song.</p>
      <p>In short, we observe that (a) SVR performance is very sensitive
to the complexity parameter; (b) R2 on segment level is very high
compared to , indicating the difficulty of estimating the
dynamics of the annotation contour within a song instead of the overall
emotion; (c) adding deltas to the regression targets improves for
arousal, but not valence prediction; (d) best song level results in
terms of R2 are obtained by averaging BLSTM predictions,
outperforming SVR by a large margin for valence (.499 vs. .419). In
the following the configurations for our test set runs are
summarized.</p>
      <p>Static task (song level):
1. SVR: SVR with C = 10 3, trained on the entire
development set
2. BLSTM-PA-Song: BLSTM-RNNs trained on the 10
training folds of the development set; segment level
predictions averaged within songs and across networks
3. BLSTM-WA-Song: BLSTM-RNN trained on the 10
training folds of the development set by weight
averaging; segment level predictions averaged within songs
Dynamic task (segment level):
1. BLSTM-PA-Seg: BLSTM-RNNs trained on the 10
training folds of the development set; predictions averaged
across networks
(a) Song level (`Static task')</p>
    </sec>
    <sec id="sec-5">
      <title>Run name</title>
      <p>SVR .646</p>
    </sec>
    <sec id="sec-6">
      <title>BLSTM-PA-Song .642</title>
    </sec>
    <sec id="sec-7">
      <title>BLSTM-WA-Song .643</title>
    </sec>
    <sec id="sec-8">
      <title>Arousal</title>
      <p>R2 MLE
.083
.085
.085
(b) Segment level (`Dynamic task')</p>
    </sec>
    <sec id="sec-9">
      <title>Run name</title>
    </sec>
    <sec id="sec-10">
      <title>BLSTM-PA-Seg</title>
    </sec>
    <sec id="sec-11">
      <title>BLSTM-WA-Seg .180 .174 Arousal</title>
      <p>2. BLSTM-WA-Seg: BLSTM-RNNs trained on the 10
training folds of the development set by weight averaging
To deliver BLSTM predictions on the test set, we either average
the predictions of the 10 networks trained on the development set
(PA), or average their weights and run additional training iterations
on the entire development set (WA).</p>
      <p>Table 2 shows that BLSTM-RNNs outperform SVR on the song
level for valence while being on par for arousal. This is consistent
with the development set results. On the segment level, the WA
strategy delivers slightly worse results in terms of than PA while
using a 10 times smaller model.
4.</p>
      <p>CONCLUSION</p>
      <p>We have presented the TUM approach to the 2013 MediaEval
Emotion in Music task. Best results on the static (song level) task
were obtained by averaging time-varying predictions of a
BLSTMRNN. BLSTM-RNNs also delivered consistent improvements over
the baseline in the dynamic task.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , C.-Y. Sha, and Y.
          <string-name>
            <surname>-H. Yang</surname>
          </string-name>
          , “
          <article-title>1000 songs for emotional analysis of music,” in Proc. of CrowdMM (held in conjunction with ACM MM)</article-title>
          . Barcelona, Spain: ACM,
          <year>2013</year>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vinciarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ringeval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chetouani</surname>
          </string-name>
          et al.,
          <source>“The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals</source>
          , Conflict, Emotion, Autism,”
          <source>in Proc. of INTERSPEECH. Lyon</source>
          , France: ISCA,
          <year>2013</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mortillaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Scherer</surname>
          </string-name>
          , “
          <article-title>On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common,” Frontiers in Emotion Science</article-title>
          , vol.
          <volume>4</volume>
          , no.
          <source>Article ID 292</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          , May
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Groß</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , “
          <article-title>Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor,”</article-title>
          <source>in Proc. of ACM MM. Barcelona</source>
          , Spain: ACM,
          <year>October 2013</year>
          , 4 pages, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          , “
          <article-title>The WEKA data mining software: an update</article-title>
          ,
          <source>” ACM SIGKDD Explorations Newsletter</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>