<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Beatsens' Solution for MediaEval 2014 Emotion in Music Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wanyi Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kang Cai</string-name>
          <email>caikang@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ying Wang</string-name>
          <email>ywangbf@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoou Chen</string-name>
          <email>chenxiaoou@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deshun Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Horner</string-name>
          <email>horner@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Hong Kong University of Science and Technology</institution>
          ,
          <addr-line>Hong Kong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science and Technology, Peking University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Key mode</institution>
          ,
          <addr-line>HCDF</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Spectrum centroid</institution>
          ,
          <addr-line>Brightness, Spread, Skewness, Kurtosis, Rollo 95, Rollo 85, Spectral Entrophy, Flatness, 78 Roughness, Irregularity, Zero crossing rate,Spectral ux, MFCC, DMFCC Chromagram peak, Chromagram centroid, Key clarity</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Table 1: Features extracted by MIRToolBox Parts Features Dim. RMS energy</institution>
          ,
          <addr-line>Slope, Attack, Low energy</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Tempo</institution>
          ,
          <addr-line>Fluctuation peak, Fluctuation centroid</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper, we describe the Beatsens Team solution of Emotion in Music task in MediaEval benchmarking campaign 2014. We extracted and designed several sets of features and used continuous conditional random eld(CCRF) for dynamic emotion characterization task. The best runs for Pearson correlation are 0:23 0:56 and 0:12 0:55 of valence and arousal respectively, for RMSE are 0:12 0:06 and 0:09 0:05.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Emotion in Music task aims to estimate valence and
arousal values for 500ms music segments. In this task,
labelers provided v-a labels using a sliding bar while they listened
to the music, which made the labels of the music segments
strongly dependent on their previous segments. More details
concerning the dataset collection can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Therefore, in our solution, we consider the labeling process as a
continuous conditional random eld (CCRF) process, where
the valence-arousal(v-a) values not only depend on the
music segments' acoustic contents, but also their preceding
segments. The nal results have also shown the advantages of
CCRF modeling.
      </p>
      <p>In this paper, we rst introduce our solution in feature
extraction and modeling. Then, we present the results in
terms of both various feature combinations and model
parameters.</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>In this section, we introduce the feature design and
model of our system. The basic logic of our system is that we
rst estimate each segment's label based on the audio
features, assuming music segments are independent instances.
Then, we break the independence assumption and further
optimize the labels by modeling music emotion labeling as
a continuous conditional random eld process. We describe
our solution in details as follows.</p>
      <p>This work has been supported by the Natural Science
Foundation of China(Multimodal Music Emotion Recognition
technology research No.61170167) and Hong Kong Research
Grants Council grants(HKUST613112).</p>
      <sec id="sec-2-1">
        <title>Spectral</title>
      </sec>
      <sec id="sec-2-2">
        <title>Dynamics</title>
      </sec>
      <sec id="sec-2-3">
        <title>Rhythm</title>
      </sec>
      <sec id="sec-2-4">
        <title>Harmony</title>
        <p>2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>First, we transformed the music from mp3 format to wav
format. Second, segmented the music (15s to 45s period)
into 60 clips, each with 500ms duration. Then we extracted
features of each 500ms-clip. Features were extracted from
the audio signal by MIRToolbox1. Both mean and standard
deviations of the features were calculated. There were 54
features in total. Table 1 shows the features in detail.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>CCRF for dynamic task</title>
      <p>As labelers used a slide bar when labeling, emotion values
change continuously but not mutationally, it is better to
de ne the labeling model as a function on all the emotions
in one song. We adopted the CCRF model with SVR as the
base classi er to model continuous emotions in dimensional
space.</p>
      <p>
        In CCRF, we denote fx1; x2; ; xng as a set of labels
predicted by SVR, and fy1; y2; ; yng as a set of nal
labels that we want to predict, x 2 Rm and y 2 R. CCRF
is de ned as a conditional probability distribution over all
emotion values. It can represent both the content
information and the relation information between emotion values,
which is useful for dynamic emotion evaluation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
1Version 1.5: https://www.jyu. /hum/laitokset/musiikki/
en/research/coe/materials/mirtoolbox
cial results on the test data
      </p>
      <p>V
0.220 0.571
0.178 0.562
0.224 0.552
0.231 0.564
0.230 0.548</p>
      <p>RMSE
0.117 0.056
0.107 0.055
0.122 0.058
0.122 0.057
0.121 0.057</p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>With the selected attributes, we modeled the data using
Support Vector Regression(SVR), K-Nearest Neighbor(KNN)
and evaluated them on the training set with 4-fold cross
validation. All of the results show that SVR outperforms KNN,
so SVR is adopted in our runs.</p>
      <p>For CCRF, we set n = 61 for the training of the ve runs,
which means the number of the clips in one song, q = 744;
i.e., the number of songs in development set.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments of Run1 and Run2</title>
      <p>
        The 54 features are divided into four parts:
dynamics, spectrum, rhythm, and harmony [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We compared the
four perceptual dimensions and the combination of them,
results showed that Spectral+Dynamic+Rhythm performs
the best. This method is adopted in Run1.
      </p>
      <p>With the features of Run1, we evaluated an SVR
associated with three kernels: radial basis functions, linear and
polynomial, and a series of C(cost). Results showed that
Linear kernel gives better result and C = 2 3 performs
best.</p>
      <p>Because 500ms is too short for information extracting,
some features failed to be extracted. Thus, we further
extend the clip length to 1s and extract the features again.
Finally we concatenate the new 1s-clip feature with original
500ms-clip feature to get the feature of Run2.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Experiments of Run3, Run4 and Run5</title>
      <p>In addition, we found that Mel-frequency cepstral coe
cient(MFCC) is one of the most important spectral features.
As 0.5s is too short to convey the emotion completely, we
made considerable experiments with MFCC by choosing
various clip lengths and frame lengths.</p>
      <p>Experiment a: We separately extracted MFCC of 0.5s, 1s,
2s, 4s, 8s clips to convey more information than a single 0.5s
clip. The results are shown in Table 2. Comparing the six
single features, the 0.5s clip performs best and this method
is adopted in Run3.</p>
      <p>For the combination, take six features' regression labels as
input of CCRF and the nal result outperforms the single
0.5s clip slightly, this method is adopted in Run4.</p>
      <p>Experiment b: Considering frame length being an
important parameter, we set di erent frame lengths (11.6ms,
23.2ms, 46.4ms), and extracted MFCC respectively. Table 3
shows that the results of di erent frame lengths remain
basically unchanged, COMB performs the best. This method
is adopted in Run5.</p>
      <p>The results obtained by test dataset are shown in
Table 4. We report the o cial challenge metrics, Pearson
correlation( ) and Root-Means-Squared error (RMSE) for
dynamic regression. We can conclude that such a simple set
of feature as MFCC, performs even much better than more
features. The combination of various clip lengths of MFCC
perform the best, achieving a su ciently good performance
on a new dataset.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>We have presented the Beatsens Team solution to the 2014
MediaEval Emotion in Music task. Best result on valence
estimation was obtained by Run4, and best result on arousal
estimation was obtained by Run1, they both used CCRF
modeling. Further work will be conducted on feature
selection and optimization of CCRF.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2014</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain, October
          <volume>16</volume>
          -17
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrusaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Banda</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <article-title>Dimensional a ect recognition using continuous conditional random elds</article-title>
          .
          <source>In Automatic Face and Gesture Recognition (FG)</source>
          ,
          <year>2013</year>
          10th IEEE International Conference and Workshops on, pages
          <fpage>1</fpage>
          <article-title>{8</article-title>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dixon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pearce</surname>
          </string-name>
          .
          <article-title>Evaluation of musical features for emotion classi cation</article-title>
          .
          <source>In ISMIR</source>
          , pages
          <volume>523</volume>
          {
          <fpage>528</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>