<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kang Cai</string-name>
          <email>caikang@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wanyi Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yao Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deshun Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoou Chen</string-name>
          <email>chenxiaoou@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science and Technology, Peking University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we describe the PKU-AIPL Team solution of Emotion in Music task in MediaEval benchmarking campaign 2015. We extracted and designed several sets of features and used continuous conditional random eld(CCRF) for dynamic emotion characterization task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>In Emotion in Music task, labelers provided v-a labels
using a sliding bar while they listened to the music, which
made the labels of the music segments strongly dependant
on their previous segments. In our solution, we rst estimate
each segment's label based on the audio features, assuming
music segments are independent instances. Then, we break
the independence assumption and further optimize the
labels by modeling music emotion labeling as a continuous
conditional random eld process.</p>
      <p>The rest of this paper is organized as follows. Section 2
describes our system in detail. Section 3 presents the
performance of our solution and analyze it.</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>In this section, we introduce our system in detail. The
predicting procedure contains the following three steps: First,
select a set of features that represent music audio signal
adequately. Second, apply a regression model that performs
well in the range of ten thousand items and optimize the
predicting results according to the relationship of
continuous clips in a piece of music. Finally, considering the delayed
reaction when people tag the label of music emotion, we
investigate the proper time of delayed reaction. The three
steps of our solution are shown as follows.</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>We preprocess the original audio les of the development
data as follows: First, we transformed the music from mp3
format to wav format. Second, segmented the music (15s to
45s period) into 60 clips, each with 500ms duration. Then
we extracted features of each 500ms-clip.</p>
      <p>This work has been supported by the Natural Science
Foundation of China(Multimodal Music Emotion Recognition
technology research No.61170167)
2.1.1</p>
      <sec id="sec-3-1">
        <title>Mel-Frequency Cepstrum Coefficients</title>
        <p>We divide the signals of songs into 50%-overlapping frames
of 1024 samples length (about 23ms). We compute 13
MelFrequency Cepstrum Coe cients (MFCCs) with the 0th
component included on each frame as a 13-D feature vector, as
well as the delta-MFCCs .
2.1.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Some General Short-term Features</title>
        <p>Like MFCCs, we divide the signals of songs into
50%overlapping frames of 1024 samples length (about 23ms).
Then we compute Short Time Energy, Spectral Centroid,
Spectral Entropy, Spectral Flux, Spectral Roll O and Zero
Cross Rate on each frame as a 6-D feature vector.
2.1.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Edge orientation histogram on Mel Spectrogram</title>
        <p>The spectrogram is a nearly complete representation of
music, and furthermore, it provides a way for us to
investigate the relationship between audio signal and emotion from
a visual angle [7]. We nd there exists strong relationship
between the edge orientations in spectrograms and music
emotions. We put forward our method by extracting EOH
feature on audio spectrogram [8].</p>
        <p>The procedure of our proposed algorithm can be described
as follows: Convert the audio signal to the spectrogram with
Mel time-frequency representations. The gradients at the
point(x,y) in the Mel Spectrogram S can be found by
convolving Sobel masks with S. Then we get edge orientation
of each point on spectrogram by dividing the strength of Y
dimension by that of X dimension. Finally, we index the
edge orientations to a certain number of bins, which form
edge orientation histogram on Mel Spectrogram.
2.1.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Feature processing</title>
        <p>An e cient and e ective method of statistics for features
of all the windows in a piece of music is to calculate the
means and variances. However, the windows of a piece
of music construct a time series and the inner-connection
between those windows cannot be revealed simply through
means and variances. We, therefore, seek a proper way to
re ect this connection in terms of time.</p>
        <p>In this system, we build an Auto-Regressive (AR) and
Moving Average (MA) Model to sort out the relationship
between windows in terms of time. First of all, we
analyze the features of all windows and sequence them in the
light of time. Each dimension of the features forms an
independent time series. Then, we gain new parameters by
modeling those time series using the AR and MA Model.
These parameters, together with means and variances, form
the new features, among the 121 dimensions of which, means
amount to 32 (19 + 13) dimensions, variances 32 (19 + 13)
dimensions, AR model 19 dimensions and MA model 38
dimensions. Then we combine above features with EOH-MEL
and OPENSMILE features to form the total features of 393
dimensions.</p>
        <p>We evaluate the these features on the development set by
splitting it into development and test set, while making sure
that no samples from the same song are both in the
development and test set. The following experiment conducted
on the development set also takes the above method.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>CCRF for dynamic task</title>
      <p>Considering the emotion labels of adjacent scores in the
same piece of music are time-continuous, we try to model
them as an interrelated sequence. The model we employ is
continuous conditional random eld (CCRF). Conditional
random eld is used as a probabilistic graph model, which
has the ability to express the long-range dependence and
overlapping features, and can better solve the problem of the
bias of the label, and all the features can be globally
normalized, and the global optimal solution can be obtained.
Notably in contrast to hidden Markov models (HMMs), CRFs
do not need the independence assumption and Markov
assumption, which is necessary for HMMs.</p>
      <p>We adopted the CCRF model with SVR as the base
classi er to model continuous emotions in dimensional space.
We denote fx1; x2; ; xng as a set of labels predicted by
want to predict, x 2 R; mynganads ya2seRt .ofCCRF is de ned as a
SVR, and fy1; y2; nal labels that we
conditional probability distribution over all emotion values.
It can represent both the content information and the
relation information between emotion values,which is useful for
dynamic emotion evaluation[2].
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Lagging time</title>
      <p>When people tag the emotion scores for music, especially
for the time-continuous clips of the music, they need the
response time for receiving and processing sound, then
tagging by hand. So we make an assumption that music clips
do not correspond to the scores directly, but with a certain
lag. Based on this assumption, we test on development set
by varying the lagging time to nd the best one. The
experimental results are shown in Table 2 and we nd that
the lagging time for tagging V scores is about 500ms and
for tagging A scores is about 1500ms, which is, however,
inferred under the experimental conditions with the certain
features and regression model of our choice, and needs more
experiments to prove.
3.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND CONLUSION</title>
      <p>For CCRF, we set n = 61 for the training of the ve runs,
which means the number of the clips in one song, q = 431;
i.e., the number of songs in development set.</p>
      <p>Run 1 uses the given features extracted by OPENSMILE
and the regression model of our choice, SVR+CCRF. Run 2
uses the features of our choice, the fusion of various features,
and the given regression model Multiple Linear Regression
(MLR). Run 3 uses both the features and the regression
model of choice. We submitted these three runs and the
results obtained by test dataset are shown in Table 3. We
report the o cial challenge metrics, Pearson correlation ( )
and Root-Means-Squared error (RMSE) for dynamic
regression.</p>
      <p>The results show that Run 3, which uses both the features
and the regression model of choice, performs best. It means
that our features and regression model performs better than
features extracted by OPENSMILE and MLR. The RMSE
of valence (V) and arousal (A) predicting are both in an
acceptable range. However, we notice that the V predicting
results gets a low even close to 0, which looks strange
compared with the high of A predicting results. A possible
reason is that V predicting is harder than A predicting. The
fact that RMSE of V predicting results is lower than that of
A predicting results also proves it.
[3] Juslin, P.N., Sloboda, J.A.: Music and emotion:</p>
      <p>Theory and research. Oxford University Press (2001)
[4] Lu, L., Liu, D., Zhang, H.J.: Automatic mood
detection and tracking of music audio signals. In: IEEE
Transactions on Audio, Speech, and Language</p>
      <p>
        Processing, 14(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 5{18 (2006)
[5] Fornari, J., Eerola, T.: The pursuit of happiness in
music: Retrieving valence with high-level musical
descriptors. In: the Computer Music Modeling and
Retrieval (2008)
[6] Korhonen, M.D., Clausi, D., Jernigan, M.: Modeling
emotional content of music using system identi cation.
In: IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics, 36(3), 588{599.
(2005)
[7] Dennis, J., Tran, H.D., Li, H.: Spectrogram image
feature for sound event classi cation in mismatched
conditions. In: Signal Processing Letters, IEEE, 18(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ),
130{133 (2011)
[8] Canny, J.: A computational approach to edge
detection. In: IEEE Transactions on Pattern Analysis
and Machine Intelligence, 679{698 (1986)
[9] Thayer, R.E.: The biopsychology of mood and arousal.
      </p>
      <p>Oxford University Press (1989)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aljanaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soleymani</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Emotion in Music Task at MediaEval 2014</article-title>
          . In: MediaEval 2014 Workshop (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Baltrusaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banda</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Dimensional a ect recognition using continuous conditional random elds</article-title>
          .
          <source>In: IEEE International Conference and Workshops</source>
          ,
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>