=Paper= {{Paper |id=Vol-1436/Paper57 |storemode=property |title=PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper57.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/CaiYCYC15 }} ==PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task== https://ceur-ws.org/Vol-1436/Paper57.pdf
                      PKU-AIPL’ Solution for MediaEval 2015
                            Emotion in Music Task∗

                     Kang Cai, Wanyi Yang, Yao Cheng, Deshun Yang, Xiaoou Chen
                                       Institute of Computer Science and Technology,
                                               Peking University, Beijing, China
               {caikang, yangwanyi, chengyao, yangdeshun, chenxiaoou}@pku.edu.cn


ABSTRACT                                                          2.1.1    Mel-Frequency Cepstrum Coefficients
In this paper, we describe the PKU-AIPL Team solution of             We divide the signals of songs into 50%-overlapping frames
Emotion in Music task in MediaEval benchmarking cam-              of 1024 samples length (about 23ms). We compute 13 Mel-
paign 2015. We extracted and designed several sets of fea-        Frequency Cepstrum Coefficients (MFCCs) with the 0th com-
tures and used continuous conditional random field(CCRF)          ponent included on each frame as a 13-D feature vector, as
for dynamic emotion characterization task.                        well as the delta-MFCCs .
                                                                  2.1.2    Some General Short-term Features
1.    INTRODUCTION                                                  Like MFCCs, we divide the signals of songs into 50%-
  In Emotion in Music task, labelers provided v-a labels          overlapping frames of 1024 samples length (about 23ms).
using a sliding bar while they listened to the music, which       Then we compute Short Time Energy, Spectral Centroid,
made the labels of the music segments strongly dependant          Spectral Entropy, Spectral Flux, Spectral Roll Off and Zero
on their previous segments. In our solution, we first estimate    Cross Rate on each frame as a 6-D feature vector.
each segment’s label based on the audio features, assuming
music segments are independent instances. Then, we break
                                                                  2.1.3    Edge orientation histogram on Mel Spectro-
the independence assumption and further optimize the la-
                                                                           gram
bels by modeling music emotion labeling as a continuous              The spectrogram is a nearly complete representation of
conditional random field process.                                 music, and furthermore, it provides a way for us to investi-
  The rest of this paper is organized as follows. Section 2       gate the relationship between audio signal and emotion from
describes our system in detail. Section 3 presents the per-       a visual angle [7]. We find there exists strong relationship
formance of our solution and analyze it.                          between the edge orientations in spectrograms and music
                                                                  emotions. We put forward our method by extracting EOH
                                                                  feature on audio spectrogram [8].
2.    SYSTEM DESCRIPTION                                             The procedure of our proposed algorithm can be described
  In this section, we introduce our system in detail. The pre-    as follows: Convert the audio signal to the spectrogram with
dicting procedure contains the following three steps: First,      Mel time-frequency representations. The gradients at the
select a set of features that represent music audio signal ad-    point(x,y) in the Mel Spectrogram S can be found by con-
equately. Second, apply a regression model that performs          volving Sobel masks with S. Then we get edge orientation
well in the range of ten thousand items and optimize the          of each point on spectrogram by dividing the strength of Y
predicting results according to the relationship of continu-      dimension by that of X dimension. Finally, we index the
ous clips in a piece of music. Finally, considering the delayed   edge orientations to a certain number of bins, which form
reaction when people tag the label of music emotion, we in-       edge orientation histogram on Mel Spectrogram.
vestigate the proper time of delayed reaction. The three
steps of our solution are shown as follows.
                                                                  Table 1: Development data results on various fea-
2.1    Feature Extraction                                         tures, Fusion stands for the fusion of MFCCs, Short-
  We preprocess the original audio files of the development       term Features, EOH-MEL, OPENSMILE
data as follows: First, we transformed the music from mp3                                       V                 A
format to wav format. Second, segmented the music (15s to                Features          R2       MSE      R2      MSE
45s period) into 60 clips, each with 500ms duration. Then           MFCCs+DMFCCs         0.4719    0.0662  0.4682   0.0621
we extracted features of each 500ms-clip.                          Short-term Features   0.3828    0.0770  0.3787   0.0718
                                                                       EOH-MEL           0.2705    0.0916  0.2088   0.0917
∗This work has been supported by the Natural Science Foun-            OPENSMILE          0.4873    0.0639  0.4514   0.0642
dation of China(Multimodal Music Emotion Recognition                     Fusion         0.5159    0.0606 0.4803 0.0608
technology research No.61170167)

                                                                  2.1.4    Feature processing
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-        An efficient and effective method of statistics for features
many                                                              of all the windows in a piece of music is to calculate the
                                        Table 3: Official results on the test data
                                              V                                  A
                        Run         RMSE               ρ               RMSE              ρ
                         1     0.3433± 0.1940   0.0016±0.4319      0.2410±0.1066   0.5243±0.3034
                         2      0.3669±0.1664   0.0086±0.3693      0.2567±0.0997   0.5025±0.2206
                         3     0.3348±0.1868 0.0181±0.4350 0.2382±0.1052 0.5403±0.2694


                                                                    conditional probability distribution over all emotion values.
Table 2: The performance of predicting model by                     It can represent both the content information and the rela-
applying various lagging time                                       tion information between emotion values,which is useful for
                       V               A                            dynamic emotion evaluation[2].
  Lagging time     R2     MSE     R2      MSE
       0ms      0.4867   0.0644 0.4507   0.0641                     2.3    Lagging time
     500ms      0.4873 0.0639 0.4514     0.0642                       When people tag the emotion scores for music, especially
     1000ms     0.4853   0.0642 0.4585   0.0633                     for the time-continuous clips of the music, they need the
     1500ms     0.4801   0.0648 0.4625 0.0629                       response time for receiving and processing sound, then tag-
     2000ms     0.4689   0.0662 0.4587   0.0633                     ging by hand. So we make an assumption that music clips
                                                                    do not correspond to the scores directly, but with a certain
                                                                    lag. Based on this assumption, we test on development set
means and variances. However, the windows of a piece                by varying the lagging time to find the best one. The ex-
of music construct a time series and the inner-connection           perimental results are shown in Table 2 and we find that
between those windows cannot be revealed simply through             the lagging time for tagging V scores is about 500ms and
means and variances. We, therefore, seek a proper way to            for tagging A scores is about 1500ms, which is, however,
reflect this connection in terms of time.                           inferred under the experimental conditions with the certain
   In this system, we build an Auto-Regressive (AR) and             features and regression model of our choice, and needs more
Moving Average (MA) Model to sort out the relationship              experiments to prove.
between windows in terms of time. First of all, we ana-
lyze the features of all windows and sequence them in the
light of time. Each dimension of the features forms an in-
                                                                    3.    RESULTS AND CONLUSION
dependent time series. Then, we gain new parameters by                 For CCRF, we set n = 61 for the training of the five runs,
modeling those time series using the AR and MA Model.               which means the number of the clips in one song, q = 431,
These parameters, together with means and variances, form           i.e., the number of songs in development set.
the new features, among the 121 dimensions of which, means             Run 1 uses the given features extracted by OPENSMILE
amount to 32 (19 + 13) dimensions, variances 32 (19 + 13)           and the regression model of our choice, SVR+CCRF. Run 2
dimensions, AR model 19 dimensions and MA model 38 di-              uses the features of our choice, the fusion of various features,
mensions. Then we combine above features with EOH-MEL               and the given regression model Multiple Linear Regression
and OPENSMILE features to form the total features of 393            (MLR). Run 3 uses both the features and the regression
dimensions.                                                         model of choice. We submitted these three runs and the
   We evaluate the these features on the development set by         results obtained by test dataset are shown in Table 3. We
splitting it into development and test set, while making sure       report the official challenge metrics, Pearson correlation (ρ)
that no samples from the same song are both in the devel-           and Root-Means-Squared error (RMSE) for dynamic regres-
opment and test set. The following experiment conducted             sion.
on the development set also takes the above method.                    The results show that Run 3, which uses both the features
                                                                    and the regression model of choice, performs best. It means
2.2    CCRF for dynamic task                                        that our features and regression model performs better than
   Considering the emotion labels of adjacent scores in the         features extracted by OPENSMILE and MLR. The RMSE
same piece of music are time-continuous, we try to model            of valence (V) and arousal (A) predicting are both in an
them as an interrelated sequence. The model we employ is            acceptable range. However, we notice that the V predicting
continuous conditional random field (CCRF). Conditional             results gets a low ρ even close to 0, which looks strange
random field is used as a probabilistic graph model, which          compared with the high ρ of A predicting results. A possible
has the ability to express the long-range dependence and            reason is that V predicting is harder than A predicting. The
overlapping features, and can better solve the problem of the       fact that RMSE of V predicting results is lower than that of
bias of the label, and all the features can be globally normal-     A predicting results also proves it.
ized, and the global optimal solution can be obtained. No-
tably in contrast to hidden Markov models (HMMs), CRFs              4.    REFERENCES
do not need the independence assumption and Markov as-              [1] Aljanaki, A., Yang, Y., Soleymani, M.: Emotion in
sumption, which is necessary for HMMs.                                  Music Task at MediaEval 2014. In: MediaEval 2014
   We adopted the CCRF model with SVR as the base clas-                 Workshop (2014)
sifier to model continuous emotions in dimensional space.           [2] Baltrusaitis, T., Banda, N., Robinson, P.: Dimensional
We denote {x1 , x2 , · · · , xn } as a set of labels predicted by       affect recognition using continuous conditional random
SVR, and {y1 , y2 , · · · , yn } as a set of final labels that we       fields. In: IEEE International Conference and
want to predict, x ∈ Rm and y ∈ R. CCRF is defined as a                 Workshops, 1–8 (2013)
[3] Juslin, P.N., Sloboda, J.A.: Music and emotion:
    Theory and research. Oxford University Press (2001)
[4] Lu, L., Liu, D., Zhang, H.J.: Automatic mood
    detection and tracking of music audio signals. In: IEEE
    Transactions on Audio, Speech, and Language
    Processing, 14(1), 5–18 (2006)
[5] Fornari, J., Eerola, T.: The pursuit of happiness in
    music: Retrieving valence with high-level musical
    descriptors. In: the Computer Music Modeling and
    Retrieval (2008)
[6] Korhonen, M.D., Clausi, D., Jernigan, M.: Modeling
    emotional content of music using system identification.
    In: IEEE Transactions on Systems, Man, and
    Cybernetics, Part B: Cybernetics, 36(3), 588–599.
    (2005)
[7] Dennis, J., Tran, H.D., Li, H.: Spectrogram image
    feature for sound event classification in mismatched
    conditions. In: Signal Processing Letters, IEEE, 18(2),
    130–133 (2011)
[8] Canny, J.: A computational approach to edge
    detection. In: IEEE Transactions on Pattern Analysis
    and Machine Intelligence, 679–698 (1986)
[9] Thayer, R.E.: The biopsychology of mood and arousal.
    Oxford University Press (1989)