1. INTRODUCTION

PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task

Kang Cai

caikang@pku.edu.cn 0

Wanyi Yang

Yao Cheng

Deshun Yang

Xiaoou Chen

chenxiaoou@pku.edu.cn 0 0 Institute of Computer Science and Technology, Peking University , Beijing , China

2015

14 15

In this paper, we describe the PKU-AIPL Team solution of Emotion in Music task in MediaEval benchmarking campaign 2015. We extracted and designed several sets of features and used continuous conditional random eld(CCRF) for dynamic emotion characterization task.

1. INTRODUCTION

In Emotion in Music task, labelers provided v-a labels using a sliding bar while they listened to the music, which made the labels of the music segments strongly dependant on their previous segments. In our solution, we rst estimate each segment's label based on the audio features, assuming music segments are independent instances. Then, we break the independence assumption and further optimize the labels by modeling music emotion labeling as a continuous conditional random eld process.

The rest of this paper is organized as follows. Section 2 describes our system in detail. Section 3 presents the performance of our solution and analyze it.

SYSTEM DESCRIPTION

In this section, we introduce our system in detail. The predicting procedure contains the following three steps: First, select a set of features that represent music audio signal adequately. Second, apply a regression model that performs well in the range of ten thousand items and optimize the predicting results according to the relationship of continuous clips in a piece of music. Finally, considering the delayed reaction when people tag the label of music emotion, we investigate the proper time of delayed reaction. The three steps of our solution are shown as follows.

Feature Extraction

We preprocess the original audio les of the development data as follows: First, we transformed the music from mp3 format to wav format. Second, segmented the music (15s to 45s period) into 60 clips, each with 500ms duration. Then we extracted features of each 500ms-clip.

This work has been supported by the Natural Science Foundation of China(Multimodal Music Emotion Recognition technology research No.61170167) 2.1.1

Mel-Frequency Cepstrum Coefficients

We divide the signals of songs into 50%-overlapping frames of 1024 samples length (about 23ms). We compute 13 MelFrequency Cepstrum Coe cients (MFCCs) with the 0th component included on each frame as a 13-D feature vector, as well as the delta-MFCCs . 2.1.2

Some General Short-term Features

Like MFCCs, we divide the signals of songs into 50%overlapping frames of 1024 samples length (about 23ms). Then we compute Short Time Energy, Spectral Centroid, Spectral Entropy, Spectral Flux, Spectral Roll O and Zero Cross Rate on each frame as a 6-D feature vector. 2.1.3

Edge orientation histogram on Mel Spectrogram

The spectrogram is a nearly complete representation of music, and furthermore, it provides a way for us to investigate the relationship between audio signal and emotion from a visual angle [7]. We nd there exists strong relationship between the edge orientations in spectrograms and music emotions. We put forward our method by extracting EOH feature on audio spectrogram [8].

The procedure of our proposed algorithm can be described as follows: Convert the audio signal to the spectrogram with Mel time-frequency representations. The gradients at the point(x,y) in the Mel Spectrogram S can be found by convolving Sobel masks with S. Then we get edge orientation of each point on spectrogram by dividing the strength of Y dimension by that of X dimension. Finally, we index the edge orientations to a certain number of bins, which form edge orientation histogram on Mel Spectrogram. 2.1.4

Feature processing

An e cient and e ective method of statistics for features of all the windows in a piece of music is to calculate the means and variances. However, the windows of a piece of music construct a time series and the inner-connection between those windows cannot be revealed simply through means and variances. We, therefore, seek a proper way to re ect this connection in terms of time.

In this system, we build an Auto-Regressive (AR) and Moving Average (MA) Model to sort out the relationship between windows in terms of time. First of all, we analyze the features of all windows and sequence them in the light of time. Each dimension of the features forms an independent time series. Then, we gain new parameters by modeling those time series using the AR and MA Model. These parameters, together with means and variances, form the new features, among the 121 dimensions of which, means amount to 32 (19 + 13) dimensions, variances 32 (19 + 13) dimensions, AR model 19 dimensions and MA model 38 dimensions. Then we combine above features with EOH-MEL and OPENSMILE features to form the total features of 393 dimensions.

We evaluate the these features on the development set by splitting it into development and test set, while making sure that no samples from the same song are both in the development and test set. The following experiment conducted on the development set also takes the above method. 2.2

CCRF for dynamic task

Considering the emotion labels of adjacent scores in the same piece of music are time-continuous, we try to model them as an interrelated sequence. The model we employ is continuous conditional random eld (CCRF). Conditional random eld is used as a probabilistic graph model, which has the ability to express the long-range dependence and overlapping features, and can better solve the problem of the bias of the label, and all the features can be globally normalized, and the global optimal solution can be obtained. Notably in contrast to hidden Markov models (HMMs), CRFs do not need the independence assumption and Markov assumption, which is necessary for HMMs.

We adopted the CCRF model with SVR as the base classi er to model continuous emotions in dimensional space. We denote fx1; x2; ; xng as a set of labels predicted by want to predict, x 2 R; mynganads ya2seRt .ofCCRF is de ned as a SVR, and fy1; y2; nal labels that we conditional probability distribution over all emotion values. It can represent both the content information and the relation information between emotion values,which is useful for dynamic emotion evaluation[2]. 2.3

Lagging time

When people tag the emotion scores for music, especially for the time-continuous clips of the music, they need the response time for receiving and processing sound, then tagging by hand. So we make an assumption that music clips do not correspond to the scores directly, but with a certain lag. Based on this assumption, we test on development set by varying the lagging time to nd the best one. The experimental results are shown in Table 2 and we nd that the lagging time for tagging V scores is about 500ms and for tagging A scores is about 1500ms, which is, however, inferred under the experimental conditions with the certain features and regression model of our choice, and needs more experiments to prove. 3.

RESULTS AND CONLUSION

For CCRF, we set n = 61 for the training of the ve runs, which means the number of the clips in one song, q = 431; i.e., the number of songs in development set.

Run 1 uses the given features extracted by OPENSMILE and the regression model of our choice, SVR+CCRF. Run 2 uses the features of our choice, the fusion of various features, and the given regression model Multiple Linear Regression (MLR). Run 3 uses both the features and the regression model of choice. We submitted these three runs and the results obtained by test dataset are shown in Table 3. We report the o cial challenge metrics, Pearson correlation ( ) and Root-Means-Squared error (RMSE) for dynamic regression.

The results show that Run 3, which uses both the features and the regression model of choice, performs best. It means that our features and regression model performs better than features extracted by OPENSMILE and MLR. The RMSE of valence (V) and arousal (A) predicting are both in an acceptable range. However, we notice that the V predicting results gets a low even close to 0, which looks strange compared with the high of A predicting results. A possible reason is that V predicting is harder than A predicting. The fact that RMSE of V predicting results is lower than that of A predicting results also proves it. [3] Juslin, P.N., Sloboda, J.A.: Music and emotion:

Theory and research. Oxford University Press (2001) [4] Lu, L., Liu, D., Zhang, H.J.: Automatic mood detection and tracking of music audio signals. In: IEEE Transactions on Audio, Speech, and Language

Processing, 14( 1 ), 5{18 (2006) [5] Fornari, J., Eerola, T.: The pursuit of happiness in music: Retrieving valence with high-level musical descriptors. In: the Computer Music Modeling and Retrieval (2008) [6] Korhonen, M.D., Clausi, D., Jernigan, M.: Modeling emotional content of music using system identi cation. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 588{599. (2005) [7] Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classi cation in mismatched conditions. In: Signal Processing Letters, IEEE, 18( 2 ), 130{133 (2011) [8] Canny, J.: A computational approach to edge detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 679{698 (1986) [9] Thayer, R.E.: The biopsychology of mood and arousal.

Oxford University Press (1989)

[1] Aljanaki , A. , Yang , Y. , Soleymani , M.: Emotion in Music Task at MediaEval 2014 . In: MediaEval 2014 Workshop ( 2014 )

[2] Baltrusaitis , T. , Banda , N. , Robinson , P. : Dimensional a ect recognition using continuous conditional random elds . In: IEEE International Conference and Workshops , 1 { 8 ( 2013 )