=Paper=
{{Paper
|id=Vol-1436/Paper57
|storemode=property
|title=PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper57.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/CaiYCYC15
}}
==PKU-AIPL' Solution for MediaEval 2015 Emotion in Music Task==
PKU-AIPL’ Solution for MediaEval 2015 Emotion in Music Task∗ Kang Cai, Wanyi Yang, Yao Cheng, Deshun Yang, Xiaoou Chen Institute of Computer Science and Technology, Peking University, Beijing, China {caikang, yangwanyi, chengyao, yangdeshun, chenxiaoou}@pku.edu.cn ABSTRACT 2.1.1 Mel-Frequency Cepstrum Coefficients In this paper, we describe the PKU-AIPL Team solution of We divide the signals of songs into 50%-overlapping frames Emotion in Music task in MediaEval benchmarking cam- of 1024 samples length (about 23ms). We compute 13 Mel- paign 2015. We extracted and designed several sets of fea- Frequency Cepstrum Coefficients (MFCCs) with the 0th com- tures and used continuous conditional random field(CCRF) ponent included on each frame as a 13-D feature vector, as for dynamic emotion characterization task. well as the delta-MFCCs . 2.1.2 Some General Short-term Features 1. INTRODUCTION Like MFCCs, we divide the signals of songs into 50%- In Emotion in Music task, labelers provided v-a labels overlapping frames of 1024 samples length (about 23ms). using a sliding bar while they listened to the music, which Then we compute Short Time Energy, Spectral Centroid, made the labels of the music segments strongly dependant Spectral Entropy, Spectral Flux, Spectral Roll Off and Zero on their previous segments. In our solution, we first estimate Cross Rate on each frame as a 6-D feature vector. each segment’s label based on the audio features, assuming music segments are independent instances. Then, we break 2.1.3 Edge orientation histogram on Mel Spectro- the independence assumption and further optimize the la- gram bels by modeling music emotion labeling as a continuous The spectrogram is a nearly complete representation of conditional random field process. music, and furthermore, it provides a way for us to investi- The rest of this paper is organized as follows. Section 2 gate the relationship between audio signal and emotion from describes our system in detail. Section 3 presents the per- a visual angle [7]. We find there exists strong relationship formance of our solution and analyze it. between the edge orientations in spectrograms and music emotions. We put forward our method by extracting EOH feature on audio spectrogram [8]. 2. SYSTEM DESCRIPTION The procedure of our proposed algorithm can be described In this section, we introduce our system in detail. The pre- as follows: Convert the audio signal to the spectrogram with dicting procedure contains the following three steps: First, Mel time-frequency representations. The gradients at the select a set of features that represent music audio signal ad- point(x,y) in the Mel Spectrogram S can be found by con- equately. Second, apply a regression model that performs volving Sobel masks with S. Then we get edge orientation well in the range of ten thousand items and optimize the of each point on spectrogram by dividing the strength of Y predicting results according to the relationship of continu- dimension by that of X dimension. Finally, we index the ous clips in a piece of music. Finally, considering the delayed edge orientations to a certain number of bins, which form reaction when people tag the label of music emotion, we in- edge orientation histogram on Mel Spectrogram. vestigate the proper time of delayed reaction. The three steps of our solution are shown as follows. Table 1: Development data results on various fea- 2.1 Feature Extraction tures, Fusion stands for the fusion of MFCCs, Short- We preprocess the original audio files of the development term Features, EOH-MEL, OPENSMILE data as follows: First, we transformed the music from mp3 V A format to wav format. Second, segmented the music (15s to Features R2 MSE R2 MSE 45s period) into 60 clips, each with 500ms duration. Then MFCCs+DMFCCs 0.4719 0.0662 0.4682 0.0621 we extracted features of each 500ms-clip. Short-term Features 0.3828 0.0770 0.3787 0.0718 EOH-MEL 0.2705 0.0916 0.2088 0.0917 ∗This work has been supported by the Natural Science Foun- OPENSMILE 0.4873 0.0639 0.4514 0.0642 dation of China(Multimodal Music Emotion Recognition Fusion 0.5159 0.0606 0.4803 0.0608 technology research No.61170167) 2.1.4 Feature processing Copyright is held by the author/owner(s). MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- An efficient and effective method of statistics for features many of all the windows in a piece of music is to calculate the Table 3: Official results on the test data V A Run RMSE ρ RMSE ρ 1 0.3433± 0.1940 0.0016±0.4319 0.2410±0.1066 0.5243±0.3034 2 0.3669±0.1664 0.0086±0.3693 0.2567±0.0997 0.5025±0.2206 3 0.3348±0.1868 0.0181±0.4350 0.2382±0.1052 0.5403±0.2694 conditional probability distribution over all emotion values. Table 2: The performance of predicting model by It can represent both the content information and the rela- applying various lagging time tion information between emotion values,which is useful for V A dynamic emotion evaluation[2]. Lagging time R2 MSE R2 MSE 0ms 0.4867 0.0644 0.4507 0.0641 2.3 Lagging time 500ms 0.4873 0.0639 0.4514 0.0642 When people tag the emotion scores for music, especially 1000ms 0.4853 0.0642 0.4585 0.0633 for the time-continuous clips of the music, they need the 1500ms 0.4801 0.0648 0.4625 0.0629 response time for receiving and processing sound, then tag- 2000ms 0.4689 0.0662 0.4587 0.0633 ging by hand. So we make an assumption that music clips do not correspond to the scores directly, but with a certain lag. Based on this assumption, we test on development set means and variances. However, the windows of a piece by varying the lagging time to find the best one. The ex- of music construct a time series and the inner-connection perimental results are shown in Table 2 and we find that between those windows cannot be revealed simply through the lagging time for tagging V scores is about 500ms and means and variances. We, therefore, seek a proper way to for tagging A scores is about 1500ms, which is, however, reflect this connection in terms of time. inferred under the experimental conditions with the certain In this system, we build an Auto-Regressive (AR) and features and regression model of our choice, and needs more Moving Average (MA) Model to sort out the relationship experiments to prove. between windows in terms of time. First of all, we ana- lyze the features of all windows and sequence them in the light of time. Each dimension of the features forms an in- 3. RESULTS AND CONLUSION dependent time series. Then, we gain new parameters by For CCRF, we set n = 61 for the training of the five runs, modeling those time series using the AR and MA Model. which means the number of the clips in one song, q = 431, These parameters, together with means and variances, form i.e., the number of songs in development set. the new features, among the 121 dimensions of which, means Run 1 uses the given features extracted by OPENSMILE amount to 32 (19 + 13) dimensions, variances 32 (19 + 13) and the regression model of our choice, SVR+CCRF. Run 2 dimensions, AR model 19 dimensions and MA model 38 di- uses the features of our choice, the fusion of various features, mensions. Then we combine above features with EOH-MEL and the given regression model Multiple Linear Regression and OPENSMILE features to form the total features of 393 (MLR). Run 3 uses both the features and the regression dimensions. model of choice. We submitted these three runs and the We evaluate the these features on the development set by results obtained by test dataset are shown in Table 3. We splitting it into development and test set, while making sure report the official challenge metrics, Pearson correlation (ρ) that no samples from the same song are both in the devel- and Root-Means-Squared error (RMSE) for dynamic regres- opment and test set. The following experiment conducted sion. on the development set also takes the above method. The results show that Run 3, which uses both the features and the regression model of choice, performs best. It means 2.2 CCRF for dynamic task that our features and regression model performs better than Considering the emotion labels of adjacent scores in the features extracted by OPENSMILE and MLR. The RMSE same piece of music are time-continuous, we try to model of valence (V) and arousal (A) predicting are both in an them as an interrelated sequence. The model we employ is acceptable range. However, we notice that the V predicting continuous conditional random field (CCRF). Conditional results gets a low ρ even close to 0, which looks strange random field is used as a probabilistic graph model, which compared with the high ρ of A predicting results. A possible has the ability to express the long-range dependence and reason is that V predicting is harder than A predicting. The overlapping features, and can better solve the problem of the fact that RMSE of V predicting results is lower than that of bias of the label, and all the features can be globally normal- A predicting results also proves it. ized, and the global optimal solution can be obtained. No- tably in contrast to hidden Markov models (HMMs), CRFs 4. REFERENCES do not need the independence assumption and Markov as- [1] Aljanaki, A., Yang, Y., Soleymani, M.: Emotion in sumption, which is necessary for HMMs. Music Task at MediaEval 2014. In: MediaEval 2014 We adopted the CCRF model with SVR as the base clas- Workshop (2014) sifier to model continuous emotions in dimensional space. [2] Baltrusaitis, T., Banda, N., Robinson, P.: Dimensional We denote {x1 , x2 , · · · , xn } as a set of labels predicted by affect recognition using continuous conditional random SVR, and {y1 , y2 , · · · , yn } as a set of final labels that we fields. In: IEEE International Conference and want to predict, x ∈ Rm and y ∈ R. CCRF is defined as a Workshops, 1–8 (2013) [3] Juslin, P.N., Sloboda, J.A.: Music and emotion: Theory and research. Oxford University Press (2001) [4] Lu, L., Liu, D., Zhang, H.J.: Automatic mood detection and tracking of music audio signals. In: IEEE Transactions on Audio, Speech, and Language Processing, 14(1), 5–18 (2006) [5] Fornari, J., Eerola, T.: The pursuit of happiness in music: Retrieving valence with high-level musical descriptors. In: the Computer Music Modeling and Retrieval (2008) [6] Korhonen, M.D., Clausi, D., Jernigan, M.: Modeling emotional content of music using system identification. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 588–599. (2005) [7] Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. In: Signal Processing Letters, IEEE, 18(2), 130–133 (2011) [8] Canny, J.: A computational approach to edge detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 679–698 (1986) [9] Thayer, R.E.: The biopsychology of mood and arousal. Oxford University Press (1989)