1. INTRODUCTION

Beatsens' Solution for MediaEval 2014 Emotion in Music Task

Wanyi Yang

1 4

Kang Cai

caikang@pku.edu.cn 1 4

Bin Wu

0 4

Ying Wang

ywangbf@cse.ust.hk 0 4

Xiaoou Chen

chenxiaoou@pku.edu.cn 1 4

Deshun Yang

1 4

Andrew Horner

horner@cse.ust.hk 0 4 0 Department of Computer Science and Engineering, Hong Kong University of Science and Technology , Hong Kong , China 1 Institute of Computer Science and Technology, Peking University , Beijing , China 2 Key mode , HCDF 3 Spectrum centroid , Brightness, Spread, Skewness, Kurtosis, Rollo 95, Rollo 85, Spectral Entrophy, Flatness, 78 Roughness, Irregularity, Zero crossing rate,Spectral ux, MFCC, DMFCC Chromagram peak, Chromagram centroid, Key clarity 4 Table 1: Features extracted by MIRToolBox Parts Features Dim. RMS energy , Slope, Attack, Low energy 5 Tempo , Fluctuation peak, Fluctuation centroid

2014

16 17

In this paper, we describe the Beatsens Team solution of Emotion in Music task in MediaEval benchmarking campaign 2014. We extracted and designed several sets of features and used continuous conditional random eld(CCRF) for dynamic emotion characterization task. The best runs for Pearson correlation are 0:23 0:56 and 0:12 0:55 of valence and arousal respectively, for RMSE are 0:12 0:06 and 0:09 0:05.

1. INTRODUCTION

The Emotion in Music task aims to estimate valence and arousal values for 500ms music segments. In this task, labelers provided v-a labels using a sliding bar while they listened to the music, which made the labels of the music segments strongly dependent on their previous segments. More details concerning the dataset collection can be found in [ 1 ]. Therefore, in our solution, we consider the labeling process as a continuous conditional random eld (CCRF) process, where the valence-arousal(v-a) values not only depend on the music segments' acoustic contents, but also their preceding segments. The nal results have also shown the advantages of CCRF modeling.

In this paper, we rst introduce our solution in feature extraction and modeling. Then, we present the results in terms of both various feature combinations and model parameters.

SYSTEM DESCRIPTION

In this section, we introduce the feature design and model of our system. The basic logic of our system is that we rst estimate each segment's label based on the audio features, assuming music segments are independent instances. Then, we break the independence assumption and further optimize the labels by modeling music emotion labeling as a continuous conditional random eld process. We describe our solution in details as follows.

This work has been supported by the Natural Science Foundation of China(Multimodal Music Emotion Recognition technology research No.61170167) and Hong Kong Research Grants Council grants(HKUST613112).

Spectral Dynamics Rhythm Harmony

2.1

Feature Extraction

First, we transformed the music from mp3 format to wav format. Second, segmented the music (15s to 45s period) into 60 clips, each with 500ms duration. Then we extracted features of each 500ms-clip. Features were extracted from the audio signal by MIRToolbox1. Both mean and standard deviations of the features were calculated. There were 54 features in total. Table 1 shows the features in detail. 2.2

CCRF for dynamic task

As labelers used a slide bar when labeling, emotion values change continuously but not mutationally, it is better to de ne the labeling model as a function on all the emotions in one song. We adopted the CCRF model with SVR as the base classi er to model continuous emotions in dimensional space.

In CCRF, we denote fx1; x2; ; xng as a set of labels predicted by SVR, and fy1; y2; ; yng as a set of nal labels that we want to predict, x 2 Rm and y 2 R. CCRF is de ned as a conditional probability distribution over all emotion values. It can represent both the content information and the relation information between emotion values, which is useful for dynamic emotion evaluation [ 2 ]. 1Version 1.5: https://www.jyu. /hum/laitokset/musiikki/ en/research/coe/materials/mirtoolbox cial results on the test data

V 0.220 0.571 0.178 0.562 0.224 0.552 0.231 0.564 0.230 0.548

RMSE 0.117 0.056 0.107 0.055 0.122 0.058 0.122 0.057 0.121 0.057

EXPERIMENTS AND RESULTS

With the selected attributes, we modeled the data using Support Vector Regression(SVR), K-Nearest Neighbor(KNN) and evaluated them on the training set with 4-fold cross validation. All of the results show that SVR outperforms KNN, so SVR is adopted in our runs.

For CCRF, we set n = 61 for the training of the ve runs, which means the number of the clips in one song, q = 744; i.e., the number of songs in development set. 3.1

Experiments of Run1 and Run2

The 54 features are divided into four parts: dynamics, spectrum, rhythm, and harmony [ 3 ]. We compared the four perceptual dimensions and the combination of them, results showed that Spectral+Dynamic+Rhythm performs the best. This method is adopted in Run1.

With the features of Run1, we evaluated an SVR associated with three kernels: radial basis functions, linear and polynomial, and a series of C(cost). Results showed that Linear kernel gives better result and C = 2 3 performs best.

Because 500ms is too short for information extracting, some features failed to be extracted. Thus, we further extend the clip length to 1s and extract the features again. Finally we concatenate the new 1s-clip feature with original 500ms-clip feature to get the feature of Run2.

3.2 Experiments of Run3, Run4 and Run5

In addition, we found that Mel-frequency cepstral coe cient(MFCC) is one of the most important spectral features. As 0.5s is too short to convey the emotion completely, we made considerable experiments with MFCC by choosing various clip lengths and frame lengths.

Experiment a: We separately extracted MFCC of 0.5s, 1s, 2s, 4s, 8s clips to convey more information than a single 0.5s clip. The results are shown in Table 2. Comparing the six single features, the 0.5s clip performs best and this method is adopted in Run3.

For the combination, take six features' regression labels as input of CCRF and the nal result outperforms the single 0.5s clip slightly, this method is adopted in Run4.

Experiment b: Considering frame length being an important parameter, we set di erent frame lengths (11.6ms, 23.2ms, 46.4ms), and extracted MFCC respectively. Table 3 shows that the results of di erent frame lengths remain basically unchanged, COMB performs the best. This method is adopted in Run5.

The results obtained by test dataset are shown in Table 4. We report the o cial challenge metrics, Pearson correlation( ) and Root-Means-Squared error (RMSE) for dynamic regression. We can conclude that such a simple set of feature as MFCC, performs even much better than more features. The combination of various clip lengths of MFCC perform the best, achieving a su ciently good performance on a new dataset.

CONCLUSION

We have presented the Beatsens Team solution to the 2014 MediaEval Emotion in Music task. Best result on valence estimation was obtained by Run4, and best result on arousal estimation was obtained by Run1, they both used CCRF modeling. Further work will be conducted on feature selection and optimization of CCRF.

[1]

Aljanaki ,

Y.-H.

Yang , and

Soleymani . Emotion in music task at mediaeval 2014 . In MediaEval 2014 Workshop, Barcelona, Spain, October 16 -17 2014 .

[2]

Baltrusaitis ,

Banda , and

Robinson . Dimensional a ect recognition using continuous conditional random elds . In Automatic Face and Gesture Recognition (FG) , 2013 10th IEEE International Conference and Workshops on, pages 1 {8 . IEEE, 2013 .

[3]

Song ,

Dixon , and

Pearce . Evaluation of musical features for emotion classi cation . In ISMIR , pages 523 { 528 , 2012 .