1.1 Related Work

MIRUtrecht participation in MediaEval 2013: Emotion in Music task

Anna Aljanaki

A.Aljanaki@uu.nl 0

Frans Wiering

F.Wiering@uu.nl 0

Remco C. Veltkamp

R.C.Veltkamp@uu.nl 0 0 Utrecht University , Princetonplein 5, Utrecht 3584CC

2013

18 19

This working notes paper describes the system proposed by the MIRUtrecht team1 for static emotion recognition from audio (task Emotion in Music) in the MediaEval evaluation contest 2013. We approach the problem by proposing a scheme comprising data filtering, feature extraction, attribute selection and multivariate regression. The system is based on state-of-the art research in the field and achieved performance of (in terms of R2, i.e. proportion of variance explained by the model) 0.64 for arousal and 0.36 for valence.

1.1 Related Work

A regressive approach to modeling valence and arousal has already been undertaken by many researchers (see review by Yang [ 7 ]), with notable attempts by MacDorman et al. [ 3 ] (using kernel ISOMAP or PCA for dimensionality reduction and multiple linear regression for predictions) and Yang et al. [ 8 ]. (using PCA for correlation reduction, RReliefF for feature selection and Support Vector Regression for predictions). In [ 8 ], the prediction accuracy in terms of R2 reaches 58.3 for arousal and 28.1 for valence.

2. SYSTEM DESCRIPTION 2.1 Data Filtering

In the dataset provided by MediaEval, it appears that valence and arousal dimensions are highly correlated (Pearson’s r = 0.56, see also Figure 1). This is not an unusual situation (in [ 3 ], these dimensions correlate with Pearson’s r = 0.33, in [ 8 ], r = 0.34). The upper left (angry) quadrant contains more data points than the opposite lower right (calm) quadrant. When looking at separate data points in the angry quadrant, we discovered some audio files containing speech or noise. We decided to filter them out. This was done after extracting features (as described in section 2.2). An InterquartileRange filter in Weka [ 3 ] was used to detect those outliers using both extracted features and valencearousal annotations. For each feature, the audio file x is considered to be an outlier, if it satisfies the following criteria: where Q1 is the first quartile threshold, i.e. the middle number between the smallest and the median of the data set, Q3 is the third quartile, i.e. the middle number between the largest and the median of the dataset, and IQR = Q3 – Q1.

In total, 13 items were deleted from the dataset based on suggestions from the filter, including, in addition to files containing speech, noise and environmental sounds, 4 files containing contemporary classical music. Figure 1 shows a scatterplot of the dataset, with outliers marked as red crosses. 1 This research is supported by the FES project COMMIT/.

Features loudness

SonicAnnotator mode As we were predicting the emotion of the long (45 seconds) audio file, both the average values and the their standard deviations of the features were calculated, where applicable. From Psysound, the dynamic loudness (using the loudness model of Chalupper and Fastl) was employed. Sonic Annotator was used to extract an alternative estimation of mode. In MIRToolbox, the mode of the piece is calculated as a key strength difference between the best major and best minor key. In SonicAnnotator, modulation boundaries are detected, a certain key is predicted for each segment, and mode is estimated according to the amount of time the music is in major or minor mode. In total we extracted 44 features.

2.3 Feature Selection

The features we extracted are not necessarily all of equal importance to our task, and the feature set might contain redundant data. To select important features, we applied the ReliefF feature selection algorithm in WEKA. Table 2 shows the top 10 most important features for valence and for arousal according to ReliefF, where merit is the quality of the attribute, estimated using the probability of the predicted values of two neighbour instances being different. As we can see, the most important features both for valence and arousal are loudness, spectral flux (as an average distance between each successive frames), roughness (average of all the dissonance between all possible pairs of peaks), and HCDF (harmonic change detection function, which is a flux of a tonal centroid).

Trying to maximize the R2 value for model predictions, we selected 26 top attributes for arousal and 27 for valence. 1 2 3 4 5 6 7 8 9 10

Feature

loudness spectral flux

MFCC4

attack time attack slope brightness

MFCC9 roughness keyclarity

Merit 0.016 0.013 0.09 0.07 0.06 0.06 0.05 0.05 0.04 0.04

Valence Feature

roughness spectral flux zero crossing rate loudness MFCC8 std roughness MFCC5 MFCC6 HCDF brightness 0.011 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05

2.4 Model fitting

With the selected attributes, we modeled the data using multiple regression, Support Vector Regression, M5Rules, Multilayer Perceptron and other regressive techniques available in WEKA, and evaluated them on the training set with 10-fold cross validation. The two systems that performed best were submitted for evaluation and are described below.

3. Results and Evaluation

The submitted systems were evaluated on 300 test items. Table 3 shows the results of the runs for multiple regression and for M5Rules, which are equal. Three metrics are provided: R2 is the metric showing the goodness of fit of the model and is often described as the proportion of variance explained by the model, MAE is the Mean Average Error and AE-STD is its standard deviation. MAE AE-STD

M5Rules & Multiple regression arousal

valence From the evaluation results we can conclude that such a simple technique as multiple regression performs as good as more sophisticated models, achieving a sufficiently good performance on a new dataset. The prediction accuracy of valence is, as one would expect from other attempts to model it [ 3,7,8 ], lower than that for arousal, though it is higher than in previous research, which might be the outcome of high degree of correlation between valence and arousal in this particular dataset.

[1] Cabrera , D. , 1999 . PSYSOUND: A computer program for psychoacoustical analysis , in Proc. Australian Acoust. Soc. Conf. , 1999 , pp. 47 - 54

[2] Lartillot , O. , Toiviainen , P. , 2007 . A Matlab Toolbox for Musical Feature Extraction FromAudio , International Conference on Digital Audio Effects, Bordeaux, 2007 .

[3] MacDorman , K. F. , Ough , S. , and Ho , C.-C. 2007 . Automatic emotion prediction of song excerpts: Index construction, algorithm design, and empirical comparison . J. New Music Res . 36 , 4 , 281 - 299 .

[4] Soleymani , M. , Caro , M. , Schmidt , E. M. , Sha , C. , and Yang , Y. 2013 . 1000 Songs for Emotional Analysis of Music . In Proceedings of the ACM multimedia 2013 workshop on Crowdsourcing for Multimedia. ACM, ACM , 2013 .

[5]

Sonic

Annotator . http://www.omras2.org/SonicAnnotator [6] Weka : Data mining software http://www.cs.waikato.ac.nz/ml/weka/

[7] Yi-Hsuan Yang and Homer H. Chen . 2012 . Machine Recognition of Music Emotion: A Review . ACM Trans. Intell. Syst. Technol. 3 , 3 , Article 40 (May 2012 ), 30 pages.

[8] Yi-Hsuan

Yang

, Yu-Ching

Lin

, Ya-Fan Su , and H. H. Chen . 2008 . A Regression Approach to Music Emotion Recognition . Trans. Audio, Speech and Lang. Proc. 16 , 2 ( February 2008 ), 448 - 45