=Paper= {{Paper |id=None |storemode=property |title=MIRUtrecht Participation in MediaEval 2013: Emotion in Music Task |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_55.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiWV13 }} ==MIRUtrecht Participation in MediaEval 2013: Emotion in Music Task== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_55.pdf

MIRUtrecht participation in MediaEval 2013:
Emotion in Music task
Anna Aljanaki, Frans Wiering, Remco C. Veltkamp
Utrecht University, Princetonplein 5, Utrecht 3584CC
{ A.Aljanaki@uu.nl, F.Wiering@uu.nl, R.C.Veltkamp@uu.nl}

ABSTRACT these dimensions correlate with Pearson’s r = 0.33, in [8], r =
This working notes paper describes the system proposed by the 0.34). The upper left (angry) quadrant contains more data points
MIRUtrecht team1 for static emotion recognition from audio (task than the opposite lower right (calm) quadrant. When looking at
Emotion in Music) in the MediaEval evaluation contest 2013. We separate data points in the angry quadrant, we discovered some
approach the problem by proposing a scheme comprising data audio files containing speech or noise. We decided to filter them
filtering, feature extraction, attribute selection and multivariate out. This was done after extracting features (as described in
regression. The system is based on state-of-the art research in the section 2.2). An InterquartileRange filter in Weka [3] was used to
field and achieved performance of (in terms of R2, i.e. proportion detect those outliers using both extracted features and valence-
of variance explained by the model) 0.64 for arousal and 0.36 for arousal annotations. For each feature, the audio file x is
valence. considered to be an outlier, if it satisfies the following criteria:
Q3 + 6*IQR < x < Q1 - 6*IQR,
1. INTRODUCTION where Q1 is the first quartile threshold, i.e. the middle number
The objective of the static task Emotion in music in the between the smallest and the median of the data set, Q3 is the
MediaEval 2013 evaluation contest is to predict emotion from third quartile, i.e. the middle number between the largest and the
musical audio. The training dataset consists of 700 music audio median of the dataset, and IQR = Q3 – Q1.
files of 45 seconds, belonging to eight different genres, which
were annotated using the valence and arousal emotional model by In total, 13 items were deleted from the dataset based on
Mechanical Turk workers. In this paper we describe the suggestions from the filter, including, in addition to files
computational model, built on a training set and evaluated on a containing speech, noise and environmental sounds, 4 files
test set, which consisted of 300 audio files, annotated in the same containing contemporary classical music. Figure 1 shows a
way. More details concerning the dataset collection can be found scatterplot of the dataset, with outliers marked as red crosses.
in [4].
The valence-arousal model allows to avoid verbalization
problems during data collection and is easily amenable to
computational modeling. Two possibilities exist for modeling
data using this model. The first possibility is to classify music
into one of four quadrants, which correspond to emotions of
(from the upper right clockwise) happiness, relaxation,
depression and anger. The second possibility is to build a
regression model separately for valence and arousal. The latter
approach is employed in this paper.

1.1 Related Work
A regressive approach to modeling valence and arousal has
already been undertaken by many researchers (see review by
Yang [7]), with notable attempts by MacDorman et al. [3] (using
kernel ISOMAP or PCA for dimensionality reduction and
multiple linear regression for predictions) and Yang et al. [8].
(using PCA for correlation reduction, RReliefF for feature
selection and Support Vector Regression for predictions). In [8],
the prediction accuracy in terms of R2 reaches 58.3 for arousal Figure 1. Training dataset plotted on valence-arousal plane.
and 28.1 for valence. Each point is an audio file, red crosses are outliers.

2. SYSTEM DESCRIPTION 2.2 Feature Extraction
We used three toolboxes to extract features, namely the
2.1 Data Filtering MIRToolbox for Matlab [2], the Psysound [1] module for Matlab
In the dataset provided by MediaEval, it appears that valence and and the Queen Mary University VAMP plugin for Sonic
arousal dimensions are highly correlated (Pearson’s r = 0.56, see Annotator [5]. Most of the features were extracted using
also Figure 1). This is not an unusual situation (in [3], MIRToolbox (see Table 1).

Copyright is held by the authors.
1
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain This research is supported by the FES project COMMIT/.
Table 1. Extracted Features 2.4 Model fitting
Source Features With the selected attributes, we modeled the data using multiple
regression, Support Vector Regression, M5Rules, Multilayer
rms, attack time, attack slope, spectral Perceptron and other regressive techniques available in WEKA,
features (centroid, brightness, spread, and evaluated them on the training set with 10-fold cross
skewness, kurtosis, flux), tempo, rolloff85, validation. The two systems that performed best were submitted
MIRToolbox rolloff95, entropy, flatness, roughness, for evaluation and are described below.
mfcc1-13, zero crossing rate, low energy,
key clarity, mode, HCDF, inharmonicity, 3. Results and Evaluation
irregularity The submitted systems were evaluated on 300 test items. Table 3
PsySound loudness shows the results of the runs for multiple regression and for
SonicAnnotator mode M5Rules, which are equal. Three metrics are provided: R2 is the
metric showing the goodness of fit of the model and is often
As we were predicting the emotion of the long (45 seconds) audio
described as the proportion of variance explained by the model,
file, both the average values and the their standard deviations of
MAE is the Mean Average Error and AE-STD is its standard
the features were calculated, where applicable. From Psysound,
deviation.
the dynamic loudness (using the loudness model of Chalupper
and Fastl) was employed. Sonic Annotator was used to extract an Table 3. Evaluation
alternative estimation of mode. In MIRToolbox, the mode of the Evaluation M5Rules & Multiple regression
piece is calculated as a key strength difference between the best metric
arousal valence
major and best minor key. In SonicAnnotator, modulation
boundaries are detected, a certain key is predicted for each R2 0.64 0.36
segment, and mode is estimated according to the amount of time MAE 0.08 0.10
the music is in major or minor mode. In total we extracted 44
features. AE-STD 0.06 0.07
From the evaluation results we can conclude that such a simple
2.3 Feature Selection technique as multiple regression performs as good as more
The features we extracted are not necessarily all of equal sophisticated models, achieving a sufficiently good performance
importance to our task, and the feature set might contain on a new dataset. The prediction accuracy of valence is, as one
redundant data. To select important features, we applied the would expect from other attempts to model it [3,7,8], lower than
ReliefF feature selection algorithm in WEKA. Table 2 shows the that for arousal, though it is higher than in previous research,
top 10 most important features for valence and for arousal which might be the outcome of high degree of correlation
according to ReliefF, where merit is the quality of the attribute, between valence and arousal in this particular dataset.
estimated using the probability of the predicted values of two
neighbour instances being different. 4. REFERENCES
Table 2. Feature importance [1] Cabrera, D., 1999. PSYSOUND: A computer program for
psychoacoustical analysis, in Proc. Australian Acoust. Soc.
Rank Arousal Valence Conf., 1999, pp. 47–54
Feature Merit Feature Merit [2] Lartillot, O., Toiviainen, P., 2007. A Matlab Toolbox for
1 loudness 0.016 roughness 0.011 Musical Feature Extraction FromAudio, International
Conference on Digital Audio Effects, Bordeaux, 2007.
2 spectral flux 0.013 spectral flux 0.08
[3] MacDorman, K. F., Ough, S., and Ho, C.-C. 2007.
3 HCDF 0.09 zero crossing rate 0.07 Automatic emotion prediction of song excerpts: Index
4 MFCC4 0.07 loudness 0.06 construction, algorithm design, and empirical comparison. J.
5 attack time 0.06 MFCC8 0.06 New Music Res. 36, 4, 281–299.

6 attack slope 0.06 std roughness 0.06 [4] Soleymani, M., Caro, M., Schmidt, E. M., Sha, C. , and
Yang, Y. 2013. 1000 Songs for Emotional Analysis of
7 brightness 0.05 MFCC5 0.06 Music. In Proceedings of the ACM multimedia 2013
8 MFCC9 0.05 MFCC6 0.05 workshop on Crowdsourcing for Multimedia. ACM, ACM,
2013.
9 roughness 0.04 HCDF 0.05
[5] Sonic Annotator. http://www.omras2.org/SonicAnnotator
10 keyclarity 0.04 brightness 0.05
As we can see, the most important features both for valence and [6] Weka: Data mining software
arousal are loudness, spectral flux (as an average distance http://www.cs.waikato.ac.nz/ml/weka/
between each successive frames), roughness (average of all the [7] Yi-Hsuan Yang and Homer H. Chen. 2012. Machine
dissonance between all possible pairs of peaks), and HCDF Recognition of Music Emotion: A Review. ACM Trans.
(harmonic change detection function, which is a flux of a tonal Intell. Syst. Technol. 3, 3, Article 40 (May 2012), 30 pages.
centroid).
[8] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and H. H. Chen.
Trying to maximize the R2 value for model predictions, we 2008. A Regression Approach to Music Emotion
selected 26 top attributes for arousal and 27 for valence. Recognition. Trans. Audio, Speech and Lang. Proc. 16, 2
(February 2008), 448-45