=Paper= {{Paper |id=None |storemode=property |title=MIRUtrecht Participation in MediaEval 2013: Emotion in Music Task |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_55.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiWV13 }} ==MIRUtrecht Participation in MediaEval 2013: Emotion in Music Task== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_55.pdf
                  MIRUtrecht participation in MediaEval 2013:
                           Emotion in Music task
                                  Anna Aljanaki, Frans Wiering, Remco C. Veltkamp
                                      Utrecht University, Princetonplein 5, Utrecht 3584CC
                           { A.Aljanaki@uu.nl, F.Wiering@uu.nl, R.C.Veltkamp@uu.nl}

ABSTRACT                                                              these dimensions correlate with Pearson’s r = 0.33, in [8], r =
This working notes paper describes the system proposed by the         0.34). The upper left (angry) quadrant contains more data points
MIRUtrecht team1 for static emotion recognition from audio (task      than the opposite lower right (calm) quadrant. When looking at
Emotion in Music) in the MediaEval evaluation contest 2013. We        separate data points in the angry quadrant, we discovered some
approach the problem by proposing a scheme comprising data            audio files containing speech or noise. We decided to filter them
filtering, feature extraction, attribute selection and multivariate   out. This was done after extracting features (as described in
regression. The system is based on state-of-the art research in the   section 2.2). An InterquartileRange filter in Weka [3] was used to
field and achieved performance of (in terms of R2, i.e. proportion    detect those outliers using both extracted features and valence-
of variance explained by the model) 0.64 for arousal and 0.36 for     arousal annotations. For each feature, the audio file x is
valence.                                                              considered to be an outlier, if it satisfies the following criteria:
                                                                                          Q3 + 6*IQR < x < Q1 - 6*IQR,
1. INTRODUCTION                                                       where Q1 is the first quartile threshold, i.e. the middle number
The objective of the static task Emotion in music in the              between the smallest and the median of the data set, Q3 is the
MediaEval 2013 evaluation contest is to predict emotion from          third quartile, i.e. the middle number between the largest and the
musical audio. The training dataset consists of 700 music audio       median of the dataset, and IQR = Q3 – Q1.
files of 45 seconds, belonging to eight different genres, which
were annotated using the valence and arousal emotional model by       In total, 13 items were deleted from the dataset based on
Mechanical Turk workers. In this paper we describe the                suggestions from the filter, including, in addition to files
computational model, built on a training set and evaluated on a       containing speech, noise and environmental sounds, 4 files
test set, which consisted of 300 audio files, annotated in the same   containing contemporary classical music. Figure 1 shows a
way. More details concerning the dataset collection can be found      scatterplot of the dataset, with outliers marked as red crosses.
in [4].
The valence-arousal model allows to avoid verbalization
problems during data collection and is easily amenable to
computational modeling. Two possibilities exist for modeling
data using this model. The first possibility is to classify music
into one of four quadrants, which correspond to emotions of
(from the upper right clockwise) happiness, relaxation,
depression and anger. The second possibility is to build a
regression model separately for valence and arousal. The latter
approach is employed in this paper.

1.1 Related Work
A regressive approach to modeling valence and arousal has
already been undertaken by many researchers (see review by
Yang [7]), with notable attempts by MacDorman et al. [3] (using
kernel ISOMAP or PCA for dimensionality reduction and
multiple linear regression for predictions) and Yang et al. [8].
(using PCA for correlation reduction, RReliefF for feature
selection and Support Vector Regression for predictions). In [8],
the prediction accuracy in terms of R2 reaches 58.3 for arousal           Figure 1. Training dataset plotted on valence-arousal plane.
and 28.1 for valence.                                                         Each point is an audio file, red crosses are outliers.

2. SYSTEM DESCRIPTION                                                 2.2 Feature Extraction
                                                                      We used three toolboxes to extract features, namely the
2.1 Data Filtering                                                    MIRToolbox for Matlab [2], the Psysound [1] module for Matlab
In the dataset provided by MediaEval, it appears that valence and     and the Queen Mary University VAMP plugin for Sonic
arousal dimensions are highly correlated (Pearson’s r = 0.56, see     Annotator [5]. Most of the features were extracted using
also Figure 1). This is not an unusual situation (in [3],             MIRToolbox (see Table 1).

 Copyright is held by the authors.
                                                                      1
 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain           This research is supported by the FES project COMMIT/.
                    Table 1. Extracted Features                       2.4 Model fitting
       Source                            Features                     With the selected attributes, we modeled the data using multiple
                                                                      regression, Support Vector Regression, M5Rules, Multilayer
                       rms, attack time, attack slope, spectral       Perceptron and other regressive techniques available in WEKA,
                       features (centroid, brightness, spread,        and evaluated them on the training set with 10-fold cross
                       skewness, kurtosis, flux), tempo, rolloff85,   validation. The two systems that performed best were submitted
     MIRToolbox        rolloff95, entropy, flatness, roughness,       for evaluation and are described below.
                       mfcc1-13, zero crossing rate, low energy,
                       key clarity, mode, HCDF, inharmonicity,        3. Results and Evaluation
                       irregularity                                   The submitted systems were evaluated on 300 test items. Table 3
      PsySound                           loudness                     shows the results of the runs for multiple regression and for
  SonicAnnotator                        mode                          M5Rules, which are equal. Three metrics are provided: R2 is the
                                                                      metric showing the goodness of fit of the model and is often
As we were predicting the emotion of the long (45 seconds) audio
                                                                      described as the proportion of variance explained by the model,
file, both the average values and the their standard deviations of
                                                                      MAE is the Mean Average Error and AE-STD is its standard
the features were calculated, where applicable. From Psysound,
                                                                      deviation.
the dynamic loudness (using the loudness model of Chalupper
and Fastl) was employed. Sonic Annotator was used to extract an                            Table 3. Evaluation
alternative estimation of mode. In MIRToolbox, the mode of the            Evaluation             M5Rules & Multiple regression
piece is calculated as a key strength difference between the best         metric
                                                                                                arousal      valence
major and best minor key. In SonicAnnotator, modulation
boundaries are detected, a certain key is predicted for each              R2                    0.64         0.36
segment, and mode is estimated according to the amount of time            MAE                   0.08         0.10
the music is in major or minor mode. In total we extracted 44
features.                                                                 AE-STD                0.06         0.07
                                                                      From the evaluation results we can conclude that such a simple
2.3 Feature Selection                                                 technique as multiple regression performs as good as more
The features we extracted are not necessarily all of equal            sophisticated models, achieving a sufficiently good performance
importance to our task, and the feature set might contain             on a new dataset. The prediction accuracy of valence is, as one
redundant data. To select important features, we applied the          would expect from other attempts to model it [3,7,8], lower than
ReliefF feature selection algorithm in WEKA. Table 2 shows the        that for arousal, though it is higher than in previous research,
top 10 most important features for valence and for arousal            which might be the outcome of high degree of correlation
according to ReliefF, where merit is the quality of the attribute,    between valence and arousal in this particular dataset.
estimated using the probability of the predicted values of two
neighbour instances being different.                                  4. REFERENCES
Table 2. Feature importance                                           [1] Cabrera, D., 1999. PSYSOUND: A computer program for
                                                                          psychoacoustical analysis, in Proc. Australian Acoust. Soc.
Rank              Arousal                          Valence                Conf., 1999, pp. 47–54
            Feature          Merit     Feature                Merit   [2] Lartillot, O., Toiviainen, P., 2007. A Matlab Toolbox for
1        loudness            0.016     roughness              0.011       Musical Feature Extraction FromAudio, International
                                                                          Conference on Digital Audio Effects, Bordeaux, 2007.
2        spectral flux       0.013     spectral flux          0.08
                                                                      [3] MacDorman, K. F., Ough, S., and Ho, C.-C. 2007.
3        HCDF                0.09      zero crossing rate     0.07        Automatic emotion prediction of song excerpts: Index
4        MFCC4               0.07      loudness               0.06        construction, algorithm design, and empirical comparison. J.
5        attack time         0.06      MFCC8                  0.06        New Music Res. 36, 4, 281–299.

6        attack slope        0.06      std roughness          0.06    [4] Soleymani, M., Caro, M., Schmidt, E. M., Sha, C. , and
                                                                          Yang, Y. 2013. 1000 Songs for Emotional Analysis of
7        brightness          0.05      MFCC5                  0.06        Music. In Proceedings of the ACM multimedia 2013
8        MFCC9               0.05      MFCC6                  0.05        workshop on Crowdsourcing for Multimedia. ACM, ACM,
                                                                          2013.
9        roughness           0.04      HCDF                   0.05
                                                                      [5] Sonic Annotator. http://www.omras2.org/SonicAnnotator
10       keyclarity          0.04      brightness             0.05
As we can see, the most important features both for valence and       [6] Weka: Data mining software
arousal are loudness, spectral flux (as an average distance                http://www.cs.waikato.ac.nz/ml/weka/
between each successive frames), roughness (average of all the        [7] Yi-Hsuan Yang and Homer H. Chen. 2012. Machine
dissonance between all possible pairs of peaks), and HCDF                 Recognition of Music Emotion: A Review. ACM Trans.
(harmonic change detection function, which is a flux of a tonal           Intell. Syst. Technol. 3, 3, Article 40 (May 2012), 30 pages.
centroid).
                                                                      [8] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and H. H. Chen.
 Trying to maximize the R2 value for model predictions, we                2008. A Regression Approach to Music Emotion
selected 26 top attributes for arousal and 27 for valence.                Recognition. Trans. Audio, Speech and Lang. Proc. 16, 2
                                                                          (February 2008), 448-45