<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIRUtrecht participation in MediaEval 2013: Emotion in Music task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Aljanaki</string-name>
          <email>A.Aljanaki@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frans Wiering</string-name>
          <email>F.Wiering@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Remco C. Veltkamp</string-name>
          <email>R.C.Veltkamp@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Utrecht University</institution>
          ,
          <addr-line>Princetonplein 5, Utrecht 3584CC</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This working notes paper describes the system proposed by the MIRUtrecht team1 for static emotion recognition from audio (task Emotion in Music) in the MediaEval evaluation contest 2013. We approach the problem by proposing a scheme comprising data filtering, feature extraction, attribute selection and multivariate regression. The system is based on state-of-the art research in the field and achieved performance of (in terms of R2, i.e. proportion of variance explained by the model) 0.64 for arousal and 0.36 for valence.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.1 Related Work</title>
      <p>
        A regressive approach to modeling valence and arousal has
already been undertaken by many researchers (see review by
Yang [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ]), with notable attempts by MacDorman et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (using
kernel ISOMAP or PCA for dimensionality reduction and
multiple linear regression for predictions) and Yang et al. [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ].
(using PCA for correlation reduction, RReliefF for feature
selection and Support Vector Regression for predictions). In [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ],
the prediction accuracy in terms of R2 reaches 58.3 for arousal
and 28.1 for valence.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Data Filtering</title>
      <p>
        In the dataset provided by MediaEval, it appears that valence and
arousal dimensions are highly correlated (Pearson’s r = 0.56, see
also Figure 1). This is not an unusual situation (in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
these dimensions correlate with Pearson’s r = 0.33, in [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ], r =
0.34). The upper left (angry) quadrant contains more data points
than the opposite lower right (calm) quadrant. When looking at
separate data points in the angry quadrant, we discovered some
audio files containing speech or noise. We decided to filter them
out. This was done after extracting features (as described in
section 2.2). An InterquartileRange filter in Weka [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was used to
detect those outliers using both extracted features and
valencearousal annotations. For each feature, the audio file x is
considered to be an outlier, if it satisfies the following criteria:
where Q1 is the first quartile threshold, i.e. the middle number
between the smallest and the median of the data set, Q3 is the
third quartile, i.e. the middle number between the largest and the
median of the dataset, and IQR = Q3 – Q1.
      </p>
      <p>In total, 13 items were deleted from the dataset based on
suggestions from the filter, including, in addition to files
containing speech, noise and environmental sounds, 4 files
containing contemporary classical music. Figure 1 shows a
scatterplot of the dataset, with outliers marked as red crosses.
1 This research is supported by the FES project COMMIT/.</p>
      <p>Features
loudness</p>
      <p>SonicAnnotator mode
As we were predicting the emotion of the long (45 seconds) audio
file, both the average values and the their standard deviations of
the features were calculated, where applicable. From Psysound,
the dynamic loudness (using the loudness model of Chalupper
and Fastl) was employed. Sonic Annotator was used to extract an
alternative estimation of mode. In MIRToolbox, the mode of the
piece is calculated as a key strength difference between the best
major and best minor key. In SonicAnnotator, modulation
boundaries are detected, a certain key is predicted for each
segment, and mode is estimated according to the amount of time
the music is in major or minor mode. In total we extracted 44
features.</p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Feature Selection</title>
      <p>The features we extracted are not necessarily all of equal
importance to our task, and the feature set might contain
redundant data. To select important features, we applied the
ReliefF feature selection algorithm in WEKA. Table 2 shows the
top 10 most important features for valence and for arousal
according to ReliefF, where merit is the quality of the attribute,
estimated using the probability of the predicted values of two
neighbour instances being different.
As we can see, the most important features both for valence and
arousal are loudness, spectral flux (as an average distance
between each successive frames), roughness (average of all the
dissonance between all possible pairs of peaks), and HCDF
(harmonic change detection function, which is a flux of a tonal
centroid).</p>
      <p>Trying to maximize the R2 value for model predictions, we
selected 26 top attributes for arousal and 27 for valence.
1
2
3
4
5
6
7
8
9
10</p>
      <sec id="sec-4-1">
        <title>Feature</title>
        <p>loudness
spectral flux</p>
        <sec id="sec-4-1-1">
          <title>MFCC4</title>
          <p>attack time
attack slope
brightness</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>MFCC9 roughness keyclarity</title>
          <p>Merit
0.016
0.013
0.09
0.07
0.06
0.06
0.05
0.05
0.04
0.04</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Valence</title>
      </sec>
      <sec id="sec-4-3">
        <title>Feature</title>
        <p>roughness
spectral flux
zero crossing rate
loudness
MFCC8
std roughness
MFCC5
MFCC6
HCDF
brightness
0.011
0.07
0.06
0.06
0.06
0.06
0.05
0.05
0.05</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>2.4 Model fitting</title>
      <p>With the selected attributes, we modeled the data using multiple
regression, Support Vector Regression, M5Rules, Multilayer
Perceptron and other regressive techniques available in WEKA,
and evaluated them on the training set with 10-fold cross
validation. The two systems that performed best were submitted
for evaluation and are described below.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Results and Evaluation</title>
      <p>The submitted systems were evaluated on 300 test items. Table 3
shows the results of the runs for multiple regression and for
M5Rules, which are equal. Three metrics are provided: R2 is the
metric showing the goodness of fit of the model and is often
described as the proportion of variance explained by the model,
MAE is the Mean Average Error and AE-STD is its standard
deviation.
MAE
AE-STD</p>
      <p>M5Rules &amp; Multiple regression
arousal</p>
      <p>
        valence
From the evaluation results we can conclude that such a simple
technique as multiple regression performs as good as more
sophisticated models, achieving a sufficiently good performance
on a new dataset. The prediction accuracy of valence is, as one
would expect from other attempts to model it [
        <xref ref-type="bibr" rid="ref3 ref6 ref7">3,7,8</xref>
        ], lower than
that for arousal, though it is higher than in previous research,
which might be the outcome of high degree of correlation
between valence and arousal in this particular dataset.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cabrera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>1999</year>
          . PSYSOUND:
          <article-title>A computer program for psychoacoustical analysis</article-title>
          ,
          <source>in Proc. Australian Acoust. Soc. Conf.</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Lartillot</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toiviainen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2007</year>
          .
          <article-title>A Matlab Toolbox for Musical Feature Extraction FromAudio</article-title>
          , International Conference on Digital Audio Effects, Bordeaux,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>MacDorman</surname>
            ,
            <given-names>K. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ough</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ho</surname>
          </string-name>
          , C.-C.
          <year>2007</year>
          .
          <article-title>Automatic emotion prediction of song excerpts: Index construction, algorithm design, and empirical comparison</article-title>
          .
          <source>J. New Music Res</source>
          .
          <volume>36</volume>
          ,
          <issue>4</issue>
          ,
          <fpage>281</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Soleymani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sha</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>1000 Songs for Emotional Analysis of Music</article-title>
          .
          <source>In Proceedings of the ACM multimedia 2013 workshop on Crowdsourcing for Multimedia. ACM, ACM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sonic</given-names>
            <surname>Annotator</surname>
          </string-name>
          . http://www.omras2.org/SonicAnnotator [6]
          <string-name>
            <surname>Weka</surname>
          </string-name>
          : Data mining software http://www.cs.waikato.ac.nz/ml/weka/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Yi-Hsuan Yang</surname>
          </string-name>
          and
          <string-name>
            <surname>Homer H. Chen</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Machine Recognition of Music Emotion: A Review</article-title>
          .
          <source>ACM Trans. Intell. Syst. Technol. 3</source>
          ,
          <issue>3</issue>
          , Article 40 (May
          <year>2012</year>
          ),
          <volume>30</volume>
          pages.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Yi-Hsuan</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu-Ching</surname>
            <given-names>Lin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ya-Fan Su</surname>
            , and
            <given-names>H. H.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>A Regression Approach to Music Emotion Recognition</article-title>
          .
          <source>Trans. Audio, Speech and Lang. Proc. 16</source>
          ,
          <issue>2</issue>
          (
          <year>February 2008</year>
          ),
          <fpage>448</fpage>
          -
          <lpage>45</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>