<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Aljanaki</string-name>
          <email>a.aljanaki@uu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Soleymani</string-name>
          <email>mohammad.soleymani@unige.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Dept., University of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information and Computing, Sciences, Utrecht University</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Emotion in Music task is held for the third consecutive year at the MediaEval benchmarking campaign. The unceasing interest towards the task shows that the music emotion recognition (MER) problem is truly important to the community, and there is a lot remaining to be discovered about it. Automatic MER methods could greatly improve the accessibility of music collections by providing quick and standardized means of music categorization and indexing. In the Emotion in Music task we provide a benchmark for automatic MER methods. This year, we concentrated on a single task, which proved to be the most challenging in the previous years: dynamic emotion characterization. We put special emphasis on providing high-quality ground truth data and maximizing inter-annotator agreement. As a consequence of meeting a higher quality demand, the dataset both for training and evaluation is smaller than in the previous years. The dataset consists of music licensed under Creative Commons from the Free Music Archive, medleyDB dataset and Jamendo. This paper describes the dataset collection, annotations, and evaluation criteria of the task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Contemporary music listeners rely on online music
services such as Spotify, iTunes or Soundcloud to access their
favorite music. In order to make their collections accessible,
music libraries need to classify music by genre,
instrumentation, tempo and mood. Automatic solutions to auto-tagging
problem are invaluable because they make annotation fast,
cheap and standardized. Emotion is one of the most
important search criteria for music. Automatic MER (music
emotion recognition) algorithms rely on ground truth for
training. There are many ways in which such a ground truth
can be generated [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; using di erent a ective representations
or di erent temporal granularity. Depending on the a
ective model or temporal resolution, the evaluation criteria can
vary. These discrepancies make it very di cult to compare
di erent methods. The Emotion in Music task is designed
This research was supported in part by Ambizione
program of the Swiss National Science Foundation and the
FES project COMMIT/. We thank Alexander Lansky, from
Queens University, Canada and Yu-Hao Chin from National
Central University, Taiwan for assistance with the song
selection and annotations.
to develop a benchmark and an evaluation framework for
such a comparison.
      </p>
      <p>
        The task is held for the third year in the MediaEval
benchmarking campaign for multimedia evaluation1 [
        <xref ref-type="bibr" rid="ref1 ref14">1,14</xref>
        ].
Building on our experience in the last two years, we concentrate
on a single dynamic emotion characterization task and on
o ering high quality ground truth.
      </p>
      <p>
        The only other current evaluation task for MER is the
audio mood classi cation (AMC) task of the annual music
information retrieval evaluation exchange2 (MIREX) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In
this task, 600 audio les are provided to the participants
of the task, who have agreed not to distribute the les for
commercial purposes. However, AMC has been criticized for
using an emotional model that is not based on psychological
research. Namely, this benchmark uses ve discrete emotion
clusters, derived from cluster analysis of online tags, instead
of more widely accepted dimensional or categorical models of
emotion. It was noted that there exists semantic or acoustic
overlap between clusters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Furthermore, the dataset only
applies a singular static rating per audio clip, which belies
the time-varying nature of music. Since 2013, another set
of 1,438 segments of 30 seconds clipped from Korean pop
songs has been used in MIREX as well. However, the same
ve-class taxonomy is adopted for this Korean set.
      </p>
      <p>
        Since the rst edition of the Emotion in Music task in
2013 we have opted for characterizing the per-second
emotion of music as numerical values in two dimensions |
valence (positive or negative emotions expressed in music) and
arousal (energy of the music) (VA) [
        <xref ref-type="bibr" rid="ref13 ref17">13, 17</xref>
        ], making it
easier to depict the temporal dynamics of emotion variation.
The VA model has been widely adopted in a ective
research [2, 6, 9{11, 15, 18{20]. However, the model is not free
of criticisms and some other alternatives may be
considered in the future. For example, the VA model has been
criticized for being too reductionist and that other
dimensions such as dominance should be added [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, the
terms `valence' and `arousal' may be sometimes too abstract
for people to have a common understanding of its meaning.
Such drawbacks of the VA model can further harm the
interannotator agreement of the annotations for an annotation
task which is already inherently fairly subjective.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>This year we o er only one task - dynamic emotion
characterization. However, in order to permit a thorough
com1http://www.multimediaeval.org
2http://www.music-ir.org/mirex/wiki/
parison between di erent methods, this year we require the
participants to submit two di erent runs.</p>
      <p>In one run, the participants are required to submit
their features and we use a baseline regression method
(linear regression) to estimate dynamic a ect. Any
features automatically extracted from the audio or the
metadata provided by the organizers are allowed.
In the second required run, all the participants are
required to use the baseline features that we provided
(see Section 3 for details) to compare their machine
learning methods. Participants are also free to submit
any combination of the features and machine learning
methods up to the total of ve runs.</p>
      <p>The participants will estimate the valence and arousal
scores continuously in time for every segment (half a
second long) on a scale from {1 to 1. The participants have
to submit both predictions of valence and arousal, their
feature set, if di erent from the basic provided one, and their
predictions when using a universal feature set. We will use
the Root-Mean-Square Error (RMSE) as the primary
evaluation measure. We will also report the Pearson correlation
(r) of the prediction and the ground truth. We will rank
the submissions based on the averaged RMSE. Whenever
the di erence based on the one sided Wilcoxon test is not
signi cant (p&gt;0.05), we will use the averaged correlation
coe cient to break the tie.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>DATASETS AND GROUND TRUTH</title>
      <p>
        Our datasets consist of royalty-free music from several
sources: freemusicarchive.org (FMA), jamendo.com, and
the medleyDB dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The development set consists of
431 clips of 45 seconds, which were selected from last year's
data based on inter-annotator agreement criteria. The test
set comprises 58 complete music pieces with an average
duration of 234 105:7 seconds.
      </p>
      <p>
        The development set is a subset of clips from last years
[
        <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
        ], all of which are from FMA. The subset was selected
according to the procedure described below:
1. We deleted the annotations which Pearson's
correlation with the averaged annotations for the same song
is below 0.1. If less than 5 annotators remain after the
deletion, we discarded the song.
2. For the remaining songs and remaining annotations,
we calculated the Cronbach's . If it was bigger than
0.6, the song was retained.
3. The mean (bias) of every dynamic annotation was
changed to match the averaged static annotation for
the same song.
      </p>
      <p>This procedure resulted in a reduction from 1,744 songs to
431 songs (the rest did not have consistent enough
annotations), each of which was annotated by 5{7 workers from
the Amazon Mechanical Turk (MTurk). The Cronbach's
is 0:76 0:12 for arousal, and 0:73 0:12 for valence.</p>
      <p>
        The evaluation set consists of 58 complete songs, one half
from medleyDB dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] of royalty-free multitrack
recordings and another half from the jamendo.com music website,
which provides music under Creative Commons license. We
selected songs with some emotional variation in them from
genres corresponding to the ones in the development set.
We used the same annotation interface as the previous two
years: a slider that is continuously moved by an annotator
while listening to music. The position of the slider indicates
the magnitude of valence or arousal.
      </p>
      <sec id="sec-3-1">
        <title>Arousal</title>
        <p>RMSE
r</p>
      </sec>
      <sec id="sec-3-2">
        <title>Valence</title>
        <p>RMSE
r
op+enMSMLIRLE 0.27 0.11 0.36 0.26 0.37 0.18 0.01 0.38
Abavseerlaingee 0.28 0.13 { 0.29 0.14 {</p>
        <p>
          The evaluation data we collected this year is di erent in
several respects. First, we opted for full-length songs to
cover the whole a ective variation. Second, we partially
annotated the data in the laboratory. The evaluation set is
annotated by 6 people; two onsite and 4 conscientious MTurk
workers, where 29% of the annotations was done in the lab.
This way, we can compare the agreement between the onsite
workers and the crowdworkers. The annotators listened to
the entire song before starting with the annotation, to get
familiar with the music and to reduce the reaction time lag.
The workers were only payed the full fee after their work
was reviewed and appeared to be of high quality. The
Cronbach's this year is 0:65 0:28 for arousal, and 0:29 0:94
for valence. In comparison, the Cronbach's for another
two existing datasets MoodSwing [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and AMG1608 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], is
0.41, 0.46 for arousal, and 0.25, 0.31 for valence, respectively.
As compared to our dataset, the consistency of annotations
has improved for arousal, but not for valence.
        </p>
        <p>
          It can be found that, there is a mismatch between the
training and test sets in terms of the duration of the clips
(45-second segments versus full songs) and the data sources
(FMA versus medleyDB and jamendo). In contrast, in either
2013 or 2014 the training and test sets were of the same
length and both were from FMA [
          <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
          ].
        </p>
        <p>
          In order to enable comparison between di erent machine
learning algorithms, we provide a baseline universal feature
set, extracted with openSMILE [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], consisting of 260
lowlevel features (mean and standard deviation of 65 low-level
acoustic descriptors, and their rst-order derivatives). In
addition to the audio features, we also provide meta-data
covering the genre labels obtained from FMA, and, for some
of the songs, folksonomy tags crawled from last.fm.
4.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>BASELINE RESULTS</title>
      <p>
        For the baseline, we used the openSMILE toolbox [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to
extract 260 feature from nonoverlapping segments of 500ms,
with frame size of 60ms with a 10ms step. We used multiple
linear regression (MLR), following last years. The results are
shown in the rst row of Table 1. Compared to the last year
(for arousal, r = 0.27 0.12, for valence, r = 0.19 0.11), the
baseline is worse. We also calculated an average baseline by
using the average of all the development set ground truth as
the prediction result for all the songs. In terms of RMSE,
this average baseline performs better for valence and at the
same level for arousal.
      </p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>A task has been developed to analyze emotion in music.
Annotations were collected using both onsite annotator and
crowdsourcing workers. The quest for higher quality labels
has led to a lower number of training and evaluation samples.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at MediaEval 2014</article-title>
          . In MediaEval 2014 Workshop,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Barthet</surname>
          </string-name>
          , G. Fazekas, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          .
          <article-title>Multidisciplinary perspectives on music emotion recognition: Implications for content and context-based models</article-title>
          .
          <source>In Int'l Symp</source>
          .
          <source>Computer Music Modelling &amp; Retrieval</source>
          , pages
          <volume>492</volume>
          {
          <fpage>507</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bittner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tierney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cannam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bello. MedleyDB</surname>
          </string-name>
          :
          <article-title>A multitrack dataset for annotation-intensive mir research</article-title>
          .
          <source>In Proc. ISMIR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.-A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The AMG1608 dataset for music emotion recognition</article-title>
          .
          <source>In Proc. IEEE Int. Conf. Acoust</source>
          .,
          <string-name>
            <surname>Speech</surname>
          </string-name>
          , Signal Process., pages
          <volume>693</volume>
          {
          <fpage>697</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Collier</surname>
          </string-name>
          .
          <article-title>Beyond valence and activity in the emotional connotations of music</article-title>
          .
          <source>Psychology of Music</source>
          ,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <volume>110</volume>
          {
          <fpage>131</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Eerola</surname>
          </string-name>
          .
          <article-title>Modelling emotions in music: Advances in conceptual, contextual and validity issues</article-title>
          .
          <source>In AES International Confernece</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent developments in openSMILE, the Munich Open-source Multimedia Feature Extractor</article-title>
          .
          <source>In Proceedings of ACM MM</source>
          , pages
          <volume>835</volume>
          {
          <fpage>838</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Downie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bay</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Ehmann</surname>
          </string-name>
          .
          <article-title>The 2007 MIREX audio mood classi cation task: Lessons learned</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf.</source>
          , pages
          <volume>462</volume>
          {
          <fpage>467</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Huq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bello</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rowe</surname>
          </string-name>
          .
          <article-title>Automated music emotion recognition: A systematic evaluation</article-title>
          .
          <source>Journal of New Music Research</source>
          ,
          <volume>39</volume>
          (
          <issue>3</issue>
          ):
          <volume>227</volume>
          {
          <fpage>244</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Migneco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Morton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Speck</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Turnbull</surname>
          </string-name>
          .
          <article-title>Music emotion recognition: A state of the art review</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf</source>
          .,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Koelstra</surname>
          </string-name>
          , C. Muhl, M. Soleymani,
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nijholt</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Patras.</surname>
          </string-name>
          <article-title>DEAP: A database for emotion analysis; using physiological signals</article-title>
          .
          <source>IEEE Trans. A ective Computing</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <volume>18</volume>
          {
          <fpage>31</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Laurier</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>Audio music mood classi cation using support vector machine</article-title>
          .
          <source>In MIREX Task on Audio Mood Classi cation</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          .
          <article-title>A circumplex model of a ect</article-title>
          .
          <source>J. Personality &amp; Social Science</source>
          ,
          <volume>39</volume>
          (
          <issue>6</issue>
          ):
          <volume>1161</volume>
          {
          <fpage>1178</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , C.-Y. Sha, and
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>1000 songs for emotional analysis of music</article-title>
          .
          <source>In Proceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia</source>
          , pages
          <fpage>1</fpage>
          <issue>{6</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pun</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>Corpus development for a ective video indexing</article-title>
          .
          <source>IEEE Trans. Multimedia</source>
          ,
          <volume>16</volume>
          (
          <issue>4</issue>
          ):
          <volume>1075</volume>
          {
          <fpage>1089</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Speck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Morton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>A comparative study of collaborative vs. traditional musical mood annotation</article-title>
          .
          <source>In Proc. Int. Soc. Music Info. Retrieval Conf</source>
          .,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Thayer</surname>
          </string-name>
          .
          <source>The Biopsychology of Mood and Arousal</source>
          . Oxford University Press, New York,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>J.-C. Wang</surname>
            ,
            <given-names>Y.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Wang</surname>
            , and
            <given-names>S.-K.</given-names>
          </string-name>
          <string-name>
            <surname>Jeng</surname>
          </string-name>
          .
          <article-title>Modeling the a ective content of music with a Gaussian mixture model</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <volume>56</volume>
          {
          <fpage>68</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ji</surname>
          </string-name>
          .
          <article-title>Video a ective content analysis: a survey of state of the art methods</article-title>
          .
          <source>IEEE Trans. A ective Computing</source>
          ,
          <source>PP(99):1</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang and H.-H. Chen</surname>
          </string-name>
          .
          <article-title>Machine recognition of music emotion: A review</article-title>
          .
          <source>ACM Trans. Intel. Systems &amp; Technology</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>