<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2015: Music Emotion Recognition based on Feed-Forward Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Braja Gopal Patra</string-name>
          <email>brajagopal.cse@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Promita Maitra</string-name>
          <email>promita.maitra@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dipankar Das</string-name>
          <email>dipankar.dipnil2005@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sivaji Bandyopadhyay</string-name>
          <email>sivaji_cse_ju@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Engineering, Jadavpur University Kolkata</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we describe the music emotion recognition system named as JU_NLP to find the dynamic valence and arousal values of a song continuously considered from 15 second to its end in an interval of 0.5 seconds. We adopted the feed-forward networks with 10 hidden layers to build the regression model. We used the correlation-based method to find out suitable features among all the features provided by the organizer. Then we applied the feedforward neural networks on the above features to find out the dynamic arousal and valence values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Rapid growth of internet has been experienced all over the world
since past ten years. It also expedites the process of purchasing
and sharing digital music in the Web. Thus, such a large
collection of digital music needs an automated process for their
organization, management, search, playlists generation etc. People
are more interested in creating music library that allows them to
access songs in accordance with their moods compared to the title,
artists and/or genre [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. People are also interested in creating
music libraries based on several other psychological factors, for
example, what songs they like or dislike (and in what
circumstances); time of the day and their state of mind etc. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Thus, the classification of music based on emotions is considered
as one of the most important aspects in music industry.
      </p>
      <p>
        Emotion in Music Task at MediaEval addresses the problem
of automatic emotion prediction of music in a time frame of 0.5
second as we can observe the significant emotional changes
during the discourse of a full length song. The organizers
provided the annotated music clips for the Music Emotion
Retrieval (MER) task. The music clips were annotated via
crowdsourcing using Amazon’s Mechanical Turk1 (MTurk) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
They followed the dimensional representations of emotion
because it is easier to describe emotions by positioning the content
in comparison to a reference point [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The Valence-Arousal
(VA) representation has been selected for the annotation scheme.
2. FEED-FORWARD NEURAL NETWORK
AND CORRELATION
Feed-Forward neural networks (also called the back-propagation
networks and multilayer perceptron) are the most widely used
models in several major application areas. Figure 1 illustrates a
one-hidden-layer feed-forward neural network with inputs x1,
x2,...,xn and output ỹ. Each arrow in the Figure symbolizes a
parameter in the network. The network is divided into multiple
layers namely input layer, hidden layer and output layer. The
input layer consists of just the inputs to the network. Then, the
network follows a hidden layer which consists of any number
1 www.mturk.com/
of neurons, or hidden units placed in parallel. Each neuron
performs a weighted summation of the inputs, which then passes a
nonlinear activation function σ, also called the neuron function.
      </p>
      <p>On the other hand, correlation is used to reduce the feature
dimension. If we treat all the features and the class in a uniform
manner, the feature-class correlation and feature-feature
intercorrelations may be calculated as follows.</p>
      <p>! =</p>
      <p>
                                                            1
!"
 + ( − 1)!!
where ! is heuristic metric of a feature subset s
containing k features, k is the number of components, !" is the
average feature–class correlations, !! is the average
featurefeature inter-correlation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        From the above equation, we can calculate how predictive
one attribute is with respect to another. A collection of instances is
considered pure if each instance is the same in contrast to the
value of a second attribute; the collection of instances is impure
(to some degree) if instances differ with respect to the value of the
second attribute. To calculate the merit of a feature subset using
above equation, feature-feature inter-correlations (the ability of
one feature to predict another and vice versa) must be measured as
well [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. APPROACH</title>
      <p>Subtask 1: In this subtask, a fixed feature set has been
provided by the organizers and we have to implement the models
of our choice to identify the valence and arousal for the clips
captured in 0.5 second time interval. In this work, we employed
two feed-forward neural networks based regression model, in
order to map feature values to the arousal and valance scores.
Both of the feed-forward neural networks use the same set of
feature values but are respectively trained on the arousal or
valence score. Each of the feed-forward neural networks is
employed with 10 neurons in the hidden layers. We divided the
whole training set into 5 parts to reduce the computation time, i.e.
we trained our system using around 5000 instances at a time.
Then, we tuned our system using a single portion as training and
another portions for testing. We calculated the Mean Square Error
(MSE) for each of the training sets. Finally, we tested the whole
test dataset using five trained modules and got five sets of results.
These five sets of results are combined using average and inverse
weighted average technique. In the average technique, we simply
took the average of all the five results. Whereas, the inverse
weighted average is calculated as the equation below,
!
Output!"#$!!"# = ! 1!                                                   (2)</p>
      <p>! !!</p>
      <p>Where ! is the ith output and ! is the MSE of the ith module.
From the equation, we can see that we gave less priority to the
result, which has derived from the module having the maximum
MSE.</p>
      <p>Finally, the Root-Mean-Square Error (RMSE) is used to
evaluate the MER systems. We also reported the Pearson’s
correlation (r) of the prediction and the ground truth. The final
RMSE and ‘r’ for the above two systems (named as baseline
feature system with our model average and weighted average)
were given in the Table 1.</p>
      <p>Subtask 2: In this subtask, a fixed regression model has been
provided by the organizers to develop the MER systems based on
the different features of our choice. According to the literature, we
found that most of the important features are provided by the
organizer as the baseline features. So, we focused to find the
important feature rather than finding any extra feature. Thus, we
used the correlation based feature reduction technique to reduce
the baseline feature set given by the organizers as per the formula
of correlation in equation 1.</p>
      <p>We found 70 and 114 numbers of important features using
this correlation formula for the arousal and valence, respectively.
Later on, we used the restricted correlation and found 24 and 70
numbers of important features for the arousal and valence,
respectively. We also selected 28 important features for valence.</p>
      <p>Subtask 3: In this subtask, we implemented the feed-forward
neural network on the derived features using correlation in order
to build the MER system. We built five systems for difference sets
of arousal and valence features described in subtask 2. The RMSE
and ‘r’ values for the above five models were shown in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>4. CONCLUSION</title>
      <p>We used feed-forward neural network to develop a
regression based system to find the dynamic arousal and valence
for analyzing the emotion in music. The correlation method is
used to reduce the feature dimension in order to find suitable
features for both arousal and valence. The best model yields
minimum RSME of 0.2622 and 0.2913 for arousal and valence
respectively using 70 best features, but the ‘r’ value for the
arousal was high as compared to other system for arousal. In
future, we want to explore deep learning neural networks for
music emotion recognition.</p>
    </sec>
    <sec id="sec-4">
      <title>5. ACKNOWLEDGMENTS</title>
      <p>The first author is supported by Visvesvaraya Ph.D.
Fellowship funded by Department of Electronics and Information
Technology (DeitY), Government of India. The authors are also
thankful to the organizers A. Aljanaki, Y. Yang and M. Soleymani
for their support and help.
BaF+ OM (Weighted Average)</p>
      <p>OF (24) + OM
OF (70) + OM
OF (114) + OM
OF(28) + OM</p>
      <sec id="sec-4-1">
        <title>Arousal Valence</title>
      </sec>
      <sec id="sec-4-2">
        <title>RMSE</title>
        <p>0.2689
±0.1073
±0.1062
±0.1011
±0.0899
X
X
0.4678
0.4671
±0.2307
±0.2282
±0.2531
±0.2489
X
X</p>
      </sec>
      <sec id="sec-4-3">
        <title>RMSE</title>
        <p>±0.1612
±0.1627</p>
        <p>X
±0.1452
±0.1666</p>
        <p>X
±0.0281
±0.3312</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
            , Unsupervised Approach to Hindi Music Mood Classification, Mining Intelligence and
            <given-names>Knowledge</given-names>
          </string-name>
          <string-name>
            <surname>Exploration</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Prasath</surname>
          </string-name>
          and T. Kathirvalavakumar (Eds.):
          <source>LNAI 8284</source>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          , Springer International Publishing,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Automatic Music Mood Classification of Hindi Songs</article-title>
          .
          <source>In Proceedings of the 3rd Workshop on Sentiment Analysis where AI meets Psychology (SAAIP-2013)</source>
          , Nagoya, Japan, pp.
          <fpage>24</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>1000 Songs for Emotional Analysis of Music</article-title>
          .
          <source>In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          , ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Duncan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fox</surname>
          </string-name>
          .
          <article-title>Computer-aided music distribution: The future of selection, retrieval and transmission</article-title>
          ,
          <source>First Monday</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Learning representations by back-propagating errors</article-title>
          .
          <source>Cognitive modeling</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in Music Task at MediaEval 2015</article-title>
          . In MediaEval 2015 Workshop, September 14-
          <issue>15</issue>
          ,
          <year>2015</year>
          , Wurzen, Germany
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mathematica</given-names>
            <surname>Neural Networks- Train</surname>
          </string-name>
          and
          <article-title>Analyze Neural Networks to Fit Your Data</article-title>
          , Wolfram Research Inc.,
          <string-name>
            <surname>First</surname>
            <given-names>Edition</given-names>
          </string-name>
          ,
          <year>September 2005</year>
          , Champaign, Illinois, USA
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hall</surname>
          </string-name>
          .
          <article-title>Correlation-based feature selection for machine learning</article-title>
          .
          <source>PhD dissertation</source>
          , The University of Waikato,
          <year>1999</year>
          -
          <fpage>0</fpage>
          .
          <fpage>0082</fpage>
          -
          <lpage>0</lpage>
          .
          <fpage>0074</fpage>
          -
          <lpage>0</lpage>
          .
          <fpage>0037</fpage>
          -
          <lpage>0</lpage>
          .0376 ±
          <fpage>0</fpage>
          .3671 ±
          <fpage>0</fpage>
          .
          <fpage>3543</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>