MediaEval 2015: Music Emotion Recognition based on
                Feed-Forward Neural Network

               Braja Gopal Patra, Promita Maitra, Dipankar Das and Sivaji Bandyopadhyay
                          Department of Computer Science & Engineering, Jadavpur University
                                                     Kolkata, India
              {brajagopal.cse, promita.maitra, dipankar.dipnil2005}@gmail.com, sivaji_cse_ju@yahoo.com


ABSTRACT                                                               of neurons, or hidden units placed in parallel. Each neuron
In this paper, we describe the music emotion recognition system        performs a weighted summation of the inputs, which then passes a
named as JU_NLP to find the dynamic valence and arousal values         nonlinear activation function σ, also called the neuron function.
of a song continuously considered from 15 second to its end in an
interval of 0.5 seconds. We adopted the feed-forward networks
with 10 hidden layers to build the regression model. We used the
correlation-based method to find out suitable features among all
the features provided by the organizer. Then we applied the feed-
forward neural networks on the above features to find out the
dynamic arousal and valence values.

1. INTRODUCTION                                                                Figure1: A Feed-Forward neural network with one hidden
Rapid growth of internet has been experienced all over the world                               layer and one output.
since past ten years. It also expedites the process of purchasing           Mathematically, the functionality of a hidden neuron is
and sharing digital music in the Web. Thus, such a large               described by: σ  ( !!!! 𝑤! 𝑥! + 𝑏! ), where the weights {𝑤! , 𝑏! } are
collection of digital music needs an automated process for their       symbolized with the arrows feeding into the neuron. The network
organization, management, search, playlists generation etc. People     output is formed by another weighted summation of the outputs of
are more interested in creating music library that allows them to      the neurons in the hidden layer [7]. This summation on the output
access songs in accordance with their moods compared to the title,     is called the output layer. The gradient descent learning principle
artists and/or genre [1, 4]. People are also interested in creating    is used to update the weights as the errors are back propagated
music libraries based on several other psychological factors, for      through each layer by the well-known back-propagation algorithm
example, what songs they like or dislike (and in what                  [5].
circumstances); time of the day and their state of mind etc. [4].           On the other hand, correlation is used to reduce the feature
Thus, the classification of music based on emotions is considered      dimension. If we treat all the features and the class in a uniform
as one of the most important aspects in music industry.                manner, the feature-class correlation and feature-feature inter-
      Emotion in Music Task at MediaEval addresses the problem         correlations may be calculated as follows.
of automatic emotion prediction of music in a time frame of 0.5                                    𝑘𝑟!"
second as we can observe the significant emotional changes                        𝑚𝑒𝑟𝑖𝑡! =                                                                           1
during the discourse of a full length song. The organizers                                    𝑘 + 𝑘(𝑘 − 1)𝑟!!
provided the annotated music clips for the Music Emotion
Retrieval (MER) task. The music clips were annotated via                    where 𝑚𝑒𝑟𝑖𝑡! is heuristic metric of a feature subset s
crowdsourcing using Amazon’s Mechanical Turk1 (MTurk) [6].             containing k features, k is the number of components, 𝑟!" is the
They followed the dimensional representations of emotion               average feature–class correlations, 𝑟!! is the average feature-
because it is easier to describe emotions by positioning the content   feature inter-correlation [8].
in comparison to a reference point [3]. The Valence-Arousal (V-             From the above equation, we can calculate how predictive
A) representation has been selected for the annotation scheme.         one attribute is with respect to another. A collection of instances is
                                                                       considered pure if each instance is the same in contrast to the
2. FEED-FORWARD NEURAL NETWORK                                         value of a second attribute; the collection of instances is impure
AND CORRELATION                                                        (to some degree) if instances differ with respect to the value of the
Feed-Forward neural networks (also called the back-propagation         second attribute. To calculate the merit of a feature subset using
networks and multilayer perceptron) are the most widely used           above equation, feature-feature inter-correlations (the ability of
models in several major application areas. Figure 1 illustrates a      one feature to predict another and vice versa) must be measured as
one-hidden-layer feed-forward neural network with inputs x1,           well [8].
x2,...,xn and output ỹ. Each arrow in the Figure symbolizes a
parameter in the network. The network is divided into multiple         3. APPROACH
layers namely input layer, hidden layer and output layer. The               Subtask 1: In this subtask, a fixed feature set has been
input layer consists of just the inputs to the network. Then, the      provided by the organizers and we have to implement the models
network follows a hidden layer which consists of any number            of our choice to identify the valence and arousal for the clips
                                                                       captured in 0.5 second time interval. In this work, we employed
1
                                                                       two feed-forward neural networks based regression model, in
 www.mturk.com/                                                        order to map feature values to the arousal and valance scores.
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany.
Both of the feed-forward neural networks use the same set of                                          for analyzing the emotion in music. The correlation method is
feature values but are respectively trained on the arousal or                                         used to reduce the feature dimension in order to find suitable
valence score. Each of the feed-forward neural networks is                                            features for both arousal and valence. The best model yields
employed with 10 neurons in the hidden layers. We divided the                                         minimum RSME of 0.2622 and 0.2913 for arousal and valence
whole training set into 5 parts to reduce the computation time, i.e.                                  respectively using 70 best features, but the ‘r’ value for the
we trained our system using around 5000 instances at a time.                                          arousal was high as compared to other system for arousal. In
Then, we tuned our system using a single portion as training and                                      future, we want to explore deep learning neural networks for
another portions for testing. We calculated the Mean Square Error                                     music emotion recognition.
(MSE) for each of the training sets. Finally, we tested the whole
test dataset using five trained modules and got five sets of results.                                 5. ACKNOWLEDGMENTS
These five sets of results are combined using average and inverse                                           The first author is supported by Visvesvaraya Ph.D.
weighted average technique. In the average technique, we simply                                       Fellowship funded by Department of Electronics and Information
took the average of all the five results. Whereas, the inverse                                        Technology (DeitY), Government of India. The authors are also
weighted average is calculated as the equation below,                                                 thankful to the organizers A. Aljanaki, Y. Yang and M. Soleymani
                                    𝑦!                                                                for their support and help.
                                   !𝛿
                                      !
             Output !"#$!!"# =                                                            (2)
                                     1                                                                6. REFERENCES
                                   ! !
                                    𝛿!                                                                [1] B. G. Patra, D. Das, and S. Bandyopadhyay, Unsupervised
      Where 𝑦! is the ith output and 𝛿! is the MSE of the ith module.                                     Approach to Hindi Music Mood Classification, Mining
From the equation, we can see that we gave less priority to the                                           Intelligence and Knowledge Exploration, R. Prasath and T.
result, which has derived from the module having the maximum                                              Kathirvalavakumar (Eds.): LNAI 8284, pp. 62–69, Springer
MSE.                                                                                                      International Publishing, 2013.
      Finally, the Root-Mean-Square Error (RMSE) is used to                                           [2] B. G. Patra, D. Das, and S. Bandyopadhyay. 2013.
evaluate the MER systems. We also reported the Pearson’s                                                  Automatic Music Mood Classification of Hindi Songs. In
correlation (r) of the prediction and the ground truth. The final                                         Proceedings of the 3rd Workshop on Sentiment Analysis
RMSE and ‘r’ for the above two systems (named as baseline                                                 where AI meets Psychology (SAAIP-2013), Nagoya, Japan,
feature system with our model average and weighted average)                                               pp. 24-28.
were given in the Table 1.
      Subtask 2: In this subtask, a fixed regression model has been                                   [3] M. Soleymani, M. N. Caro, E. M. Schmidt, C. Sha, and Y.
provided by the organizers to develop the MER systems based on                                            Yang. 1000 Songs for Emotional Analysis of Music. In
the different features of our choice. According to the literature, we                                     Proceedings of the 2nd ACM international workshop on
found that most of the important features are provided by the                                             Crowdsourcing for multimedia, pp. 1-6, ACM, 2013.
organizer as the baseline features. So, we focused to find the                                        [4] N. Duncan, and M. Fox. Computer–aided music distribution:
important feature rather than finding any extra feature. Thus, we                                         The future of selection, retrieval and transmission, First
used the correlation based feature reduction technique to reduce                                          Monday, 10(4), 2005.
the baseline feature set given by the organizers as per the formula
                                                                                                      [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning
of correlation in equation 1.
                                                                                                          representations by back-propagating errors. Cognitive
      We found 70 and 114 numbers of important features using
                                                                                                          modeling, 5(3):1988.
this correlation formula for the arousal and valence, respectively.
Later on, we used the restricted correlation and found 24 and 70                                      [6] A. Aljanaki, Y. Yang, and M. Soleymani. Emotion in Music
numbers of important features for the arousal and valence,                                                Task at MediaEval 2015. In MediaEval 2015 Workshop,
respectively. We also selected 28 important features for valence.                                         September 14-15, 2015, Wurzen, Germany
      Subtask 3: In this subtask, we implemented the feed-forward                                     [7] Mathematica Neural Networks- Train and Analyze Neural
neural network on the derived features using correlation in order                                         Networks to Fit Your Data, Wolfram Research Inc., First
to build the MER system. We built five systems for difference sets                                        Edition, September 2005, Champaign, Illinois, USA
of arousal and valence features described in subtask 2. The RMSE
and ‘r’ values for the above five models were shown in Table 1.                                       [8] M. A. Hall. Correlation-based feature selection for machine
                                                                                                          learning. PhD dissertation, The University of Waikato, 1999
4. CONCLUSION
     We used feed-forward neural network to develop a
regression based system to find the dynamic arousal and valence
        Table 1: Regression results with different models (BaF: Baseline Feature, OM: Our Model, OF: Our Feature, X: Not Available)
                                                                                     Arousal                                            Valence
                                                             RMSE            Range                r        Range      RMSE       Range            r     Range
                 BaF+OM (Average)                            0.2689         ±0.1073             0.4678    ±0.2307     0.3538     ±0.1612    -0.0082     ±0.3671
          BaF+ OM (Weighted Average)                         0.2702         ±0.1062             0.4671    ±0.2282     0.3646     ±0.1627    -0.0074     ±0.3543
                     OF (24) + OM                            0.2829         ±0.1011             0.2787    ±0.2531       X           X             X        X
                     OF (70) + OM                            0.2622         ±0.0899             0.3929    ±0.2489     0.2913     ±0.1452    -0.0037     ±0.0281
                     OF (114) + OM                              X                X                X          X        0.3799     ±0.1666    -0.0376     ±0.3312
                      OF(28) + OM                               X                X                X          X        0.4300     ±0.1801    -0.0180     ±0.3113