MediaEval 2015: Music Emotion Recognition based on Feed-Forward Neural Network Braja Gopal Patra, Promita Maitra, Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering, Jadavpur University Kolkata, India {brajagopal.cse, promita.maitra, dipankar.dipnil2005}@gmail.com, sivaji_cse_ju@yahoo.com ABSTRACT of neurons, or hidden units placed in parallel. Each neuron In this paper, we describe the music emotion recognition system performs a weighted summation of the inputs, which then passes a named as JU_NLP to find the dynamic valence and arousal values nonlinear activation function σ, also called the neuron function. of a song continuously considered from 15 second to its end in an interval of 0.5 seconds. We adopted the feed-forward networks with 10 hidden layers to build the regression model. We used the correlation-based method to find out suitable features among all the features provided by the organizer. Then we applied the feed- forward neural networks on the above features to find out the dynamic arousal and valence values. 1. INTRODUCTION Figure1: A Feed-Forward neural network with one hidden Rapid growth of internet has been experienced all over the world layer and one output. since past ten years. It also expedites the process of purchasing Mathematically, the functionality of a hidden neuron is and sharing digital music in the Web. Thus, such a large described by: σ  ( !!!! 𝑤! 𝑥! + 𝑏! ), where the weights {𝑤! , 𝑏! } are collection of digital music needs an automated process for their symbolized with the arrows feeding into the neuron. The network organization, management, search, playlists generation etc. People output is formed by another weighted summation of the outputs of are more interested in creating music library that allows them to the neurons in the hidden layer [7]. This summation on the output access songs in accordance with their moods compared to the title, is called the output layer. The gradient descent learning principle artists and/or genre [1, 4]. People are also interested in creating is used to update the weights as the errors are back propagated music libraries based on several other psychological factors, for through each layer by the well-known back-propagation algorithm example, what songs they like or dislike (and in what [5]. circumstances); time of the day and their state of mind etc. [4]. On the other hand, correlation is used to reduce the feature Thus, the classification of music based on emotions is considered dimension. If we treat all the features and the class in a uniform as one of the most important aspects in music industry. manner, the feature-class correlation and feature-feature inter- Emotion in Music Task at MediaEval addresses the problem correlations may be calculated as follows. of automatic emotion prediction of music in a time frame of 0.5 𝑘𝑟!" second as we can observe the significant emotional changes 𝑚𝑒𝑟𝑖𝑡! =                                                     1 during the discourse of a full length song. The organizers 𝑘 + 𝑘(𝑘 − 1)𝑟!! provided the annotated music clips for the Music Emotion Retrieval (MER) task. The music clips were annotated via where 𝑚𝑒𝑟𝑖𝑡! is heuristic metric of a feature subset s crowdsourcing using Amazon’s Mechanical Turk1 (MTurk) [6]. containing k features, k is the number of components, 𝑟!" is the They followed the dimensional representations of emotion average feature–class correlations, 𝑟!! is the average feature- because it is easier to describe emotions by positioning the content feature inter-correlation [8]. in comparison to a reference point [3]. The Valence-Arousal (V- From the above equation, we can calculate how predictive A) representation has been selected for the annotation scheme. one attribute is with respect to another. A collection of instances is considered pure if each instance is the same in contrast to the 2. FEED-FORWARD NEURAL NETWORK value of a second attribute; the collection of instances is impure AND CORRELATION (to some degree) if instances differ with respect to the value of the Feed-Forward neural networks (also called the back-propagation second attribute. To calculate the merit of a feature subset using networks and multilayer perceptron) are the most widely used above equation, feature-feature inter-correlations (the ability of models in several major application areas. Figure 1 illustrates a one feature to predict another and vice versa) must be measured as one-hidden-layer feed-forward neural network with inputs x1, well [8]. x2,...,xn and output ỹ. Each arrow in the Figure symbolizes a parameter in the network. The network is divided into multiple 3. APPROACH layers namely input layer, hidden layer and output layer. The Subtask 1: In this subtask, a fixed feature set has been input layer consists of just the inputs to the network. Then, the provided by the organizers and we have to implement the models network follows a hidden layer which consists of any number of our choice to identify the valence and arousal for the clips captured in 0.5 second time interval. In this work, we employed 1 two feed-forward neural networks based regression model, in www.mturk.com/ order to map feature values to the arousal and valance scores. Copyright is held by the author/owner(s). MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Germany. Both of the feed-forward neural networks use the same set of for analyzing the emotion in music. The correlation method is feature values but are respectively trained on the arousal or used to reduce the feature dimension in order to find suitable valence score. Each of the feed-forward neural networks is features for both arousal and valence. The best model yields employed with 10 neurons in the hidden layers. We divided the minimum RSME of 0.2622 and 0.2913 for arousal and valence whole training set into 5 parts to reduce the computation time, i.e. respectively using 70 best features, but the ‘r’ value for the we trained our system using around 5000 instances at a time. arousal was high as compared to other system for arousal. In Then, we tuned our system using a single portion as training and future, we want to explore deep learning neural networks for another portions for testing. We calculated the Mean Square Error music emotion recognition. (MSE) for each of the training sets. Finally, we tested the whole test dataset using five trained modules and got five sets of results. 5. ACKNOWLEDGMENTS These five sets of results are combined using average and inverse The first author is supported by Visvesvaraya Ph.D. weighted average technique. In the average technique, we simply Fellowship funded by Department of Electronics and Information took the average of all the five results. Whereas, the inverse Technology (DeitY), Government of India. The authors are also weighted average is calculated as the equation below, thankful to the organizers A. Aljanaki, Y. Yang and M. Soleymani 𝑦! for their support and help. !𝛿 ! Output !"#$!!"# =                                                  (2) 1 6. REFERENCES ! ! 𝛿! [1] B. G. Patra, D. Das, and S. Bandyopadhyay, Unsupervised Where 𝑦! is the ith output and 𝛿! is the MSE of the ith module. Approach to Hindi Music Mood Classification, Mining From the equation, we can see that we gave less priority to the Intelligence and Knowledge Exploration, R. Prasath and T. result, which has derived from the module having the maximum Kathirvalavakumar (Eds.): LNAI 8284, pp. 62–69, Springer MSE. International Publishing, 2013. Finally, the Root-Mean-Square Error (RMSE) is used to [2] B. G. Patra, D. Das, and S. Bandyopadhyay. 2013. evaluate the MER systems. We also reported the Pearson’s Automatic Music Mood Classification of Hindi Songs. In correlation (r) of the prediction and the ground truth. The final Proceedings of the 3rd Workshop on Sentiment Analysis RMSE and ‘r’ for the above two systems (named as baseline where AI meets Psychology (SAAIP-2013), Nagoya, Japan, feature system with our model average and weighted average) pp. 24-28. were given in the Table 1. Subtask 2: In this subtask, a fixed regression model has been [3] M. Soleymani, M. N. Caro, E. M. Schmidt, C. Sha, and Y. provided by the organizers to develop the MER systems based on Yang. 1000 Songs for Emotional Analysis of Music. In the different features of our choice. According to the literature, we Proceedings of the 2nd ACM international workshop on found that most of the important features are provided by the Crowdsourcing for multimedia, pp. 1-6, ACM, 2013. organizer as the baseline features. So, we focused to find the [4] N. Duncan, and M. Fox. Computer–aided music distribution: important feature rather than finding any extra feature. Thus, we The future of selection, retrieval and transmission, First used the correlation based feature reduction technique to reduce Monday, 10(4), 2005. the baseline feature set given by the organizers as per the formula [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning of correlation in equation 1. representations by back-propagating errors. Cognitive We found 70 and 114 numbers of important features using modeling, 5(3):1988. this correlation formula for the arousal and valence, respectively. Later on, we used the restricted correlation and found 24 and 70 [6] A. Aljanaki, Y. Yang, and M. Soleymani. Emotion in Music numbers of important features for the arousal and valence, Task at MediaEval 2015. In MediaEval 2015 Workshop, respectively. We also selected 28 important features for valence. September 14-15, 2015, Wurzen, Germany Subtask 3: In this subtask, we implemented the feed-forward [7] Mathematica Neural Networks- Train and Analyze Neural neural network on the derived features using correlation in order Networks to Fit Your Data, Wolfram Research Inc., First to build the MER system. We built five systems for difference sets Edition, September 2005, Champaign, Illinois, USA of arousal and valence features described in subtask 2. The RMSE and ‘r’ values for the above five models were shown in Table 1. [8] M. A. Hall. Correlation-based feature selection for machine learning. PhD dissertation, The University of Waikato, 1999 4. CONCLUSION We used feed-forward neural network to develop a regression based system to find the dynamic arousal and valence Table 1: Regression results with different models (BaF: Baseline Feature, OM: Our Model, OF: Our Feature, X: Not Available) Arousal Valence RMSE Range r Range RMSE Range r Range BaF+OM (Average) 0.2689 ±0.1073 0.4678 ±0.2307 0.3538 ±0.1612 -0.0082 ±0.3671 BaF+ OM (Weighted Average) 0.2702 ±0.1062 0.4671 ±0.2282 0.3646 ±0.1627 -0.0074 ±0.3543 OF (24) + OM 0.2829 ±0.1011 0.2787 ±0.2531 X X X X OF (70) + OM 0.2622 ±0.0899 0.3929 ±0.2489 0.2913 ±0.1452 -0.0037 ±0.0281 OF (114) + OM X X X X 0.3799 ±0.1666 -0.0376 ±0.3312 OF(28) + OM X X X X 0.4300 ±0.1801 -0.0180 ±0.3113