=Paper=
{{Paper
|id=Vol-2600/paper18
|storemode=property
|title=Novel Approach to Music Genre Classification using Clustering Augmented Learning Method (CALM)
|pdfUrl=https://ceur-ws.org/Vol-2600/paper18.pdf
|volume=Vol-2600
|authors=Soumya Suvra Ghosal,Indranil Sarkar
|dblpUrl=https://dblp.org/rec/conf/aaaiss/GhosalS20
}}
==Novel Approach to Music Genre Classification using Clustering Augmented Learning Method (CALM)==
Novel Approach to Music Genre Classification using Clustering Augmented Learning Method (CALM) Soumya Suvra Ghosal and Indranil Sarkar National Institute of Technology Durgapur India {soumyasuvraghosal@gmail.com,indranil.sarkar.nitdgp@gmail.com} Abstract each genre at each timestep. Inspired by previous literature, we propose to leverage the idea by augmenting LSTM au- This paper proposes an automatic music genre-classification system using a deep learning model. The proposed model toencoder with CNN and use a Clustering-based classifier to leverages Convolutional Neural Nets(CNN) to extract local predict the genre of music. features and LSTM Sequence to Sequence Autoencoders to learn representations of time series data by taking into ac- Previous Works count their temporal dynamics. The paper also introduces Clustering Augmented Learning Method (CALM) classifier Music genre classification has been actively studied since which is based on the concept of simultaneous heterogeneous the early days. Tzanetakis and Cook [Tzanetakis and clustering and classification to learn deep feature representa- Cook2002] used k-nearest neighbor classifier and Gaussian tions of the features obtained from LSTM autoencoder. Mixture models with a comprehensive set of features for Computational Experiments using GTZAN dataset resulted music classification. Those features could be summarized in an overall test accuracy of 95.4% with a precision of into three categories: rhythm, pitch, and temporal struc- 91.87%. ture. Zhouyu Fu [Fu et al.2010] proposed a Naive Bayes (NB) classifier framework, namely NB Nearest Neighbor Introduction (NBNN) and NB Support Vector Machine (NBSVM) for music genre classification. [Deshpande and Singh2001] compared k-nearest neighbor, Gaussian Mixtures and SVM With the increasing amount of music available online, there to classify music into three genres which are rock, piano, is automatically a growing demand for the symmetrical or- and jazz. In recent years, using an audio spectrogram has ganization of audio files and that has increased the interest become mainstream for music genre classification. Spectro- in music classification. To detect a group of the music of a grams encode time and frequency information of given mu- similar genre is the main work of the recommendation sys- sic as a whole. Spectrograms can be considered as images tem and playlist generators. Thus building a robust music and used to train convolutional neural networks (CNNs) classifier using machine learning techniques is essential to ( [Wyse2017]). [Li, Chan, and A2010] developed a CNN to automate tagging unlabeled music and improve user’s expe- predict the music genre using the raw Mel Frequency cep- rience of media players and music libraries. In recent years, stral coefficients(MFCCs) as input. convolutional neural networks(CNNs) have brought revolu- In this paper, we aim to combine convolutional nets with tionary changes to the computer vision community. Mean- LSTM Autoencoders to extract both spatial and temporal while, CNN’s have been widely used for music information features of the audio signal. Instead of baseline classifiers, retrieval, especially music genre classification. Recently, it we propose a clustering-based classification model. In the became increasingly popular to combine CNNs with recur- proposed classification approach we cluster the data based rent networks(RNNs) to process audio signals, which in- on their inherent characteristics and in the process of learn- troduce time-sequential information to the model. In con- ing the best clustering solution we optimize the hyperparam- volutional recurrent networks(C-RNNs), the CNN compo- eters of the classification model, thereby substantially im- nent is used to extract features while RNN plays the role proving the learning process. We used the mel-spectrogram of summarizing temporal features. The inputs of C-RNNs as the only feature and compared the proposed model with are soundtrack spectrograms and outputs are probabilities of traditional classifiers and previous literature. Copyright © 2020 held by the author(s). In A. Martin, K. Hinkel- mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen Dataset and Representation (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- bining Machine Learning and Knowledge Engineering in Practice Dataset In the paper, we have used the GTZAN dataset. It (AAAI-MAKE 2020). Stanford University, Palo Alto, California, contains 10 music genres, each genre has 100 audio clips in USA, March 23-25, 2020. Use permitted under Creative Commons .au format. The genres are - blues, classical, country, disco, License Attribution 4.0 International (CC BY 4.0). hip-hop, pop, jazz, reggae, rock, metal. Each audio clips has a length 30 seconds, are 22050Hz Mono 16-bit files. The all convolutions, we pad zeros to each side of the input dataset incorporates samples from a variety of sources like to keep size fixed. Dropout(0.5) is applied to all convo- CDs, radios, microphone recordings, etc. The training, test- lutional layers to increases generalization. The CNN out- ing and validating sets are randomly partitioned following put has a feature map size of N×1×15 (number of fea- proportion 8:1:1. ture maps×frequency×time). For extracting temporal pat- Features A popular representation of sound is the spectro- tern we use an LSTM-based architecture. The architecture gram which captures both time and frequency information. uses LSTM layers having {256,64,16} units as the encoder In this study, we used the Mel spectrogram as the only in- LSTMenc and LSTM layers having {64,256} units as the put to train our neural model. A mel spectrogram is a spec- decoder LSTMdec . trogram transformed to have frequencies in the mel scale, Methodology To start, features are extracted from the spec- which is logarithmic, more naturally representing how hu- trogram using convolutional layers. The output of Convolu- man senses different sound frequencies. To convert raw au- tional Neural Networks is fed to an LSTM Seq to Seq Au- dio to Mel spectrogram, one must apply Short Time Fourier toencoder which collects key information about the tempo- Transforms(STFT) across sliding windows of audio, around ral properties of the input sequence in its hidden state. The 20ms wide. final hidden state of the LSTMenc is then passed through In this case, the music features are extracted using the some layers, the output of which is used to initialize the hid- LibROSA library in Python using 128 mel filters, frame den state of the LSTMdec . The function of the LSTMdec is length of 2048 samples and a hop size of 1024. We got a to reconstruct the input sequence based on the information spectrogram of size 647 × 128. contained in its initial hidden state. The network is trained to minimize the root mean squared error between the input Proposed architecture and methodology sequence and the reconstruction. Once the training is com- plete, the activation of the fully connected encoded layer is used as representations of the audio sequence and is fed as input to Clustering Augmented Learning Method Classifier. This system showed 98% accuracy at the end of the training. Clustering Augmented Learning Method (CALM) Proposed Approach Input augmentationAs in [Ghosal et al.2019], we consider a matrix of input data D and a set of cluster centers C. Since in this case study, there are 10 music genres, we keep C as 10. In this paper, we use clustering to augment input data x ∈ D for better learning. To augment the input data, we add a new set of features representing either an input example Figure 1: Model Architecture belongs to a cluster or not. To distinguish input examples, we introduce an additional index h ∈ {1, . . . , |D|} representing the number of an input example (x1 is the first input example Architecture of D). We define also a vector ch composed of chl , l ∈ C The model consists of a four-layer convolutional neural net- for each example xh ∈ D. It is a one-hot representation con- work (CNN) which is followed by an LSTM Sequence to taining zeros except for the index of the cluster it belongs to Sequence Autoencoder(AE) and ultimately consists of the (e.g. c1 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] means that the first in- proposed CALM classifier. Not only to make the network put example x1 belongs to the 4th cluster out of 10 clusters). unconstrained of any handcrafted features, but the convolu- Finally, we augment input examples by concatenating the tional layers are also used to extract meaningful and useful vector xh with the vector ch for each h ∈ {1, . . . , |D|}. features from the song. The output of the CNN is a sequence Cluster centers To determine the cluster centers, CALM in which every timestep strongly relies on both the immedi- consists of a clustering model and a Feed-Forward Neural ate predecessors and long term structure of the entire song. Net(FNN) having a softmax output to classify the music To capture both transient and overall characteristics, we use genres. For the clustering model, we propose to use a Ran- LSTM Sequence to Sequence Autoencoder and for classifi- dom Forest classifier to determine cluster centers. After the cation, we propose CALM, which is explained in the follow- FNN is trained using a state-of-the-art solver for data be- ing sections. The assumption underlying this model is that longing to a single cluster ∈ {1, . . . , |C|}, a Random Forest the temporal pattern can be aggregated better with LSTM Classifier is used to find the best cluster center. Hence we Autoencoders than CNNs while relying on CNNs on input repeat |C| instances of training the FNN to find the |C| cen- side for local feature extraction. ters. For any instance l of the model, we use the one-hot The CNN architecture consists of 4 convolutional layers encoded vector of l as labels for all the input sample in that of 64 feature maps, 3-by-3 convolution kernels and max- cluster. In simple words, while predicting center of 4th clus- pooling layers of dimensions (2×2)-(3×3)-(4×4)-(4×4). In ter (for example) we use [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] as label for 10 iterations, i.e., the clustering problem converges. The configuration of the proposed model is given as: A) Convolutional Neural Network(CNN): It is used to extract lo- cal features from the input . Maxpool Maxpool Maxpool CNN1 −−−−−−−−→CNN2 −−−−−−−−→CNN3 −−−−−−−−→ filter size = 2×2 filter size = 3×3 filter size = 4×4 Maxpool CNN4 −−−−−−−−→LSTM AE. Dimension of all convolution filter size = 4×4 kernel is 3×3. B) LSTM Autoencoder : It is used to aggregate temporal features LSTM Layer1 −→ LSTM Layer2 −→ LSTM Layer3 (Embed- ded layer) −→ LSTM Layer4 −→ LSTM Layer5. Dimension of the LSTM Layers are {256, 64, 16, 64, 256} respectively. C) Classification Model: FC1 −→ Leaky ReLU −→ FC2 −→ Leaky ReLU −→FC3 −→ Softmax . Dimension of FC1: 128. Figure 2: Architecture of Clustering Augmented Learning Dimension of FC2: 32. Dimension of FC3: 10. Method(CALM) Classifier D) Optimizer: ADAM Learning Rate 0.001, momentum rate 0.9, weight decay(L2 regularization):1e-4. all input samples, since |C| is 10. We propose that the input sample which has the lowest Algorithm 1: Clustering-augmented learning error in predicting its cluster label is considered as the cen- method ter of that cluster in the subsequent iteration of the proposed Step 0: Data obtained after extracting the local features approach. In such a manner, the center would be the input using CNN and temporal information using LSTM sample which is the most fitting representative of that clus- sequence to sequence Autoencoder acts as input to CALM. ter. As a result, the clustering process would aggregate the data having similar characteristics resulting in better learn- Step 1: Initialization of the cluster centers u1 , u2 , ..., u|C| randomly. Clustering of the output data obtained from ing by the FNN classification model. LSTM autoencoder and augmenting each data sample with its one-hot encoded cluster label. Clustering Problem We have a distance/dissimilarity measure dil between input Step 2: Training the FNN & clustering model examples i ∈ D and cluster centers l ∈ C. The clustering foreach l ∈ {1 . . . |C|} do problem aims to assign each input example to a cluster such Train the FNN model on data belonging to cluster l to that the total distance between the elements of a cluster and learn classification. its center is minimized. For supervised training of the random forest classifier In this paper we also propose a novel dissimilarity mea- we use one hot encoded representation of clusters as sure based on the weights of the trained FNN classifier. labels. Running the clustering model gives the cluster center ul . It uses the average of weights linked to each neuron of the input layer. Assuming that the original input (with- Step 3: Clustering out the new clustering feature) has d dimensions (xh = Update dissimilarity matrix using W ∗ [x1h , . . . , xdh ], h ∈ {1, . . . , |D|}) and the weight linking node n of the input layer to node j ∈ {1, . . . , n1 } of the follow- if stopping criterion is attained then Stop. ing layer is wjn , the two distances measures are formulated else go to Step 2. as follows: wjn |xki − xkl | P dil = avg n∈{1...d} j∈{1,...,n1 } Results and Discussions Thus the distance measure computes the distance between two examples based on how important is the contribution of The proposed model is trained by ADAM [Kingma and each input feature to the resulting prediction. Therefore, the Ba2014] for 150 epochs or early stop [Prechelt1998] if no resulting clusters contain examples with similar potential to improvement in 25 epochs. The performance of all networks improve the classification results. are evaluated using Precision, Recall, and Accuracy which are defined as : Proposed Algorithm Precision = NCN+N C F We propose an approach (Algorithm 1) where we iteratively NC Recall = NC +NM train the FNN classifier, use its weights for input data clus- totalc Accuracy = total tering thus changing the input vector, train again the FNN m classifier using the new input data, and so on until a stop- where NC is the number of accurately predicted music ping criterion is attained. The stopping criterion is triggered tracks, NF is the number of falsely predicted music tracks, if the cluster assignment remains the same for consecutive NM is the number of missed music tracks, totalc is the number of all accurately predicted music tracks and totalm is the number of all music tracks. Table 1: Performance of models Train Test To further interpret the results, we plotted the confusion Model Accuracy Accuracy matrix(Table 3) of the proposed model. Looking more CNN + LSTM AE + CALM closely at our confusion matrix, we see that our proposed 0.98 0.954 (Proposed Model) model managed to correctly classify 80% of rock audio CNN + LSTM AE + Logistic 0.914 0.873 as rock, labeling the others as mainly country or blues. Regression Additionally, it incorrectly classified some country, as well Logistic Regression 1.0 0.77 as a small fraction of blues and reggae, as rock music. k- Nearest Neighbours 1.0 0.36 Comparison with Baseline Classifiers We trained four Multilayer Perceptron 0.9725 0.83 traditional classification models on the dataset as baseline Support Vector Machine 1.0 0.28 classifiers, including k-nearest neighbors, logistic regres- Random Forest 1.0 0.76 sion, random forest, multilayer perceptrons, and linear sup- port vector machine, using Mel Frequency Cepstral Coeffi- cients(MFCCs) by flattening them into a 1-D array. Apart from baseline classifiers we also experimented by stacking a Logistic Regression classifier with the features obtained from Convolutional Net and LSTM Autoencoder to test the performance of the CALM classifier. As evident from Ta- ble 1, CALM outperforms the Logistic Regression classi- fier when augmented with Convolutional nets and LSTM au- toencoder. Moreover, Fig.4 shows how the intra-cluster vari- ance decreases after approximately 75 iterations and then stabilizes. To measure intra-cluster variance, we used Eu- clidean distance in this case study. Similarly, it is evident from Fig. 3 that testing loss starts decreasing after 80 epochs and gradually as the clustering solution converges, the accu- racy begins to improve. This observation bolsters our initial assumption that clustering data based on inherent character- istics would improve the learning process of FNN. For a fair comparison, all models are trained and tested Figure 4: Plot of Intra-cluster Variance vs Iterations on the same dataset as the proposed model. The hyper- parameters are tuned by a grid search to ensure that the best model configuration is adapted. In table 2 we have also com- Table 2: Comparison with Literature pared our model with relevant literature and it is evident that Models Accuracy proposed architecture performs strongly. Proposed Model 0.954 Liu et al. [Liu et al.2019] 0.939 Multi-DNN [Dai et al.2015] 0.934 CVAF [Nanni et al.2017] 0.909 Hybrid Model [Karunakaran and Arya2018] 0.883 NNet2 [Zhang et al.2016] 0.874 Bergstra et al. [Matityaho and Furst2006] 0.825 aims to take full advantage of low-level information of Mel- spectrogram for making the classification decision. We have shown how our model is effective by comparing the state- of-art methods, including both hands crafted feature ap- proaches and deep learning models. In this work, we use Figure 3: Training and Testing Loss the GTZAN dataset which is a common benchmark dataset. Our proposed model has achieved an impressive accuracy of 95.4% while testing, which outperforms all other models. In Conclusion the future, we will try to improve the model by improvising some new distance metric methods to compute the similarity In this paper, we present a specially designed network for ac- between genres. curately recognizing the music genre. The proposed model Table 3: Confusion Matrix Predicted Blues Classical Country Disco HipHop Jazz Metal Pop Reggae Rock Recall Actual Blues 95 0 0 0 0 1 0 0 1 3 95% Classical 0 95 0 0 0 1 0 0 1 1 97.9% Country 1 1 82 1 0 0 0 0 0 6 90.1% Disco 0 0 3 75 1 0 1 0 1 3 89.3% Hiphop 0 0 0 0 72 0 0 3 6 4 84.7% Jazz 1 2 0 0 0 77 0 0 1 1 93.9% Metal 0 0 0 3 0 0 64 0 0 3 91.4% Pop 0 0 0 0 1 0 1 75 2 1 94.9% Reggae 1 0 0 0 1 0 0 3 76 3 90.4% Rock 2 0 7 3 0 1 1 0 1 85 85% 91.24% Precision 95% 96.9% 89.1% 91.5% 96% 96.3% 95.5% 92.6% 86.4% 79.4% 91.87% References [Prechelt1998] Prechelt, L. 1998. Early stopping-but when? In Neural Networks: Tricks of the trade. Springer. 55–69. [Dai et al.2015] Dai, J.; Liu, W.; Dong, L.; and Yang, H. 2015. Multilingual deep neural network for music genre [Tzanetakis and Cook2002] Tzanetakis, G., and Cook, P. classification. In Sixteenth Annual Conference of the Inter- 2002. Musical genre classification of audio signal. IEEE national Speech Communication Association. Transactions on Speech, and Audio Processing 10(3):293– 302. [Deshpande and Singh2001] Deshpande, H., and Singh, R. [Wyse2017] Wyse, L. 2017. Audio spectrogram represen- 2001. Classification of music signals in the visual domain. tations for processing with convolutional neural networks. [Fu et al.2010] Fu, Z.; Lu, G.; Ting, K. M.; and Zhang, D. arXiv preprint arXiv: 1706.09559. 2010. Learning naive bayes classifiers for music classifi- [Zhang et al.2016] Zhang, W.; Lei, W.; Xu, X.; and Xing, X. cation and retrieval. In 2010 International Conference on 2016. Improved music genre classifi- cation with convolu- Pattern Recognition. tional neural networks. INTERSPEECH 3304–3308. [Ghosal et al.2019] Ghosal, S. S.; Bani, A.; Amrouss, A.; and El Hallaoui, I. 2019. A deep learning approach to predict parking occupancy using cluster augmented learning method. In 2019 International Conference on Data Mining Workshops (ICDMW), 581–586. [Karunakaran and Arya2018] Karunakaran, N., and Arya, A. 2018. A scalable hybrid classifier for music genre classifi- cation using machine learning concepts and spark. In In- ternational Conference on Intelligent Autonomous Systems (ICoIAS), 128–135. IEEE. [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [Li, Chan, and A2010] Li, T. L.; Chan, A. B.; and A, C. 2010. Automatic musical pattern feature extraction using convolutional neural network. In 2015 Data Mining and Ap- plications. IEEE. [Liu et al.2019] Liu, C.; Feng, L.; Wang, H.; and Liu, S. 2019. Bottom-up broadcast neural network for music genre classification. arXiv preprint arXiv:1901.08928v1. [Matityaho and Furst2006] Matityaho, B., and Furst, M. 2006. Aggregate features and adaboost for music classifi- cation. Machine Learning 473–484. [Nanni et al.2017] Nanni, L.; Costa, Y. M.; Lucio, D. R.; Silla Jr, C. N.; and Brahnam, S. 2017. Combining visual and acoustic features for audio classification tasks. Pattern Recognition Letters 49–56.