=Paper=
{{Paper
|id=Vol-2600/paper18
|storemode=property
|title=Novel Approach to Music Genre Classification using Clustering Augmented Learning Method (CALM)
|pdfUrl=https://ceur-ws.org/Vol-2600/paper18.pdf
|volume=Vol-2600
|authors=Soumya Suvra Ghosal,Indranil Sarkar
|dblpUrl=https://dblp.org/rec/conf/aaaiss/GhosalS20
}}
==Novel Approach to Music Genre Classification using Clustering Augmented Learning Method (CALM)==
<pdf width="1500px">https://ceur-ws.org/Vol-2600/paper18.pdf</pdf>
<pre>
     Novel Approach to Music Genre Classification using Clustering Augmented
                           Learning Method (CALM)

                                        Soumya Suvra Ghosal and Indranil Sarkar
                                          National Institute of Technology Durgapur
                                                              India
                               {soumyasuvraghosal@gmail.com,indranil.sarkar.nitdgp@gmail.com}


                            Abstract                                each genre at each timestep. Inspired by previous literature,
                                                                    we propose to leverage the idea by augmenting LSTM au-
  This paper proposes an automatic music genre-classification
  system using a deep learning model. The proposed model
                                                                    toencoder with CNN and use a Clustering-based classifier to
  leverages Convolutional Neural Nets(CNN) to extract local         predict the genre of music.
  features and LSTM Sequence to Sequence Autoencoders to
  learn representations of time series data by taking into ac-                          Previous Works
  count their temporal dynamics. The paper also introduces
  Clustering Augmented Learning Method (CALM) classifier            Music genre classification has been actively studied since
  which is based on the concept of simultaneous heterogeneous       the early days. Tzanetakis and Cook [Tzanetakis and
  clustering and classification to learn deep feature representa-   Cook2002] used k-nearest neighbor classifier and Gaussian
  tions of the features obtained from LSTM autoencoder.             Mixture models with a comprehensive set of features for
  Computational Experiments using GTZAN dataset resulted            music classification. Those features could be summarized
  in an overall test accuracy of 95.4% with a precision of          into three categories: rhythm, pitch, and temporal struc-
  91.87%.                                                           ture. Zhouyu Fu [Fu et al.2010] proposed a Naive Bayes
                                                                    (NB) classifier framework, namely NB Nearest Neighbor
                        Introduction                                (NBNN) and NB Support Vector Machine (NBSVM) for
                                                                    music genre classification. [Deshpande and Singh2001]
                                                                    compared k-nearest neighbor, Gaussian Mixtures and SVM
With the increasing amount of music available online, there         to classify music into three genres which are rock, piano,
is automatically a growing demand for the symmetrical or-           and jazz. In recent years, using an audio spectrogram has
ganization of audio files and that has increased the interest       become mainstream for music genre classification. Spectro-
in music classification. To detect a group of the music of a        grams encode time and frequency information of given mu-
similar genre is the main work of the recommendation sys-           sic as a whole. Spectrograms can be considered as images
tem and playlist generators. Thus building a robust music           and used to train convolutional neural networks (CNNs)
classifier using machine learning techniques is essential to        ( [Wyse2017]). [Li, Chan, and A2010] developed a CNN to
automate tagging unlabeled music and improve user’s expe-           predict the music genre using the raw Mel Frequency cep-
rience of media players and music libraries. In recent years,       stral coefficients(MFCCs) as input.
convolutional neural networks(CNNs) have brought revolu-            In this paper, we aim to combine convolutional nets with
tionary changes to the computer vision community. Mean-             LSTM Autoencoders to extract both spatial and temporal
while, CNN’s have been widely used for music information            features of the audio signal. Instead of baseline classifiers,
retrieval, especially music genre classification. Recently, it      we propose a clustering-based classification model. In the
became increasingly popular to combine CNNs with recur-             proposed classification approach we cluster the data based
rent networks(RNNs) to process audio signals, which in-             on their inherent characteristics and in the process of learn-
troduce time-sequential information to the model. In con-           ing the best clustering solution we optimize the hyperparam-
volutional recurrent networks(C-RNNs), the CNN compo-               eters of the classification model, thereby substantially im-
nent is used to extract features while RNN plays the role           proving the learning process. We used the mel-spectrogram
of summarizing temporal features. The inputs of C-RNNs              as the only feature and compared the proposed model with
are soundtrack spectrograms and outputs are probabilities of        traditional classifiers and previous literature.
Copyright © 2020 held by the author(s). In A. Martin, K. Hinkel-
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen               Dataset and Representation
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
bining Machine Learning and Knowledge Engineering in Practice       Dataset In the paper, we have used the GTZAN dataset. It
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,       contains 10 music genres, each genre has 100 audio clips in
USA, March 23-25, 2020. Use permitted under Creative Commons        .au format. The genres are - blues, classical, country, disco,
License Attribution 4.0 International (CC BY 4.0).                  hip-hop, pop, jazz, reggae, rock, metal. Each audio clips has
a length 30 seconds, are 22050Hz Mono 16-bit files. The         all convolutions, we pad zeros to each side of the input
dataset incorporates samples from a variety of sources like     to keep size fixed. Dropout(0.5) is applied to all convo-
CDs, radios, microphone recordings, etc. The training, test-    lutional layers to increases generalization. The CNN out-
ing and validating sets are randomly partitioned following      put has a feature map size of N×1×15 (number of fea-
proportion 8:1:1.                                               ture maps×frequency×time). For extracting temporal pat-
Features A popular representation of sound is the spectro-      tern we use an LSTM-based architecture. The architecture
gram which captures both time and frequency information.        uses LSTM layers having {256,64,16} units as the encoder
In this study, we used the Mel spectrogram as the only in-      LSTMenc and LSTM layers having {64,256} units as the
put to train our neural model. A mel spectrogram is a spec-     decoder LSTMdec .
trogram transformed to have frequencies in the mel scale,       Methodology To start, features are extracted from the spec-
which is logarithmic, more naturally representing how hu-       trogram using convolutional layers. The output of Convolu-
man senses different sound frequencies. To convert raw au-      tional Neural Networks is fed to an LSTM Seq to Seq Au-
dio to Mel spectrogram, one must apply Short Time Fourier       toencoder which collects key information about the tempo-
Transforms(STFT) across sliding windows of audio, around        ral properties of the input sequence in its hidden state. The
20ms wide.                                                      final hidden state of the LSTMenc is then passed through
In this case, the music features are extracted using the        some layers, the output of which is used to initialize the hid-
LibROSA library in Python using 128 mel filters, frame          den state of the LSTMdec . The function of the LSTMdec is
length of 2048 samples and a hop size of 1024. We got a         to reconstruct the input sequence based on the information
spectrogram of size 647 × 128.                                  contained in its initial hidden state. The network is trained
                                                                to minimize the root mean squared error between the input
    Proposed architecture and methodology                       sequence and the reconstruction. Once the training is com-
                                                                plete, the activation of the fully connected encoded layer is
                                                                used as representations of the audio sequence and is fed as
                                                                input to Clustering Augmented Learning Method Classifier.
                                                                This system showed 98% accuracy at the end of the training.

                                                                    Clustering Augmented Learning Method
                                                                                  (CALM)
                                                                Proposed Approach
                                                                Input augmentationAs in [Ghosal et al.2019], we consider
                                                                a matrix of input data D and a set of cluster centers C. Since
                                                                in this case study, there are 10 music genres, we keep C
                                                                as 10. In this paper, we use clustering to augment input data
                                                                x ∈ D for better learning. To augment the input data, we add
                                                                a new set of features representing either an input example
               Figure 1: Model Architecture                     belongs to a cluster or not. To distinguish input examples, we
                                                                introduce an additional index h ∈ {1, . . . , |D|} representing
                                                                the number of an input example (x1 is the first input example
Architecture                                                    of D). We define also a vector ch composed of chl , l ∈ C
The model consists of a four-layer convolutional neural net-    for each example xh ∈ D. It is a one-hot representation con-
work (CNN) which is followed by an LSTM Sequence to             taining zeros except for the index of the cluster it belongs to
Sequence Autoencoder(AE) and ultimately consists of the         (e.g. c1 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] means that the first in-
proposed CALM classifier. Not only to make the network          put example x1 belongs to the 4th cluster out of 10 clusters).
unconstrained of any handcrafted features, but the convolu-     Finally, we augment input examples by concatenating the
tional layers are also used to extract meaningful and useful    vector xh with the vector ch for each h ∈ {1, . . . , |D|}.
features from the song. The output of the CNN is a sequence        Cluster centers To determine the cluster centers, CALM
in which every timestep strongly relies on both the immedi-     consists of a clustering model and a Feed-Forward Neural
ate predecessors and long term structure of the entire song.    Net(FNN) having a softmax output to classify the music
To capture both transient and overall characteristics, we use   genres. For the clustering model, we propose to use a Ran-
LSTM Sequence to Sequence Autoencoder and for classifi-         dom Forest classifier to determine cluster centers. After the
cation, we propose CALM, which is explained in the follow-      FNN is trained using a state-of-the-art solver for data be-
ing sections. The assumption underlying this model is that      longing to a single cluster ∈ {1, . . . , |C|}, a Random Forest
the temporal pattern can be aggregated better with LSTM         Classifier is used to find the best cluster center. Hence we
Autoencoders than CNNs while relying on CNNs on input           repeat |C| instances of training the FNN to find the |C| cen-
side for local feature extraction.                              ters. For any instance l of the model, we use the one-hot
The CNN architecture consists of 4 convolutional layers         encoded vector of l as labels for all the input sample in that
of 64 feature maps, 3-by-3 convolution kernels and max-         cluster. In simple words, while predicting center of 4th clus-
pooling layers of dimensions (2×2)-(3×3)-(4×4)-(4×4). In        ter (for example) we use [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] as label for
                                                                           10 iterations, i.e., the clustering problem converges.
                                                                           The configuration of the proposed model is given as:
                                                                          A) Convolutional Neural Network(CNN): It is used to extract lo-
                                                                             cal features from the input .
                                                                                        Maxpool                    Maxpool             Maxpool
                                                                             CNN1 −−−−−−−−→CNN2 −−−−−−−−→CNN3 −−−−−−−−→
                                                                                     filter size = 2×2          filter size = 3×3   filter size = 4×4
                                                                                           Maxpool
                                                                             CNN4 −−−−−−−−→LSTM AE. Dimension of all convolution
                                                                                      filter size = 4×4
                                                                             kernel is 3×3.
                                                                          B) LSTM Autoencoder : It is used to aggregate temporal features
                                                                             LSTM Layer1 −→ LSTM Layer2 −→ LSTM Layer3 (Embed-
                                                                             ded layer) −→ LSTM Layer4 −→ LSTM Layer5. Dimension of
                                                                             the LSTM Layers are {256, 64, 16, 64, 256} respectively.
                                                                          C) Classification Model: FC1 −→ Leaky ReLU −→ FC2 −→
                                                                             Leaky ReLU −→FC3 −→ Softmax . Dimension of FC1: 128.
Figure 2: Architecture of Clustering Augmented Learning                      Dimension of FC2: 32. Dimension of FC3: 10.
Method(CALM) Classifier                                                   D) Optimizer: ADAM Learning Rate 0.001, momentum rate 0.9,
                                                                             weight decay(L2 regularization):1e-4.
all input samples, since |C| is 10.
   We propose that the input sample which has the lowest                    Algorithm          1:         Clustering-augmented      learning
error in predicting its cluster label is considered as the cen-             method
ter of that cluster in the subsequent iteration of the proposed              Step 0: Data obtained after extracting the local features
approach. In such a manner, the center would be the input                    using CNN and temporal information using LSTM
sample which is the most fitting representative of that clus-                sequence to sequence Autoencoder acts as input to CALM.
ter. As a result, the clustering process would aggregate the
data having similar characteristics resulting in better learn-               Step 1: Initialization of the cluster centers u1 , u2 , ..., u|C|
                                                                             randomly. Clustering of the output data obtained from
ing by the FNN classification model.
                                                                             LSTM autoencoder and augmenting each data sample with
                                                                             its one-hot encoded cluster label.
Clustering Problem
We have a distance/dissimilarity measure dil between input                   Step 2: Training the FNN & clustering model
examples i ∈ D and cluster centers l ∈ C. The clustering                     foreach l ∈ {1 . . . |C|} do
problem aims to assign each input example to a cluster such                      Train the FNN model on data belonging to cluster l to
that the total distance between the elements of a cluster and                    learn classification.
its center is minimized.                                                         For supervised training of the random forest classifier
   In this paper we also propose a novel dissimilarity mea-                      we use one hot encoded representation of clusters as
sure based on the weights of the trained FNN classifier.                         labels. Running the clustering model gives the cluster
                                                                                 center ul .
It uses the average of weights linked to each neuron of
the input layer. Assuming that the original input (with-                     Step 3: Clustering
out the new clustering feature) has d dimensions (xh =                       Update dissimilarity matrix using W ∗
[x1h , . . . , xdh ], h ∈ {1, . . . , |D|}) and the weight linking node
n of the input layer to node j ∈ {1, . . . , n1 } of the follow-             if stopping criterion is attained then Stop.
ing layer is wjn , the two distances measures are formulated                 else go to Step 2.
as follows:
                                                wjn |xki − xkl |
                           P
                 dil =                  avg
                   n∈{1...d} j∈{1,...,n1 }
                                                                                              Results and Discussions
Thus the distance measure computes the distance between
two examples based on how important is the contribution of                 The proposed model is trained by ADAM [Kingma and
each input feature to the resulting prediction. Therefore, the             Ba2014] for 150 epochs or early stop [Prechelt1998] if no
resulting clusters contain examples with similar potential to              improvement in 25 epochs. The performance of all networks
improve the classification results.                                        are evaluated using Precision, Recall, and Accuracy which
                                                                           are defined as :
Proposed Algorithm                                                                                   Precision = NCN+N
                                                                                                                     C
                                                                                                                        F
We propose an approach (Algorithm 1) where we iteratively                                                        NC
                                                                                                      Recall = NC +NM
train the FNN classifier, use its weights for input data clus-                                                    totalc
                                                                                                     Accuracy = total
tering thus changing the input vector, train again the FNN                                                             m

classifier using the new input data, and so on until a stop-               where NC is the number of accurately predicted music
ping criterion is attained. The stopping criterion is triggered            tracks, NF is the number of falsely predicted music tracks,
if the cluster assignment remains the same for consecutive                 NM is the number of missed music tracks, totalc is the
number of all accurately predicted music tracks and totalm
is the number of all music tracks.                                              Table 1: Performance of models
                                                                                                        Train         Test
To further interpret the results, we plotted the confusion                     Model
                                                                                                       Accuracy     Accuracy
matrix(Table 3) of the proposed model. Looking more                 CNN + LSTM AE + CALM
closely at our confusion matrix, we see that our proposed                                                    0.98     0.954
                                                                       (Proposed Model)
model managed to correctly classify 80% of rock audio               CNN + LSTM AE + Logistic
                                                                                                         0.914        0.873
as rock, labeling the others as mainly country or blues.                   Regression
Additionally, it incorrectly classified some country, as well            Logistic Regression                 1.0      0.77
as a small fraction of blues and reggae, as rock music.
                                                                       k- Nearest Neighbours                 1.0      0.36
   Comparison with Baseline Classifiers We trained four                 Multilayer Perceptron            0.9725       0.83
traditional classification models on the dataset as baseline           Support Vector Machine                1.0      0.28
classifiers, including k-nearest neighbors, logistic regres-
                                                                           Random Forest                     1.0      0.76
sion, random forest, multilayer perceptrons, and linear sup-
port vector machine, using Mel Frequency Cepstral Coeffi-
cients(MFCCs) by flattening them into a 1-D array. Apart
from baseline classifiers we also experimented by stacking
a Logistic Regression classifier with the features obtained
from Convolutional Net and LSTM Autoencoder to test the
performance of the CALM classifier. As evident from Ta-
ble 1, CALM outperforms the Logistic Regression classi-
fier when augmented with Convolutional nets and LSTM au-
toencoder. Moreover, Fig.4 shows how the intra-cluster vari-
ance decreases after approximately 75 iterations and then
stabilizes. To measure intra-cluster variance, we used Eu-
clidean distance in this case study. Similarly, it is evident
from Fig. 3 that testing loss starts decreasing after 80 epochs
and gradually as the clustering solution converges, the accu-
racy begins to improve. This observation bolsters our initial
assumption that clustering data based on inherent character-
istics would improve the learning process of FNN.
For a fair comparison, all models are trained and tested              Figure 4: Plot of Intra-cluster Variance vs Iterations
on the same dataset as the proposed model. The hyper-
parameters are tuned by a grid search to ensure that the best
model configuration is adapted. In table 2 we have also com-                 Table 2: Comparison with Literature
pared our model with relevant literature and it is evident that                         Models                         Accuracy
proposed architecture performs strongly.                                           Proposed Model                        0.954
                                                                               Liu et al. [Liu et al.2019]               0.939
                                                                             Multi-DNN [Dai et al.2015]                  0.934
                                                                               CVAF [Nanni et al.2017]                   0.909
                                                                      Hybrid Model [Karunakaran and Arya2018]            0.883
                                                                               NNet2 [Zhang et al.2016]                  0.874
                                                                       Bergstra et al. [Matityaho and Furst2006]         0.825


                                                                  aims to take full advantage of low-level information of Mel-
                                                                  spectrogram for making the classification decision. We have
                                                                  shown how our model is effective by comparing the state-
                                                                  of-art methods, including both hands crafted feature ap-
                                                                  proaches and deep learning models. In this work, we use
            Figure 3: Training and Testing Loss                   the GTZAN dataset which is a common benchmark dataset.
                                                                  Our proposed model has achieved an impressive accuracy of
                                                                  95.4% while testing, which outperforms all other models. In
                       Conclusion                                 the future, we will try to improve the model by improvising
                                                                  some new distance metric methods to compute the similarity
In this paper, we present a specially designed network for ac-    between genres.
curately recognizing the music genre. The proposed model
                                                     Table 3: Confusion Matrix
              Predicted
                           Blues Classical Country     Disco     HipHop   Jazz    Metal     Pop    Reggae    Rock          Recall
  Actual
  Blues                     95      0         0           0         0      1         0       0       1        3             95%
  Classical                  0      95        0           0         0      1         0       0       1        1            97.9%
  Country                    1      1         82          1         0      0         0       0       0        6            90.1%
  Disco                      0      0         3          75         1      0         1       0       1        3            89.3%
  Hiphop                     0      0         0           0        72      0         0       3       6        4            84.7%
  Jazz                       1      2         0           0         0      77        0       0       1        1            93.9%
  Metal                      0      0         0           3         0      0        64       0       0        3            91.4%
  Pop                        0      0         0           0         1      0         1      75       2        1            94.9%
  Reggae                     1      0         0           0         1      0         0       3       76       3            90.4%
  Rock                       2      0         7           3         0      1         1       0       1        85            85%
                                                                                                                               91.24%
  Precision                95%     96.9%     89.1%     91.5%      96%     96.3%   95.5%   92.6%     86.4%    79.4%
                                                                                                                      91.87%


                          References                                [Prechelt1998] Prechelt, L. 1998. Early stopping-but when?
                                                                     In Neural Networks: Tricks of the trade. Springer. 55–69.
[Dai et al.2015] Dai, J.; Liu, W.; Dong, L.; and Yang, H.
 2015. Multilingual deep neural network for music genre             [Tzanetakis and Cook2002] Tzanetakis, G., and Cook, P.
 classification. In Sixteenth Annual Conference of the Inter-        2002. Musical genre classification of audio signal. IEEE
 national Speech Communication Association.                          Transactions on Speech, and Audio Processing 10(3):293–
                                                                     302.
[Deshpande and Singh2001] Deshpande, H., and Singh, R.
                                                                    [Wyse2017] Wyse, L. 2017. Audio spectrogram represen-
 2001. Classification of music signals in the visual domain.
                                                                     tations for processing with convolutional neural networks.
[Fu et al.2010] Fu, Z.; Lu, G.; Ting, K. M.; and Zhang, D.           arXiv preprint arXiv: 1706.09559.
 2010. Learning naive bayes classifiers for music classifi-         [Zhang et al.2016] Zhang, W.; Lei, W.; Xu, X.; and Xing, X.
 cation and retrieval. In 2010 International Conference on           2016. Improved music genre classifi- cation with convolu-
 Pattern Recognition.                                                tional neural networks. INTERSPEECH 3304–3308.
[Ghosal et al.2019] Ghosal, S. S.; Bani, A.; Amrouss, A.;
 and El Hallaoui, I. 2019. A deep learning approach to
 predict parking occupancy using cluster augmented learning
 method. In 2019 International Conference on Data Mining
 Workshops (ICDMW), 581–586.
[Karunakaran and Arya2018] Karunakaran, N., and Arya, A.
 2018. A scalable hybrid classifier for music genre classifi-
 cation using machine learning concepts and spark. In In-
 ternational Conference on Intelligent Autonomous Systems
 (ICoIAS), 128–135. IEEE.
[Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014.
 Adam: A method for stochastic optimization. arXiv preprint
 arXiv:1412.6980.
[Li, Chan, and A2010] Li, T. L.; Chan, A. B.; and A, C.
 2010. Automatic musical pattern feature extraction using
 convolutional neural network. In 2015 Data Mining and Ap-
 plications. IEEE.
[Liu et al.2019] Liu, C.; Feng, L.; Wang, H.; and Liu, S.
 2019. Bottom-up broadcast neural network for music genre
 classification. arXiv preprint arXiv:1901.08928v1.
[Matityaho and Furst2006] Matityaho, B., and Furst, M.
 2006. Aggregate features and adaboost for music classifi-
 cation. Machine Learning 473–484.
[Nanni et al.2017] Nanni, L.; Costa, Y. M.; Lucio, D. R.;
 Silla Jr, C. N.; and Brahnam, S. 2017. Combining visual
 and acoustic features for audio classification tasks. Pattern
 Recognition Letters 49–56.

</pre>