Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification Alexander Schindler Thomas Lidy, Andreas Rauber Austrian Institute of Technology Vienna University of Technology Digital Safety and Security Institute of Software Technology Vienna, Austria Vienna, Austria alexander.schindler@ait.ac.at lidy,rauber@ifs.tuwien.ac.at problems have been inspired by the remarkable success of Deep Neural Networks (DNN) in the domains of computer Abstract vision [KSH12], where deep learning based approaches have already become the de facto standard. The major advan- In this paper we investigate performance differ- tage of DNNs are their feature learning capability, which ences of different neural network architectures on alleviates the domain knowledge and time intensive task the task of automatic music genre classification. of crafting audio features by hand. Predictions are also Comparative evaluations on four well known made directly on the modeled input representations, which datasets of different sizes were performed including is commonly raw input data such as images, text or au- the application of two audio data augmentation dio spectrograms. Recent accomplishments in applying methods. The results show that shallow network Convolutional Neural Networks (CNN) to audio classifica- architectures are better suited for small datasets tion tasks have shown promising results by outperforming than deeper models, which could be relevant for conventional approaches in different evaluation campaigns experiments and applications which rely on small such as the Detection and Classification of Acoustic Scenes datasets. A noticeable advantage was observed and Events (DCASE) [LS16a] and the Music Information through the application of data augmentation Retrieval Evaluation EXchange (MIREX) [LS16b]. using deep models. A final comparison with previous evaluations on the same datasets shows that the presented neural network based approaches already outperform state-of-the-art handcrafted music features. An often mentioned paradigm concerning neural networks is that deeper networks are better in modeling 1 Introduction non-linear relationships of given tasks [SLJ+15]. So far preceding MIR experiments and approaches reported in Music classification is a well researched topic in Music Infor- literature have not explicitly demonstrated the advantage mation Retrieval (MIR) [FLTZ11]. Generally, its aim is to of deep over shallow network architectures in a magnitude assign one or multiple labels to a sequence or an entire audio similar to results reported from the computer vision file, which is commonly accomplished in two major steps. domain. This may be related to the absence of similarly First, semantically meaningful audio content descriptors are large datasets as they are available in the visual related extracted from the sampled audio signal. Second, a machine research areas. A special focus of this paper is thus set on learning algorithm is applied, which attempts to discrim- the performance of neural networks on small datasets, since inate between the classes by finding separating boundaries data availability is still a problem in MIR, but also because in the multidimensional feature-spaces. Especially the first many tasks involve the processing of small collections. In step requires extensive knowledge and skills in various spe- this paper we present a performance evaluation of shallow cific research areas such as audio signal processing, acoustics and deep neural network architectures. These models and and/or music theory. Recently many approaches to MIR the applied method will be detailed in Section 2. The evaluation will be performed on well known music genre Copyright c by the paper’s authors. Copying permitted for private and academic purposes. classification datasets in the domain of Music Information In: W. Aigner, G. Schmiedl, K. Blumenstein, M. Zeppelzauer (eds.): Retrieval. These datasets and the evaluation procedure will Proceedings of the 9th Forum Media Technology 2016, St. Pölten, be described in Section 3. Finally we draw conclusions from Austria, 24-11-2016, published at http://ceur-ws.org the results in Section 5 and give an outlook to future work. 17 Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification Input 80 x 80 Input 80 x 80 Conv - 16 - 10 x 23 - LeakyReLU Conv - 16 - 21 x 20 - LeakyReLU Conv - 16 - 10 x 23 - LeakyReLU Conv - 16 - 21 x 10 - LeakyReLU Layer 1 Layer 1 Max-Pooling 1 x 20 Max-Pooling 20 x 1 Max-Pooling 2 x 2 Max-Pooling 2 x 2 merge Conv - 32 - 5 x 11 - LeakyReLU Conv - 32 - 10 x 5 - LeakyReLU Layer 2 Dropout 0.1 Max-Pooling 2 x 2 Max-Pooling 2 x 2 Fully connected 200 Conv - 64 - 3 x 5 - LeakyReLU Conv - 64 - 5 x 3 - LeakyReLU Layer 3 Softmax Max-Pooling 2 x 2 Max-Pooling 2 x 2 Conv - 128 - 2 x 4 - LeakyReLU Conv - 128 - 4 x 2 - LeakyReLU Layer 4 Figure 1: Shallow CNN architecture Max-Pooling 1 x 5 Max-Pooling 5 x 1 2 Method merge The parallel architectures of the neural networks used Dropout 0.25 in the evaluation are based on the idea of using a time and a frequency pipeline described in [PLS16], which Fully connected 200 - LeakyReLU was successfully applied in two evaluation campaigns Softmax [LS16a, LS16b]. The system is based on a parallel CNN architecture where separate CNN Layers are optimized for processing and recognizing music relations in the frequency Figure 2: Deep CNN architecture domain and to capture temporal relations (see Figure 1). are slightly adapted to have the same rectangular shapes The Shallow Architecture: In our adaption of the with one part being rotated by 90 . As in the shallow archi- CNN architecture described in [PLS16] we use two similar tecture the same sizes of the final feature maps of the parallel pipelines of CNN Layers with 16 filter kernels each followed model paths balances their influences on the following fully by a Max Pooling layer (see Figure 1). The left pipeline connected layer with 200 units with a 25% dropout rate. aims at capturing frequency relations using filter kernel sizes of 10⇥23 and Max Pooling sizes of 1⇥20. The resulting 16 vertical rectangular shaped feature map responses of shape 2.1 Training and Predicting Results 80⇥4 are intended to capture spectral characteristics of a In each epoch during training the network multiple training segment and to reduce the temporal complexity to 4 discrete examples sampled from the segment-wise log-transformed intervals. The right pipeline uses a filter of size 21⇥20 and Mel-spectrogram analysis of all files in the training set are Max Pooling sizes of 20⇥1. This results in horizontal rect- presented to both pipelines of the neural network. Each of angular shaped feature maps of shape 4⇥80. This captures the parallel pipelines of the architectures uses the same 80 temporal changes in intensity levels of four discrete spectral ⇥ 80 log-transformed Mel-spectrogram segments as input. intervals. The 16 feature maps of each pipeline are flattened These segments have been calculated from a fast Fourier to a shape of 1⇥5120 and merged by concatenation into transformed spectrogram using a window size of 1024 sam- the shape of 1⇥10240, which serves as input to a 200 units ples and an overlap of 50% from 0.93 seconds of audio trans- fully connected layer with a dropout of 10%. formed subsequently into Mel scale and Log scale.. For each The Deep Architecture: This architecture follows the song of a dataset 15 segments have been randomly chosen. same principles of the shallow approach. It uses a parallel All trainable layers used the Leaky ReLU activation arrangement of rectangular shaped filters and Max-Pooling function [MHN13], which is an extension to the ReLU windows to capture frequency and temporal relationships (Rectifier Linear Unit) that does not completely cut off at once. But, instead of using the information of the large activation for negative values, but allows for negative feature map responses, this architecture applies additional values close to zero to pass through. It is defined by adding CNN and pooling layer pairs (see Figure 2). Thus, more a coefficient ↵ in f(x) = ↵x, for x < 0, while keeping units can be applied to train on the subsequent smaller f(x) = x, for x 0 as for the ReLU. In our architectures, input feature maps. The first level of the parallel layers are we apply Leaky ReLU activation with ↵=0.3. L1 weight similar to the original approach. They use filter kernel sizes regularization with a penalty of 0.0001 was applied to all of 10⇥23 and 21⇥10 to capture frequency and temporal trainable parameters. All networks were trained towards relationships. To retain these characteristics the sizes of the categorical-crossentropy objective using the stochastic convolutional filter kernels as well as the feature maps are Adam optimization [KB14] with beta1 =0.9, beta2 =0.999, sub-sequentially divided in halves by the second and third epsilon=1e 08 and a learning rate of 0.00005. layers. The filter and Max Pooling sizes of the fourth layer The system is implemented in Python and using librosa 18 Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification [MRL+15] for audio processing and Mel-log-transforms Train Test Data Tracks cls wo. au. w. au. and Theano-based library Keras for Deep Learning. GTZAN 1,000 10 11,250 47,250 3,750 ISMIR G. 1,458 6 16,380 68,796 5,490 2.1.1 Data Augmentation Latin 3,227 10 36,240 152,208 12,165 MSD 49,900 15 564,165 — 185,685 To increase the number of training instances we experiment with two different audio data augmentation methods. Table 1: Overview of the evaluation datasets, their number The deformations were applied directly to the audio of classes (cls) and their corresponding number of test and signal preceding any further feature calculation procedure training data instances without (wo. au.) and with (w. described in Section 2.1. The following two methods were au.) data augmentation. applied using the MUDA framework for musical data augmentation [MHB15]: the shallow and deep architecture were evaluated followed by the experiments including data augmentation. The archi- Time Stretching: slowing down or speeding up the tectures were further evaluated according their performance original audio sample while keeping the same pitch after a different number of training epochs. The networks information. Time stretching was applied using the were trained and evaluated after 100 and 200 epochs multiplication factors 0.5, 0.2 for slowing down and 1.2, without early stopping. Preceding experiments showed 1.5 for increasing the tempo. that test accuracy could improve despite rising validation Pitch Shifting: raising or lowering the pitch of an audio loss though on smaller sets no significant improvement was sample while keeping the tempo unchanged. The applied recognizable after 200 epochs. For the experiments with pitch shifting lowered and raised the pitch by 2 and 5 data augmentation, the augmented data was only used to semitones. train the networks (see Table 3.1). For testing the network the original segments without deformations were used. For each deformation three segments have been randomly chosen from the audio content. The combinations of the two deformations with four different factors each resulted 3.1 Data Sets thus in 48 additional data instances per audio file. For the evaluation four data sets have been used. We have chosen these datasets due to their increasing number of 3 Evaluation tracks and because they are well known and extensively evaluated in the automatic genre classification task. This As our system analyzes and predicts multiple audio should also provide comparability with experiments segments per input file, there are several ways to perform reported in literature. the final prediction of an input instance: GTZAN: This data set was compiled by George Tzane- Raw Probability: The raw accuracy of predicting takis [Tza02] in 2000-2001 and consists of 1000 audio tracks the segments as separated instances ignoring their file equally distributed over the 10 music genres: blues, classical, dependencies. country, disco, hiphop, pop, jazz, metal, reggae, and rock. Maximum Probability: The output probabilities of the ISMIR Genre: This data set has been assembled for Softmax layer for the corresponding number of classes of training and development in the ISMIR 2004 Genre the datasets are summed up for all segments belonging Classification contest [CGG+06]. It contains 1458 full to the same input file. The predicted class is determined length audio recordings from Magnatune.com distributed by the maximum probability among the classes from the across the 6 genre classes: Classical, Electronic, JazzBlues, summed probabilities. MetalPunk, RockPop, World. Majority Vote: Here, the predictions are made for each Latin Music Database (LMD): [SKK08] contains segment processed from the audio file as input instance to 3227 songs, categorized into the 10 Latin music genres Axé, the network. The class of an audio segment is determined Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, by the maximum probability as output by the Softmax layer Sertaneja and Tango. for this segment instance. Then, a majority vote is taken on all predicted classes from all segments of the same input file. Million Song Dataset (MSD): [BMEWL11] a Majority vote determines the class that occurs most often. collection of one million music pieces, enables methods for large-scale applications. It comes as a collection of meta- We used stratified 4-fold cross validation. Multi-level data such as the song names, artists and albums, together stratification was applied paying special attention to the with a set of features extracted with the The Echo Nest multiple segments used per file. It was ensured that the services, such as loudness, tempo, and MFCC-like features. files were distributed according their genre distributions We used the CD2C genre assignments as ground truth and that no segments of a training file was provided in [Sch15] which are an adaptation of the MSD genre label the corresponding test split. assignments presented in [SMR12]. For the experiments The experiments were grouped according to the four a sub-set of approximately 50.000 tracks was sub-sampled. different datasets. For each dataset the performances for 19 Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification 4 Results D Model raw max maj ep shallow 66.56 (0.69) 78.10 (0.89) 77.80 (0.52) 100 deep 65.69 (1.23) 78.60 (1.97) 78.00 (2.87) 100 GTZAN The results of the experiments are provided in Table 4. shallow 67.49 (0.39) 80.80 (1.67) 80.20 (1.68) 200 For each dataset all combinations of experimental results deep 66.19 (0.55) 80.60 (2.93) 80.30 (2.87) 200 shallow aug 66.77 (0.78) 78.90 (2.64) 77.10 (1.19) 100 were tested for significant difference using a Wilcoxon deep aug 68.31 (2.68) 81.80 (2.95) 82.20 (2.30) 100 signed-rank test. None of the presented results showed a significant difference for p<0.05. Thus, we tested at the shallow 75.66 (1.30) 85.46 (1.87) 84.77 (1.43) 100 ISMIR Genre next higher level p < 0.1. The following observations on deep 74.53 (0.52) 84.08 (1.13) 83.95 (0.97) 100 the datasets were made: shallow 75.43 (0.65) 84.91 (1.96) 85.18 (1.27) 200 deep 74.51 (1.71) 85.12 (0.76) 85.18 (1.23) 200 shallow aug 76.61 (1.04) 86.90 (0.23) 86.00 (0.54) 100 deep aug 77.20 (1.14) 87.17 (1.17) 86.75 (1.41) 100 GTZAN: Training the models with 200 epochs instead of only 100 epochs significantly improved the raw and max shallow 79.80 (0.95) 92.44 (0.86) 92.10 (0.97) 100 accuracies for the shallow models. An additional test on deep 81.13 (0.64) 94.42 (1.04) 94.30 (0.81) 100 Latin training 500 epochs showed no further increase in accuracy shallow 80.64 (0.83) 93.46 (1.13) 92.68 (0.88) 200 deep 81.06 (0.51) 95.14 (0.40) 94.83 (0.53) 200 for any of the three prediction methods. Training longer had shallow aug 78.09 (0.68) 92.78 (0.88) 92.03 (0.81) 100 no effect on the deep model due to early over-fitting. No deep aug 83.22 (0.83) 96.03 (0.56) 95.60 (0.58) 100 significant differences were observed between shallow and deep models except for the raw prediction values of the shal- MSD shallow 58.20 (0.49) 63.89 (0.81) 63.11 (0.74) 100 low model (200 epochs) exceeding those of the deep model deep 60.60 (0.28) 67.16 (0.64) 66.41 (0.52) 100 (200 epochs). While the improvements through data aug- mentation on deep models compared to the un-augmented longer trained deep models are not significant, considerable Table 2: Experimental results for the evaluation datasets improvements of 4.2% were observed for models trained (D) at different number of training epochs (ep): Mean accu- for the same number of epochs. An interesting observation racies and standard deviations of the 4-fold cross-evaluation is the negative effect of data augmentation on the shallow runs calculated using raw prediction scores (raw) and the models where longer training outperformed augmentation. file based maximum probability (max) and majority vote approach (maj). ISMIR Genre: Training more epochs only had a significant positive effect on the max and maj values of the deep model but none for the shallow ones. The deep models 5 Conclusions and Future Work showed no significant advantage over the shallow architec- tures which also showed higher raw prediction values even In this paper we evaluated shallow and deep CNN on shorter trained models. Data augmentation improved architectures towards their performance on different dataset the predictions of both architectures with significant im- sizes in music genre classification tasks. Our observations provements for the raw values. Especially the deep models showed that for smaller datasets shallow models seem to be significantly profited from data augmentation with max more appropriate since deeper models showed no significant values increased by 3.08% for models trained for the same improvement. Deeper models performed slightly better in number of epochs and 2.05% for the longer trained models. the presence of larger datasets, but a clear conclusion that The improvements from deep over shallow models using deeper models are generally better could not be drawn. augmented data were only significant for the raw values. Data augmentation using time stretching and pitch shifting Latin: Training more epochs only had a positive effect for significantly improved the performance of deep models. For the raw and max values of the shallow model, but not for shallow models on the contrary it showed a negative effect the deep architecture. On this dataset, the deep model sig- on the small datasets. Thus, deeper models should be nificantly outperformed the shallow architecture including considered when applying data augmentation. Comparing the shallow model trained using data augmentation. Data the presented results with previously reported evaluations augmentation improved the significantly improved the per- on the same datasets [SR12] shows, that the CNN based formance of the deep models by 1,61% for the max values. approaches already outperform handcrafted music features Similar to the GTZAN dataset, data augmentation showed such as the Rhythm Patterns (RP) family [LSC+10] a degrading effect on the shallow model which showed signif- (highest values: GTZAN 73.2%, ISMIR Genre 80.9%, Latin icantly higher accuracy values by training for more epochs. 87.3%) or the in the referred study presented Temporal MSD: A not significant advantage of deep over shallow Echonest Features [SR12] (highest values: GTZAN 66.9%, models was observed. Experiments using data augmen- ISMIR Genre 81.3%, Latin 89.0%). tation and longer training were omitted due to the already Future work will focus on further data augmentation large variance provided by the MSD which multiplies the methods to improve the performance of neural networks preceding datasets by factors from 15 to 50. on small datasets and the Million Song Dataset as well as on different network architectures. 20 Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification References [MRL+15] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric [BMEWL11] Thierry Bertin-Mahieux, Daniel PW Ellis, Battenberg, and Oriol Nieto. librosa: Audio Brian Whitman, and Paul Lamere. The and music signal analysis in python. In million song dataset. In ISMIR, volume 2, Proceedings of the 14th Python in Science page 10, 2011. Conference, 2015. [CGG+06] Pedro Cano, Emilia Gómez, Fabien Gouyon, [PLS16] Jordi Pons, Thomas Lidy, and Xavier Serra. Perfecto Herrera, Markus Koppenberger, Experimenting with musically motivated Beesuan Ong, Xavier Serra, Sebastian Stre- convolutional neural networks. In Proceed- ich, and Nicolas Wack. ISMIR 2004 audio ings of the 14th International Workshop on description contest. Technical report, 2006. Content-based Multimedia Indexing (CBMI [FLTZ11] Zhouyu Fu, Guojun Lu, Kai Ming Ting, and 2016), Bucharest, Romania, June 2016. Dengsheng Zhang. A survey of audio-based [Sch15] Hendrik Schreiber. Improving genre music classification and annotation. Multi- annotations for the million song dataset. In media, IEEE Transactions on, 13(2):303–319, Proceedings of the 16th International Society 2011. for Music Information Retrieval Conference [KB14] Diederik P. Kingma and Jimmy Ba. Adam: (ISMIR 2012), Malaga, Spain, 2015. A method for stochastic optimization. [SKK08] C N Silla Jr., Celso A A Kaestner, and CoRR, abs/1412.6980, 2014. Alessandro L Koerich. The Latin Music [KSH12] Alex Krizhevsky, Ilya Sutskever, and Database. In Proceedings of the 9th Inter- Geoffrey E Hinton. Imagenet classification national Conference on Music Information with deep convolutional neural networks. In Retrieval, pages 451—-456, 2008. Advances in neural information processing [SLJ+15] Christian Szegedy, Wei Liu, Yangqing Jia, systems, pages 1097–1105, 2012. Pierre Sermanet, Scott Reed, Dragomir [LS16a] Thomas Lidy and Alexander Schindler. Anguelov, Dumitru Erhan, Vincent Van- CQT-based convolutional neural networks houcke, and Andrew Rabinovich. Going for audio scene classification. In Proceedings deeper with convolutions. In Proceedings of of the Detection and Classification of the IEEE Conference on Computer Vision Acoustic Scenes and Events 2016 Workshop and Pattern Recognition, pages 1–9, 2015. (DCASE2016), pages 60–64, September 2016. [SMR12] Alexander Schindler, Rudolf Mayer, and [LS16b] Thomas Lidy and Alexander Schindler. Par- Andreas Rauber. Facilitating comprehensive allel convolutional neural networks for music benchmarking experiments on the million genre and mood classification. Technical re- song dataset. In Proceedings of the 13th port, Music Information Retrieval Evaluation International Society for Music Information eXchange (MIREX 2016), August 2016. Retrieval Conference (ISMIR 2012), pages 469–474, Porto, Portugal, October 8-12 2012. [LSC+10] Thomas Lidy, Carlos N. Silla, Olmo Cornelis, Fabien Gouyon, Andreas Rauber, Celso [SR12] Alexander Schindler and Andreas Rauber. A. A. Kaestner, and Alessandro L. Koerich. Capturing the temporal domain in echon- On the suitability of state-of-the-art music est features for improved classification information retrieval methods for analyzing, effectiveness. In Adaptive Multimedia categorizing, structuring and accessing Retrieval, Lecture Notes in Computer non-western and ethnic music collections. Science, Copenhagen, Denmark, October Signal Processing, 90(4):1032–1048, 2010. 24-25 2012. Springer. [MHB15] Brian McFee, Eric J Humphrey, and Juan P [Tza02] G. Tzanetakis. Manipulation, analysis and Bello. A software framework for musical retrieval systems for audio signals. PhD data augmentation. In International Society thesis, 2002. for Music Information Retrieval Conference (ISMIR), 2015. [MHN13] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. ICML 2013, 28, 2013. 21