=Paper= {{Paper |id=Vol-1734/fmt-proceedings-2016-paper2 |storemode=property |title=Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification |pdfUrl=https://ceur-ws.org/Vol-1734/fmt-proceedings-2016-paper2.pdf |volume=Vol-1734 |authors=Alexander Schindler,Thomas Lidy,Andreas Rauber |dblpUrl=https://dblp.org/rec/conf/fmt/SchindlerLR16 }} ==Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification== https://ceur-ws.org/Vol-1734/fmt-proceedings-2016-paper2.pdf

Comparing Shallow versus Deep Neural Network
Architectures for Automatic Music Genre Classification

Alexander Schindler Thomas Lidy, Andreas Rauber
Austrian Institute of Technology Vienna University of Technology
Digital Safety and Security Institute of Software Technology
Vienna, Austria Vienna, Austria
alexander.schindler@ait.ac.at lidy,rauber@ifs.tuwien.ac.at

problems have been inspired by the remarkable success of
Deep Neural Networks (DNN) in the domains of computer
Abstract vision [KSH12], where deep learning based approaches have
already become the de facto standard. The major advan-
In this paper we investigate performance differ- tage of DNNs are their feature learning capability, which
ences of different neural network architectures on alleviates the domain knowledge and time intensive task
the task of automatic music genre classification. of crafting audio features by hand. Predictions are also
Comparative evaluations on four well known made directly on the modeled input representations, which
datasets of different sizes were performed including is commonly raw input data such as images, text or au-
the application of two audio data augmentation dio spectrograms. Recent accomplishments in applying
methods. The results show that shallow network Convolutional Neural Networks (CNN) to audio classifica-
architectures are better suited for small datasets tion tasks have shown promising results by outperforming
than deeper models, which could be relevant for conventional approaches in different evaluation campaigns
experiments and applications which rely on small such as the Detection and Classification of Acoustic Scenes
datasets. A noticeable advantage was observed and Events (DCASE) [LS16a] and the Music Information
through the application of data augmentation Retrieval Evaluation EXchange (MIREX) [LS16b].
using deep models. A final comparison with
previous evaluations on the same datasets
shows that the presented neural network based
approaches already outperform state-of-the-art
handcrafted music features. An often mentioned paradigm concerning neural
networks is that deeper networks are better in modeling
1 Introduction non-linear relationships of given tasks [SLJ+15]. So far
preceding MIR experiments and approaches reported in
Music classification is a well researched topic in Music Infor- literature have not explicitly demonstrated the advantage
mation Retrieval (MIR) [FLTZ11]. Generally, its aim is to of deep over shallow network architectures in a magnitude
assign one or multiple labels to a sequence or an entire audio similar to results reported from the computer vision
file, which is commonly accomplished in two major steps. domain. This may be related to the absence of similarly
First, semantically meaningful audio content descriptors are large datasets as they are available in the visual related
extracted from the sampled audio signal. Second, a machine research areas. A special focus of this paper is thus set on
learning algorithm is applied, which attempts to discrim- the performance of neural networks on small datasets, since
inate between the classes by finding separating boundaries data availability is still a problem in MIR, but also because
in the multidimensional feature-spaces. Especially the first many tasks involve the processing of small collections. In
step requires extensive knowledge and skills in various spe- this paper we present a performance evaluation of shallow
cific research areas such as audio signal processing, acoustics and deep neural network architectures. These models and
and/or music theory. Recently many approaches to MIR the applied method will be detailed in Section 2. The
evaluation will be performed on well known music genre
Copyright c by the paper’s authors. Copying permitted for private
and academic purposes. classification datasets in the domain of Music Information
In: W. Aigner, G. Schmiedl, K. Blumenstein, M. Zeppelzauer (eds.):
Retrieval. These datasets and the evaluation procedure will
Proceedings of the 9th Forum Media Technology 2016, St. Pölten, be described in Section 3. Finally we draw conclusions from
Austria, 24-11-2016, published at http://ceur-ws.org the results in Section 5 and give an outlook to future work.

17
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification

Input 80 x 80 Input 80 x 80

Conv - 16 - 10 x 23 - LeakyReLU Conv - 16 - 21 x 20 - LeakyReLU Conv - 16 - 10 x 23 - LeakyReLU Conv - 16 - 21 x 10 - LeakyReLU

Layer 1
Layer 1

Max-Pooling 1 x 20 Max-Pooling 20 x 1 Max-Pooling 2 x 2 Max-Pooling 2 x 2

merge
Conv - 32 - 5 x 11 - LeakyReLU Conv - 32 - 10 x 5 - LeakyReLU

Layer 2
Dropout 0.1 Max-Pooling 2 x 2 Max-Pooling 2 x 2

Fully connected 200 Conv - 64 - 3 x 5 - LeakyReLU Conv - 64 - 5 x 3 - LeakyReLU

Layer 3
Softmax Max-Pooling 2 x 2 Max-Pooling 2 x 2

Conv - 128 - 2 x 4 - LeakyReLU Conv - 128 - 4 x 2 - LeakyReLU

Layer 4
Figure 1: Shallow CNN architecture
Max-Pooling 1 x 5 Max-Pooling 5 x 1

2 Method merge

The parallel architectures of the neural networks used Dropout 0.25
in the evaluation are based on the idea of using a time
and a frequency pipeline described in [PLS16], which Fully connected 200 - LeakyReLU

was successfully applied in two evaluation campaigns Softmax
[LS16a, LS16b]. The system is based on a parallel CNN
architecture where separate CNN Layers are optimized for
processing and recognizing music relations in the frequency Figure 2: Deep CNN architecture
domain and to capture temporal relations (see Figure 1).
are slightly adapted to have the same rectangular shapes
The Shallow Architecture: In our adaption of the with one part being rotated by 90 . As in the shallow archi-
CNN architecture described in [PLS16] we use two similar tecture the same sizes of the final feature maps of the parallel
pipelines of CNN Layers with 16 filter kernels each followed model paths balances their influences on the following fully
by a Max Pooling layer (see Figure 1). The left pipeline connected layer with 200 units with a 25% dropout rate.
aims at capturing frequency relations using filter kernel sizes
of 10⇥23 and Max Pooling sizes of 1⇥20. The resulting 16
vertical rectangular shaped feature map responses of shape 2.1 Training and Predicting Results
80⇥4 are intended to capture spectral characteristics of a In each epoch during training the network multiple training
segment and to reduce the temporal complexity to 4 discrete examples sampled from the segment-wise log-transformed
intervals. The right pipeline uses a filter of size 21⇥20 and Mel-spectrogram analysis of all files in the training set are
Max Pooling sizes of 20⇥1. This results in horizontal rect- presented to both pipelines of the neural network. Each of
angular shaped feature maps of shape 4⇥80. This captures the parallel pipelines of the architectures uses the same 80
temporal changes in intensity levels of four discrete spectral ⇥ 80 log-transformed Mel-spectrogram segments as input.
intervals. The 16 feature maps of each pipeline are flattened These segments have been calculated from a fast Fourier
to a shape of 1⇥5120 and merged by concatenation into transformed spectrogram using a window size of 1024 sam-
the shape of 1⇥10240, which serves as input to a 200 units ples and an overlap of 50% from 0.93 seconds of audio trans-
fully connected layer with a dropout of 10%. formed subsequently into Mel scale and Log scale.. For each
The Deep Architecture: This architecture follows the song of a dataset 15 segments have been randomly chosen.
same principles of the shallow approach. It uses a parallel All trainable layers used the Leaky ReLU activation
arrangement of rectangular shaped filters and Max-Pooling function [MHN13], which is an extension to the ReLU
windows to capture frequency and temporal relationships (Rectifier Linear Unit) that does not completely cut off
at once. But, instead of using the information of the large activation for negative values, but allows for negative
feature map responses, this architecture applies additional values close to zero to pass through. It is defined by adding
CNN and pooling layer pairs (see Figure 2). Thus, more a coefficient ↵ in f(x) = ↵x, for x < 0, while keeping
units can be applied to train on the subsequent smaller f(x) = x, for x 0 as for the ReLU. In our architectures,
input feature maps. The first level of the parallel layers are we apply Leaky ReLU activation with ↵=0.3. L1 weight
similar to the original approach. They use filter kernel sizes regularization with a penalty of 0.0001 was applied to all
of 10⇥23 and 21⇥10 to capture frequency and temporal trainable parameters. All networks were trained towards
relationships. To retain these characteristics the sizes of the categorical-crossentropy objective using the stochastic
convolutional filter kernels as well as the feature maps are Adam optimization [KB14] with beta1 =0.9, beta2 =0.999,
sub-sequentially divided in halves by the second and third epsilon=1e 08 and a learning rate of 0.00005.
layers. The filter and Max Pooling sizes of the fourth layer The system is implemented in Python and using librosa

18
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification

[MRL+15] for audio processing and Mel-log-transforms Train Test
Data Tracks cls wo. au. w. au.
and Theano-based library Keras for Deep Learning. GTZAN 1,000 10 11,250 47,250 3,750
ISMIR G. 1,458 6 16,380 68,796 5,490
2.1.1 Data Augmentation Latin 3,227 10 36,240 152,208 12,165
MSD 49,900 15 564,165 — 185,685
To increase the number of training instances we experiment
with two different audio data augmentation methods. Table 1: Overview of the evaluation datasets, their number
The deformations were applied directly to the audio of classes (cls) and their corresponding number of test and
signal preceding any further feature calculation procedure training data instances without (wo. au.) and with (w.
described in Section 2.1. The following two methods were au.) data augmentation.
applied using the MUDA framework for musical data
augmentation [MHB15]: the shallow and deep architecture were evaluated followed
by the experiments including data augmentation. The archi-
Time Stretching: slowing down or speeding up the tectures were further evaluated according their performance
original audio sample while keeping the same pitch after a different number of training epochs. The networks
information. Time stretching was applied using the were trained and evaluated after 100 and 200 epochs
multiplication factors 0.5, 0.2 for slowing down and 1.2, without early stopping. Preceding experiments showed
1.5 for increasing the tempo. that test accuracy could improve despite rising validation
Pitch Shifting: raising or lowering the pitch of an audio loss though on smaller sets no significant improvement was
sample while keeping the tempo unchanged. The applied recognizable after 200 epochs. For the experiments with
pitch shifting lowered and raised the pitch by 2 and 5 data augmentation, the augmented data was only used to
semitones. train the networks (see Table 3.1). For testing the network
the original segments without deformations were used.
For each deformation three segments have been randomly
chosen from the audio content. The combinations of the
two deformations with four different factors each resulted 3.1 Data Sets
thus in 48 additional data instances per audio file. For the evaluation four data sets have been used. We have
chosen these datasets due to their increasing number of
3 Evaluation tracks and because they are well known and extensively
evaluated in the automatic genre classification task. This
As our system analyzes and predicts multiple audio
should also provide comparability with experiments
segments per input file, there are several ways to perform
reported in literature.
the final prediction of an input instance:
GTZAN: This data set was compiled by George Tzane-
Raw Probability: The raw accuracy of predicting
takis [Tza02] in 2000-2001 and consists of 1000 audio tracks
the segments as separated instances ignoring their file
equally distributed over the 10 music genres: blues, classical,
dependencies.
country, disco, hiphop, pop, jazz, metal, reggae, and rock.
Maximum Probability: The output probabilities of the
ISMIR Genre: This data set has been assembled for
Softmax layer for the corresponding number of classes of
training and development in the ISMIR 2004 Genre
the datasets are summed up for all segments belonging
Classification contest [CGG+06]. It contains 1458 full
to the same input file. The predicted class is determined
length audio recordings from Magnatune.com distributed
by the maximum probability among the classes from the
across the 6 genre classes: Classical, Electronic, JazzBlues,
summed probabilities.
MetalPunk, RockPop, World.
Majority Vote: Here, the predictions are made for each
Latin Music Database (LMD): [SKK08] contains
segment processed from the audio file as input instance to
3227 songs, categorized into the 10 Latin music genres Axé,
the network. The class of an audio segment is determined
Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa,
by the maximum probability as output by the Softmax layer
Sertaneja and Tango.
for this segment instance. Then, a majority vote is taken on
all predicted classes from all segments of the same input file. Million Song Dataset (MSD): [BMEWL11] a
Majority vote determines the class that occurs most often. collection of one million music pieces, enables methods for
large-scale applications. It comes as a collection of meta-
We used stratified 4-fold cross validation. Multi-level data such as the song names, artists and albums, together
stratification was applied paying special attention to the with a set of features extracted with the The Echo Nest
multiple segments used per file. It was ensured that the services, such as loudness, tempo, and MFCC-like features.
files were distributed according their genre distributions We used the CD2C genre assignments as ground truth
and that no segments of a training file was provided in [Sch15] which are an adaptation of the MSD genre label
the corresponding test split. assignments presented in [SMR12]. For the experiments
The experiments were grouped according to the four a sub-set of approximately 50.000 tracks was sub-sampled.
different datasets. For each dataset the performances for

19
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification

4 Results D Model raw max maj ep
shallow 66.56 (0.69) 78.10 (0.89) 77.80 (0.52) 100
deep 65.69 (1.23) 78.60 (1.97) 78.00 (2.87) 100

GTZAN
The results of the experiments are provided in Table 4. shallow 67.49 (0.39) 80.80 (1.67) 80.20 (1.68) 200
For each dataset all combinations of experimental results deep 66.19 (0.55) 80.60 (2.93) 80.30 (2.87) 200
shallow aug 66.77 (0.78) 78.90 (2.64) 77.10 (1.19) 100
were tested for significant difference using a Wilcoxon deep aug 68.31 (2.68) 81.80 (2.95) 82.20 (2.30) 100
signed-rank test. None of the presented results showed
a significant difference for p<0.05. Thus, we tested at the shallow 75.66 (1.30) 85.46 (1.87) 84.77 (1.43) 100

ISMIR Genre
next higher level p < 0.1. The following observations on deep 74.53 (0.52) 84.08 (1.13) 83.95 (0.97) 100
the datasets were made: shallow 75.43 (0.65) 84.91 (1.96) 85.18 (1.27) 200
deep 74.51 (1.71) 85.12 (0.76) 85.18 (1.23) 200
shallow aug 76.61 (1.04) 86.90 (0.23) 86.00 (0.54) 100
deep aug 77.20 (1.14) 87.17 (1.17) 86.75 (1.41) 100
GTZAN: Training the models with 200 epochs instead
of only 100 epochs significantly improved the raw and max shallow 79.80 (0.95) 92.44 (0.86) 92.10 (0.97) 100
accuracies for the shallow models. An additional test on deep 81.13 (0.64) 94.42 (1.04) 94.30 (0.81) 100

Latin
training 500 epochs showed no further increase in accuracy shallow 80.64 (0.83) 93.46 (1.13) 92.68 (0.88) 200
deep 81.06 (0.51) 95.14 (0.40) 94.83 (0.53) 200
for any of the three prediction methods. Training longer had shallow aug 78.09 (0.68) 92.78 (0.88) 92.03 (0.81) 100
no effect on the deep model due to early over-fitting. No deep aug 83.22 (0.83) 96.03 (0.56) 95.60 (0.58) 100
significant differences were observed between shallow and
deep models except for the raw prediction values of the shal-

MSD
shallow 58.20 (0.49) 63.89 (0.81) 63.11 (0.74) 100
low model (200 epochs) exceeding those of the deep model deep 60.60 (0.28) 67.16 (0.64) 66.41 (0.52) 100
(200 epochs). While the improvements through data aug-
mentation on deep models compared to the un-augmented
longer trained deep models are not significant, considerable Table 2: Experimental results for the evaluation datasets
improvements of 4.2% were observed for models trained (D) at different number of training epochs (ep): Mean accu-
for the same number of epochs. An interesting observation racies and standard deviations of the 4-fold cross-evaluation
is the negative effect of data augmentation on the shallow runs calculated using raw prediction scores (raw) and the
models where longer training outperformed augmentation. file based maximum probability (max) and majority vote
approach (maj).
ISMIR Genre: Training more epochs only had a
significant positive effect on the max and maj values of the
deep model but none for the shallow ones. The deep models 5 Conclusions and Future Work
showed no significant advantage over the shallow architec-
tures which also showed higher raw prediction values even
In this paper we evaluated shallow and deep CNN
on shorter trained models. Data augmentation improved
architectures towards their performance on different dataset
the predictions of both architectures with significant im-
sizes in music genre classification tasks. Our observations
provements for the raw values. Especially the deep models
showed that for smaller datasets shallow models seem to be
significantly profited from data augmentation with max
more appropriate since deeper models showed no significant
values increased by 3.08% for models trained for the same
improvement. Deeper models performed slightly better in
number of epochs and 2.05% for the longer trained models.
the presence of larger datasets, but a clear conclusion that
The improvements from deep over shallow models using
deeper models are generally better could not be drawn.
augmented data were only significant for the raw values.
Data augmentation using time stretching and pitch shifting
Latin: Training more epochs only had a positive effect for significantly improved the performance of deep models. For
the raw and max values of the shallow model, but not for shallow models on the contrary it showed a negative effect
the deep architecture. On this dataset, the deep model sig- on the small datasets. Thus, deeper models should be
nificantly outperformed the shallow architecture including considered when applying data augmentation. Comparing
the shallow model trained using data augmentation. Data the presented results with previously reported evaluations
augmentation improved the significantly improved the per- on the same datasets [SR12] shows, that the CNN based
formance of the deep models by 1,61% for the max values. approaches already outperform handcrafted music features
Similar to the GTZAN dataset, data augmentation showed such as the Rhythm Patterns (RP) family [LSC+10]
a degrading effect on the shallow model which showed signif- (highest values: GTZAN 73.2%, ISMIR Genre 80.9%, Latin
icantly higher accuracy values by training for more epochs. 87.3%) or the in the referred study presented Temporal
MSD: A not significant advantage of deep over shallow Echonest Features [SR12] (highest values: GTZAN 66.9%,
models was observed. Experiments using data augmen- ISMIR Genre 81.3%, Latin 89.0%).
tation and longer training were omitted due to the already Future work will focus on further data augmentation
large variance provided by the MSD which multiplies the methods to improve the performance of neural networks
preceding datasets by factors from 15 to 50. on small datasets and the Million Song Dataset as well
as on different network architectures.

20
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification

References [MRL+15] Brian McFee, Colin Raffel, Dawen Liang,
Daniel PW Ellis, Matt McVicar, Eric
[BMEWL11] Thierry Bertin-Mahieux, Daniel PW Ellis,
Battenberg, and Oriol Nieto. librosa: Audio
Brian Whitman, and Paul Lamere. The
and music signal analysis in python. In
million song dataset. In ISMIR, volume 2,
Proceedings of the 14th Python in Science
page 10, 2011.
Conference, 2015.
[CGG+06] Pedro Cano, Emilia Gómez, Fabien Gouyon,
[PLS16] Jordi Pons, Thomas Lidy, and Xavier Serra.
Perfecto Herrera, Markus Koppenberger,
Experimenting with musically motivated
Beesuan Ong, Xavier Serra, Sebastian Stre-
convolutional neural networks. In Proceed-
ich, and Nicolas Wack. ISMIR 2004 audio
ings of the 14th International Workshop on
description contest. Technical report, 2006.
Content-based Multimedia Indexing (CBMI
[FLTZ11] Zhouyu Fu, Guojun Lu, Kai Ming Ting, and 2016), Bucharest, Romania, June 2016.
Dengsheng Zhang. A survey of audio-based
[Sch15] Hendrik Schreiber. Improving genre
music classification and annotation. Multi-
annotations for the million song dataset. In
media, IEEE Transactions on, 13(2):303–319,
Proceedings of the 16th International Society
2011.
for Music Information Retrieval Conference
[KB14] Diederik P. Kingma and Jimmy Ba. Adam: (ISMIR 2012), Malaga, Spain, 2015.
A method for stochastic optimization.
[SKK08] C N Silla Jr., Celso A A Kaestner, and
CoRR, abs/1412.6980, 2014.
Alessandro L Koerich. The Latin Music
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Database. In Proceedings of the 9th Inter-
Geoffrey E Hinton. Imagenet classification national Conference on Music Information
with deep convolutional neural networks. In Retrieval, pages 451—-456, 2008.
Advances in neural information processing [SLJ+15] Christian Szegedy, Wei Liu, Yangqing Jia,
systems, pages 1097–1105, 2012. Pierre Sermanet, Scott Reed, Dragomir
[LS16a] Thomas Lidy and Alexander Schindler. Anguelov, Dumitru Erhan, Vincent Van-
CQT-based convolutional neural networks houcke, and Andrew Rabinovich. Going
for audio scene classification. In Proceedings deeper with convolutions. In Proceedings of
of the Detection and Classification of the IEEE Conference on Computer Vision
Acoustic Scenes and Events 2016 Workshop and Pattern Recognition, pages 1–9, 2015.
(DCASE2016), pages 60–64, September 2016. [SMR12] Alexander Schindler, Rudolf Mayer, and
[LS16b] Thomas Lidy and Alexander Schindler. Par- Andreas Rauber. Facilitating comprehensive
allel convolutional neural networks for music benchmarking experiments on the million
genre and mood classification. Technical re- song dataset. In Proceedings of the 13th
port, Music Information Retrieval Evaluation International Society for Music Information
eXchange (MIREX 2016), August 2016. Retrieval Conference (ISMIR 2012), pages
469–474, Porto, Portugal, October 8-12 2012.
[LSC+10] Thomas Lidy, Carlos N. Silla, Olmo Cornelis,
Fabien Gouyon, Andreas Rauber, Celso [SR12] Alexander Schindler and Andreas Rauber.
A. A. Kaestner, and Alessandro L. Koerich. Capturing the temporal domain in echon-
On the suitability of state-of-the-art music est features for improved classification
information retrieval methods for analyzing, effectiveness. In Adaptive Multimedia
categorizing, structuring and accessing Retrieval, Lecture Notes in Computer
non-western and ethnic music collections. Science, Copenhagen, Denmark, October
Signal Processing, 90(4):1032–1048, 2010. 24-25 2012. Springer.

[MHB15] Brian McFee, Eric J Humphrey, and Juan P [Tza02] G. Tzanetakis. Manipulation, analysis and
Bello. A software framework for musical retrieval systems for audio signals. PhD
data augmentation. In International Society thesis, 2002.
for Music Information Retrieval Conference
(ISMIR), 2015.
[MHN13] Andrew L. Maas, Awni Y. Hannun, and
Andrew Y. Ng. Rectifier nonlinearities
improve neural network acoustic models.
ICML 2013, 28, 2013.