        Comparing Shallow versus Deep Neural Network
     Architectures for Automatic Music Genre Classification

                        Alexander Schindler                                  Thomas Lidy, Andreas Rauber
                  Austrian Institute of Technology                          Vienna University of Technology
                    Digital Safety and Security                             Institute of Software Technology
                          Vienna, Austria                                             Vienna, Austria
                   alexander.schindler@ait.ac.at                               lidy,rauber@ifs.tuwien.ac.at

                                                                          problems have been inspired by the remarkable success of
                                                                          Deep Neural Networks (DNN) in the domains of computer
                         Abstract                                         vision [KSH12], where deep learning based approaches have
                                                                          already become the de facto standard. The major advan-
     In this paper we investigate performance differ-                     tage of DNNs are their feature learning capability, which
     ences of different neural network architectures on                   alleviates the domain knowledge and time intensive task
     the task of automatic music genre classification.                    of crafting audio features by hand. Predictions are also
     Comparative evaluations on four well known                           made directly on the modeled input representations, which
     datasets of different sizes were performed including                 is commonly raw input data such as images, text or au-
     the application of two audio data augmentation                       dio spectrograms. Recent accomplishments in applying
     methods. The results show that shallow network                       Convolutional Neural Networks (CNN) to audio classifica-
     architectures are better suited for small datasets                   tion tasks have shown promising results by outperforming
     than deeper models, which could be relevant for                      conventional approaches in different evaluation campaigns
     experiments and applications which rely on small                     such as the Detection and Classification of Acoustic Scenes
     datasets. A noticeable advantage was observed                        and Events (DCASE) [LS16a] and the Music Information
     through the application of data augmentation                         Retrieval Evaluation EXchange (MIREX) [LS16b].
     using deep models. A final comparison with
     previous evaluations on the same datasets
     shows that the presented neural network based
     approaches already outperform state-of-the-art
     handcrafted music features.                                             An often mentioned paradigm concerning neural
                                                                          networks is that deeper networks are better in modeling
1    Introduction                                                         non-linear relationships of given tasks [SLJ+15]. So far
                                                                          preceding MIR experiments and approaches reported in
Music classification is a well researched topic in Music Infor-           literature have not explicitly demonstrated the advantage
mation Retrieval (MIR) [FLTZ11]. Generally, its aim is to                 of deep over shallow network architectures in a magnitude
assign one or multiple labels to a sequence or an entire audio            similar to results reported from the computer vision
file, which is commonly accomplished in two major steps.                  domain. This may be related to the absence of similarly
First, semantically meaningful audio content descriptors are              large datasets as they are available in the visual related
extracted from the sampled audio signal. Second, a machine                research areas. A special focus of this paper is thus set on
learning algorithm is applied, which attempts to discrim-                 the performance of neural networks on small datasets, since
inate between the classes by finding separating boundaries                data availability is still a problem in MIR, but also because
in the multidimensional feature-spaces. Especially the first              many tasks involve the processing of small collections. In
step requires extensive knowledge and skills in various spe-              this paper we present a performance evaluation of shallow
cific research areas such as audio signal processing, acoustics           and deep neural network architectures. These models and
and/or music theory. Recently many approaches to MIR                      the applied method will be detailed in Section 2. The
                                                                          evaluation will be performed on well known music genre
                                                     Input 80 x 80                                                                                                      Input 80 x 80

                Conv - 16 - 10 x 23 - LeakyReLU                         Conv - 16 - 21 x 20 - LeakyReLU                       Conv - 16 - 10 x 23 - LeakyReLU                                Conv - 16 - 21 x 10 - LeakyReLU

                                                                                                                  Layer 1
    Layer 1

                     Max-Pooling 1 x 20                                      Max-Pooling 20 x 1                                     Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2

                                                                                                                              Conv - 32 - 5 x 11 - LeakyReLU                                 Conv - 32 - 10 x 5 - LeakyReLU

                                                                                                                  Layer 2
                                                     Dropout 0.1                                                                    Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2

                                                  Fully connected 200                                                          Conv - 64 - 3 x 5 - LeakyReLU                                     Conv - 64 - 5 x 3 - LeakyReLU

                                                                                                                  Layer 3
                                                       Softmax                                                                      Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2

                                                                                                                              Conv - 128 - 2 x 4 - LeakyReLU                                 Conv - 128 - 4 x 2 - LeakyReLU

                                                                                                                  Layer 4
                 Figure 1: Shallow CNN architecture
                                                                                                                                    Max-Pooling 1 x 5                                                 Max-Pooling 5 x 1

2             Method                                                                                                                                                       merge

The parallel architectures of the neural networks used                                                                                                                  Dropout 0.25
in the evaluation are based on the idea of using a time
and a frequency pipeline described in [PLS16], which                                                                                                           Fully connected 200 - LeakyReLU

was successfully applied in two evaluation campaigns                                                                                                                      Softmax
[LS16a, LS16b]. The system is based on a parallel CNN
architecture where separate CNN Layers are optimized for
processing and recognizing music relations in the frequency                                                                         Figure 2: Deep CNN architecture
domain and to capture temporal relations (see Figure 1).
                                                                                                               are slightly adapted to have the same rectangular shapes
The Shallow Architecture: In our adaption of the                                                               with one part being rotated by 90 . As in the shallow archi-
CNN architecture described in [PLS16] we use two similar                                                       tecture the same sizes of the final feature maps of the parallel
pipelines of CNN Layers with 16 filter kernels each followed                                                   model paths balances their influences on the following fully
by a Max Pooling layer (see Figure 1). The left pipeline                                                       connected layer with 200 units with a 25% dropout rate.
aims at capturing frequency relations using filter kernel sizes
of 10⇥23 and Max Pooling sizes of 1⇥20. The resulting 16
vertical rectangular shaped feature map responses of shape                                                     2.1          Training and Predicting Results
80⇥4 are intended to capture spectral characteristics of a                                                     In each epoch during training the network multiple training
segment and to reduce the temporal complexity to 4 discrete                                                    examples sampled from the segment-wise log-transformed
intervals. The right pipeline uses a filter of size 21⇥20 and                                                  Mel-spectrogram analysis of all files in the training set are
Max Pooling sizes of 20⇥1. This results in horizontal rect-                                                    presented to both pipelines of the neural network. Each of
angular shaped feature maps of shape 4⇥80. This captures                                                       the parallel pipelines of the architectures uses the same 80
temporal changes in intensity levels of four discrete spectral                                                 ⇥ 80 log-transformed Mel-spectrogram segments as input.
intervals. The 16 feature maps of each pipeline are flattened                                                  These segments have been calculated from a fast Fourier
to a shape of 1⇥5120 and merged by concatenation into                                                          transformed spectrogram using a window size of 1024 sam-
the shape of 1⇥10240, which serves as input to a 200 units                                                     ples and an overlap of 50% from 0.93 seconds of audio trans-
fully connected layer with a dropout of 10%.                                                                   formed subsequently into Mel scale and Log scale.. For each
The Deep Architecture: This architecture follows the                                                           song of a dataset 15 segments have been randomly chosen.
same principles of the shallow approach. It uses a parallel                                                       All trainable layers used the Leaky ReLU activation
arrangement of rectangular shaped filters and Max-Pooling                                                      function [MHN13], which is an extension to the ReLU
windows to capture frequency and temporal relationships                                                        (Rectifier Linear Unit) that does not completely cut off
at once. But, instead of using the information of the large                                                    activation for negative values, but allows for negative
feature map responses, this architecture applies additional                                                    values close to zero to pass through. It is defined by adding
CNN and pooling layer pairs (see Figure 2). Thus, more                                                         a coefficient ↵ in f(x) = ↵x, for x < 0, while keeping
units can be applied to train on the subsequent smaller                                                        f(x) = x, for x 0 as for the ReLU. In our architectures,
input feature maps. The first level of the parallel layers are                                                 we apply Leaky ReLU activation with ↵=0.3. L1 weight
similar to the original approach. They use filter kernel sizes                                                 regularization with a penalty of 0.0001 was applied to all
of 10⇥23 and 21⇥10 to capture frequency and temporal                                                           trainable parameters. All networks were trained towards
relationships. To retain these characteristics the sizes of the                                                categorical-crossentropy objective using the stochastic
convolutional filter kernels as well as the feature maps are                                                   Adam optimization [KB14] with beta1 =0.9, beta2 =0.999,
sub-sequentially divided in halves by the second and third                                                     epsilon=1e 08 and a learning rate of 0.00005.
layers. The filter and Max Pooling sizes of the fourth layer                                                      The system is implemented in Python and using librosa

[MRL+15] for audio processing and Mel-log-transforms                                                   Train                 Test
                                                                        Data         Tracks    cls   wo. au.    w. au.
and Theano-based library Keras for Deep Learning.                       GTZAN          1,000    10     11,250    47,250      3,750
                                                                        ISMIR G.       1,458     6     16,380    68,796      5,490
2.1.1   Data Augmentation                                               Latin          3,227    10     36,240   152,208     12,165
                                                                        MSD           49,900    15    564,165        —     185,685
To increase the number of training instances we experiment
with two different audio data augmentation methods.                    Table 1: Overview of the evaluation datasets, their number
The deformations were applied directly to the audio                    of classes (cls) and their corresponding number of test and
signal preceding any further feature calculation procedure             training data instances without (wo. au.) and with (w.
described in Section 2.1. The following two methods were               au.) data augmentation.
applied using the MUDA framework for musical data
augmentation [MHB15]:                                                  the shallow and deep architecture were evaluated followed
                                                                       by the experiments including data augmentation. The archi-
Time Stretching: slowing down or speeding up the                       tectures were further evaluated according their performance
original audio sample while keeping the same pitch                     after a different number of training epochs. The networks
information. Time stretching was applied using the                     were trained and evaluated after 100 and 200 epochs
multiplication factors 0.5, 0.2 for slowing down and 1.2,              without early stopping. Preceding experiments showed
1.5 for increasing the tempo.                                          that test accuracy could improve despite rising validation
Pitch Shifting: raising or lowering the pitch of an audio              loss though on smaller sets no significant improvement was
sample while keeping the tempo unchanged. The applied                  recognizable after 200 epochs. For the experiments with
pitch shifting lowered and raised the pitch by 2 and 5                 data augmentation, the augmented data was only used to
semitones.                                                             train the networks (see Table 3.1). For testing the network
                                                                       the original segments without deformations were used.
For each deformation three segments have been randomly
chosen from the audio content. The combinations of the
two deformations with four different factors each resulted             3.1   Data Sets
thus in 48 additional data instances per audio file.                   For the evaluation four data sets have been used. We have
                                                                       chosen these datasets due to their increasing number of
3    Evaluation                                                        tracks and because they are well known and extensively
                                                                       evaluated in the automatic genre classification task. This
As our system analyzes and predicts multiple audio
                                                                       should also provide comparability with experiments
segments per input file, there are several ways to perform
                                                                       reported in literature.
the final prediction of an input instance:
                                                                       GTZAN: This data set was compiled by George Tzane-
Raw Probability: The raw accuracy of predicting
                                                                       takis [Tza02] in 2000-2001 and consists of 1000 audio tracks
the segments as separated instances ignoring their file
                                                                       equally distributed over the 10 music genres: blues, classical,
                                                                       country, disco, hiphop, pop, jazz, metal, reggae, and rock.
Maximum Probability: The output probabilities of the
                                                                       ISMIR Genre: This data set has been assembled for
Softmax layer for the corresponding number of classes of
                                                                       training and development in the ISMIR 2004 Genre
the datasets are summed up for all segments belonging
                                                                       Classification contest [CGG+06]. It contains 1458 full
to the same input file. The predicted class is determined
                                                                       length audio recordings from Magnatune.com distributed
by the maximum probability among the classes from the
                                                                       across the 6 genre classes: Classical, Electronic, JazzBlues,
summed probabilities.
                                                                       MetalPunk, RockPop, World.
Majority Vote: Here, the predictions are made for each
                                                                       Latin Music Database (LMD): [SKK08] contains
segment processed from the audio file as input instance to
                                                                       3227 songs, categorized into the 10 Latin music genres Axé,
the network. The class of an audio segment is determined
                                                                       Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa,
by the maximum probability as output by the Softmax layer
                                                                       Sertaneja and Tango.
for this segment instance. Then, a majority vote is taken on
all predicted classes from all segments of the same input file.        Million Song Dataset (MSD): [BMEWL11] a
Majority vote determines the class that occurs most often.             collection of one million music pieces, enables methods for
                                                                       large-scale applications. It comes as a collection of meta-
   We used stratified 4-fold cross validation. Multi-level             data such as the song names, artists and albums, together
stratification was applied paying special attention to the             with a set of features extracted with the The Echo Nest
multiple segments used per file. It was ensured that the               services, such as loudness, tempo, and MFCC-like features.
files were distributed according their genre distributions             We used the CD2C genre assignments as ground truth
and that no segments of a training file was provided in                [Sch15] which are an adaptation of the MSD genre label
the corresponding test split.                                          assignments presented in [SMR12]. For the experiments
   The experiments were grouped according to the four                  a sub-set of approximately 50.000 tracks was sub-sampled.
different datasets. For each dataset the performances for

4    Results                                                         D Model                   raw            max            maj          ep
                                                                                   shallow     66.56 (0.69)   78.10 (0.89)   77.80 (0.52) 100
                                                                                   deep        65.69 (1.23)   78.60 (1.97)   78.00 (2.87) 100

The results of the experiments are provided in Table 4.                            shallow     67.49 (0.39)   80.80 (1.67)   80.20 (1.68) 200
For each dataset all combinations of experimental results                          deep        66.19 (0.55)   80.60 (2.93)   80.30 (2.87) 200
                                                                                   shallow aug 66.77 (0.78)   78.90 (2.64)   77.10 (1.19) 100
were tested for significant difference using a Wilcoxon                            deep aug    68.31 (2.68)   81.80 (2.95)   82.20 (2.30) 100
signed-rank test. None of the presented results showed
a significant difference for p<0.05. Thus, we tested at the                        shallow     75.66 (1.30)   85.46 (1.87)   84.77 (1.43) 100

                                                                     ISMIR Genre
next higher level p < 0.1. The following observations on                           deep        74.53 (0.52)   84.08 (1.13)   83.95 (0.97) 100
the datasets were made:                                                            shallow     75.43 (0.65)   84.91 (1.96)   85.18 (1.27) 200
                                                                                   deep        74.51 (1.71)   85.12 (0.76)   85.18 (1.23) 200
                                                                                   shallow aug 76.61 (1.04)   86.90 (0.23)   86.00 (0.54) 100
                                                                                   deep aug    77.20 (1.14)   87.17 (1.17)   86.75 (1.41) 100
GTZAN: Training the models with 200 epochs instead
of only 100 epochs significantly improved the raw and max                          shallow     79.80 (0.95)   92.44 (0.86)   92.10 (0.97) 100
accuracies for the shallow models. An additional test on                           deep        81.13 (0.64)   94.42 (1.04)   94.30 (0.81) 100

training 500 epochs showed no further increase in accuracy                         shallow     80.64 (0.83)   93.46 (1.13)   92.68 (0.88) 200
                                                                                   deep        81.06 (0.51)   95.14 (0.40)   94.83 (0.53) 200
for any of the three prediction methods. Training longer had                       shallow aug 78.09 (0.68)   92.78 (0.88)   92.03 (0.81) 100
no effect on the deep model due to early over-fitting. No                          deep aug    83.22 (0.83)   96.03 (0.56)   95.60 (0.58) 100
significant differences were observed between shallow and
deep models except for the raw prediction values of the shal-

                                                                                   shallow     58.20 (0.49)   63.89 (0.81)   63.11 (0.74) 100
low model (200 epochs) exceeding those of the deep model                           deep        60.60 (0.28)   67.16 (0.64)   66.41 (0.52) 100
(200 epochs). While the improvements through data aug-
mentation on deep models compared to the un-augmented
longer trained deep models are not significant, considerable         Table 2: Experimental results for the evaluation datasets
improvements of 4.2% were observed for models trained                (D) at different number of training epochs (ep): Mean accu-
for the same number of epochs. An interesting observation            racies and standard deviations of the 4-fold cross-evaluation
is the negative effect of data augmentation on the shallow           runs calculated using raw prediction scores (raw) and the
models where longer training outperformed augmentation.              file based maximum probability (max) and majority vote
                                                                     approach (maj).
ISMIR Genre: Training more epochs only had a
significant positive effect on the max and maj values of the
deep model but none for the shallow ones. The deep models            5             Conclusions and Future Work
showed no significant advantage over the shallow architec-
tures which also showed higher raw prediction values even
                                                                     In this paper we evaluated shallow and deep CNN
on shorter trained models. Data augmentation improved
                                                                     architectures towards their performance on different dataset
the predictions of both architectures with significant im-
                                                                     sizes in music genre classification tasks. Our observations
provements for the raw values. Especially the deep models
                                                                     showed that for smaller datasets shallow models seem to be
significantly profited from data augmentation with max
                                                                     more appropriate since deeper models showed no significant
values increased by 3.08% for models trained for the same
                                                                     improvement. Deeper models performed slightly better in
number of epochs and 2.05% for the longer trained models.
                                                                     the presence of larger datasets, but a clear conclusion that
The improvements from deep over shallow models using
                                                                     deeper models are generally better could not be drawn.
augmented data were only significant for the raw values.
                                                                     Data augmentation using time stretching and pitch shifting
Latin: Training more epochs only had a positive effect for           significantly improved the performance of deep models. For
the raw and max values of the shallow model, but not for             shallow models on the contrary it showed a negative effect
the deep architecture. On this dataset, the deep model sig-          on the small datasets. Thus, deeper models should be
nificantly outperformed the shallow architecture including           considered when applying data augmentation. Comparing
the shallow model trained using data augmentation. Data              the presented results with previously reported evaluations
augmentation improved the significantly improved the per-            on the same datasets [SR12] shows, that the CNN based
formance of the deep models by 1,61% for the max values.             approaches already outperform handcrafted music features
Similar to the GTZAN dataset, data augmentation showed               such as the Rhythm Patterns (RP) family [LSC+10]
a degrading effect on the shallow model which showed signif-         (highest values: GTZAN 73.2%, ISMIR Genre 80.9%, Latin
icantly higher accuracy values by training for more epochs.          87.3%) or the in the referred study presented Temporal
MSD: A not significant advantage of deep over shallow                Echonest Features [SR12] (highest values: GTZAN 66.9%,
models was observed. Experiments using data augmen-                  ISMIR Genre 81.3%, Latin 89.0%).
tation and longer training were omitted due to the already             Future work will focus on further data augmentation
large variance provided by the MSD which multiplies the              methods to improve the performance of neural networks
preceding datasets by factors from 15 to 50.                         on small datasets and the Million Song Dataset as well
                                                                     as on different network architectures.

