=Paper= {{Paper |id=Vol-1734/fmt-proceedings-2016-paper2 |storemode=property |title=Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification |pdfUrl=https://ceur-ws.org/Vol-1734/fmt-proceedings-2016-paper2.pdf |volume=Vol-1734 |authors=Alexander Schindler,Thomas Lidy,Andreas Rauber |dblpUrl=https://dblp.org/rec/conf/fmt/SchindlerLR16 }} ==Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification== https://ceur-ws.org/Vol-1734/fmt-proceedings-2016-paper2.pdf
        Comparing Shallow versus Deep Neural Network
     Architectures for Automatic Music Genre Classification

                        Alexander Schindler                                  Thomas Lidy, Andreas Rauber
                  Austrian Institute of Technology                          Vienna University of Technology
                    Digital Safety and Security                             Institute of Software Technology
                          Vienna, Austria                                             Vienna, Austria
                   alexander.schindler@ait.ac.at                               lidy,rauber@ifs.tuwien.ac.at



                                                                          problems have been inspired by the remarkable success of
                                                                          Deep Neural Networks (DNN) in the domains of computer
                         Abstract                                         vision [KSH12], where deep learning based approaches have
                                                                          already become the de facto standard. The major advan-
     In this paper we investigate performance differ-                     tage of DNNs are their feature learning capability, which
     ences of different neural network architectures on                   alleviates the domain knowledge and time intensive task
     the task of automatic music genre classification.                    of crafting audio features by hand. Predictions are also
     Comparative evaluations on four well known                           made directly on the modeled input representations, which
     datasets of different sizes were performed including                 is commonly raw input data such as images, text or au-
     the application of two audio data augmentation                       dio spectrograms. Recent accomplishments in applying
     methods. The results show that shallow network                       Convolutional Neural Networks (CNN) to audio classifica-
     architectures are better suited for small datasets                   tion tasks have shown promising results by outperforming
     than deeper models, which could be relevant for                      conventional approaches in different evaluation campaigns
     experiments and applications which rely on small                     such as the Detection and Classification of Acoustic Scenes
     datasets. A noticeable advantage was observed                        and Events (DCASE) [LS16a] and the Music Information
     through the application of data augmentation                         Retrieval Evaluation EXchange (MIREX) [LS16b].
     using deep models. A final comparison with
     previous evaluations on the same datasets
     shows that the presented neural network based
     approaches already outperform state-of-the-art
     handcrafted music features.                                             An often mentioned paradigm concerning neural
                                                                          networks is that deeper networks are better in modeling
1    Introduction                                                         non-linear relationships of given tasks [SLJ+15]. So far
                                                                          preceding MIR experiments and approaches reported in
Music classification is a well researched topic in Music Infor-           literature have not explicitly demonstrated the advantage
mation Retrieval (MIR) [FLTZ11]. Generally, its aim is to                 of deep over shallow network architectures in a magnitude
assign one or multiple labels to a sequence or an entire audio            similar to results reported from the computer vision
file, which is commonly accomplished in two major steps.                  domain. This may be related to the absence of similarly
First, semantically meaningful audio content descriptors are              large datasets as they are available in the visual related
extracted from the sampled audio signal. Second, a machine                research areas. A special focus of this paper is thus set on
learning algorithm is applied, which attempts to discrim-                 the performance of neural networks on small datasets, since
inate between the classes by finding separating boundaries                data availability is still a problem in MIR, but also because
in the multidimensional feature-spaces. Especially the first              many tasks involve the processing of small collections. In
step requires extensive knowledge and skills in various spe-              this paper we present a performance evaluation of shallow
cific research areas such as audio signal processing, acoustics           and deep neural network architectures. These models and
and/or music theory. Recently many approaches to MIR                      the applied method will be detailed in Section 2. The
                                                                          evaluation will be performed on well known music genre
Copyright c by the paper’s authors. Copying permitted for private
and academic purposes.                                                    classification datasets in the domain of Music Information
In: W. Aigner, G. Schmiedl, K. Blumenstein, M. Zeppelzauer (eds.):
                                                                          Retrieval. These datasets and the evaluation procedure will
Proceedings of the 9th Forum Media Technology 2016, St. Pölten,          be described in Section 3. Finally we draw conclusions from
Austria, 24-11-2016, published at http://ceur-ws.org                      the results in Section 5 and give an outlook to future work.

                                                                     17
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification



                                                     Input 80 x 80                                                                                                      Input 80 x 80




                Conv - 16 - 10 x 23 - LeakyReLU                         Conv - 16 - 21 x 20 - LeakyReLU                       Conv - 16 - 10 x 23 - LeakyReLU                                Conv - 16 - 21 x 10 - LeakyReLU




                                                                                                                  Layer 1
    Layer 1




                     Max-Pooling 1 x 20                                      Max-Pooling 20 x 1                                     Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2



                                                        merge
                                                                                                                              Conv - 32 - 5 x 11 - LeakyReLU                                 Conv - 32 - 10 x 5 - LeakyReLU




                                                                                                                  Layer 2
                                                     Dropout 0.1                                                                    Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2


                                                  Fully connected 200                                                          Conv - 64 - 3 x 5 - LeakyReLU                                     Conv - 64 - 5 x 3 - LeakyReLU




                                                                                                                  Layer 3
                                                       Softmax                                                                      Max-Pooling 2 x 2                                                 Max-Pooling 2 x 2



                                                                                                                              Conv - 128 - 2 x 4 - LeakyReLU                                 Conv - 128 - 4 x 2 - LeakyReLU




                                                                                                                  Layer 4
                 Figure 1: Shallow CNN architecture
                                                                                                                                    Max-Pooling 1 x 5                                                 Max-Pooling 5 x 1

2             Method                                                                                                                                                       merge


The parallel architectures of the neural networks used                                                                                                                  Dropout 0.25
in the evaluation are based on the idea of using a time
and a frequency pipeline described in [PLS16], which                                                                                                           Fully connected 200 - LeakyReLU


was successfully applied in two evaluation campaigns                                                                                                                      Softmax
[LS16a, LS16b]. The system is based on a parallel CNN
architecture where separate CNN Layers are optimized for
processing and recognizing music relations in the frequency                                                                         Figure 2: Deep CNN architecture
domain and to capture temporal relations (see Figure 1).
                                                                                                               are slightly adapted to have the same rectangular shapes
The Shallow Architecture: In our adaption of the                                                               with one part being rotated by 90 . As in the shallow archi-
CNN architecture described in [PLS16] we use two similar                                                       tecture the same sizes of the final feature maps of the parallel
pipelines of CNN Layers with 16 filter kernels each followed                                                   model paths balances their influences on the following fully
by a Max Pooling layer (see Figure 1). The left pipeline                                                       connected layer with 200 units with a 25% dropout rate.
aims at capturing frequency relations using filter kernel sizes
of 10⇥23 and Max Pooling sizes of 1⇥20. The resulting 16
vertical rectangular shaped feature map responses of shape                                                     2.1          Training and Predicting Results
80⇥4 are intended to capture spectral characteristics of a                                                     In each epoch during training the network multiple training
segment and to reduce the temporal complexity to 4 discrete                                                    examples sampled from the segment-wise log-transformed
intervals. The right pipeline uses a filter of size 21⇥20 and                                                  Mel-spectrogram analysis of all files in the training set are
Max Pooling sizes of 20⇥1. This results in horizontal rect-                                                    presented to both pipelines of the neural network. Each of
angular shaped feature maps of shape 4⇥80. This captures                                                       the parallel pipelines of the architectures uses the same 80
temporal changes in intensity levels of four discrete spectral                                                 ⇥ 80 log-transformed Mel-spectrogram segments as input.
intervals. The 16 feature maps of each pipeline are flattened                                                  These segments have been calculated from a fast Fourier
to a shape of 1⇥5120 and merged by concatenation into                                                          transformed spectrogram using a window size of 1024 sam-
the shape of 1⇥10240, which serves as input to a 200 units                                                     ples and an overlap of 50% from 0.93 seconds of audio trans-
fully connected layer with a dropout of 10%.                                                                   formed subsequently into Mel scale and Log scale.. For each
The Deep Architecture: This architecture follows the                                                           song of a dataset 15 segments have been randomly chosen.
same principles of the shallow approach. It uses a parallel                                                       All trainable layers used the Leaky ReLU activation
arrangement of rectangular shaped filters and Max-Pooling                                                      function [MHN13], which is an extension to the ReLU
windows to capture frequency and temporal relationships                                                        (Rectifier Linear Unit) that does not completely cut off
at once. But, instead of using the information of the large                                                    activation for negative values, but allows for negative
feature map responses, this architecture applies additional                                                    values close to zero to pass through. It is defined by adding
CNN and pooling layer pairs (see Figure 2). Thus, more                                                         a coefficient ↵ in f(x) = ↵x, for x < 0, while keeping
units can be applied to train on the subsequent smaller                                                        f(x) = x, for x 0 as for the ReLU. In our architectures,
input feature maps. The first level of the parallel layers are                                                 we apply Leaky ReLU activation with ↵=0.3. L1 weight
similar to the original approach. They use filter kernel sizes                                                 regularization with a penalty of 0.0001 was applied to all
of 10⇥23 and 21⇥10 to capture frequency and temporal                                                           trainable parameters. All networks were trained towards
relationships. To retain these characteristics the sizes of the                                                categorical-crossentropy objective using the stochastic
convolutional filter kernels as well as the feature maps are                                                   Adam optimization [KB14] with beta1 =0.9, beta2 =0.999,
sub-sequentially divided in halves by the second and third                                                     epsilon=1e 08 and a learning rate of 0.00005.
layers. The filter and Max Pooling sizes of the fourth layer                                                      The system is implemented in Python and using librosa

                                                                                                          18
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification


[MRL+15] for audio processing and Mel-log-transforms                                                   Train                 Test
                                                                        Data         Tracks    cls   wo. au.    w. au.
and Theano-based library Keras for Deep Learning.                       GTZAN          1,000    10     11,250    47,250      3,750
                                                                        ISMIR G.       1,458     6     16,380    68,796      5,490
2.1.1   Data Augmentation                                               Latin          3,227    10     36,240   152,208     12,165
                                                                        MSD           49,900    15    564,165        —     185,685
To increase the number of training instances we experiment
with two different audio data augmentation methods.                    Table 1: Overview of the evaluation datasets, their number
The deformations were applied directly to the audio                    of classes (cls) and their corresponding number of test and
signal preceding any further feature calculation procedure             training data instances without (wo. au.) and with (w.
described in Section 2.1. The following two methods were               au.) data augmentation.
applied using the MUDA framework for musical data
augmentation [MHB15]:                                                  the shallow and deep architecture were evaluated followed
                                                                       by the experiments including data augmentation. The archi-
Time Stretching: slowing down or speeding up the                       tectures were further evaluated according their performance
original audio sample while keeping the same pitch                     after a different number of training epochs. The networks
information. Time stretching was applied using the                     were trained and evaluated after 100 and 200 epochs
multiplication factors 0.5, 0.2 for slowing down and 1.2,              without early stopping. Preceding experiments showed
1.5 for increasing the tempo.                                          that test accuracy could improve despite rising validation
Pitch Shifting: raising or lowering the pitch of an audio              loss though on smaller sets no significant improvement was
sample while keeping the tempo unchanged. The applied                  recognizable after 200 epochs. For the experiments with
pitch shifting lowered and raised the pitch by 2 and 5                 data augmentation, the augmented data was only used to
semitones.                                                             train the networks (see Table 3.1). For testing the network
                                                                       the original segments without deformations were used.
For each deformation three segments have been randomly
chosen from the audio content. The combinations of the
two deformations with four different factors each resulted             3.1   Data Sets
thus in 48 additional data instances per audio file.                   For the evaluation four data sets have been used. We have
                                                                       chosen these datasets due to their increasing number of
3    Evaluation                                                        tracks and because they are well known and extensively
                                                                       evaluated in the automatic genre classification task. This
As our system analyzes and predicts multiple audio
                                                                       should also provide comparability with experiments
segments per input file, there are several ways to perform
                                                                       reported in literature.
the final prediction of an input instance:
                                                                       GTZAN: This data set was compiled by George Tzane-
Raw Probability: The raw accuracy of predicting
                                                                       takis [Tza02] in 2000-2001 and consists of 1000 audio tracks
the segments as separated instances ignoring their file
                                                                       equally distributed over the 10 music genres: blues, classical,
dependencies.
                                                                       country, disco, hiphop, pop, jazz, metal, reggae, and rock.
Maximum Probability: The output probabilities of the
                                                                       ISMIR Genre: This data set has been assembled for
Softmax layer for the corresponding number of classes of
                                                                       training and development in the ISMIR 2004 Genre
the datasets are summed up for all segments belonging
                                                                       Classification contest [CGG+06]. It contains 1458 full
to the same input file. The predicted class is determined
                                                                       length audio recordings from Magnatune.com distributed
by the maximum probability among the classes from the
                                                                       across the 6 genre classes: Classical, Electronic, JazzBlues,
summed probabilities.
                                                                       MetalPunk, RockPop, World.
Majority Vote: Here, the predictions are made for each
                                                                       Latin Music Database (LMD): [SKK08] contains
segment processed from the audio file as input instance to
                                                                       3227 songs, categorized into the 10 Latin music genres Axé,
the network. The class of an audio segment is determined
                                                                       Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa,
by the maximum probability as output by the Softmax layer
                                                                       Sertaneja and Tango.
for this segment instance. Then, a majority vote is taken on
all predicted classes from all segments of the same input file.        Million Song Dataset (MSD): [BMEWL11] a
Majority vote determines the class that occurs most often.             collection of one million music pieces, enables methods for
                                                                       large-scale applications. It comes as a collection of meta-
   We used stratified 4-fold cross validation. Multi-level             data such as the song names, artists and albums, together
stratification was applied paying special attention to the             with a set of features extracted with the The Echo Nest
multiple segments used per file. It was ensured that the               services, such as loudness, tempo, and MFCC-like features.
files were distributed according their genre distributions             We used the CD2C genre assignments as ground truth
and that no segments of a training file was provided in                [Sch15] which are an adaptation of the MSD genre label
the corresponding test split.                                          assignments presented in [SMR12]. For the experiments
   The experiments were grouped according to the four                  a sub-set of approximately 50.000 tracks was sub-sampled.
different datasets. For each dataset the performances for

                                                                  19
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification


4    Results                                                         D Model                   raw            max            maj          ep
                                                                                   shallow     66.56 (0.69)   78.10 (0.89)   77.80 (0.52) 100
                                                                                   deep        65.69 (1.23)   78.60 (1.97)   78.00 (2.87) 100




                                                                     GTZAN
The results of the experiments are provided in Table 4.                            shallow     67.49 (0.39)   80.80 (1.67)   80.20 (1.68) 200
For each dataset all combinations of experimental results                          deep        66.19 (0.55)   80.60 (2.93)   80.30 (2.87) 200
                                                                                   shallow aug 66.77 (0.78)   78.90 (2.64)   77.10 (1.19) 100
were tested for significant difference using a Wilcoxon                            deep aug    68.31 (2.68)   81.80 (2.95)   82.20 (2.30) 100
signed-rank test. None of the presented results showed
a significant difference for p<0.05. Thus, we tested at the                        shallow     75.66 (1.30)   85.46 (1.87)   84.77 (1.43) 100




                                                                     ISMIR Genre
next higher level p < 0.1. The following observations on                           deep        74.53 (0.52)   84.08 (1.13)   83.95 (0.97) 100
the datasets were made:                                                            shallow     75.43 (0.65)   84.91 (1.96)   85.18 (1.27) 200
                                                                                   deep        74.51 (1.71)   85.12 (0.76)   85.18 (1.23) 200
                                                                                   shallow aug 76.61 (1.04)   86.90 (0.23)   86.00 (0.54) 100
                                                                                   deep aug    77.20 (1.14)   87.17 (1.17)   86.75 (1.41) 100
GTZAN: Training the models with 200 epochs instead
of only 100 epochs significantly improved the raw and max                          shallow     79.80 (0.95)   92.44 (0.86)   92.10 (0.97) 100
accuracies for the shallow models. An additional test on                           deep        81.13 (0.64)   94.42 (1.04)   94.30 (0.81) 100




                                                                     Latin
training 500 epochs showed no further increase in accuracy                         shallow     80.64 (0.83)   93.46 (1.13)   92.68 (0.88) 200
                                                                                   deep        81.06 (0.51)   95.14 (0.40)   94.83 (0.53) 200
for any of the three prediction methods. Training longer had                       shallow aug 78.09 (0.68)   92.78 (0.88)   92.03 (0.81) 100
no effect on the deep model due to early over-fitting. No                          deep aug    83.22 (0.83)   96.03 (0.56)   95.60 (0.58) 100
significant differences were observed between shallow and
deep models except for the raw prediction values of the shal-


                                                                     MSD
                                                                                   shallow     58.20 (0.49)   63.89 (0.81)   63.11 (0.74) 100
low model (200 epochs) exceeding those of the deep model                           deep        60.60 (0.28)   67.16 (0.64)   66.41 (0.52) 100
(200 epochs). While the improvements through data aug-
mentation on deep models compared to the un-augmented
longer trained deep models are not significant, considerable         Table 2: Experimental results for the evaluation datasets
improvements of 4.2% were observed for models trained                (D) at different number of training epochs (ep): Mean accu-
for the same number of epochs. An interesting observation            racies and standard deviations of the 4-fold cross-evaluation
is the negative effect of data augmentation on the shallow           runs calculated using raw prediction scores (raw) and the
models where longer training outperformed augmentation.              file based maximum probability (max) and majority vote
                                                                     approach (maj).
ISMIR Genre: Training more epochs only had a
significant positive effect on the max and maj values of the
deep model but none for the shallow ones. The deep models            5             Conclusions and Future Work
showed no significant advantage over the shallow architec-
tures which also showed higher raw prediction values even
                                                                     In this paper we evaluated shallow and deep CNN
on shorter trained models. Data augmentation improved
                                                                     architectures towards their performance on different dataset
the predictions of both architectures with significant im-
                                                                     sizes in music genre classification tasks. Our observations
provements for the raw values. Especially the deep models
                                                                     showed that for smaller datasets shallow models seem to be
significantly profited from data augmentation with max
                                                                     more appropriate since deeper models showed no significant
values increased by 3.08% for models trained for the same
                                                                     improvement. Deeper models performed slightly better in
number of epochs and 2.05% for the longer trained models.
                                                                     the presence of larger datasets, but a clear conclusion that
The improvements from deep over shallow models using
                                                                     deeper models are generally better could not be drawn.
augmented data were only significant for the raw values.
                                                                     Data augmentation using time stretching and pitch shifting
Latin: Training more epochs only had a positive effect for           significantly improved the performance of deep models. For
the raw and max values of the shallow model, but not for             shallow models on the contrary it showed a negative effect
the deep architecture. On this dataset, the deep model sig-          on the small datasets. Thus, deeper models should be
nificantly outperformed the shallow architecture including           considered when applying data augmentation. Comparing
the shallow model trained using data augmentation. Data              the presented results with previously reported evaluations
augmentation improved the significantly improved the per-            on the same datasets [SR12] shows, that the CNN based
formance of the deep models by 1,61% for the max values.             approaches already outperform handcrafted music features
Similar to the GTZAN dataset, data augmentation showed               such as the Rhythm Patterns (RP) family [LSC+10]
a degrading effect on the shallow model which showed signif-         (highest values: GTZAN 73.2%, ISMIR Genre 80.9%, Latin
icantly higher accuracy values by training for more epochs.          87.3%) or the in the referred study presented Temporal
MSD: A not significant advantage of deep over shallow                Echonest Features [SR12] (highest values: GTZAN 66.9%,
models was observed. Experiments using data augmen-                  ISMIR Genre 81.3%, Latin 89.0%).
tation and longer training were omitted due to the already             Future work will focus on further data augmentation
large variance provided by the MSD which multiplies the              methods to improve the performance of neural networks
preceding datasets by factors from 15 to 50.                         on small datasets and the Million Song Dataset as well
                                                                     as on different network architectures.

                                                                20
Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification


References                                                        [MRL+15]   Brian McFee, Colin Raffel, Dawen Liang,
                                                                             Daniel PW Ellis, Matt McVicar, Eric
[BMEWL11] Thierry Bertin-Mahieux, Daniel PW Ellis,
                                                                             Battenberg, and Oriol Nieto. librosa: Audio
          Brian Whitman, and Paul Lamere. The
                                                                             and music signal analysis in python. In
          million song dataset. In ISMIR, volume 2,
                                                                             Proceedings of the 14th Python in Science
          page 10, 2011.
                                                                             Conference, 2015.
[CGG+06]    Pedro Cano, Emilia Gómez, Fabien Gouyon,
                                                                  [PLS16]    Jordi Pons, Thomas Lidy, and Xavier Serra.
            Perfecto Herrera, Markus Koppenberger,
                                                                             Experimenting with musically motivated
            Beesuan Ong, Xavier Serra, Sebastian Stre-
                                                                             convolutional neural networks. In Proceed-
            ich, and Nicolas Wack. ISMIR 2004 audio
                                                                             ings of the 14th International Workshop on
            description contest. Technical report, 2006.
                                                                             Content-based Multimedia Indexing (CBMI
[FLTZ11]    Zhouyu Fu, Guojun Lu, Kai Ming Ting, and                         2016), Bucharest, Romania, June 2016.
            Dengsheng Zhang. A survey of audio-based
                                                                  [Sch15]    Hendrik Schreiber.        Improving genre
            music classification and annotation. Multi-
                                                                             annotations for the million song dataset. In
            media, IEEE Transactions on, 13(2):303–319,
                                                                             Proceedings of the 16th International Society
            2011.
                                                                             for Music Information Retrieval Conference
[KB14]      Diederik P. Kingma and Jimmy Ba. Adam:                           (ISMIR 2012), Malaga, Spain, 2015.
            A method for stochastic optimization.
                                                                  [SKK08]    C N Silla Jr., Celso A A Kaestner, and
            CoRR, abs/1412.6980, 2014.
                                                                             Alessandro L Koerich. The Latin Music
[KSH12]     Alex Krizhevsky, Ilya Sutskever, and                             Database. In Proceedings of the 9th Inter-
            Geoffrey E Hinton. Imagenet classification                       national Conference on Music Information
            with deep convolutional neural networks. In                      Retrieval, pages 451—-456, 2008.
            Advances in neural information processing             [SLJ+15]   Christian Szegedy, Wei Liu, Yangqing Jia,
            systems, pages 1097–1105, 2012.                                  Pierre Sermanet, Scott Reed, Dragomir
[LS16a]     Thomas Lidy and Alexander Schindler.                             Anguelov, Dumitru Erhan, Vincent Van-
            CQT-based convolutional neural networks                          houcke, and Andrew Rabinovich. Going
            for audio scene classification. In Proceedings                   deeper with convolutions. In Proceedings of
            of the Detection and Classification of                           the IEEE Conference on Computer Vision
            Acoustic Scenes and Events 2016 Workshop                         and Pattern Recognition, pages 1–9, 2015.
            (DCASE2016), pages 60–64, September 2016.             [SMR12]    Alexander Schindler, Rudolf Mayer, and
[LS16b]     Thomas Lidy and Alexander Schindler. Par-                        Andreas Rauber. Facilitating comprehensive
            allel convolutional neural networks for music                    benchmarking experiments on the million
            genre and mood classification. Technical re-                     song dataset. In Proceedings of the 13th
            port, Music Information Retrieval Evaluation                     International Society for Music Information
            eXchange (MIREX 2016), August 2016.                              Retrieval Conference (ISMIR 2012), pages
                                                                             469–474, Porto, Portugal, October 8-12 2012.
[LSC+10]    Thomas Lidy, Carlos N. Silla, Olmo Cornelis,
            Fabien Gouyon, Andreas Rauber, Celso                  [SR12]     Alexander Schindler and Andreas Rauber.
            A. A. Kaestner, and Alessandro L. Koerich.                       Capturing the temporal domain in echon-
            On the suitability of state-of-the-art music                     est features for improved classification
            information retrieval methods for analyzing,                     effectiveness.   In Adaptive Multimedia
            categorizing, structuring and accessing                          Retrieval, Lecture Notes in Computer
            non-western and ethnic music collections.                        Science, Copenhagen, Denmark, October
            Signal Processing, 90(4):1032–1048, 2010.                        24-25 2012. Springer.

[MHB15]     Brian McFee, Eric J Humphrey, and Juan P              [Tza02]    G. Tzanetakis. Manipulation, analysis and
            Bello. A software framework for musical                          retrieval systems for audio signals. PhD
            data augmentation. In International Society                      thesis, 2002.
            for Music Information Retrieval Conference
            (ISMIR), 2015.
[MHN13]     Andrew L. Maas, Awni Y. Hannun, and
            Andrew Y. Ng. Rectifier nonlinearities
            improve neural network acoustic models.
            ICML 2013, 28, 2013.

                                                             21