<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>St. P¨olten,
Austria</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparing Shallow versus Deep Neural Network Architectures for Automatic Music Genre Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Schindler</string-name>
          <email>alexander.schindler@ait.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Lidy, Andreas Rauber</string-name>
          <email>lidy,rauber@ifs.tuwien.ac.at</email>
          <email>rauber@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Austrian Institute of Technology, Digital Safety and Security</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vienna University of Technology, Institute of Software Technology</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>In this paper we investigate performance differences of different neural network architectures on the task of automatic music genre classification. Comparative evaluations on four well known datasets of different sizes were performed including the application of two audio data augmentation methods. The results show that shallow network architectures are better suited for small datasets than deeper models, which could be relevant for experiments and applications which rely on small datasets. A noticeable advantage was observed through the application of data augmentation using deep models. A final comparison with previous evaluations on the same datasets shows that the presented neural network based approaches already outperform state-of-the-art handcrafted music features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Music classification is a well researched topic in Music
Information Retrieval (MIR) [FLTZ11]. Generally, its aim is to
assign one or multiple labels to a sequence or an entire audio
file, which is commonly accomplished in two major steps.
First, semantically meaningful audio content descriptors are
extracted from the sampled audio signal. Second, a machine
learning algorithm is applied, which attempts to
discriminate between the classes by finding separating boundaries
in the multidimensional feature-spaces. Especially the first
step requires extensive knowledge and skills in various
specific research areas such as audio signal processing, acoustics
and/or music theory. Recently many approaches to MIR
problems have been inspired by the remarkable success of
Deep Neural Networks (DNN) in the domains of computer
vision [KSH12], where deep learning based approaches have
already become the de facto standard. The major
advantage of DNNs are their feature learning capability, which
alleviates the domain knowledge and time intensive task
of crafting audio features by hand. Predictions are also
made directly on the modeled input representations, which
is commonly raw input data such as images, text or
audio spectrograms. Recent accomplishments in applying
Convolutional Neural Networks (CNN) to audio
classification tasks have shown promising results by outperforming
conventional approaches in different evaluation campaigns
such as the Detection and Classification of Acoustic Scenes
and Events (DCASE) [LS16a] and the Music Information
Retrieval Evaluation EXchange (MIREX) [LS16b].</p>
      <p>An often mentioned paradigm concerning neural
networks is that deeper networks are better in modeling
non-linear relationships of given tasks [SLJ+15]. So far
preceding MIR experiments and approaches reported in
literature have not explicitly demonstrated the advantage
of deep over shallow network architectures in a magnitude
similar to results reported from the computer vision
domain. This may be related to the absence of similarly
large datasets as they are available in the visual related
research areas. A special focus of this paper is thus set on
the performance of neural networks on small datasets, since
data availability is still a problem in MIR, but also because
many tasks involve the processing of small collections. In
this paper we present a performance evaluation of shallow
and deep neural network architectures. These models and
the applied method will be detailed in Section 2. The
evaluation will be performed on well known music genre
classification datasets in the domain of Music Information
Retrieval. These datasets and the evaluation procedure will
be described in Section 3. Finally we draw conclusions from
the results in Section 5 and give an outlook to future work.
1
r
e
y
a
L</p>
      <p>Conv - 16 - 10 x 23 - LeakyReLU</p>
      <p>Conv - 16 - 21 x 20 - LeakyReLU
Max-Pooling 1 x 20</p>
      <p>Max-Pooling 20 x 1
The parallel architectures of the neural networks used
in the evaluation are based on the idea of using a time
and a frequency pipeline described in [PLS16], which
was successfully applied in two evaluation campaigns
[LS16a, LS16b]. The system is based on a parallel CNN
architecture where separate CNN Layers are optimized for
processing and recognizing music relations in the frequency
domain and to capture temporal relations (see Figure 1).
The Shallow Architecture: In our adaption of the
CNN architecture described in [PLS16] we use two similar
pipelines of CNN Layers with 16 filter kernels each followed
by a Max Pooling layer (see Figure 1). The left pipeline
aims at capturing frequency relations using filter kernel sizes
of 10⇥ 23 and Max Pooling sizes of 1⇥ 20. The resulting 16
vertical rectangular shaped feature map responses of shape
80⇥ 4 are intended to capture spectral characteristics of a
segment and to reduce the temporal complexity to 4 discrete
intervals. The right pipeline uses a filter of size 21⇥ 20 and
Max Pooling sizes of 20⇥ 1. This results in horizontal
rectangular shaped feature maps of shape 4⇥ 80. This captures
temporal changes in intensity levels of four discrete spectral
intervals. The 16 feature maps of each pipeline are flattened
to a shape of 1⇥ 5120 and merged by concatenation into
the shape of 1⇥ 10240, which serves as input to a 200 units
fully connected layer with a dropout of 10%.</p>
      <p>The Deep Architecture: This architecture follows the
same principles of the shallow approach. It uses a parallel
arrangement of rectangular shaped filters and Max-Pooling
windows to capture frequency and temporal relationships
at once. But, instead of using the information of the large
feature map responses, this architecture applies additional
CNN and pooling layer pairs (see Figure 2). Thus, more
units can be applied to train on the subsequent smaller
input feature maps. The first level of the parallel layers are
similar to the original approach. They use filter kernel sizes
of 10⇥ 23 and 21⇥ 10 to capture frequency and temporal
relationships. To retain these characteristics the sizes of the
convolutional filter kernels as well as the feature maps are
sub-sequentially divided in halves by the second and third
layers. The filter and Max Pooling sizes of the fourth layer
3
r
e
y
a
L
4
r
e
y
a
L
are slightly adapted to have the same rectangular shapes
with one part being rotated by 90 . As in the shallow
architecture the same sizes of the final feature maps of the parallel
model paths balances their influences on the following fully
connected layer with 200 units with a 25% dropout rate.
In each epoch during training the network multiple training
examples sampled from the segment-wise log-transformed
Mel-spectrogram analysis of all files in the training set are
presented to both pipelines of the neural network. Each of
the parallel pipelines of the architectures uses the same 80
⇥ 80 log-transformed Mel-spectrogram segments as input.
These segments have been calculated from a fast Fourier
transformed spectrogram using a window size of 1024
samples and an overlap of 50% from 0.93 seconds of audio
transformed subsequently into Mel scale and Log scale.. For each
song of a dataset 15 segments have been randomly chosen.</p>
      <p>All trainable layers used the Leaky ReLU activation
function [MHN13], which is an extension to the ReLU
(Rectifier Linear Unit) that does not completely cut off
activation for negative values, but allows for negative
values close to zero to pass through. It is defined by adding
a coefficient ↵ in f(x) = x↵ , for x &lt; 0, while keeping
f(x) = x, for x 0 as for the ReLU. In our architectures,
we apply Leaky ReLU activation with ↵ = 0.3. L1 weight
regularization with a penalty of 0.0001 was applied to all
trainable parameters. All networks were trained towards
categorical-crossentropy objective using the stochastic
Adam optimization [KB14] with beta1 = 0.9, beta2 = 0.999,
epsilon = 1e 08 and a learning rate of 0.00005.</p>
      <p>The system is implemented in Python and using librosa
[MRL+15] for audio processing and Mel-log-transforms
and Theano-based library Keras for Deep Learning.
2.1.1</p>
      <p>Data Augmentation
To increase the number of training instances we experiment
with two different audio data augmentation methods.
The deformations were applied directly to the audio
signal preceding any further feature calculation procedure
described in Section 2.1. The following two methods were
applied using the MUDA framework for musical data
augmentation [MHB15]:
Time Stretching: slowing down or speeding up the
original audio sample while keeping the same pitch
information. Time stretching was applied using the
multiplication factors 0.5, 0.2 for slowing down and 1.2,
1.5 for increasing the tempo.</p>
      <p>Pitch Shifting: raising or lowering the pitch of an audio
sample while keeping the tempo unchanged. The applied
pitch shifting lowered and raised the pitch by 2 and 5
semitones.</p>
      <p>For each deformation three segments have been randomly
chosen from the audio content. The combinations of the
two deformations with four different factors each resulted
thus in 48 additional data instances per audio file.
3</p>
      <p>Evaluation
As our system analyzes and predicts multiple audio
segments per input file, there are several ways to perform
the final prediction of an input instance:
Raw Probability: The raw accuracy of predicting
the segments as separated instances ignoring their file
dependencies.</p>
      <p>Maximum Probability: The output probabilities of the
Softmax layer for the corresponding number of classes of
the datasets are summed up for all segments belonging
to the same input file. The predicted class is determined
by the maximum probability among the classes from the
summed probabilities.</p>
      <p>Majority Vote: Here, the predictions are made for each
segment processed from the audio file as input instance to
the network. The class of an audio segment is determined
by the maximum probability as output by the Softmax layer
for this segment instance. Then, a majority vote is taken on
all predicted classes from all segments of the same input file.
Majority vote determines the class that occurs most often.</p>
      <p>We used stratified 4-fold cross validation. Multi-level
stratification was applied paying special attention to the
multiple segments used per file. It was ensured that the
files were distributed according their genre distributions
and that no segments of a training file was provided in
the corresponding test split.</p>
      <p>The experiments were grouped according to the four
different datasets. For each dataset the performances for</p>
      <p>Data
GTZAN
ISMIR G.</p>
      <p>Latin
MSD
the shallow and deep architecture were evaluated followed
by the experiments including data augmentation. The
architectures were further evaluated according their performance
after a different number of training epochs. The networks
were trained and evaluated after 100 and 200 epochs
without early stopping. Preceding experiments showed
that test accuracy could improve despite rising validation
loss though on smaller sets no significant improvement was
recognizable after 200 epochs. For the experiments with
data augmentation, the augmented data was only used to
train the networks (see Table 3.1). For testing the network
the original segments without deformations were used.
For the evaluation four data sets have been used. We have
chosen these datasets due to their increasing number of
tracks and because they are well known and extensively
evaluated in the automatic genre classification task. This
should also provide comparability with experiments
reported in literature.</p>
      <p>GTZAN: This data set was compiled by George
Tzanetakis [Tza02] in 2000-2001 and consists of 1000 audio tracks
equally distributed over the 10 music genres: blues, classical,
country, disco, hiphop, pop, jazz, metal, reggae, and rock.
ISMIR Genre: This data set has been assembled for
training and development in the ISMIR 2004 Genre
Classification contest [CGG+06]. It contains 1458 full
length audio recordings from Magnatune.com distributed
across the 6 genre classes: Classical, Electronic, JazzBlues,
MetalPunk, RockPop, World.</p>
      <p>Latin Music Database (LMD): [SKK08] contains
3227 songs, categorized into the 10 Latin music genres Ax´e,
Bachata, Bolero, Forro´, Gau´cha, Merengue, Pagode, Salsa,
Sertaneja and Tango.</p>
      <p>Million Song Dataset (MSD): [BMEWL11] a
collection of one million music pieces, enables methods for
large-scale applications. It comes as a collection of
metadata such as the song names, artists and albums, together
with a set of features extracted with the The Echo Nest
services, such as loudness, tempo, and MFCC-like features.
We used the CD2C genre assignments as ground truth
[Sch15] which are an adaptation of the MSD genre label
assignments presented in [SMR12]. For the experiments
a sub-set of approximately 50.000 tracks was sub-sampled.
4</p>
      <p>Results
The results of the experiments are provided in Table 4.
For each dataset all combinations of experimental results
were tested for significant difference using a Wilcoxon
signed-rank test. None of the presented results showed
a significant difference for p &lt; 0.05. Thus, we tested at the
next higher level p &lt; 0.1. The following observations on
the datasets were made:
GTZAN: Training the models with 200 epochs instead
of only 100 epochs significantly improved the raw and max
accuracies for the shallow models. An additional test on
training 500 epochs showed no further increase in accuracy
for any of the three prediction methods. Training longer had
no effect on the deep model due to early over-fitting. No
significant differences were observed between shallow and
deep models except for the raw prediction values of the
shallow model (200 epochs) exceeding those of the deep model
(200 epochs). While the improvements through data
augmentation on deep models compared to the un-augmented
longer trained deep models are not significant, considerable
improvements of 4.2% were observed for models trained
for the same number of epochs. An interesting observation
is the negative effect of data augmentation on the shallow
models where longer training outperformed augmentation.
ISMIR Genre: Training more epochs only had a
significant positive effect on the max and maj values of the
deep model but none for the shallow ones. The deep models
showed no significant advantage over the shallow
architectures which also showed higher raw prediction values even
on shorter trained models. Data augmentation improved
the predictions of both architectures with significant
improvements for the raw values. Especially the deep models
significantly profited from data augmentation with max
values increased by 3.08% for models trained for the same
number of epochs and 2.05% for the longer trained models.
The improvements from deep over shallow models using
augmented data were only significant for the raw values.
Latin: Training more epochs only had a positive effect for
the raw and max values of the shallow model, but not for
the deep architecture. On this dataset, the deep model
significantly outperformed the shallow architecture including
the shallow model trained using data augmentation. Data
augmentation improved the significantly improved the
performance of the deep models by 1,61% for the max values.
Similar to the GTZAN dataset, data augmentation showed
a degrading effect on the shallow model which showed
significantly higher accuracy values by training for more epochs.
MSD: A not significant advantage of deep over shallow
models was observed. Experiments using data
augmentation and longer training were omitted due to the already
large variance provided by the MSD which multiplies the
preceding datasets by factors from 15 to 50.</p>
      <p>Model</p>
      <p>raw
shallow 66.56 (0.69)
Ndeep 65.69 (1.23)
Ashallow 67.49 (0.39)
ZTdeep 66.19 (0.55)
Gshallow aug 66.77 (0.78)</p>
      <p>deep aug 68.31 (2.68)
reen sdheaepllow 7754..6563 ((10..3502))
Gshallow 75.43 (0.65)
IRdeep 74.51 (1.71)
Mshallow aug 76.61 (1.04)
IS deep aug 77.20 (1.14)
shallow 79.80 (0.95)
deep 81.13 (0.64)
itn shallow 80.64 (0.83)
a deep 81.06 (0.51)
L shallow aug 78.09 (0.68)
deep aug 83.22 (0.83)
max
In this paper we evaluated shallow and deep CNN
architectures towards their performance on different dataset
sizes in music genre classification tasks. Our observations
showed that for smaller datasets shallow models seem to be
more appropriate since deeper models showed no significant
improvement. Deeper models performed slightly better in
the presence of larger datasets, but a clear conclusion that
deeper models are generally better could not be drawn.
Data augmentation using time stretching and pitch shifting
significantly improved the performance of deep models. For
shallow models on the contrary it showed a negative effect
on the small datasets. Thus, deeper models should be
considered when applying data augmentation. Comparing
the presented results with previously reported evaluations
on the same datasets [SR12] shows, that the CNN based
approaches already outperform handcrafted music features
such as the Rhythm Patterns (RP) family [LSC+10]
(highest values: GTZAN 73.2%, ISMIR Genre 80.9%, Latin
87.3%) or the in the referred study presented Temporal
Echonest Features [SR12] (highest values: GTZAN 66.9%,
ISMIR Genre 81.3%, Latin 89.0%).</p>
      <p>Future work will focus on further data augmentation
methods to improve the performance of neural networks
on small datasets and the Million Song Dataset as well
as on different network architectures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>[BMEWL11] Thierry</surname>
            Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and
            <given-names>Paul Lamere.</given-names>
          </string-name>
          <article-title>The million song dataset</article-title>
          .
          <source>In ISMIR</source>
          , volume
          <volume>2</volume>
          , page 10,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[CGG+06] [FLTZ11] [KB14] [KSH12] [LS16a] [LS16b] [LSC+10] [MHB15] [MHN13] Pedro Cano</source>
          , Emilia Go´mez, Fabien Gouyon, Perfecto Herrera, Markus Koppenberger, Beesuan Ong, Xavier Serra,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Streich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Wack</surname>
          </string-name>
          .
          <article-title>ISMIR 2004 audio description contest</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Zhouyu</given-names>
            <surname>Fu</surname>
          </string-name>
          , Guojun Lu, Kai Ming Ting,
          <string-name>
            <given-names>and Dengsheng</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>A survey of audio-based music classification and annotation</article-title>
          . Multimedia, IEEE Transactions on,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <fpage>303</fpage>
          -
          <lpage>319</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>CoRR</surname>
          </string-name>
          , abs/1412.6980,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>CQT-based convolutional neural networks for audio scene classification</article-title>
          .
          <source>In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)</source>
          , pages
          <fpage>60</fpage>
          -
          <lpage>64</lpage>
          ,
          <year>September 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Lidy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Schindler</surname>
          </string-name>
          .
          <article-title>Parallel convolutional neural networks for music genre and mood classification</article-title>
          .
          <source>Technical report, Music Information Retrieval Evaluation eXchange (MIREX</source>
          <year>2016</year>
          ),
          <year>August 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing</surname>
          </string-name>
          ,
          <volume>90</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1032</fpage>
          -
          <lpage>1048</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Brian</surname>
            <given-names>McFee</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eric J Humphrey</surname>
          </string-name>
          , and
          <string-name>
            <surname>Juan P Bello.</surname>
          </string-name>
          <article-title>A software framework for musical data augmentation</article-title>
          .
          <source>In International Society for Music Information Retrieval Conference (ISMIR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>ICML</source>
          <year>2013</year>
          ,
          <volume>28</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[MRL+15] [PLS16] [Sch15] [SKK08] [SLJ+15] [SMR12] [SR12] [Tza02] Brian McFee</source>
          ,
          <string-name>
            <given-names>Colin</given-names>
            <surname>Raffel</surname>
          </string-name>
          , Dawen Liang, Daniel PW Ellis,
          <string-name>
            <surname>Matt</surname>
            <given-names>McVicar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Eric</given-names>
            <surname>Battenberg</surname>
          </string-name>
          , and
          <article-title>Oriol Nieto. librosa: Audio and music signal analysis in python</article-title>
          .
          <source>In Proceedings of the 14th Python in Science Conference</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>Experimenting with musically motivated convolutional neural networks</article-title>
          .
          <source>In Proceedings of the 14th International Workshop on Content-based Multimedia Indexing (CBMI</source>
          <year>2016</year>
          ), Bucharest, Romania,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Hendrik</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <article-title>Improving genre annotations for the million song dataset</article-title>
          .
          <source>In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR</source>
          <year>2012</year>
          ), Malaga, Spain,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>C N Silla</surname>
          </string-name>
          <article-title>Jr</article-title>
          .,
          <string-name>
            <surname>Celso</surname>
            <given-names>A A</given-names>
          </string-name>
          <string-name>
            <surname>Kaestner</surname>
          </string-name>
          , and Alessandro L Koerich.
          <article-title>The Latin Music Database</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Music Information Retrieval</source>
          , pages
          <fpage>451</fpage>
          --
          <lpage>456</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Wei Liu, Yangqing Jia,
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Sermanet</surname>
          </string-name>
          , Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Schindler</surname>
          </string-name>
          , Rudolf Mayer, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Rauber</surname>
          </string-name>
          .
          <article-title>Facilitating comprehensive benchmarking experiments on the million song dataset</article-title>
          .
          <source>In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR</source>
          <year>2012</year>
          ), pages
          <fpage>469</fpage>
          -
          <lpage>474</lpage>
          , Porto, Portugal, October 8-12
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Capturing the temporal domain in echonest features for improved classification effectiveness</article-title>
          .
          <source>In Adaptive Multimedia Retrieval, Lecture Notes in Computer Science</source>
          , Copenhagen, Denmark, October
          <volume>24</volume>
          -25
          <year>2012</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Tzanetakis</surname>
          </string-name>
          .
          <article-title>Manipulation, analysis and retrieval systems for audio signals</article-title>
          .
          <source>PhD thesis</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>