<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2018 AcousticBrainz Genre Task: A CNN Baseline Relying on Mel-Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>USA hs@tagtraum.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AllMusic Discogs Last.fm tagtraum Subtask 2</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Number of Parameters</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>These working notes describe a relatively simple baseline for the MediaEval 2018 AcousticBrainz Genre Task. As classifier it uses a fully convolutional neural network (CNN) based on only the lowlevel AcousticBrainz melband features as input.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>We present a baseline approach for the MediaEval 2018
AcousticBrainz Genre Task. The task is defined as follows:</p>
      <p>
        Based on provided track-level features, participants have to
estimate genre labels for four diferent datasets (AllMusic, Discogs,
Last.fm, and tagtraum), featuring four diferent label namespaces.
Subtask 1 asks participants to train separately on each of the datasets
and their respective labels and predict those labels for separate test
sets. Subtask 2 allows training on the union of all four training
datasets, but still requires predictions for the four test sets in their
respective label spaces. For more details about the tasks see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>Our baseline approach explores how well a convolutional neural
network (CNN) performs that has been trained on a relatively small
subset of the available pre-computed features. For this purpose we
have chosen to train only on Mel-features. The complete code is
available on GitHub1.</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Selection</title>
      <p>
        Traditionally, music genre recognition (MGR) has often relied on
Mel-based features—in fact, one of the most often cited MGR
publications uses Mel-frequency cepstral coeficients (MFCCs) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Melbased approaches attempt to capture the timbre of a track, thus
allowing conjectures about its instrumentation and genre. They do
not necessarily take temporal properties into account and therefore
often ignore an important aspect of musical expression, which can
also be used for genre/style classification, see e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. But since
we are only interested in finding a baseline for more sophisticated
systems, using just the provided melbands features is a reasonable
approach. Lowlevel AcousticBrainz2 data ofers nine diferent
Melfeatures (global statistics: min, max, mean, ...) with 40 bands each,
resulting in a total of 360 values per track. Because Mel-bands have
a spatial relationship to each other, we organize the data into nine
diferent channels, each featuring a 40-dimensional vector
resulting in a (N , 40, 9)-dimensional tensor with N being the number of
1https://github.com/hendriks73/melbaseline
2https://acousticbrainz.org/
918,646
685,479
691,683
675,656
1,258,315
samples. Each of the 40-dimensional feature vectors is scaled so
that its maximum is 1.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Neural Network</title>
      <p>
        We choose to use the fully convolutional network (FCN) architecture
depicted in Figure 1. In essence, the network consists of four similar
feature extraction blocks, each formed by a one-dimensional
convolutional layer, an ELU activation function [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a dropout layer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
with dropout probability 0.2, an average pooling layer (omitted in
the last extraction block), and lastly a batch normalization layer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
From block to block the number of filters is increased from 64 to
512 as the length of the input decreases from 40 to 5 due to average
pooling with a pool size of 2. The feature extraction blocks are
followed by a classification block consisting of a one-dimensional
convolution, an ELU activation function, a batch normalization
layer, a global average pooling layer and sigmoid output units. The
sigmoid activation function for the output is used, because the task
is a multi-label multi-class problem. Note that the number of output
dimensions depends on the number of diferent labels in the dataset.
We therefore refer to it with the placeholder OUT. The total number
of parameters in each networks is listed in Table 1.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Training</title>
      <p>
        For subtask 1 we train the network using the provided training and
validation sets with binary cross-entropy as loss function, Adam [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
with a learning rate of 0.001 as optimizer, and a batch size of 1,000.
To avoid overfitting we employ early stopping with a patience of
50 epochs and use the last model that still showed an improvement
in its validation loss.
      </p>
      <p>Because the training data is very unbalanced, we experimented
with balancing the training data with respect to the main genre
labels via oversampling. As this led to worse results, balancing is
not part of this submission.</p>
      <p>For subtask 2 we gently normalize the provided labels by
converting them to lowercase and removing all non-alphanumeric
characters. Based on these transformed labels we create a unified
training set.</p>
      <sec id="sec-5-1">
        <title>Dataset</title>
      </sec>
      <sec id="sec-5-2">
        <title>AllMusic tagtraum</title>
      </sec>
      <sec id="sec-5-3">
        <title>Last.fm</title>
      </sec>
      <sec id="sec-5-4">
        <title>Discogs</title>
      </sec>
      <sec id="sec-5-5">
        <title>Average per</title>
      </sec>
      <sec id="sec-5-6">
        <title>Track (all labels)</title>
      </sec>
      <sec id="sec-5-7">
        <title>Track (genre labels)</title>
      </sec>
      <sec id="sec-5-8">
        <title>Track (subgenre labels)</title>
      </sec>
      <sec id="sec-5-9">
        <title>Label (all labels)</title>
      </sec>
      <sec id="sec-5-10">
        <title>Label (genre labels)</title>
      </sec>
      <sec id="sec-5-11">
        <title>Label (subgenre labels)</title>
      </sec>
      <sec id="sec-5-12">
        <title>Average per</title>
      </sec>
      <sec id="sec-5-13">
        <title>Track (all labels)</title>
      </sec>
      <sec id="sec-5-14">
        <title>Track (genre labels)</title>
      </sec>
      <sec id="sec-5-15">
        <title>Track (subgenre labels)</title>
      </sec>
      <sec id="sec-5-16">
        <title>Label (all labels)</title>
      </sec>
      <sec id="sec-5-17">
        <title>Label (genre labels)</title>
      </sec>
      <sec id="sec-5-18">
        <title>Label (subgenre labels) P</title>
        <p>
          R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
P
R
F
The output of the last network layer consists of as many values
in the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] as we have diferent labels in the dataset ( OUT).
If one of these values is greater than a predefined threshold, we
assume that the associated label is applicable for the track. In order
to optimize the tradeof between precision and recall, we choose
this threshold individually for each label based on the maximum
F-score for predictions on the validation set [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], also known as
plug-in rule approach [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In case the threshold is not crossed by
any prediction for a given track, we divide all predictions by their
thresholds and pick the label corresponding to the largest value.
        </p>
        <p>Since we are using one unified training set for subtask 2, we need
to reduce its output to labels that are valid in the context of a specific
test dataset. We do so by reverting the applied normalization and
dropping all labels not occurring in the test dataset.
3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>
        We evaluated a single run for both subtask 1 and 2. Results are listed
in Tables 2 and 3. As expected, all results are well below last year’s
winning submission [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which used a much larger network and
2,646 features. But the achieved scores are competitive with last
year’s second ranked submission [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which used a similar number
of features, though very diferent ones. Somewhat unexpected, the
network trained for subtask 2 was not able to benefit from the
additional training material and reaches generally slightly lower results
than the networks trained on individual datasets for subtask 1.
We have shown that using a relatively small and simple
convolutional neural network (CNN) trained only on global Mel-features
can achieve respectable scores in this task. Adding temporal
features may improve the results further.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Julián</given-names>
            <surname>Urbano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hendrik</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources</article-title>
          . In MediaEval 2018 Workshop. Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Djork-Arné</surname>
            <given-names>Clevert</given-names>
          </string-name>
          , Thomas Unterthiner, and
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Fast and accurate deep network learning by exponential linear units (elus)</article-title>
          ,
          <source>In International Conference on Learning Representations (ICLR)</source>
          .
          <source>arXiv preprint arXiv:1511</source>
          .
          <fpage>07289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Krzysztof</given-names>
            <surname>Dembczynski</surname>
          </string-name>
          , Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and
          <string-name>
            <given-names>Eyke</given-names>
            <surname>Hüllermeier</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Optimizing the Fmeasure in multi-label classification: Plug-in rule approach versus structured loss minimization</article-title>
          .
          <source>In International Conference on Machine Learning. Atlanta</source>
          ,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , USA,
          <fpage>1130</fpage>
          -
          <lpage>1138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          .
          <source>arXiv preprint arXiv:1502.03167</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
          </string-name>
          and Jimmy Lei Ba.
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Koutini</surname>
          </string-name>
          , Alina Imenina, Matthias Dorfer,
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Gruber</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach</article-title>
          . In MediaEval 2017 Workshop. Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Murauer</surname>
          </string-name>
          , Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Pichl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Günther</given-names>
            <surname>Specht</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hierarchical Multilabel Classification and Voting for Genre Classification</article-title>
          . In MediaEval 2017 Workshop. Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          , Florian Eyben, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Rigoll</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection</article-title>
          .
          <source>EURASIP Journal on Audio, Speech, and Music Processing</source>
          <year>2008</year>
          (
          <year>2008</year>
          ),
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>George</given-names>
            <surname>Tzanetakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Perry</given-names>
            <surname>Cook</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Musical Genre Classification of Audio Signals</article-title>
          .
          <source>IEEE Transactions on Speech and Audio Processing</source>
          <volume>10</volume>
          ,
          <issue>5</issue>
          (
          <year>2002</year>
          ),
          <fpage>293</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>