<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khaled Koutini</string-name>
          <email>khaled.koutini@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alina Imenina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Dorfer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Rudolf Gruber</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This report describes the approach developed by the JKU team for the MediaEval 2017 AcousticBrainz Genre Task. After experimenting with various classifiers on the development dataset, our final approach is based on multilayer perceptron classifiers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        We present an approach for recognizing genre for unknown music
recordings given the data provided in the AcousicBrainz dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Details about data, task, and evaluation are described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Our work is developed for both subtasks of the MediaEval 2017
AcousticBrainz Genre Task. For the single-source classification
subtask a multilayer perceptron is applied on each source. For the
multiple-source classification subtask we use similarity measures
between sources to adjust the probability of the record belonging
to a certain genre in each source.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We split the ground truth of each source using the script provided
by the organizers, into a training and a validation set, where each
comprises 80% and 20% of the original data respectively. The split
also ensures that no recording from the same recordings group
appears in both the training and validation sets, in order to avoid
the album efect.</p>
    </sec>
    <sec id="sec-3">
      <title>Features Selection</title>
      <p>
        As stated in the overview paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we are given for each
recording a set of features extracted using Essentia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Given the large
number of provided features, a fine-grained manual inspection for
individual features is not feasible. Instead, we pick broad features
groups high in the Essentia feature groups hierarchy. Namely, We
use all the low level features, rhythm features except beats_count
and beats_position, tonal features except chords_key, chords_scale,
key_key and key_scale. Overall, this yields 2646 numerical features
per recording.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Neural Network</title>
      <p>We tried various neural network architectures and compared their
performance based on the mean label-wise F-score of batches, using
the Lastfm dataset. The best performing architecture is outlined in
Table 1.</p>
      <p>2.2.1 Input layer. As stated in Section 2.1, there are 2646 input
features. We normalized the input using z-score normalization.</p>
      <p>Input: 2646</p>
      <p>First layer:
4000 Dense(ReLU) + Drop-Out(0.5)
4000 Dense(tanh) + Drop-Out(0.5)
4000 Dense(sigmoid) + Drop-Out(0.5)</p>
      <p>Second layer:</p>
      <p>Concat layer
8000 Dense + Drop-Out(0.6)
Batch-Normalization layer</p>
      <p>Non-linearity (ReLU)</p>
      <p>
        Output layer:
k-bins sigmoid
2.2.2 First hidden layer. The first hidden layer is a dense layer
consisting of 12000 units where the first 4000 units have a rectified
linear [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] activation function, the next 4000 units have a tanh
activation function and the last 4000 units have a siдmoid activation
function. As shown in Table 1, each group of units is followed by a
dropout layer with a dropout-probability of 0.5.
      </p>
      <p>2.2.3 Second hidden layer. The second hidden layer consists of
8000 batch-normalized rectified linear units. As input to this layer
we concatenate the output of the 3 groups of the first layer and
add the second layer with no activation function or bias. We again
apply dropout with a probability of 0.6.</p>
      <p>2.2.4 Output layer. The output layer consists of k units, where
k is source specific, denoting the number of labels of the source
(genre or sub-genre), the activation function of the output layer is
siдmoid.</p>
      <p>2.2.5 Loss function. We used mean binary cross-entropy as loss
function for the network.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Adjusting Threshold</title>
      <p>
        The output of our neural network are k numerical values for each
recording, as stated in Section 2.2.4. Each output is in the range
[
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] representing the probability of the label (genre or sub-genre)
corresponding to the respective output neuron. If the probability of
a label for a given recording is larger than a predefined threshold,
we assign that label to the recording. Based on our experiments,
we found that using a threshold of 0.5 for all of the labels results
in high precision but low recall. Since the goal of the task is to
optimize precision, recall and F-score, we adjusted the threshold for
each label individually to obtain the best value for these evaluation
measures. Best results are obtained when using static thresholds of
either 0.2 or 0.3 for all labels or by using a dynamic threshold for
each label, estimated by maximizing the mean F-score.
The second part of the task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] consisted of combining information
from multiple source to predict labels of one source. To achieve this,
we calculate the similarity between every label of one source and
every label of all other sources, in order to adjust the probability of
assigning a source label to a recording using other source’s labels
probabilities from models trained on these other sources. As stated
in the overview paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the datasets of diferent sources intersect.
We exploit this intersection to estimate the similarity between the
labels of diferent sources.
      </p>
      <p>Labels are modeled as vectors, where each label is a vector of
the recordings annotated with it in the ground truth. The similarity
between labels from diferent sources is measured as the cosine
similarity between these label vectors. Based on that we compute
similarity matrices Mi, j between diferent sources where element
(i, j) holds the similarity of label i of the first source and label j in
the second source. We use these pairwise similarities as conversion
matrices to project probabilities produced by a model trained on
one source to the labels of another source. For a specific recording,
the probabilities Pi of source labels i are a vector of length ni . This
vector is produced by a model trained on the training set of source i.
To also make use of the models trained on other sources we compute
Pj · Mj,i which is a vector of the same length ni also representing
the probabilities of source i labels. However, this vector is produced
by a model trained on source j by projecting the probabilities Pj
using the respective conversion matrix. The final label probabilities
(task 2) for a specific recording of source k are the weighted average
label probabilities produced by the model trained on the recording’s
source training set as well as the projected label probabilities of all
other sources (see Equation (1)).</p>
      <p>Yk =</p>
      <p>Pk + 31 Íi,k Pi · Mi,k
2
(1)
3</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>predicting sub-genres labels is harder than predicting genre labels
which might be a result of the fewer training examples of those
sub-genres in the dataset.</p>
      <p>
        We submitted 3 runs for the first task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], two runs using static
threshold of 0.2 and 0.3, and a run using dynamic thresholds as
described in section 2.3. We also submitted 5 runs for the second
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], two runs identical to the static threshold runs of task1, and
3 runs based on probabilities calculated as described in section 2.4
using static threshold of 0.2 and 0.3 and dynamic thresholds.
      </p>
      <p>
        Table 4 summarizes the f-score oficial results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of our best run
of each source.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Urbano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Oficial results of the MediaEval 2017 AcousticBrainz Genre Task</article-title>
          . https:// multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/. (
          <year>2017</year>
          ). [Online; accessed 09-September-2017].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Urbano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The MediaEval 2017 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2016 Workshop</source>
          . Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mayor</surname>
          </string-name>
          , G. Roma, J. Salamon,
          <string-name>
            <given-names>J.R.</given-names>
            <surname>Zapata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Essentia: An Audio Analysis Library for Music Information Retrieval</article-title>
          .
          <source>In International Society for Music Information Retrieval (ISMIR'13) Conference</source>
          . Curitiba, Brazil,
          <fpage>493</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          .
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tsukanov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Acousticbrainz: a community platform for gathering music information obtained from audio</article-title>
          .
          <source>In Proceedings of the 16th International Society for Music Information Retrieval Conference</source>
          . Malaga, Spain,
          <fpage>786</fpage>
          -
          <lpage>792</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>