<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2018 AcousticBrainz Genre Task: A baseline combining deep feature embeddings across datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergio Oramas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Bogdanov</string-name>
          <email>dmitry.bogdanov@upf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alastair Porter</string-name>
          <email>alastair.porter@upf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pandora Media Inc.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this paper we present a baseline approach for the MediaEval 2018 AcousticBrainz Genre Task that takes advantage of stacking multiple feature embeddings learned on individual genre datasets by simple deep learning architectures. Although we employ basic neural networks, the combination of their deep feature embeddings provides a significant gain in performance compared to each individual network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        This paper describes our baseline submission to the MediaEval 2018
AcousticBrainz Genre Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The goal of the task is to
automatically classify music tracks by genres based on pre-computed audio
content features provided by the organizers. Four diferent genre
datasets coming from diferent annotation sources with diferent
genre taxonomies are used in the challenge. For each dataset,
training, validation, and testing splits are provided. This allows to build
and evaluate classiefir models for each genre dataset independently
(Subtask 1) as well as explore combinations of genre sources in
order to boost performance of the models (Subtask 2).
      </p>
      <p>For this baseline, we decided to focus on demonstration of
possibilities of merging diferent genre ground truth sources using
a simple deep learning architecture. To this end, we explore how
stacking deep feature embeddings obtained on diferent datasets
can benefit genre recognition systems.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Submission to the previous edition of the task have explored late
fusion of predictions made by classifier models trained for each genre
source individually. In order to predict genres following a taxonomy
of a target source, the proposed solutions applied genre mapping
between taxonomies, either by computing genre co-occurrences on
the intersection of all four training genre datasets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or by textual
string matching [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In our baseline, we propose an alternative early fusion approach,
similar to the one proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for multimodal genre
classification. The approach incorporates knowledge across datasets by
stacking deep feature embeddings learned on each dataset
individually and using those as an input to predict genres for each test
dataset.
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>Input features</title>
      <p>
        We use all available features extracted from music audio
recordings using Essentia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and provided for the challenge. As a
preprocessing step, we apply one-hot encoding for a few categorical
features related to tonality (key_key, key_scale, chords_key, and
chords_scale) and standardize all features (zero mean, unit
variance). In total, this amounts to 2669 input features.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Neural network architecture</title>
      <p>A simple feedforward network is used to predict the probabilities
of each genre given a track. The network consists of an input layer
of 2669 units (the size of the feature vector for an input recording),
followed by a hidden dense layer of 256 units with ReLu activation,
and the output layer where the number of units coincide with
the number of genres to be predicted in each dataset. Dropout of
0.5 is applied after the input and the hidden layer. As the targeted
genre classification task is multi-label, the output layer uses sigmoid
activations and is evaluated with a binary cross-entropy loss.</p>
      <p>
        Mini-batches of 32 items are randomly sampled from the training
data to compute the gradient, and the Adam [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] optimizer is used
to train the models, with the default suggested learning parameters.
The networks are trained for a maximum of 100 epochs with early
stopping. Once trained, we extract the 256-dimensional vectors
from the hidden layer for the training, validation, and test sets.
      </p>
      <p>The model architecture is used to train a multi-label genre
classifier on each of the four datasets. The models are trained on 80%
of the training set and validated after each epoch using the other
20% using the split script with release-group filtering provided by
the organizers. Predictions are computed for the validation and test
sets.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Embedding fusion approach</title>
      <p>Following the described methodology, one model per dataset is
trained and these models serve for predictions in Subtask 1. Then,
the given models are used as feature extractors. All four models
share the same input format, so input feature vectors from one
dataset can be used as input to a model trained on other dataset.
Thus, for each model we feed all tracks from the training, validation
and test sets of each dataset, and obtain the activations of the hidden
layer as a 256-dimensional feature embedding. Therefore, for each
track in each dataset we obtain four diferent feature embeddings,
coming from each of the four previously trained models.</p>
      <p>Given the four feature embeddings of each track, we apply l
2norm to each of them and then stack them together into a single
1024-dimensional feature vector. Following this process, we obtain
new feature vectors for every track in the training, validation and
test sets of each dataset. Then, we use these feature vectors as input
of a simple network where the input layer is directly connected
to the output layer. Dropout of 0.5 is applied after the input layer.
The output layer is exactly the same as in the network described in
above, where sigmoid activation and binary cross-entropy loss are
applied. The new network is trained following the same
methodology described before, with Adam as the optimizer and mini-batches
of 32 items randomly sampled. The network is trained on 80% of
the training data and validated on the other 20%. Following this
approach, we train a network per dataset, and obtain the genre
probability predictions of the validation and test sets for Subtask 2.
3.4</p>
    </sec>
    <sec id="sec-7">
      <title>Predictions thresholding</title>
      <p>
        The predictions made by each model contain continuous values,
while the task requires binary prediction of genre labels. We
therefore apply a plug-in rule approach thresholding the prediction
values in order to maximize the evaluation metrics. We decided to
maximize the macro F-score, and applied thresholds individual for
each genre label that we estimated on the validation data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND ANALYSIS</title>
      <p>
        We evaluated a single run for both Subtask 1 and 2. Table 1 presents
the ROC AUC metric on the validation sets. Table 2 presents the
ifnal results after applying thresholding on the test datasets. As the
general pattern, we can clearly see the benefit of models based on
embedding fusion approach compared to the models trained
individually on each dataset. While the individual models (Subtask 1)
are hardly usable compared to the random and popularity baselines,
the combined models got a significant improvement in performance,
being competitive with last years’ second ranked submission [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In our experiments, we focused on optimizing macro F-score,
however choosing this metric for threshold optimization can have a
negative efect on micro-averaged metrics. In the case of infrequent
subgenre labels and an uninformative classifier, an optimal, but
undesirable strategy may involve predicting those labels always [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Indeed, this was the case for the individual models, but the hybrid
models did not have this issue.
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>
        In our baseline approach we focused on Subtask 2 and demonstrated
the advantage of fusing feature embeddings learned on individual
genre datasets on the example of a simple feedforward network
architecture. We may expect further improvements in performance
by means of a more sophisticated network architecture (for
example [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). The code of the baseline is available online.1
1https://github.com/MTG/acousticbrainz-mediaeval-baselines
      </p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreements No 688382 (AudioCommons) and 770376-2 (TROMPA),
as well as the Ministry of Economy and Competitiveness of the
Spanish Government (Reference: TIN2015-69935-P).</p>
      <sec id="sec-10-1">
        <title>AllMusic</title>
      </sec>
      <sec id="sec-10-2">
        <title>Dataset</title>
        <p>Discogs Lastfm</p>
      </sec>
      <sec id="sec-10-3">
        <title>Tagtraum</title>
      </sec>
      <sec id="sec-10-4">
        <title>Subtask 1 (individual models)</title>
        <p>Subtask 2 (fusion models)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Julián</given-names>
            <surname>Urbano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hendrik</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The MediaEval 2018 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources</article-title>
          . In MediaEval 2018 Workshop. Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mayor</surname>
          </string-name>
          , G. Roma, J. Salamon,
          <string-name>
            <given-names>J.R.</given-names>
            <surname>Zapata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Essentia: An Audio Analysis Library for Music Information Retrieval</article-title>
          .
          <source>In International Society for Music Information Retrieval (ISMIR'13) Conference</source>
          . Curitiba, Brazil,
          <fpage>493</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Koutini</surname>
          </string-name>
          , Alina Imenina, Matthias Dorfer, Alexander Rudolf Gruber, and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach</article-title>
          . In MediaEval 2017 Workshop. Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Zachary</surname>
            <given-names>C Lipton</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charles Elkan</surname>
            , and
            <given-names>Balakrishnan</given-names>
          </string-name>
          <string-name>
            <surname>Naryanaswamy</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Optimal thresholding of classifiers to maximize F1 measure</article-title>
          .
          <source>In Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . Springer,
          <fpage>225</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Murauer</surname>
          </string-name>
          , Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle,
          <source>Martin Pichl, and GÃĳnther Specht</source>
          .
          <year>2017</year>
          .
          <article-title>Hierarchical Multilabel Classification and Voting for Genre Classification</article-title>
          . In MediaEval 2017 Workshop. Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sergio</given-names>
            <surname>Oramas</surname>
          </string-name>
          , Francesco Barbieri, Oriol Nieto, and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Multimodal Deep Learning for Music Genre Classification</article-title>
          .
          <source>Transactions of the International Society for Music Information Retrieval 1</source>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>