<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Song Mood and Theme: Clustering-based Ensembles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maximilian Mayerl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Vötter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Peintner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Günther Specht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universität Innsbruck</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Emotions and Themes in Music task at MediaEval 2021 has the goal of correctly assigning mood and theme labels to pieces of music. In this paper, we describe our (team UIBK-DBIS) approach solving this task. Last year, we devised an ensemble-based method where we trained multiple neural network models on diferent partitions of the target labels. This year, we build upon this approach and attempt to automatically generate label partitions based on clustering techniques. This approach achieves a PR-AUC of 0.109 on the test set for the task, which is slightly better than the baseline.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The goal of the Emotions and Themes in Music as MediaEval
2021 is to detect the moods and themes present in a song based
on descriptors of the song’s audio properties. In total, there are 56
diferent mood and theme labels that can be assigned to a song, and
each song can have more than one label. The dataset used for this
task was created by Bogdanov et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and is publicly available.
Further details about the task itself can be found in the overview
paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our approach to this year’s edition of the task is based
on the approach we submitted last year [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The basic idea is to
train multiple models for distinct subsets of the target labels and
then combine the results. Last year, we formed the label subsets by
simply splitting the set of labels into equally-sized subsets as well
as by manually dividing the labels into mood and theme labels. This
year, we propose to use clustering techniques to generate better
label subsets, forming clusters of either similar or dissimilar labels.
For clustering similar labels, we use the popular k-means algorithm,
and for clustering dissimilar labels, we propose a simple algorithm
that can generate such clusters. The code for our implementation
is available on GitHub1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        For MediaEval 2020 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we proposed an ensemble approach using
multiple neural network models trained for handling subsets of
the target labels. Our results for this approach showed that using
such ensemble models can improve the 1 score over using a neural
network model trained for handling all labels. Building on those
results, we make the following changes and additions for this year’s
†authors contributed equally to this work
1https://github.com/dbis-uibk/mediaeval2021
edition of the task: (i) Instead of a CRNN architecture, which we
used last year, we use a VGG model (taken from the baseline
provided with the task dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and with a ResNet-18 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. (ii) Instead
of partitioning the target labels linearly or manually, we employ
clustering techniques (see Section 2.2) to find partitions of similar
or dissimilar labels.
      </p>
      <p>As every model in the ensemble handles a disjoint subset of target
labels, the final prediction results are obtained by concatenation
and reordering the label predictions of all models.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Data Preprocessing</title>
      <p>
        Since the neural network models we use in our approach require
mel-spectrograms of equal length for all songs, we extract a
spectrogram of size 1366 from the center of each song. This follows the
approach by Mayerl et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for the 2019 edition of the task.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Clustering</title>
      <p>
        To generate partitions of target labels for our ensemble, we first map
labels into a space for clustering, such that each label is represented
by one vector. To find the vector for a given label, we take all songs
in the training set to which that label is assigned and compute the
centroid of the feature vectors for those songs. For this step, we
used 22 high-level features extracted with Essentia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] instead of
mel-spectrograms. We then computed label partitions by using two
diferent clustering techniques on the resulting vector space.
      </p>
      <p>To find partitions such that each partition contains similar labels,
we used the well-known k-means algorithm. As k-means requires
manually setting the number of desired clusters, we used the
popular elbow method to determine the best number of clusters, which
we found to be four. To find partitions such that each partition
contains dissimilar labels, we propose a simple clustering algorithm,
which we call dk-means (dissimilar k-means). This algorithm is a
variation of the k-means algorithm and works as follows:
(1) Randomly chose  points (in our case, corresponding to
labels) as seeds. This gives us  clusters, each containing
one point.
(2) For each cluster
(a) Compute the centroid of the points in the cluster.
(b) Among all the points not yet assigned to a cluster, find
the point that has the highest euclidean distance to
this centroid. Add that point to the cluster.</p>
      <p>(3) Repeat (2) until all points are assigned.</p>
      <p>A visualization of the clusters produced by these methods is
given in Figure 1. For this visualization, the centroids corresponding
to each label have been projected to a 2-dimensional space using
principal component analysis.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>The evaluation results are given in Table 1. The table also includes
results for two baseline approaches as well as a run using linear
label splits and VGG models, which is included for comparison. The
evaluation was done using four evaluation metrics, as defined by
the task. The baseline approaches consist of a single model trained
on all target labels, i.e. do not use an ensemble. Comparing the
results of the submitted approaches to the baselines shows that the
submitted approaches generally perform worse than or equal to
the baseline. Looking at the results for the approaches using VGG
models, we observe a clear performance improvement when using
k-means clustering compared to linear splits. Both the ROC-AUC
as well as the PR-AUC increase, from 0.684 to 0.705 and from 0.097
to 0.109 respectively, while both F1 scores stay the same. The same
is not true when using dk-means clustering, were the performance
remains almost the same compared to linear splits across all four
metrics. From this, we conclude that partitioning target labels such
that similar labels are handled by the same model in the ensemble
is beneficial and results in better performance when using VGG
models, at least for the given dataset. The approaches using
ResNet18 show a diferent behavior. Here, linear splits clearly outperform
both splits using k-means and dk-means clustering. This indicates
that, with the given dataset, partitioning target labels based on
similarity or dissimilarity does not improve performance. Lastly, we
can compare approaches using k-means clustering with approaches
using dk-means clustering. Here, we can observe a decrease in
performance when using dk-means compared to k-means, for both
VGG and ResNet-18. This implies that applying models to clusters</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>In this paper, we presented our approach for the Emotions and
Themes in Music task at MediaEval 2021. While our approach only
slightly outperformed the baselines, we were still able to show
potential benefits in building ensemble models based on partitions
of target labels using clustering techniques. For models using
VGGbased classifiers, we observed an increase in performance when
determining label partitions using k-means clustering.</p>
      <p>For future work, one interesting avenue would be to combine the
various approaches we have developed for this task over the past
few years. In 2019, we introduced a random sampling approach to
augment the provided dataset and generate more representative
samples for each song. Last year, we further built on this by
generating a more balanced dataset by drawing more samples for target
labels that are underrepresented. As we did not employ either of
these techniques for this year’s submissions, it would be interesting
to see what results could be accomplished by incorporating them
into the new, clustering-based approach. Comparing the
performance of our ensemble with the baselines implies that training on
disjoint subsets of labels leads to a decrease in performance. Hence,
it would be interesting to see if we can increase the performance
by using overlapping label sets in our ensemble.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, and others.
          <source>2013</source>
          .
          <article-title>Essentia: An audio analysis library for music information retrieval</article-title>
          .
          <source>In Proceedings of the 14th Conference of the International Society for Music Information Retrieval (ISMIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo Dataset for Automatic Music Tagging</article-title>
          .
          <source>In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML</source>
          <year>2019</year>
          ). Long Beach, CA, United States. http://hdl.handle.
          <source>net/10230/42015</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . p.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Maximilian</given-names>
            <surname>Mayerl</surname>
          </string-name>
          , Michael Vötter,
          <string-name>
            <surname>Hsiao-Tzu</surname>
            <given-names>Hung</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bo-Yu</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi-Hsuan Yang</surname>
            , and
            <given-names>Eva</given-names>
          </string-name>
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          .
          <article-title>Recognizing Song Mood and Theme Using Convolutional Recurrent Neural Networks</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop</source>
          , Sophia Antipolis, France,
          <fpage>27</fpage>
          -
          <lpage>30</lpage>
          October
          <year>2019</year>
          .
          <article-title>CEUR-WS.org</article-title>
          . http://ceur-ws.
          <source>org/</source>
          Vol-2670
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , Dmitry Bogdanov, and
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <source>MediaEval</source>
          <year>2021</year>
          :
          <article-title>Emotion and Theme Recognition in Music Using Jamendo</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2021</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Vötter</surname>
          </string-name>
          , Maximilian Mayerl, Günther Specht, and
          <string-name>
            <given-names>Eva</given-names>
            <surname>Zangerle</surname>
          </string-name>
          .
          <article-title>Recognizing Song Mood and Theme: Leveraging Ensembles of Tag Groups</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2020</year>
          .
          <article-title>CEUR-WS.org</article-title>
          . http://ceur-ws.
          <source>org/</source>
          Vol-2882
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>