<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing Song Mood and Theme: Leveraging Ensembles of Tag Groups</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Vötter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Mayerl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Günther Specht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universität Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this year's MediaEval Emotions and Themes in Music task, the goal was to assign emotion and theme tags to songs. In this paper, we describe our (Team UIBK-DBIS) approach to solving this task. We extend the neural network model approach of our last year's submission, based on a Convolutional Recurrent Neural Network (CRNN), by building an ensemble model that utilizes spectral features. Our approaches achieve a ROC-AUC score between 0.626 and 0.707 on the provided test set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The Emotions and Themes in Music task of the MediaEval 2020
workshop requires to detect a song’s mood and theme based on
audio descriptors of the song. The prediction is done in a
multilabel fashion where a total of 56 mood and theme tags are available
in the dataset collected from Jamendo. The dataset as well as the
split (split-0) used for the task were created by Bogdanov et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ];
details can be found in the overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Our neural network ensemble approach is an extension of our
Convolutional Recurrent Neural Network (CRNN) approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
used to solve last year’s task. We confirm last year’s findings, that
augmenting samples by choosing multiple random windows from
the provided mel-spectrograms improves the results of both CNN
and CRNN models. Further, we show that building an ensemble
model of tag groups achieves improved F1 results. The underlying
code is available on GitHub1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        Our models for MediaEval 2019 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have shown varying prediction
results depending on the target tag—possibly because mood and
theme are two diferent concepts with some overlap or correlation.
Hence, we hypothesize that a single model trained to predict tags
for both has to learn diferent concepts at the same time, potentially
degrading accuracy. Inspired by ensemble models, we propose to
use multiple independent models. As it was infeasible to create an
ensemble in a fully fledged one-versus-rest fashion due to
computational constraints, we split the emotion and theme tags into tag
groups. This allows training one model per group. For the final
predictions as required for the task, these models have to be combined
into an ensemble.
1https://github.com/dbis-uibk/MediaEval2020
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Preprocessing</title>
      <p>
        We used two diferent ways of pre-processing the provided
melspectrograms for our model, which are both based on the
windowing approach introduced by Mayerl et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Both use a window
size of 1366.One pre-processing strategy used the center window
approach, where one sample is taken from the center of each song.
The other pre-processing strategy was used to augment data for
underrepresented tags, and uses windows taken from random
positions within the song. As the most frequent tag occur around 14
times more often than the least frequent ones, we first extracted 14
random windows per song. Based on the tag counts in the dataset,
we then categorize each song into one of four diferent categories.
To assign each song to a category, we defined decision boundaries
at 12 , 13 , and 14 of the count of the most frequently occurring tag. The
ifrst group contains songs that have a tag (the one with the highest
1
overall count assigned to this song) assigned that has a count ≥ 2
of the maximum value, the second group contains remaining songs
with a tag count ≥ 31 of the maximum value, etc. For these groups,
we kept 1, 2, 3 or 4 randomly selected samples, respectively, of each
song contained in the train or validation set. This procedure results
in a train and validation set with better balance between tags. In
the following, the strategy using center windows is referred to as
raw, and the other strategy is referred to as augmented.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Ensemble Model</title>
      <p>
        The motivation for building an ensemble model stems from the
fact that the given tags cover two partly overlapping concepts,
namely mood (emotions) and theme (topic). Further, we have seen
substantial diferences when comparing the per-tag scores of our
last year’s submissions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To build an ensemble, we split the given
tags into diferent groups and train one base model (CNN or CRNN)
per group. We propose the following three splitting strategies:
• linear: Splits the tags in two equally sized consecutive
groups based on lexicographical order.
• performance: Splits the tags into two equally sized groups
ordered by the best scores (F1 and PR-AUC).
• manually: Uses two or three tag groups that were manually
assigned. The split into the two groups mood and theme
was determined by majority vote among four diferent
human judges. As a tie breaker (necessary for the tags love
and upbeat), we used a coin flip. The three group split was
created by assigning all tags with a kappa score of one to
the respective mood or theme group and all others to an
uncertain group (exact groups see README on GitHub).
      </p>
      <p>Each model hence predicts a disjoint subset of target tags based
on the given mel-spectrogram input. To obtain the final predictions,
we merge those predictions to get the overall ensemble prediction.
0.500
0.726
0.643
0.637
0.637
0.626
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>CNN Model</title>
      <p>
        We use a CNN model as originally introduced by LeCun et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Our CNN model uses the mel-spectrogram as input. Similar to
the CRNN model of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we use a padding layer right after the
input layer with a width of 1440, where the input with of 1366
is zero-padded left and right with the same size. This padding
layer is followed by five blocks each containing two successive
2dconvolution layers with ELU activation, followed by a max-pooling
layer and a dropout layer. The convolution layers use a kernel size
of 3x3 while the max pooling layer uses a pooling size of 2x2. The
dropout rate is 0.1, and the number of filters per block is 32, 32,
64, 64, and 64, respectively. After that, we use a dense layer with a
width of 256 and ELU activation followed by a dropout layer with a
dropout rate of 0.2, followed by another dense layer using sigmoid
activation and a width of 56 to fit the expected output shape. We
train our model using the RMSprop optimizer with categorical
cross-entropy loss. This results in a network being able to predict
probabilities per tag. Binary tags are predicted by setting a threshold
per tag that gets optimized by finding the “elbow” in the ROC curve
computed on the validation set as already done in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>CRNN Model</title>
      <p>
        This model is a slight adaption of the model used in last year’s
submission of Mayerl et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The model consists of
convolutional layers followed by recurrent and dense layers, based on the
architecture introduced by Choi et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In total, there are four
convolutional blocks each consisting of a 2D-convolution layer,
followed by a batch normalization layer, ELU activation, a max
pooling layer and a dropout layer. After these two blocks, two GRU
layers are used that are followed by dropout layer with dropout rate
of 0.3 and a dense layer with a size of 56 to fit the required output
shape. We train this model using the Adam [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] optimizer with a
categorical cross-entropy loss in contrast to the binary cross-entropy
used last year. Again the threshold used for binary tag predictions
is set using the ROC curve method described in Section 2.3.
2.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Submissions</title>
      <p>Based on the setup described above, we submitted the following
ifve runs where each model is trained for 20 epochs:
• Run #1: CRNN ensemble model using manual mood and
theme split, and trained on augmented data.
• Run #2: CRNN ensemble model using manual mood and
theme split, and trained on raw data (center windowing).
• Run #3: CRNN ensemble model using manual mood, theme
and uncertain split, and trained on augmented data.
• Run #4: CRNN ensemble model using two splits based on
the F1-score, and trained on augmented data.</p>
      <p>• Run #5: CRNN model trained on augmented data.
3</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND ANALYSIS</title>
      <p>In addition to the results for the submitted runs described in
Section 2.5, we included further results in Table 1. All models contained
in Table 1 are trained for 20 epochs. The name of each approach
encodes the used dataset and model. The first letter specifies the
used dataset: a stands for the augmented dataset (Section 2.1), while
r means that the raw dataset without augmentation was used. This
is followed by the type of model, CNN (Section 2.3 or CRNN
(Section 2.4). If the model name is prefixed with e this means that the
given model was used in an ensemble (Section 2.2). In case of
ensemble models, the remainder of the approach name specifies the
splitting strategy (Section 2.2) and how many splits (last token of
the approach name) were used.</p>
      <p>
        From Table 1, we can see that using the augmented dataset leads
to slightly improved results over the raw dataset. Hence, we were
able to reproduce last year’s findings of Mayerl et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Further,
we see that the CRNN model outperforms the CNN model across all
setups. Additionally, the results show that for most of the metrics,
the ensemble method using two or three manual splits of the tags
outperforms splittings based on performance of a simple linear split.
The submitted CRNN model trained on the augmented dataset with
tags manually split into mood and theme shows the best results for
F1. Additional analyses showed that this performance gain stems
from a better precision score, as this approach also has the best
precision score (not shown in the table) as well as a good recall
score. In contrast, the CRNN model (without using an ensemble
strategy) shows the best results for ROC-AUC (on the raw dataset)
and PR-AUC (on the augmented dataset). Further analyses is need
to find the cause for that diference.
4
      </p>
    </sec>
    <sec id="sec-9">
      <title>SUMMARY AND OUTLOOK</title>
      <p>Our proposed approach of using an ensemble design, is inspired
by the varying performance of last year’s models when looking at
the per-tag scores. We showed that for traditional scoring metrics
such as F1, the ensemble models are able to outperform a plain
CRNN approach. Nevertheless, we could not outperform the
VGGish baseline. Potential future work includes changing the models
contained in the ensemble. For example, using diferent types of
models for diferent groups of tags would be an option. Further, the
ensemble may be extended by training multiple models per group
or even multiple models on diferent groups with e.g., a majority
vote for the final predictions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          . Online,
          <volume>14</volume>
          -
          <fpage>15</fpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Minz Won, Philip Tovstogan,
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The MTG-Jamendo Dataset for Automatic Music Tagging</article-title>
          .
          <source>In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML</source>
          <year>2019</year>
          ). Long Beach, CA, United States.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          , György Fazekas,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional recurrent neural networks for music classification</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . IEEE,
          <fpage>2392</fpage>
          -
          <lpage>2396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Diederik</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>International Conference on Learning Representations (12</source>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and
          <string-name>
            <given-names>Lawrence D</given-names>
            <surname>Jackel</surname>
          </string-name>
          .
          <year>1989</year>
          .
          <article-title>Backpropagation applied to handwritten zip code recognition</article-title>
          .
          <source>Neural computation 1</source>
          ,
          <issue>4</issue>
          (
          <year>1989</year>
          ),
          <fpage>541</fpage>
          -
          <lpage>551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Maximilian</given-names>
            <surname>Mayerl</surname>
          </string-name>
          , Michael Vötter,
          <string-name>
            <surname>Hsiao-Tzu</surname>
            <given-names>Hung</given-names>
          </string-name>
          , Boyu Chen,
          <source>YiHsuan Yang, and Eva Zangerle</source>
          .
          <year>2019</year>
          .
          <article-title>Recognizing Song Mood and Theme Using Convolutional Recurrent Neural Networks</article-title>
          .
          <source>In Proc. of the MediaEval 2019 Workshop</source>
          . Sophia Antipolis, France,
          <fpage>27</fpage>
          -
          <lpage>30</lpage>
          October
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>