<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2019 Emotion and Theme Recognition task: A VQ-VAE Based Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hsiao-Tzu Hung</string-name>
          <email>fbiannahung@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu-Hua Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Mayerl</string-name>
          <email>maximilian.mayerl@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Vötter</string-name>
          <email>michael.voetter@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <email>eva.zangerle@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi-Hsuan Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Center for IT Innovation</institution>
          ,
          <addr-line>Academia Sinica</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Taiwan AI Labs</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universität Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>In this paper, we, Taiinn (Taiwan) team, use pre-trained VQ-VAE as a feature extractor and compare two types of classifier for audiobased emotion and theme recognition. The VQ-VAE is pre-trained on the Million Song Dataset (MSD). We found better performance in ROC-AUC by fixing the pre-trained parameters of VQ-VAE while training the classifier. In addition, an embedding with bigger shape works better than the one-dimensional counterpart. The code and submitted models can be found at: https://github.com/annahung31/ moodtheme-tagging.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>This paper describes our submission to the MediaEval 2019 Emotion
and Theme recognition task [2]. The goal is to automatically assign
audio clips with emotion and theme tags using a data collection
from Jamendo, a platform of copyright free music. The task can be
considered as a multi-label, music auto-tagging problem [6].</p>
      <p>Lately, vector-quantized variational auto-encoder (VQ-VAE) [8]
has been shown efective for images and audio generation. It learns
a quantized representation of its input in an unsupervised way.
This motivates us to study the use of VQ-VAE for classification
problems such as the one involved in the MediaEval 2019 Emotion
and Theme task. While our work remains preliminary, it seems no
previous work has used VQ-VAE for auto-tagging problems.</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH Third-party dataset</title>
      <p>Besides the Jamendo dataset prepared by the task organizers, we
also use the million song dataset (MSD) [1] and the MagnaTagATune
(MTAT) dataset [4] in our work. The number of samples of the two
datasets can be found in Table 1. We use MSD only for pre-training
the VQ-VAE model, so we only split the datset into training and
validation sets. As for MTAT, we use it as the second test set (in
addition to Jamendo) for testing VQ-VAE, and hence we split it into
training, validation, and test sets. We only consider the top-50 tags
(mostly genre and instrument tags [3]) for MTAT.
†The two authors contributed equally to this work</p>
      <sec id="sec-2-1">
        <title>Train</title>
      </sec>
      <sec id="sec-2-2">
        <title>Validation</title>
        <p>MSD [1]
MTAT [4]</p>
      </sec>
      <sec id="sec-2-3">
        <title>Test 0 2,651</title>
        <p>2.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Input feature</title>
      <p>We use librosa [5] to extract 128-dimensional log-mel spectrums
from the audio files. The sampling rate is set to be 22,050 Hz, and
only first 1,024 frames are took for every clips, leading to a fixed-size
matrix of 128 × 1024 per clip.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Neural networks</title>
      <p>
        2.3.1 VQ-VAE as feature extractor. We use VQ-VAE as an feature
extractor to get a discrete embedding from mel-spectrograms. The
VQ-VAE basically contains an encoder and a decoder. The encoder
contains 5 convolutional layers, followed by two residual 3×3 blocks
all having 256 feature maps. The kernel size and the stride of the first
4 layers is (
        <xref ref-type="bibr" rid="ref3 ref4">4,3</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
        ), and those of the fifth layer are (
        <xref ref-type="bibr" rid="ref4 ref5">5,4</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ). The
padding of every layer are (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
        ) ,(
        <xref ref-type="bibr" rid="ref1 ref8">1,8</xref>
        ), (
        <xref ref-type="bibr" rid="ref1">1,16</xref>
        ), (
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ). The dilation
are the same as padding. As a result, the encoder will generate an
embedding with shape of 256 × 4 × 512. The decoder consists two
residual 3 × 3 blocks, followed by 5 transposed convolutional layers.
The kernel size, stride and padding for the first later is (
        <xref ref-type="bibr" rid="ref4 ref4">4,4</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ),
(
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ), and are (
        <xref ref-type="bibr" rid="ref3 ref4">4,3</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
        ), (0.1) for the second layer. For the remaining
three layers, the kernel size, stride and padding are (
        <xref ref-type="bibr" rid="ref3 ref4">4,3</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
        ), (
        <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
        ).
In the end of the decoder, an activation function of tanh is used.
We call the this Type-1 VQ-VAE.
      </p>
      <p>
        To observe how the dimension of the embedding afects the
performance of tagging, we implement an alternative that uses
(
        <xref ref-type="bibr" rid="ref4 ref8">8,4</xref>
        ) kernel for the fifth layer of the encoder, making the shape
of the embedding 256 × 1 × 512. We may view it as a sequence of
256-dimensional feature vectors. We call this one Type-2 VQ-VAE.
      </p>
      <p>2.3.2 Classifiers. We use two kinds of classifier for training. The
ifrst one is a GRU-classifier, with 2 bi-directional gated recurrent
units (GRUs). After the first GRU, layer normalization is applied.
The output hidden states of the second GRU will then go through a
fully-connected layer and sigmoid activation layer to get prediction.
The second one is a CNN (convolutional neural network)-classifier.
The model structure of the CNN classifier is basically the same as
that proposed in [7], with the size of channels halved.</p>
      <sec id="sec-4-1">
        <title>Step 1</title>
      </sec>
      <sec id="sec-4-2">
        <title>Step 2</title>
        <p>Encoder</p>
        <p>Encoder</p>
        <p>Embedding
Space</p>
        <p>Decoder
CNNclassifier
…… ……
GRUclassifier
The training procedure, as depicted in Figure 1, is composed of two
steps. In step-1, we pre-train VQ-VAE on MSD by minimizing the
reconstruction error. In step-2, we cascade the encoder of VQ-VAE
trained in step-1 along with a classifier (a GRU or a CNN based
one), and train the network by binary cross entropy loss for genre,
mood or theme recognition (depending on the dataset). During the
training process, we set the batch size to 12 and learning rate to
2e-4. The Adam optimizer is used to train the models. The networks
are trained for a maximum of 100 epochs with early stopping.
2.5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Methods</title>
      <sec id="sec-5-1">
        <title>We submit the following five runs:</title>
        <p>• Run-1: type-1 VQ-VAE + GRU; updating both VQ-VAE and</p>
        <p>GRU during step-2 training.
• Run-2: type-1 VQ-VAE + GRU; fixing VQ-VAE and
updating only the GRU during step-2 training.
• Run-3: type-1 VQ-VAE + CNN; updating both VQ-VAE
and CNN during step-2 training.
• Run-4: type-1 VQ-VAE + CNN; fixing VQ-VAE and
updating only the CNN during step-2 training.
• Run-5: type-2 VQ-VAE + GRU; updating both VQ-VAE and</p>
        <p>GRU during step-2 training.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS Auto-tagging on MTAT</title>
      <p>To verify the efectiveness of the VQ-VAE based classification method,
we firstly evaluate the run-1 method on MTAT for auto-tagging.
Specifically, in step-2 training, we update the type-1 VQ-VAE
(pretrained on MSD) along with the GRU classifier on MTAT and
observe the performance of tagging. It turns out that the model attains
ROC-AUC 0.90 when predicting top-50 tags, which is close to the
performance of state-of-the-art models [6].
3.2</p>
      <p>Mood &amp; theme classification on Jamendo
The result on the Jamendo dataset is shown in Table 2. We can
see that, in terms of ROC-AUC, Run-2 outperforms Run-1, and
Run-4 outperforms Run-3. This may indicate that it is better to
ifx the VQ-VAE when training the classifiers. We can also see that
the CNN classifier seems to perform slightly better than the GRU
classifier. And, it seems that the type-1 VQ-VAE works than the
type-2 counterpart. The best ROC-AUC 0.7207 is obtained by Run-4.
Yet, it is worse than VGG-ish, which represents a strong baseline.
4</p>
    </sec>
    <sec id="sec-7">
      <title>SUMMARY AND OUTLOOK</title>
      <p>In this paper, we have reported a preliminary attempt that uses
pre-trained VQ-VAE model for music auto-tagging problems. From
the evaluation result, it seems that either the approach is not that
promising for discrminative tasks, or that we have not fully
capitalized its potential. We would like to further develop this approach
in the near future, for both discrminative and generative problems
in music (e.g., to generate music in the audio domain).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Bertin-Mahieux</surname>
          </string-name>
          , Daniel P.W. Ellis, and and Paul Lamere Brian Whitman.
          <year>2011</year>
          .
          <article-title>The million song dataset</article-title>
          .
          <source>In Proc. International Society for Music Information Retrieval Conference (ISMIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          , Alastair Porter,
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Minz</given-names>
            <surname>Won</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MediaEval 2019: Emotion and theme recognition in music using Jamendo</article-title>
          .
          <source>In MediaEval 2019 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Keunwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>List of automatic music tagging research articles that are evaluated against MagnaTagATune Dataset</article-title>
          . https://github. com/keunwoochoi/magnatagatune-list. (
          <year>2017</year>
          ).
          <source>Online; accessed 29 September</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Edith</given-names>
            <surname>Law</surname>
          </string-name>
          , Kris West,
          <string-name>
            <surname>Michael I. Mandel</surname>
          </string-name>
          , Mert Bay, and
          <string-name>
            <given-names>J. Stephen</given-names>
            <surname>Downie</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Evaluation of algorithms using games: The case of music tagging</article-title>
          .
          <source>In Proc. International Society for Music Information Retrieval Conference (ISMIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Brian</surname>
            <given-names>McFee</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Colin</given-names>
            <surname>Rafel</surname>
          </string-name>
          , Dawen Liang, Daniel P. W . Ellis,
          <string-name>
            <surname>Matt</surname>
            <given-names>McVicar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Eric</given-names>
            <surname>Battenberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Nieto</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>librosa: Audio and music signal analysis in Python</article-title>
          .
          <source>In Proc. Python in Science Conf</source>
          .
          <volume>18</volume>
          -
          <fpage>25</fpage>
          . [Online] https://librosa.github.io/librosa/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Juhan</given-names>
            <surname>Nam</surname>
          </string-name>
          , Keunwoo Choi,
          <string-name>
            <given-names>Jongpil</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Szu-Yu Chou</surname>
          </string-name>
          , and
          <string-name>
            <surname>Yi-Hsuan Yang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from Bach</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          <volume>36</volume>
          ,
          <issue>1</issue>
          (
          <year>2019</year>
          ),
          <fpage>41</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Pons</surname>
          </string-name>
          , Oriol Nieto, Matthew Prockup,
          <string-name>
            <surname>Erik M. Schmidt</surname>
            ,
            <given-names>Andreas F.</given-names>
          </string-name>
          <string-name>
            <surname>Ehmann</surname>
            , and
            <given-names>Xavier</given-names>
          </string-name>
          <string-name>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>End-to-end learning for music audio tagging at scale</article-title>
          .
          <source>In Proc. International Society for Music Information Retrieval Conference (ISMIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Aaron</surname>
            <given-names>van den Oord</given-names>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <given-names>Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural discrete representation learning</article-title>
          .
          <source>In Proc. Conference on Neural Information Processing Systems (NIPS).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>