<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MediaEval 2019 Emotion and Theme Recognition task: A VQ-VAE Based Approach</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hsiao-Tzu</forename><surname>Hung</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Taiwan AI Labs</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yu-Hua</forename><surname>Chen</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Taiwan AI Labs</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maximilian</forename><surname>Mayerl</surname></persName>
							<email>maximilian.mayerl@uibk.ac.at</email>
							<affiliation key="aff2">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Michael</forename><surname>Vötter</surname></persName>
							<email>michael.voetter@uibk.ac.at</email>
							<affiliation key="aff2">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Eva</forename><surname>Zangerle</surname></persName>
							<email>eva.zangerle@uibk.ac.at</email>
							<affiliation key="aff2">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yi-Hsuan</forename><surname>Yang</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Taiwan AI Labs</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Research Center for IT Innovation</orgName>
								<orgName type="institution">Academia Sinica</orgName>
								<address>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">MediaEval 2019 Emotion and Theme Recognition task: A VQ-VAE Based Approach</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6A9E44D21947CD797BEC65437833A9E4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we, Taiinn (Taiwan) team, use pre-trained VQ-VAE as a feature extractor and compare two types of classifier for audiobased emotion and theme recognition. The VQ-VAE is pre-trained on the Million Song Dataset (MSD). We found better performance in ROC-AUC by fixing the pre-trained parameters of VQ-VAE while training the classifier. In addition, an embedding with bigger shape works better than the one-dimensional counterpart. The code and submitted models can be found at: https://github.com/annahung31/ moodtheme-tagging.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>This paper describes our submission to the MediaEval 2019 Emotion and Theme recognition task <ref type="bibr" target="#b1">[2]</ref>. The goal is to automatically assign audio clips with emotion and theme tags using a data collection from Jamendo, a platform of copyright free music. The task can be considered as a multi-label, music auto-tagging problem <ref type="bibr" target="#b5">[6]</ref>.</p><p>Lately, vector-quantized variational auto-encoder (VQ-VAE) <ref type="bibr" target="#b7">[8]</ref> has been shown effective for images and audio generation. It learns a quantized representation of its input in an unsupervised way. This motivates us to study the use of VQ-VAE for classification problems such as the one involved in the MediaEval 2019 Emotion and Theme task. While our work remains preliminary, it seems no previous work has used VQ-VAE for auto-tagging problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH 2.1 Third-party dataset</head><p>Besides the Jamendo dataset prepared by the task organizers, we also use the million song dataset (MSD) <ref type="bibr" target="#b0">[1]</ref> and the MagnaTagATune (MTAT) dataset <ref type="bibr" target="#b3">[4]</ref> in our work. The number of samples of the two datasets can be found in Table <ref type="table" target="#tab_0">1</ref>. We use MSD only for pre-training the VQ-VAE model, so we only split the datset into training and validation sets. As for MTAT, we use it as the second test set (in addition to Jamendo) for testing VQ-VAE, and hence we split it into training, validation, and test sets. We only consider the top-50 tags (mostly genre and instrument tags <ref type="bibr" target="#b2">[3]</ref>) for MTAT. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Input feature</head><p>We use librosa <ref type="bibr" target="#b4">[5]</ref> to extract 128-dimensional log-mel spectrums from the audio files. The sampling rate is set to be 22,050 Hz, and only first 1,024 frames are took for every clips, leading to a fixed-size matrix of 128 × 1024 per clip.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Neural networks</head><p>2.3.1 VQ-VAE as feature extractor. We use VQ-VAE as an feature extractor to get a discrete embedding from mel-spectrograms. The VQ-VAE basically contains an encoder and a decoder. The encoder contains 5 convolutional layers, followed by two residual 3×3 blocks all having 256 feature maps. The kernel size and the stride of the first 4 layers is (4,3), (2,1), and those of the fifth layer are <ref type="bibr" target="#b4">(5,</ref><ref type="bibr" target="#b3">4)</ref>, <ref type="bibr" target="#b0">(1,</ref><ref type="bibr" target="#b1">2)</ref>. The padding of every layer are <ref type="bibr" target="#b0">(1,</ref><ref type="bibr" target="#b1">2)</ref>, <ref type="bibr" target="#b0">(1,</ref><ref type="bibr" target="#b3">4)</ref> , <ref type="bibr" target="#b0">(1,</ref><ref type="bibr" target="#b7">8)</ref>, <ref type="bibr" target="#b0">(1,</ref><ref type="bibr">16</ref>), (0,1). The dilation are the same as padding. As a result, the encoder will generate an embedding with shape of 256 × 4 × 512. The decoder consists two residual 3 × 3 blocks, followed by 5 transposed convolutional layers. The kernel size, stride and padding for the first later is (4,4), (1,2), (0,1), and are (4,3), (2,1), (0.1) for the second layer. For the remaining three layers, the kernel size, stride and padding are (4,3), (2,1), (1,1). In the end of the decoder, an activation function of tanh is used. We call the this Type-1 VQ-VAE.</p><p>To observe how the dimension of the embedding affects the performance of tagging, we implement an alternative that uses <ref type="bibr" target="#b7">(8,</ref><ref type="bibr" target="#b3">4)</ref> kernel for the fifth layer of the encoder, making the shape of the embedding 256 × 1 × 512. We may view it as a sequence of 256-dimensional feature vectors. We call this one Type-2 VQ-VAE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2">Classifiers.</head><p>We use two kinds of classifier for training. The first one is a GRU-classifier, with 2 bi-directional gated recurrent units (GRUs). After the first GRU, layer normalization is applied. The output hidden states of the second GRU will then go through a fully-connected layer and sigmoid activation layer to get prediction. The second one is a CNN (convolutional neural network)-classifier. The model structure of the CNN classifier is basically the same as that proposed in <ref type="bibr" target="#b6">[7]</ref>, with the size of channels halved. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Training</head><p>The training procedure, as depicted in Figure <ref type="figure" target="#fig_0">1</ref>, is composed of two steps. In step-1, we pre-train VQ-VAE on MSD by minimizing the reconstruction error. In step-2, we cascade the encoder of VQ-VAE trained in step-1 along with a classifier (a GRU or a CNN based one), and train the network by binary cross entropy loss for genre, mood or theme recognition (depending on the dataset). During the training process, we set the batch size to 12 and learning rate to 2e-4. The Adam optimizer is used to train the models. The networks are trained for a maximum of 100 epochs with early stopping.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5">Methods</head><p>We submit the following five runs:</p><p>• </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS AND ANALYSIS 3.1 Auto-tagging on MTAT</head><p>To verify the effectiveness of the VQ-VAE based classification method, we firstly evaluate the run-1 method on MTAT for auto-tagging. Specifically, in step-2 training, we update the type-1 VQ-VAE (pretrained on MSD) along with the GRU classifier on MTAT and observe the performance of tagging. It turns out that the model attains ROC-AUC 0.90 when predicting top-50 tags, which is close to the performance of state-of-the-art models <ref type="bibr" target="#b5">[6]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Mood &amp; theme classification on Jamendo</head><p>The result on the Jamendo dataset is shown in Table <ref type="table" target="#tab_1">2</ref>. We can see that, in terms of ROC-AUC, Run-2 outperforms Run-1, and Run-4 outperforms Run-3. This may indicate that it is better to fix the VQ-VAE when training the classifiers. We can also see that the CNN classifier seems to perform slightly better than the GRU classifier. And, it seems that the type-1 VQ-VAE works than the type-2 counterpart. The best ROC-AUC 0.7207 is obtained by Run-4. Yet, it is worse than VGG-ish, which represents a strong baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">SUMMARY AND OUTLOOK</head><p>In this paper, we have reported a preliminary attempt that uses pre-trained VQ-VAE model for music auto-tagging problems. From the evaluation result, it seems that either the approach is not that promising for discrminative tasks, or that we have not fully capitalized its potential. We would like to further develop this approach in the near future, for both discrminative and generative problems in music (e.g., to generate music in the audio domain).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Schematic architecture of the proposed neural network and training procedure.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Run- 1 :</head><label>1</label><figDesc>type-1 VQ-VAE + GRU; updating both VQ-VAE and GRU during step-2 training. • Run-2: type-1 VQ-VAE + GRU; fixing VQ-VAE and updating only the GRU during step-2 training. • Run-3: type-1 VQ-VAE + CNN; updating both VQ-VAE and CNN during step-2 training. • Run-4: type-1 VQ-VAE + CNN; fixing VQ-VAE and updating only the CNN during step-2 training. • Run-5: type-2 VQ-VAE + GRU; updating both VQ-VAE and GRU during step-2 training.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Number of audio samples of third-party datasets in the train, validation and test splits we made</figDesc><table><row><cell></cell><cell cols="2">Train Validation</cell><cell>Test</cell></row><row><cell>MSD [1]</cell><cell>557,315</cell><cell>37,008</cell><cell>0</cell></row><row><cell>MTAT [4]</cell><cell>16,776</cell><cell cols="2">1,339 2,651</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Testing (first seven rows) and validation (last five) scores on the MediaEval'19 Jamendo dataset.</figDesc><table><row><cell></cell><cell cols="4">ROC-AUC PR-AUC F1(macro) F1(micro)</cell></row><row><cell>Popularity</cell><cell>0.5000</cell><cell>0.0320</cell><cell>0.0570</cell><cell>0.0030</cell></row><row><cell>VGG-ish</cell><cell>0.7258</cell><cell>0.1077</cell><cell>0.1657</cell><cell>0.1771</cell></row><row><cell>Run-1</cell><cell>0.7103</cell><cell>0.0984</cell><cell>0.1183</cell><cell>0.1439</cell></row><row><cell>Run-2</cell><cell>0.7141</cell><cell>0.1037</cell><cell>0.0901</cell><cell>0.1184</cell></row><row><cell>Run-3</cell><cell>0.7147</cell><cell>0.0994</cell><cell>0.1013</cell><cell>0.1233</cell></row><row><cell>Run-4</cell><cell>0.7207</cell><cell>0.1077</cell><cell>0.1068</cell><cell>0.1522</cell></row><row><cell>Run-5</cell><cell>0.6916</cell><cell>0.0860</cell><cell>0.0884</cell><cell>0.1209</cell></row><row><cell>Run-1</cell><cell>0.6829</cell><cell>0.0717</cell><cell>0.0891</cell><cell>0.1161</cell></row><row><cell>Run-2</cell><cell>0.6973</cell><cell>0.0782</cell><cell>0.0838</cell><cell>0.1201</cell></row><row><cell>Run-3</cell><cell>0.6928</cell><cell>0.0746</cell><cell>0.0921</cell><cell>0.1227</cell></row><row><cell>Run-4</cell><cell>0.6966</cell><cell>0.0770</cell><cell>0.0851</cell><cell>0.1142</cell></row><row><cell>Run-5</cell><cell>0.6662</cell><cell>0.0608</cell><cell>0.0746</cell><cell>0.0899</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The million song dataset</title>
		<author>
			<persName><surname>Thierry Bertin-Mahieux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paul</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Brian</forename><surname>Lamere</surname></persName>
		</author>
		<author>
			<persName><surname>Whitman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. International Society for Music Information Retrieval Conference (ISMIR)</title>
				<meeting>International Society for Music Information Retrieval Conference (ISMIR)</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">MediaEval 2019: Emotion and theme recognition in music using Jamendo</title>
		<author>
			<persName><forename type="first">Dmitry</forename><surname>Bogdanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alastair</forename><surname>Porter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><surname>Tovstogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minz</forename><surname>Won</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2019 Workshop</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">Keunwoo</forename><surname>Choi</surname></persName>
		</author>
		<ptr target="https://github.com/keunwoochoi/magnatagatune-list" />
		<title level="m">List of automatic music tagging research articles that are evaluated against MagnaTagATune Dataset</title>
				<imprint>
			<date type="published" when="2017-09-29">2017. 2017. 29 September 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Evaluation of algorithms using games: The case of music tagging</title>
		<author>
			<persName><forename type="first">Edith</forename><surname>Law</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kris</forename><surname>West</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><forename type="middle">I</forename><surname>Mandel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mert</forename><surname>Bay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Stephen Downie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. International Society for Music Information Retrieval Conference (ISMIR)</title>
				<meeting>International Society for Music Information Retrieval Conference (ISMIR)</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">librosa: Audio and music signal analysis in Python</title>
		<author>
			<persName><forename type="first">Brian</forename><surname>Mcfee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Colin</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dawen</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matt</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><surname>Mcvicar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Battenberg</surname></persName>
		</author>
		<author>
			<persName><surname>Nieto</surname></persName>
		</author>
		<ptr target="https://librosa.github.io/librosa/" />
	</analytic>
	<monogr>
		<title level="m">Proc. Python in Science Conf</title>
				<meeting>Python in Science Conf</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="18" to="25" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from Bach</title>
		<author>
			<persName><forename type="first">Juhan</forename><surname>Nam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Keunwoo</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jongpil</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Szu-Yu</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yi-Hsuan</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="41" to="51" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">End-to-end learning for music audio tagging at scale</title>
		<author>
			<persName><forename type="first">Jordi</forename><surname>Pons</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Nieto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthew</forename><surname>Prockup</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erik</forename><forename type="middle">M</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andreas</forename><forename type="middle">F</forename><surname>Ehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xavier</forename><surname>Serra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. International Society for Music Information Retrieval Conference (ISMIR)</title>
				<meeting>International Society for Music Information Retrieval Conference (ISMIR)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Neural discrete representation learning</title>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Van Den Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Koray</forename><surname>Kavukcuoglu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Conference on Neural Information Processing Systems (NIPS)</title>
				<meeting>Conference on Neural Information essing Systems (NIPS)</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
