<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tri-Nhan</forename><surname>Do</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">University of Science</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Vietnam National University</orgName>
								<address>
									<addrLine>Ho Chi</addrLine>
									<settlement>Minh city</settlement>
									<country key="VN">Vietnam</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Minh-Tri</forename><surname>Nguyen</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">University of Science</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Vietnam National University</orgName>
								<address>
									<addrLine>Ho Chi</addrLine>
									<settlement>Minh city</settlement>
									<country key="VN">Vietnam</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hai-Dang</forename><surname>Nguyen</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">University of Science</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Vietnam National University</orgName>
								<address>
									<addrLine>Ho Chi</addrLine>
									<settlement>Minh city</settlement>
									<country key="VN">Vietnam</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Minh-Triet</forename><surname>Tran</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">University of Science</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">John von Neumann Institute</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Vietnam National University</orgName>
								<address>
									<addrLine>Ho Chi</addrLine>
									<settlement>Minh city</settlement>
									<country key="VN">Vietnam</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xuan-Nam</forename><surname>Cao</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">University of Science</orgName>
								<orgName type="institution" key="instit2">VNU-HCM</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">Vietnam National University</orgName>
								<address>
									<addrLine>Ho Chi</addrLine>
									<settlement>Minh city</settlement>
									<country key="VN">Vietnam</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">HCMUS at MediaEval 2020: Emotion Classification Using Wavenet Features with SpecAugment and EfficientNet</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D4451274E468C6B1D14714AE0FC3CE5D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>MediaEval 2020 provided a subset of the MTG-Jamendo dataset, aimed to recognize mood and theme in music. Team HCMUS proposes several solutions to build efficient classifiers to solve this problem. In addition to the mel-spectrogram features, new features extracted from the wavenet model is extracted and utilized to train the EfficientNet model. As evaluated by the jury, our best result achieved of 0.142 in PR-AUC and 0.76 in the ROC-AUC measurement. With fast training and lightweight features, our proposed methods are potential to work well with deeper neural networks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Emotions and Themes in Music task in MediaEval <ref type="bibr" target="#b0">[1]</ref> is difficult and challenging due to the ambiguity of tags in the real world. Mood is often influenced by human perception, different people will have different feelings, moreover, this is a multi-class classification problem with more than 56 tags. The dataset is pretty unbalanced in the distribution of mood labels, each audio music is composed of multi-labels that there can be many emotions in the same song.</p><p>To be able to solve this task, the authors have tried many methods, using many kinds of models, input features or loss functions. Our best result is an ensemble of two kinds of different methods, one using provided mel-spectrogram features with EfficientNet model and the other using waveNet features with MobileNetV2 model <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b8">9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Data augmentation is important when training neural network model. Traditional audio augmentation methods try to modify the speed of the waveforms or alter the original signal samples with noises, this method need much computational cost. With SpecAugment approach <ref type="bibr" target="#b5">[6]</ref>, they adjust the spectrogram by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. This approach is more simple, cost less time and resources.</p><p>WaveNet model is applicable in many problem of signal processing, time series forecasting and music generation <ref type="bibr" target="#b3">[4]</ref>. Therefore, the authors also try following this approach by using a pre-trained WaveNet model to extract feature vectors from raw audio and then, using those features as inputs for convolutional neural networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data analysis</head><p>As in the figure, the green part shows the audio with only one label mood/theme, the yellow part shows the audio with 2 to 3 moods, the red part shows the audio with more than 3 moods. Number of sample audio for training is 9949 with a total of 17885 moods. On average, each class will have 319 audio with a standard deviation of 202.75. The maximum number of moods of an audio is 8. Mood / theme that appears most is happy with 927 audios.</p><p>We can see that the data is extremely unbalanced, and some classes have no audio representing it entirely. Therefore, it is necessary to have a way to reduce the complexity of the data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Features preprocessing:</head><p>Wavenet feature: Based on the idea of using wavenet as classifier for raw waveform music audio <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b9">10]</ref>, the authors use WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, this model was pretrained from high-quality dataset of musical notes Nsynth <ref type="bibr" target="#b1">[2]</ref>.</p><p>Based on the dataset's statistic, the minimum length of audio is 30 seconds and based on the limitation of the authors' training machine, sound samples greater than 400 seconds in length will be trimmed to take the middle part. Each sample is again randomly cut for 30 seconds and then extract features from them. This approach is quite subjective and causes loss of input data, we planned to experiment with random cutting from 400 seconds of audios after each epoch. The output of a 30 seconds audio is 16 frames multiply with 937-time steps.</p><p>Mel-spectrogram: Each sample feature has 96 channels and time frames are randomly cropped to 6950 after each epoch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Data augmentation</head><p>SpecAugment: To train models more efficiently, the authors include segmentation method SpecAugment which was introduced by Google. This method masks blocks of consecutive time steps and channels in each mel-spectrogram. The result when using this method is increased significantly, PR-AUC-macro is improved from 0.134 to 0.139.</p><p>Each input have 70% chance to be augmented by using SpecAugment, each mel-spectrogram will have two blocks of time masking and two blocks of channel masking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Deep Neural Network model</head><p>Since both mel-spectrogram features and wavenet feature can be expressed as images, the authors use convolutional models such as MobileNet and EfficientNet. The mel-spectrogram features are passed to EfficientNet-B0, on the other hand, the waveNet features are passed to MobileNetV2 and EfficientNet-B7. Because waveNet features are not large enough to fit EfficientNet-B7, the authors duplicate the number of channels so that these kinds of features can be used.</p><p>In addition, we also tested the SVM model, InceptionNet, Resnet, and to capture the long-term temporal characteristics, self-attention was added as in the method of AMLAG 2019 <ref type="bibr" target="#b7">[8]</ref>, but this method produce a slight improvement in the result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Loss function</head><p>For the loss function, binary cross entropy loss (BCE) is applied for both MobileNet V2 and EfficientNet. The authors also try to apply Focal Loss <ref type="bibr" target="#b2">[3]</ref> since the dataset is pretty unbalanced, however it does not gain better results on our dataset after the balance step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS AND RESULTS</head><p>Our experiments are done on a computer server with Nvidia Quadro k6000 graphic card. Method A,B and D are not submitted to the challenge. We realize that data balancing method leads to a better result comparing to the original dataset with default labels. Based on the experiments on the validation set, our ensemble models are calculated with factors of 0.7 and 0. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION AND FUTURE WORKS</head><p>The EfficientNet model was shown to be more efficient than previous models in the mood and theme classification problem. The results can be improved by training mel-spectrogram features on other more complex EfficientNet models.</p><p>Although the result when training on wavenet features is not higher than mel-spectrogram features, but when assembling two models using these features, the results are improved, it shows that wavenet can extract other aspects of the dataset. Because the wavenet features were extracted by using a pretrained model, the augmentation methods have not been fully applied, for the future work, there are still more improvements to come when training WaveNet-style autoencoder models on the Jamendo dataset.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Histogram of mood and theme of training set 3.2 Data preprocessing 3.2.1 Data balance: To reduce the ambiguity of the data, the authors try to change each audio's label from multi-label to single label, keeping the most significant tag of each audio, reduce standard deviation, give preference to moods with little data.</figDesc><graphic coords="1,343.18,388.44,201.76,136.73" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>MediaEval' 20 ,Figure 2 :</head><label>202</label><figDesc>Figure 2: Overview of submission 1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>3 for mel-spectrogram features and wavenet features to gain the best results. Experiment results</figDesc><table><row><cell>Method</cell><cell>Features and Model</cell><cell>PR-AUC-macro</cell></row><row><cell>A</cell><cell>Mel-spectrogram EfficientNet-B0</cell><cell>0.127</cell></row><row><cell>B</cell><cell>Mel-spectrogram EfficientNet-B0 with data processing</cell><cell>0.134</cell></row><row><cell>C (run2)</cell><cell>Mel-spectrogram EfficientNet-B0 using augmentation</cell><cell>0.139</cell></row><row><cell>D</cell><cell>WaveNet MobileNetV2</cell><cell>0.102</cell></row><row><cell>E (run3)</cell><cell>WaveNet EfficientNet-B7</cell><cell>0.105</cell></row><row><cell>F (run1)</cell><cell>Ensemble C and D</cell><cell>0.1413</cell></row><row><cell>G (run4)</cell><cell>Ensemble C and E</cell><cell>0.1414</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>Research is supported with computing infrastructure by SELAB and AILAB, University of Science, Vietnam National University -Ho Chi Minh City.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Emotions and Themes in Music</head><p>MediaEval'20, December 14-15 2020, Online</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">MediaEval 2020: Emotion and theme recognition in music using Jamendo</title>
		<author>
			<persName><forename type="first">Philip</forename><surname>Tovstogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minz</forename><surname>Won</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dmitry</forename><surname>Bogdanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alastair</forename><surname>Porter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2020 Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Neural audio synthesis of musical notes with wavenet autoencoders</title>
		<author>
			<persName><forename type="first">Jesse</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cinjon</forename><surname>Resnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adam</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sander</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammad</forename><surname>Norouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Douglas</forename><surname>Eck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Simonyan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning. PMLR</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1068" to="1077" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Focal loss for dense object detection</title>
		<author>
			<persName><forename type="first">Tsung-Yi</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Priya</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ross</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Dollár</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2980" to="2988" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Wavenet: A generative model for raw audio</title>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Van Den Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sander</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Heiga</forename><surname>Zen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nal</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Senior</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Koray</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1609.03499</idno>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Emotion recognition from raw speech using wavenet</title>
		<author>
			<persName><forename type="first">Sandeep</forename><surname>Kumar Pandey</surname></persName>
		</author>
		<author>
			<persName><surname>Shekhawat</surname></persName>
		</author>
		<author>
			<persName><surname>Prasanna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">TENCON 2019-2019 IEEE Region 10 Conference (TENCON)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1292" to="1297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Specaugment: A simple data augmentation method for automatic speech recognition</title>
		<author>
			<persName><forename type="first">William</forename><surname>Daniel S Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chung-Cheng</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barret</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ekin</forename><forename type="middle">D</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Quoc V</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><surname>Le</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.08779</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Mobilenetv2: Inverted residuals and linear bottlenecks</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Sandler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Menglong</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrey</forename><surname>Zhmoginov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liang-Chieh</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="4510" to="4520" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">Manoj</forename><surname>Sukhavasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sainath</forename><surname>Adapa</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1911.07041</idno>
		<title level="m">Music theme recognition using CNN and self-attention</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">Mingxing</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Quoc V</forename><surname>Le</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1905.11946</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data</title>
		<author>
			<persName><forename type="first">Xulong</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yongwei</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yi</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Li</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.04371</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
