<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Recognizing Music Mood and Theme Using Convolutional Neural Networks and Attention</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Alish</forename><surname>Dipani</surname></persName>
							<email>alish.dipani@uploadai.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Upload AI LLC</orgName>
								<address>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="laboratory">Cognitive Neuroscience Lab</orgName>
								<orgName type="institution" key="instit1">BITS Pilani</orgName>
								<orgName type="institution" key="instit2">K.K.Birla Goa Campus</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gaurav</forename><surname>Iyer</surname></persName>
							<affiliation key="aff1">
								<orgName type="laboratory">Cognitive Neuroscience Lab</orgName>
								<orgName type="institution" key="instit1">BITS Pilani</orgName>
								<orgName type="institution" key="instit2">K.K.Birla Goa Campus</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Veeky</forename><surname>Baths</surname></persName>
							<affiliation key="aff1">
								<orgName type="laboratory">Cognitive Neuroscience Lab</orgName>
								<orgName type="institution" key="instit1">BITS Pilani</orgName>
								<orgName type="institution" key="instit2">K.K.Birla Goa Campus</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Recognizing Music Mood and Theme Using Convolutional Neural Networks and Attention</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CD83E513B34E919FA226BFA061414C07</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present the UAI-CNRL submission to MediaEval 2020 task on Emotion and Theme Recognition in Music. We make use of the ResNet34 architecture, coupled with a self-attention module to detect moods/themes in music tracks. The autotagging-moodtheme subset of the MTG-Jamendo dataset was used to train the model. We show that the proposed model outperforms the provided VGG-ish and popularity baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Convolutional neural networks (CNNs) have been successful in extracting meaningful features for tasks such as image recognition [10, 14] and object detection [10]. In the field of audio processing, CNNs have been used for a variety of tasks, such as automatic † Authors Contributed Equally § https://github.com/alishdipani/Multimediaeval2020-emotions-and-themes-inmusic</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Music has been shown to induce a variety of emotions such as happiness, sadness, and anger <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b26">27]</ref>. This induction of emotions can be attributed to different intrinsic properties such as tempo, rhythm variations, intensity, mode and extrinsic properties such as the association of music with personal events and previous experiences <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b22">23]</ref>. These emotional responses could also be one of the important motivators for humans to listen to music <ref type="bibr" target="#b19">[20]</ref><ref type="bibr" target="#b20">[21]</ref><ref type="bibr" target="#b21">[22]</ref>.</p><p>Automatic tagging and detection of emotions of music is a difficult task considering the subjectivity of human emotions. The MTG-Jamendo dataset <ref type="bibr" target="#b3">[4]</ref> aims at tackling several such autotagging tasks by providing royalty-free audios of consistent quality with several tags for genre, instruments and mood/theme. The Emotion and Theme Recognition Task of MediaEval 2020 uses the mood/theme subset of the MTG-Jamendo dataset. The task is as follows -given audio, automatically detect one or multiple moods/themes out of 56 given tags, for example, fun, sad, romantic, happy <ref type="bibr" target="#b2">[3]</ref>.</p><p>In this paper, we describe our approach (team name: UAI-CNRL) for this task by using convolutional neural networks to extract features from the mel-spectrograms of the audios and multi-head self-attention to predict the mood/theme by processing the extracted features. Our approach achieves better performance than the baselines. tagging <ref type="bibr" target="#b5">[6]</ref>, source separation <ref type="bibr" target="#b29">[30]</ref>, music emotion classification <ref type="bibr" target="#b15">[16]</ref> and speaker identification <ref type="bibr" target="#b17">[18]</ref>.</p><p>Transformer networks which use self-attention layers <ref type="bibr" target="#b27">[28]</ref> have been successful in tackling language tasks involving long-range dependencies. They have also been used in the field of audio processing for many tasks, such as automatic tagging <ref type="bibr" target="#b28">[29]</ref>, source separation <ref type="bibr" target="#b4">[5]</ref>, and speech recognition <ref type="bibr" target="#b1">[2]</ref>.</p><p>A combination of these methods have been demonstrated to achieve state-of-the-art performance <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b31">32]</ref>. Inspired by these, we use convolution layers to extract features from mel-spectrograms and self-attention layers to process those features to predict the moods/themes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">APPROACH</head><p>We make use of a popular convolutional neural network architecture, the ResNet <ref type="bibr" target="#b9">[10]</ref> as a feature extractor to extract compact representations of our data. We pair this with self-attention <ref type="bibr" target="#b27">[28]</ref> in order to capture long-term temporal attributes of the given data. We also make use of batch normalization <ref type="bibr" target="#b10">[11]</ref> and dropout <ref type="bibr" target="#b23">[24]</ref> in order to further regularize the model. We describe the model architecture in this section. Our code and trained model are available at this URL § .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">ResNet34</head><p>Residual connections make training deep neural networks easier, since they address the problem of vanishing gradients. We make use of a standard ResNet34 architecture to take advantage of this property. This is preceded by two convolutional layers in order to reshape the data into a form that can be fed into the ResNet. Another convolutional layer is used after the ResNet feature extractor to reduce the number of channels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Self-Attention</head><p>The MTG-Jamendo dataset consists of tracks of varying lengths, a majority of which are over 200 seconds. Using self-attention, we attempt to capture long-range temporal attributes and summarize the sequence of music representation. Our model architecture is inspired by the works done in <ref type="bibr" target="#b24">[25]</ref>, which uses multi-head attention along with positional encoding. 2 layers, each consisting of 4 attention heads were used. The input sequence length and embedding size used were unchanged. 3.3.2 SpecAugment. SpecAugment <ref type="bibr" target="#b18">[19]</ref> is an augmentation technique used for speech recognition, which involves augmenting the spectrogram itself, instead of the waveform data. SpecAugment modifies the spectrogram by warping it in the time axis, masking blocks of frequency channels, and masking blocks of time steps. This makes the model more robust to missing information in terms of the input speech data as well as frequency information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Data Augmentation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">Other Augmentations.</head><p>Other transformation techniques, such as random cropping and random scaling were used to further augment the given data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">TRAINING DETAILS</head><p>This section describes the details of data pre-processing, architecture and other training details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data Preparation</head><p>We use the mel-spectrograms provided in the MTG-Jamendo dataset for the purpose of training. Random cropping and scaling are used to augment and transform the data into a tensor of length 4096 (approximately 87.4 seconds). Additionally, SpecAugment is used to augment the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Architecture and Control Flow</head><p>• The input tensor of shape <ref type="bibr" target="#b0">(1,</ref><ref type="bibr">96,</ref><ref type="bibr">4096</ref>) is divided into 16 segments length-wise, each new segment being of length 256. • Each segment is then processed through 2 convolutional layers, in order to obtain a representation with 3 channels. • The obtained representation is then passed into the ResNet34 feature extractor, followed by a convolutional layer to obtain an intermediate representation.</p><p>• The feature maps are then passed through the self-attention module, followed by a series of linear layers to obtain the final class scores. Dropout is used to regularise the training process. • The model returns the outputs of the self-attention module and the feature maps (after passing them through the linear layers). Both outputs are used to compute the loss and perform backpropagation, but only the outputs of the selfattention module are used to make predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Hyperparameters and Other Details</head><p>The model was trained with the Adam <ref type="bibr" target="#b12">[13]</ref> optimizer, at a learning rate of 1e-4, for 35 epochs. The values of 𝛽 1 and 𝛽 2 were set to 0.9 and 0.999 respectively. Binary cross entropy loss was used as the loss function. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RESULTS</head><p>The proposed model produces results that improve on those of the given VGG-ish and popularity baselines. We obtain an ROC-AUCmacro metric of 0.7360 and a PR-AUC-macro metric of 0.1275. For comparison, the baseline VGG-ish model produces an ROC-AUC macro of 0.7258 and a PR-AUC macro of 0.1077. Detailed results can be found in Table <ref type="table" target="#tab_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">FUTURE WORK</head><p>In this section, we discuss other approaches that we considered towards the problem statement. These may be used as pointers towards future work on tasks involving this dataset. Our approach can be broken down into two parts -first, the extraction of features from the audio data and second, processing the extracted features to predict the moods/themes. Both these parts could be potentially improved upon, and we mention a few ways to do so below.</p><p>With respect to feature extraction:</p><p>• Using a wider range of features to aid the classification task instead of using mel-spectrograms. For example, the LEAF frontend proposed by <ref type="bibr" target="#b0">[1]</ref> can be used for this approach. • Using self-supervised approach to extract features, such as wav2vec 2.0 <ref type="bibr" target="#b1">[2]</ref>. This would also reduce reliance on labelled data. • Using temporal convolutional networks <ref type="bibr" target="#b14">[15]</ref> to extract features directly from audio instead of using mel-spectrograms.</p><p>With respect to the processing of extracted features:</p><p>• Using dual path processing inspired by <ref type="bibr" target="#b16">[17]</ref> in order to capture long-term dependencies while also reducing computational load. • Exploring ways of processing the raw audio data with more powerful models, such as WaveNet <ref type="bibr" target="#b25">[26]</ref> in order to obtain better insights into the dataset, and theme recognition in general.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>3. 3 . 1</head><label>31</label><figDesc>Mixup. Previous submissions to MediaEval 2019<ref type="bibr" target="#b24">[25]</ref> for this task have shown that Mixup<ref type="bibr" target="#b30">[31]</ref> greatly improves the performance of the model being used. Mixup creates a new training example by linearly combining two random, existing training samples -in the feature space as well as in the label space. More formally, Mixup trains a neural network on convex combinations of pairs of examples and their labels. This helps the model alleviate unwanted behaviours, such as memorization, especially since the dataset size is relatively small.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results</figDesc><table><row><cell>Metric</cell><cell>Ours</cell><cell cols="2">VGG-ish[3] popularity[3]</cell></row><row><cell cols="2">ROC-AUC-macro 0.7360</cell><cell>0.7258</cell><cell>0.5000</cell></row><row><cell>PR-AUC-macro</cell><cell>0.1275</cell><cell>0.1077</cell><cell>0.03192</cell></row><row><cell>precision-macro</cell><cell>0.1639</cell><cell>0.1382</cell><cell>0.0014</cell></row><row><cell>recall-macro</cell><cell>0.3487</cell><cell>0.3086</cell><cell>0.0179</cell></row><row><cell>F-score-macro</cell><cell>0.1884</cell><cell>0.1657</cell><cell>0.0026</cell></row><row><cell cols="2">ROC-AUC-micro 0.7865</cell><cell>0.7750</cell><cell>0.5139</cell></row><row><cell>PR-AUC-micro</cell><cell>0.1369</cell><cell>0.1409</cell><cell>0.0341</cell></row><row><cell>precision-micro</cell><cell>0.1105</cell><cell>0.1161</cell><cell>0.0799</cell></row><row><cell>recall-micro</cell><cell>0.4032</cell><cell>0.3735</cell><cell>0.0447</cell></row><row><cell>F-score-micro</cell><cell>0.1735</cell><cell>0.1771</cell><cell>0.0573</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>We thank Shell Xu Hu for helpful discussions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Emotions and Themes in Music</head><p>MediaEval'20, December 14-15 2020, Online</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A Universal Learnable Audio Frontend</title>
		<author>
			<persName><surname>Anonymous</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=jM76BCb6F9munderreview" />
	</analytic>
	<monogr>
		<title level="m">Submitted to International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations</title>
		<author>
			<persName><forename type="first">Alexei</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Henry</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abdelrahman</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Auli</surname></persName>
		</author>
		<idno>arXiv:cs.CL/2006.11477</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Emotion and Theme Recognition in Music Using Jamendo</title>
		<author>
			<persName><forename type="first">Dmitry</forename><surname>Bogdanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alastair</forename><surname>Porter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><surname>Tovstogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minz</forename><surname>Won</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2020 Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The MTG-Jamendo Dataset for Automatic Music Tagging</title>
		<author>
			<persName><forename type="first">Dmitry</forename><surname>Bogdanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minz</forename><surname>Won</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><surname>Tovstogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alastair</forename><surname>Porter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xavier</forename><surname>Serra</surname></persName>
		</author>
		<ptr target="http://hdl.handle.net/10230/42015" />
	</analytic>
	<monogr>
		<title level="m">Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019)</title>
				<meeting><address><addrLine>Long Beach, CA, United States</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation</title>
		<author>
			<persName><forename type="first">Jingjing</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qirong</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2007.13975</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Automatic tagging using deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">Keunwoo</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><surname>Fazekas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Sandler</surname></persName>
		</author>
		<idno>arXiv:cs.SD/1606.00298</idno>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Music induces universal emotion-related psychophysiological responses: comparing Canadian listeners to Congolese Pygmies</title>
		<author>
			<persName><forename type="first">Hauke</forename><surname>Egermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nathalie</forename><surname>Fernando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorraine</forename><surname>Chuen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Mcadams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers in psychology</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">1341</biblScope>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Universal recognition of three basic emotions in music</title>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Fritz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Jentschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nathalie</forename><surname>Gosselin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniela</forename><surname>Sammler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Isabelle</forename><surname>Peretz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Turner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Angela</forename><forename type="middle">D</forename><surname>Friederici</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Koelsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Current biology</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="573" to="576" />
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Conformer: Convolution-augmented Transformer for Speech Recognition</title>
		<author>
			<persName><forename type="first">Anmol</forename><surname>Gulati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chung-Cheng</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niki</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jiahui</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shibo</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhengdong</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonghui</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruoming</forename><surname>Pang</surname></persName>
		</author>
		<idno>arXiv:eess.AS/2005.08100</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Deep Residual Learning for Image Recognition</title>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiangyu</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shaoqing</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jian</forename><surname>Sun</surname></persName>
		</author>
		<idno>arXiv:cs.CV/1512.03385</idno>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</title>
		<author>
			<persName><forename type="first">Sergey</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Szegedy</surname></persName>
		</author>
		<idno>arXiv:cs.LG/1502.03167</idno>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Role of tempo entrainment in psychophysiological differentiation of happy and sad music?</title>
		<author>
			<persName><forename type="first">Stéphanie</forename><surname>Khalfa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mathieu</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Rainville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Dalla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bella</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Isabelle</forename><surname>Peretz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Psychophysiology</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="17" to="26" />
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Diederik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jimmy</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><surname>Ba</surname></persName>
		</author>
		<idno>arXiv:cs.LG/1412.6980</idno>
		<title level="m">Adam: A Method for Stochastic Optimization</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Imagenet classification with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">Alex</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilya</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="84" to="90" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Temporal convolutional networks: A unified approach to action segmentation</title>
		<author>
			<persName><forename type="first">Colin</forename><surname>Lea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rene</forename><surname>Vidal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Austin</forename><surname>Reiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gregory</forename><forename type="middle">D</forename><surname>Hager</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="47" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">CNN based music emotion classification</title>
		<author>
			<persName><forename type="first">Xin</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qingcai</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiangping</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yan</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yang</forename><surname>Liu</surname></persName>
		</author>
		<idno>arXiv:cs.MM/1704.05665</idno>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation</title>
		<author>
			<persName><forename type="first">Yi</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhuo</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Takuya</forename><surname>Yoshioka</surname></persName>
		</author>
		<idno>arXiv:eess.AS/1910.06379</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Vox-Celeb: A Large-Scale Speaker Identification Dataset</title>
		<author>
			<persName><forename type="first">Arsha</forename><surname>Nagrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joon</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Son</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="DOI">10.21437/interspeech.2017-950</idno>
		<ptr target="https://doi.org/10.21437/interspeech.2017-950" />
	</analytic>
	<monogr>
		<title level="j">Interspeech</title>
		<imprint>
			<date type="published" when="2017-08">2017. 2017. Aug 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</title>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">S</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">William</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chung-Cheng</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barret</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ekin</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quoc</surname></persName>
		</author>
		<author>
			<persName><surname>Le</surname></persName>
		</author>
		<idno type="DOI">10.21437/interspeech.2019-2680</idno>
		<ptr target="https://doi.org/10.21437/interspeech.2019-2680" />
	</analytic>
	<monogr>
		<title level="j">Interspeech</title>
		<imprint>
			<date type="published" when="2019-09">2019. 2019. Sep 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Music and its inductive power: a psychobiological and evolutionary approach to musical emotions</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Reybrouck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tuomas</forename><surname>Eerola</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Psychology</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page">494</biblScope>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The psychological functions of music listening</title>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peter</forename><surname>Sedlmeier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christine</forename><surname>Städtler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Huron</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers in psychology</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">511</biblScope>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">When you&apos;re down and troubled: Views on the regulatory power of music</title>
		<author>
			<persName><forename type="first">Roni</forename><surname>Shifriss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ehud</forename><surname>Bodner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuval</forename><surname>Palgi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychology of Music</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="793" to="807" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Psychological perspectives on music and emotion</title>
		<author>
			<persName><forename type="first">A</forename><surname>John</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Patrik</forename><forename type="middle">N</forename><surname>Sloboda</surname></persName>
		</author>
		<author>
			<persName><surname>Juslin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Music and emotion: Theory and research</title>
				<imprint>
			<date type="published" when="2001">2001. 2001</date>
			<biblScope unit="page" from="71" to="104" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</title>
		<author>
			<persName><forename type="first">Nitish</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilya</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruslan</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v15/srivastava14a.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1929" to="1958" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">Manoj</forename><surname>Sukhavasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sainath</forename><surname>Adapa</surname></persName>
		</author>
		<idno>arXiv:cs.SD/1911.07041</idno>
		<title level="m">Music theme recognition using CNN and self-attention</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Van Den Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sander</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Heiga</forename><surname>Zen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nal</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Senior</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Koray</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<idno>arXiv:cs.SD/1609.03499</idno>
		<title level="m">WaveNet: A Generative Model for Raw Audio</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Emotion induction through music: A review of the musical mood induction procedure</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Västfjäll</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Musicae Scientiae</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="173" to="211" />
			<date type="published" when="2001">2001. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<title level="m" type="main">Attention Is All You Need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niki</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jakob</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Llion</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aidan</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Illia</forename><surname>Polosukhin</surname></persName>
		</author>
		<idno>arXiv:cs.CL/1706.03762</idno>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">Toward interpretable music tagging with self-attention</title>
		<author>
			<persName><forename type="first">Minz</forename><surname>Won</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanghyuk</forename><surname>Chun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xavier</forename><surname>Serra</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.04972</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">CNN-LSTM models for Multi-Speaker Source Separation using Bayesian Hyper Parameter Optimization</title>
		<author>
			<persName><forename type="first">Jeroen</forename><surname>Zegers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hugo</forename><surname>Van Hamme</surname></persName>
		</author>
		<idno>arXiv:cs.LG/1912.09254</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">mixup: Beyond Empirical Risk Minimization</title>
		<author>
			<persName><forename type="first">Hongyi</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moustapha</forename><surname>Cisse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yann</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Lopez-Paz</surname></persName>
		</author>
		<idno>arXiv:cs.LG/1710.09412</idno>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition</title>
		<author>
			<persName><forename type="first">Yu</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">S</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chung-Cheng</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruoming</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quoc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonghui</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><surname>Wu</surname></persName>
		</author>
		<idno>arXiv:eess.AS/2010.10504</idno>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
