<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Ricardo</forename><surname>Kleinlein</surname></persName>
							<email>ricardo.kleinlein@upm.es</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Center for Information Processing and Telecommunications</orgName>
								<orgName type="department" key="dep2">E.T.S.I. de Telecomunicación</orgName>
								<orgName type="laboratory">Speech Technology Group</orgName>
								<orgName type="institution">Universidad Politécnica de Madrid</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cristina</forename><surname>Luna-Jiménez</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Center for Information Processing and Telecommunications</orgName>
								<orgName type="department" key="dep2">E.T.S.I. de Telecomunicación</orgName>
								<orgName type="laboratory">Speech Technology Group</orgName>
								<orgName type="institution">Universidad Politécnica de Madrid</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zoraida</forename><surname>Callejas</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Languages and Computer Systems</orgName>
								<orgName type="institution">University of Granada</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fernando</forename><surname>Fernández-Martínez</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Center for Information Processing and Telecommunications</orgName>
								<orgName type="department" key="dep2">E.T.S.I. de Telecomunicación</orgName>
								<orgName type="laboratory">Speech Technology Group</orgName>
								<orgName type="institution">Universidad Politécnica de Madrid</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">46DBD2821D88EB52C3180D857D58714A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper reports on the GTH-UPM team experience in the Predicting Media Memorability task at MediaEval 2020. Teams were requested to predict memorability scores at both short-term and long-term, understanding such score as a measure of whether a video was perdurable in a viewer's memory or not. Our proposed system relies on a late fusion of the scores predicted by three sequential models, each trained over a different modality: video captions, aural embeddings and visual optical flow-based vectors. Whereas single-modality models show a low or zero Spearman correlation coefficient value, their combination considerably boosts performance over development data up to 0.2 in the short-term memorability prediction subtask and 0.19 in the long-term subtask. However, performance over test data drops to 0.016 and -0.041, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH AND EXPERIMENTS</head><p>Every video sample in the dataset presents the following sources of information: between 2 and 5 text captions that roughly describe the content of the video, the video audio signal and its visual frames. As stated before, multimodal systems are able to learn modality-wise</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The improvement in computational capabilities is progressively allowing researchers to tackle problems long though to be out of reach due to the subjective nature of the phenomena involved. One good instance is memorability prediction. The seminal work of Isola et al. set the ground for later work on computational modelling of image memorability <ref type="bibr" target="#b11">[11]</ref>. Since 2018 the Predicting Media Memorability Challenge, hosted within the MediaEval workshop, has pushed forward the extent of the original problem to encompass memorability prediction over multimedia sources of information <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>. In its current edition the goal of the task holds the same as previous years, yet video clips now cover a kind of material resembling short videos commonly found in social media. Further information can be found in the challenge description paper <ref type="bibr" target="#b7">[7]</ref>.</p><p>Several multimodal late fusion strategies have been proposed regarding the image and video memorability prediction problem <ref type="bibr" target="#b4">[5]</ref>. Additionally, attention mechanisms have been successfully applied to problems in which data come naturally in a sequential form <ref type="bibr" target="#b16">[16]</ref>. In particular, self-attention layers have been proved to boost performance when tackling the computational modelling of media memorability <ref type="bibr" target="#b6">[6]</ref>. data representations, and combine their predictive power in order to make a final, unique memorability prediction. We hypothesize that a late fusion scheme will benefit from incorporating a selfattention mechanism that learns to focus on what it is particularly relevant on a given sample's prediction.</p><p>We propose a system based on the late fusion by a Support Vector Regressor (SVR) of the predictions made by three singlemodality models whose architecture is depicted in Figure <ref type="figure" target="#fig_0">2</ref>. In all cases the biLSTM encoders have 75 units, with all the learners sharing the same architecture but trained independently. Prediction comes as the outcome of the last sigmoid layer. Learned layers suffer from a dropout rate fixed at 0.3. For every single-modality learner the training pipeline holds the same; batch size is set at 128, with initial learning rate 0.001 and Adam optimizer <ref type="bibr" target="#b12">[12]</ref>. Figure <ref type="figure">1</ref> shows the general prediction pipeline from these models. Results shown in this paper are obtained following a 5-fold cross-validation procedure over the 1000 videos of the development data. Training is stopped after 5 epochs with no improvement over the Spearman correlation coefficient, computed over the fold's validation data. Experimental results are summarized in Table <ref type="table" target="#tab_0">1</ref>. Next we introduce in greater detail the feature extraction processing carried out for every modality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Text captions</head><p>We merge all the captions of a sample into a single one in a Bag-Of-Words fashion. Afterwards, we extract the lemma of every word in the text using NLTK's WordNet-based Lemmatizer <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b14">14]</ref>. Finally, the input of the text modality is made by the sequence of fasttext 300-dimensional word embeddings corresponding to every word in the sample's BOW-text <ref type="bibr" target="#b1">[2]</ref>. At training time, random noise with 𝜇 = 0 and 𝜎 = 0.15 is added to the niput embeddings in order to improve learning robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Audio signal</head><p>Based on previous experience, we hypothesize that event detectionoriented embeddings provide a robust basis to study multimedia perceptual variables such as attention or memorability <ref type="bibr" target="#b13">[13]</ref>. Therefore we compute aural embeddings using the default VGGish configuration, which is pretrained on Audioset, a large audio event-detection database <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b9">9]</ref>. That way every video audio signal is defined by a sequence of 128-dimensional embeddings, each spanning 960 ms of audio and without overlap between them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Video image</head><p>Videos in the dataset are no longer than a few seconds, characterized by an event happening quickly and conforming the most relevant Figure <ref type="figure">1</ref>: Proposed video memorability prediction pipeline. The system is the same when dealing with both short-and longterm memorability scores, but single-modality learners are trained independently for every time interval and modality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Spearman coeff. for fold # -Development Set</head><p>Test Set  part of the clip. Because of that, videos are expected to display quick changes in pixel values between consecutive frames due to visual events taking place. In order to capture the degree of visual change along a clip, we compute optical feature maps for its frames, extracted at 3 FPS, using a LiteFlowNet model <ref type="bibr" target="#b10">[10]</ref>. We further reduce optical flow features' dimensionality by projecting them into a 128-dimensional subspace computed by PCA <ref type="bibr" target="#b15">[15]</ref>. A sample is represented by a temporally-sorted sequence of 128-dimensional features that retains most of the information regarding the optical flow features maps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Ensemble of modality-wise models</head><p>We independently train single-modality models from the features explained in the sections above. Thereafter, a memorability prediction is computed for every sample in the dataset. The combination of the three memorability scores is the input for a SVR that makes a final prediction that reflects the knowledge extracted from the different the modalities. As it can be seen from Table <ref type="table" target="#tab_0">1</ref>, individual learners are not able to fully characterize a video sample and learn the relationship with its memorability score. However, the ensemble of the three of them achieves a Spearman correlation coefficient value of 0.2 in the shortterm problem and 0.19 in the long-term one over development data. However, we notice that the performance on the test data significantly drops, achieving much lower scores on both subtasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DISCUSSION AND OUTLOOK</head><p>Despite individual learners showing very low or even zero coefficient values, a SVR based on their posteriors seems to weakly capture the relationship between media content and its memorability score, with similar correlation values obtained at both shortterm and long-term subtasks. This might be partially caused by the limited amount of data available, which is likely to be dragging the learning process, and therefore making the SVR to learn the development dataset's score distribution. Prediction's distribution suggests that the system might be learning to approximate every sample to the mean memorability score, rather than exploiting the knowledge extracted from the computed features. Future work includes extending the amount of training data with similar datasets. It is also left for future studies to explore different data encodings, with special emphasis on smaller, more compact data representations that might better suited for cases where large datasets are not available.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Architecture of the single-modality learners.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc></figDesc><table><row><cell cols="2">Time range Model</cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>AVG</cell><cell>Spearman Pearson MSE</cell></row><row><cell></cell><cell>Word2Vec Captions</cell><cell cols="5">0.00 0.05 0.13 -0.03 -0.06</cell><cell>0.02</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Short-term</cell><cell>Audioset embeddings Optical Flow + PCA(128)</cell><cell cols="5">-0.06 -0.04 0.07 0.02 0.01 0.11 0.01 0.07 -0.1 0.08</cell><cell>0.00 0.03</cell><cell>--</cell><cell>--</cell><cell>--</cell></row><row><cell></cell><cell cols="6">Prediction ensemble + SVR 0.22 0.20 0.20 0.23 0.17</cell><cell>0.20</cell><cell>0.016</cell><cell>0.011</cell><cell>0.01</cell></row><row><cell></cell><cell>Word2Vec Captions</cell><cell cols="5">0.08 0.06 0.06 0.12 0.13</cell><cell>0.09</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Long-term</cell><cell>Audioset embeddings Optical Flow + PCA(128)</cell><cell cols="5">0.07 0.05 -0.10 0.12 0.17 -0.02 0.13 -0.05 0.10 0.19</cell><cell>0.06 0.07</cell><cell>--</cell><cell>--</cell><cell>--</cell></row><row><cell></cell><cell cols="6">Prediction ensemble + SVR 0.19 0.19 0.19 0.23 0.18</cell><cell>0.19</cell><cell>-0.041</cell><cell>-0.028</cell><cell>0.05</cell></row></table><note>Spearman correlation coefficient scores computed for every validation fold in the dataset, as well as the overall average and official test results. Both short-and long-term scores are shown for every predictive model studied.</note></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>The work leading to these results has been supported by the Spanish Ministry of Economy, Industry and Competitiveness through CAVIAR (MINECO, TEC2017-84593-C2-1-R) and AMIC (MINECO, TIN2017-85854-C4-4-R) projects (AEI/FEDER, UE). Ricardo Kleinlein's research was supported by the Spanish Ministry of Education (FPI grant PRE2018-083225).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Natural Language Processing with Python</title>
		<author>
			<persName><forename type="first">Steven</forename><surname>Bird</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ewan</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edward</forename><surname>Loper</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>O&apos;Reilly Media</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edouard</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Armand</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1607.04606</idno>
		<title level="m">Enriching Word Vectors with Subword Information</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<author>
			<persName><forename type="first">Romain</forename><surname>Cohendet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ngoc</forename><surname>Duong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thanh-Toan</forename><surname>Do</surname></persName>
		</author>
		<author>
			<persName><forename type="first">France</forename><surname>Rennes</surname></persName>
		</author>
		<idno>arXiv:cs.CV/1807.01052</idno>
	</analytic>
	<monogr>
		<title level="m">Predicting Media Memorability Task</title>
				<imprint>
			<date type="published" when="2018">2018. 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">The Predicting Media Memorability Task at MediaEval</title>
		<author>
			<persName><forename type="first">Mihai-Gabriel</forename><surname>Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ngoc</forename><surname>Duong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xavier</forename><surname>Alameda-Pineda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Using Aesthetics and Action Recognition-Based Networks for the Prediction of Media Memorability</title>
		<author>
			<persName><forename type="first">Mihai</forename><surname>Gabriel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Constantin</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gabriela</forename><surname>Dinu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frédéric</forename><surname>Dufaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giuseppe</forename><surname>Valenzise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="http://ceur-ws.org/Vol-2670/MediaEval_19_paper_60.pdf" />
		<title level="m">CEUR Workshop Proceedings</title>
				<editor>
			<persName><forename type="first">Martha</forename><forename type="middle">A</forename><surname>Larson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Steven</forename><forename type="middle">Alexander</forename><surname>Hicks</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Mihai</forename><surname>Gabriel Constantin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Benjamin</forename><surname>Bischke</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Alastair</forename><surname>Porter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Peijian</forename><surname>Zhao</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Mathias</forename><surname>Lux</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Laura</forename><surname>Cabrera Quiros</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Jordan</forename><surname>Calandre</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Gareth</forename><surname>Jones</surname></persName>
		</editor>
		<meeting><address><addrLine>Sophia Antipolis, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-10-30">27-30 October 2019</date>
			<biblScope unit="volume">2670</biblScope>
		</imprint>
	</monogr>
	<note>Workshop</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Jiri</forename><surname>Fajtl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vasileios</forename><surname>Argyriou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dorothy</forename><surname>Monekosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Remagnino</surname></persName>
		</author>
		<idno>arXiv:cs.AI/1804.03115</idno>
		<title level="m">AMNet: Memorability Estimation with Attention</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a Video Memorable?</title>
		<author>
			<persName><forename type="first">Alba</forename><surname>García</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Seco</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rukiye</forename><surname>Savran Kiziltepe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jon</forename><surname>Chamberlain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mihai</forename><surname>Gabriel Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Faiyaz</forename><surname>Doctor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alan</forename><forename type="middle">F</forename><surname>Smeaton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2020 Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Audio Set: An ontology and human-labeled dataset for audio events</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Gemmeke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P W</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Freedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Plakal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ritter</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICASSP.2017.7952261</idno>
		<ptr target="https://doi.org/10.1109/ICASSP.2017.7952261" />
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="776" to="780" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">Shawn</forename><surname>Hershey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sourish</forename><surname>Chaudhuri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jort</forename><forename type="middle">F</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aren</forename><surname>Gemmeke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">Channing</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manoj</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Devin</forename><surname>Plakal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rif</forename><forename type="middle">A</forename><surname>Platt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bryan</forename><surname>Saurous</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Malcolm</forename><surname>Seybold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ron</forename><forename type="middle">J</forename><surname>Slaney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kevin</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><surname>Wilson</surname></persName>
		</author>
		<idno>arXiv:cs.SD/1609.09430</idno>
		<title level="m">CNN Architectures for Large-Scale Audio Classification</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation</title>
		<author>
			<persName><forename type="first">Tak-Wai</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaoou</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><forename type="middle">Change</forename><surname>Loy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">What makes a photograph memorable? Pattern Analysis and Machine Intelligence</title>
		<author>
			<persName><forename type="first">Phillip</forename><surname>Isola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianxiong</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Devi</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonio</forename><surname>Torralba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aude</forename><surname>Oliva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="1469" to="1482" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Diederik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jimmy</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><surname>Ba</surname></persName>
		</author>
		<idno>arXiv:cs.LG/1412.6980</idno>
		<title level="m">Adam: A Method for Stochastic Optimization</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models</title>
		<author>
			<persName><forename type="first">Ricardo</forename><surname>Kleinlein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cristina</forename><forename type="middle">Luna</forename><surname>Jiménez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juan</forename><surname>Manuel Montero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zoraida</forename><surname>Callejas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Fernández-Martínez</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2019-2799</idno>
		<ptr target="https://doi.org/10.21437/Interspeech.2019-2799" />
	</analytic>
	<monogr>
		<title level="m">Proc. Interspeech</title>
				<meeting>Interspeech</meeting>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="61" to="65" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">WordNet: A Lexical Database for English</title>
		<author>
			<persName><forename type="first">George</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">COMMUNICATIONS OF THE ACM</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="39" to="41" />
			<date type="published" when="1995">1995. 1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">LIII. On lines and planes of closest fit to systems of points in space</title>
		<author>
			<persName><forename type="first">Karl</forename><surname>Pearson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="559" to="572" />
			<date type="published" when="1901">1901. 1901</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Attention Is All You Need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niki</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jakob</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Llion</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aidan</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Illia</forename><surname>Polosukhin</surname></persName>
		</author>
		<idno>arXiv:cs.CL/1706.03762</idno>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
