<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Predicting Media Memorability with Audio, Video, and Text representations</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alison</forename><surname>Reboud</surname></persName>
							<email>alison.reboud@eurecom.fr</email>
							<affiliation key="aff0">
								<orgName type="institution">EURECOM</orgName>
								<address>
									<settlement>Sophia Antipolis</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ismail</forename><surname>Harrando</surname></persName>
							<email>ismail.harrando@eurecom.fr</email>
							<affiliation key="aff0">
								<orgName type="institution">EURECOM</orgName>
								<address>
									<settlement>Sophia Antipolis</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jorma</forename><surname>Laaksonen</surname></persName>
							<email>jorma.laaksonen@aalto.fi</email>
							<affiliation key="aff1">
								<orgName type="institution">Aalto University</orgName>
								<address>
									<settlement>Espoo</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raphaël</forename><surname>Troncy</surname></persName>
							<email>raphael.troncy@eurecom.fr</email>
							<affiliation key="aff0">
								<orgName type="institution">EURECOM</orgName>
								<address>
									<settlement>Sophia Antipolis</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Predicting Media Memorability with Audio, Video, and Text representations</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B8965D8FFB933E52CFDB1C6213DC4A32</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes a multimodal approach proposed by the MeMAD team for the MediaEval 2020 "Predicting Media Memorability" task. Our best approach is a weighted average method combining predictions made separately from visual, audio, textual and visiolinguistic representations of videos. Our best model achieves Spearman scores of 0.101 and 0.078, respectively, for the short and long term predictions tasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Considering video memorability as a useful tool for digital content retrieval as well as for sorting and recommending an ever growing number of videos, the Predicting Media Memorability task aims at fostering the research in the field by asking its participants to automatically predict both a short and a long term memorability score for a given set of annotated videos. The full description for this task is provided in <ref type="bibr" target="#b4">[5]</ref>. Last year's best approaches for both the long term <ref type="bibr" target="#b9">[10]</ref> and short term tasks <ref type="bibr" target="#b1">[2]</ref> rely on multimodal features. Our method is inspired from last year's best approaches but also acknowledges the specifics of the 2020's edition dataset. More specifically, because in comparison to last year's set of videos, the TRECVid videos contain more actions, our model uses video features and image features for multiple frames. In addition, because this year sound was included in the videos, our model includes audio features. Finally, a key contribution of our approach is to test the relevance of visiolinguistic representation for the Media Memorability task. Our final model<ref type="foot" target="#foot_0">1</ref> is a multimodal weighted average with visual and audio deep features extracted from the videos, textual features from the provided captions and visiolinguistic features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH</head><p>We trained separate models for the short and long term predictions using originally a 6-fold cross-validation of the training set, which means that we typically had 492 samples for training and 98 samples for testing each model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Audio-Visual Approach</head><p>Our audio-visual memorability prediction scores are based on using a feed-forward neural network with a concatenation of video and audio features in the input, one hidden layer of units and one unit in the output layer. The best performance was obtained with 2575-dimensional features consisting of the concatenation of 2048-dimensional I3D <ref type="bibr" target="#b2">[3]</ref> video features and 527-dimensional audio features. Our audio features encode the occurrence probabilities of the 527 classes of the Google AudioSet Ontology <ref type="bibr" target="#b5">[6]</ref> in each video clip. The hidden layer uses ReLU activations and dropout during the training phase, while the output unit is sigmoidal. The training of the network used the Adam optimizer. The features, the number of training epochs and the number of units in the hidden layer were selected with the 6-fold cross-validation. For short term memorability prediction, the optimal number of epochs was 750 and the optimal hidden layer size 80 units, whereas for the long term prediction these figures were 260 and 160, respectively.</p><p>We also experimented with other types of features and their combinations. These include the ResNet <ref type="bibr" target="#b6">[7]</ref> features extracted just from the middle frames of the clips as this approach worked very well last year. The contents of this year's videos are, however, such that genuine video features I3D and C3D <ref type="bibr" target="#b12">[13]</ref> work better than still image features. When I3D and AudioSet features are used, C3D features do not bring any additional advantage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Textual Approach</head><p>Our textual approach leverages the video descriptions provided by the organizers. First, all the provided descriptions are concatenated by video identifier to get one string per video. To generate the textual representation of the video content, we used the following methods:</p><p>• Computing TF-IDF, removing rare (less than 4 occurrences) and stopwords and accounting for frequent 2-grams. • Averaging GloVe embeddings for all non-stopwords words using the pre-trained 300d version <ref type="bibr" target="#b8">[9]</ref>. • Averaging BERT <ref type="bibr" target="#b3">[4]</ref> token representations (keeping all the words in the descriptions up to 250 words per sentence). • Using Sentence-BERT <ref type="bibr" target="#b10">[11]</ref> sentence representations. We use the distilled version that is fine-tuned for the STS Textual Similarity Benchmark 2 . For each representation, we experimented with multiple regression models and finetuned the hyper-parameters for each model 2 https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens using the 6-fold cross-validation on the training set. For our submission, we used the Averaging GloVe embeddings with a Support Machine Regressor with an RBF kernel and a regulation parameter 𝐶 = 1𝑒 − 5.</p><p>We also attempted enhancing the provided descriptions with additional captions automatically generated using the DeepCaption<ref type="foot" target="#foot_1">3</ref> software. We did not see an improvement in the results, which is probably due to the nature of the clips provided for this year's edition (as DeepCaption is trained on static stock images from MS COCO and TGIF datasets).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Visiolinguistic Approach</head><p>ViLBERT <ref type="bibr" target="#b7">[8]</ref> is a task-agnostic extension of BERT that aims to learn the associations and links between visual and linguistic properties of a concept. It has a two-stream architecture, first modelling each modality (i.e. visual and textual) separately, and then fusing them through a set of attention-based interactions (co-attention). ViL-BERT is pre-trained using the Conceptual Captions data set (3.3M image-caption pairs) <ref type="bibr" target="#b11">[12]</ref> on masked multi modal learning and multi-modal alignment prediction. We used a frozen pre-trained model which was fine-tuned twice, first on the task of Video-Question Answering (VQA) <ref type="bibr" target="#b0">[1]</ref> and then on the 2019 MediaEval Memorability task and dataset.</p><p>The 1024-dimensional features extracted for the two modalities can be combined in different ways.In our experiment, multiplying textual and visual feature vectors performed the best for short term memorability prediction but using the sole visual feature vectors worked better for long term memorability prediction. Averaging the features extracted from 6 frames performed better than only using only the middle frame. We experimented with the same set of regression models as for the textual approach. In our submission, we used a Support Machine Regressor with a regulation parameter 𝐶 = 1𝑒 − 5 and an RBF or Poly kernel respectively for short and long term scores prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS AND ANALYSIS</head><p>We have prepared 5 different runs following the task description defined as follows:</p><formula xml:id="formula_0">• run1 = Audio-Visual Score • run2 = Visiolinguistic Score • run3 = Textual Score • run4 = 0.5 * run1 + 0.2 * run2 + 0.3 * run3 • run5 = run4 with LT scores for LT task</formula><p>For the Long Term task, all models except run5 use exclusively shortterm scores. For runs 4 and 5, we normalise the scores obtained from runs 1, 2 and 3 before combining them.</p><p>Table <ref type="table" target="#tab_0">1</ref> provides the Spearman score obtained for each run when performing a 6-folds cross-validation on the training set. We observe that our models use only the training set, as the annotations on the later-provided development set did not yield better results. We hypothesize that this is due to the fewer number of annotations per video available as many videos had a score for 1, for instance, which we do not observe on the training set. We present in Table <ref type="table" target="#tab_1">2</ref> the final results obtained on the test set using models trained on the full training set composed of 590 videos. We observe that the weighted average method which uses short term scores works the best for both short and long term prediction, obtaining results which are approximately double the mean Spearman score obtained across the teams. Our best results (Spearman scores) on the test set are however significantly worse than the ones we obtained on average over the 6-folds of the training set suggesting that the test set is quite different from the training set. The results for Long Term prediction are always worse than the ones for Short Term prediction. Finally, both our scores and the mean score across team are below the ones obtained for the 2018 and 2019 videos.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DISCUSSION AND OUTLOOK</head><p>This paper describes a multimodal weighted average method proposed for the 2020 Predicting Media Memorability task of Media-Eval. One of the key contribution of this paper is to have shown that based on our experiments during the model construction or testing phase, in comparison to image, audio and text, video features performed the best. Similarly to last year, short term scores predictions correlated better with long term scores than the predictions made when training directly on long term scores. Finally considering the difference of results obtained between the training and test set, it would be interesting to investigate further the differences between these datasets in terms of content (video, audio and text) and annotation. We conclude that generalizing this type of task to different video genres and characteristics remain a scientific challenge.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Average Spearman score obtained on a 6-folds cross validation of the Training set</figDesc><table><row><cell cols="3">Method Short Term Long Term</cell></row><row><cell>run1</cell><cell>0.2899</cell><cell>0.179</cell></row><row><cell>run2</cell><cell>0.214</cell><cell>0.1309</cell></row><row><cell>run3</cell><cell>0.2506</cell><cell>0.1372</cell></row><row><cell>run4</cell><cell>0.3104</cell><cell>0.2038</cell></row><row><cell>run5</cell><cell>0.067</cell><cell>0.1700</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Results on the Test set for Short Term (ST) and Long Term (LT) memorability</figDesc><table><row><cell>Method</cell><cell cols="4">SpearmanST PearsonST SpearmanLT PearsonLT</cell></row><row><cell>run1</cell><cell>0.099</cell><cell>0.09</cell><cell>0.077</cell><cell>0.0855</cell></row><row><cell>run2</cell><cell>0.098</cell><cell>0.085</cell><cell>-0.017</cell><cell>0.011</cell></row><row><cell>run3</cell><cell>0.073</cell><cell>0.091</cell><cell>0.019</cell><cell>0.049</cell></row><row><cell>run4</cell><cell>0.101</cell><cell>0.09</cell><cell>0.078</cell><cell>0.085</cell></row><row><cell>run5</cell><cell>0.101</cell><cell>0.09</cell><cell>0.067</cell><cell>0.066</cell></row><row><cell>AvgTeams</cell><cell>0.058</cell><cell>0.066</cell><cell>0.036</cell><cell>0.043</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/MeMAD-project/media-memorability</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://github.com/aalto-cbir/DeepCaption</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work has been partially supported by the European Union's Horizon 2020 research and innovation programme via the project MeMAD (GA 780069).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">VQA: Visual Question Answering</title>
		<author>
			<persName><forename type="first">Stanislaw</forename><surname>Antol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aishwarya</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jiasen</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Margaret</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dhruv</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Lawrence</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Devi</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Computer Vision (ICCV)</title>
				<meeting><address><addrLine>Santiago, Chile</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Predicting media memorability using ensemble models</title>
		<author>
			<persName><forename type="first">David</forename><surname>Azcona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Enric</forename><surname>Moreu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Feiyan</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomás</forename><forename type="middle">E</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alan</forename><forename type="middle">F</forename><surname>Smeaton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2019: Multimedia Benchmark Workshop</title>
				<meeting><address><addrLine>Sophia Antipolis, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</title>
		<author>
			<persName><forename type="first">João</forename><surname>Carreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="4724" to="4733" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming-Wei</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kenton</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kristina</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). ACL</title>
				<meeting><address><addrLine>Minneapolis, Minnesota, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a Video Memorable?</title>
		<author>
			<persName><forename type="first">Alba</forename><surname>García</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Seco</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rukiye</forename><surname>Savran Kiziltepe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jon</forename><surname>Chamberlain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mihai</forename><surname>Gabriel Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Faiyaz</forename><surname>Doctor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alan</forename><forename type="middle">F</forename><surname>Smeaton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2020 Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Audio set: An ontology and human-labeled dataset for audio events</title>
		<author>
			<persName><surname>Jort F Gemmeke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dylan</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aren</forename><surname>Freedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wade</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Channing</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manoj</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marvin</forename><surname>Plakal</surname></persName>
		</author>
		<author>
			<persName><surname>Ritter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<meeting><address><addrLine>New Orleans, Louisiana, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="776" to="780" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiangyu</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shaoqing</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jian</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting><address><addrLine>Las Vegas, Nevada, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visionand-Language Tasks</title>
		<author>
			<persName><forename type="first">Jiasen</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dhruv</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Devi</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">33 𝑟𝑑 Conference on Neural Information Processing Systems (NeurIPS)</title>
				<meeting><address><addrLine>Vancouver, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL</title>
				<meeting><address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Combining Textual and Visual Modeling for Predicting Media Memorability</title>
		<author>
			<persName><forename type="first">Alison</forename><surname>Reboud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ismail</forename><surname>Harrando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jorma</forename><surname>Laaksonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Danny</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Raphaël</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Héctor</forename><surname>Laria Mantecón</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2019: Multimedia Benchmark Workshop</title>
				<meeting><address><addrLine>Sophia Antipolis, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</title>
		<author>
			<persName><forename type="first">Nils</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iryna</forename><surname>Gurevych</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL</title>
				<meeting><address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3982" to="3992" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning</title>
		<author>
			<persName><forename type="first">Piyush</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nan</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Radu</forename><surname>Soricut</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">56th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting><address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<publisher>ACL</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="2556" to="2565" />
		</imprint>
	</monogr>
	<note>: Long Papers)</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Learning Spatiotemporal Features with 3D Convolutional Networks</title>
		<author>
			<persName><forename type="first">Du</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lubomir</forename><forename type="middle">D</forename><surname>Bourdev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rob</forename><surname>Fergus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorenzo</forename><surname>Torresani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manohar</forename><surname>Paluri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Computer Vision (ICCV)</title>
				<meeting><address><addrLine>Santiago, Chile</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="4489" to="4497" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
