<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">LanViKD: Cross-Modal Language-Vision Knowledge Distillation for Egocentric Action Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yizheng</forename><surname>Sun</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">The University of Manchester</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hao</forename><surname>Li</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">The University of Manchester</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chenghua</forename><surname>Lin</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">The University of Manchester</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Riza</forename><surname>Batista-Navarro</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">The University of Manchester</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">LanViKD: Cross-Modal Language-Vision Knowledge Distillation for Egocentric Action Recognition</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">54646CD8431BFE0C363D34B3F13E3C1B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:29+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Egocentric Action Recognition</term>
					<term>Language-Vision Multi-modality</term>
					<term>Knowledge Distillation CEUR Workshop Proceedings</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Understanding human actions through the analysis of egocentric videos is a desirable capability of intelligent agents, and is a research area that has gained popularity recently. Thus far, most approaches to egocentric (video) action recognition (EAR), i.e., the task of classifying a given video clip according to a predefined set of natural-language descriptions (actions), represent the target action classes (label) using one-hot encoding, thus ignoring any relationships or similarities between some of the actions. The goal of this work is to augment the generalisation capability of vision models through leveraging the pre-existing knowledge encoded within pre-trained language models. Specifically, we propose a language-vision knowledge distillation framework to distil a pre-trained language model's knowledge about actions (expressed in text) into a vision model. Instead of using the one-hot encoding representation of a label, we employ the probability distribution across all action classes-given by a language model-as a teaching signal. Our experiments demonstrate that our framework obtains improved performance and generalisation capability on EAR based on the EPIC-Kitchens, Something-Something V2 and Something-Else benchmarks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Egocentric vision is a subfield of computer vision that analyses first-person viewpoint vision data captured by a wearable camera. Núñez-Marcos et al. <ref type="bibr">[1]</ref> highlights that compared with thirdperson view (exocentric) videos, egocentric videos usually involve rich hand-object interactions. Our framework leverages the observation that different egocentric actions often involve the same objects (e.g., both "Taking cutting board" and "Cutting onion" involve a cutting board) and captures such correlations using pre-trained large language models.</p><p>Early work has demonstrated that, in addition to the RGB modality, leveraging multiple modalities such as audio, optical flow, and the bounding box and category of an object help improve a model's capability to understand egocentric videos <ref type="bibr" target="#b2">[2,</ref><ref type="bibr" target="#b3">3,</ref><ref type="bibr" target="#b4">4]</ref>. Such efforts have explored the potential of multi-modal knowledge distillation, where the teacher and student models receive different input modalities <ref type="bibr" target="#b5">[5,</ref><ref type="bibr" target="#b6">6,</ref><ref type="bibr" target="#b7">7,</ref><ref type="bibr" target="#b8">8]</ref>. Their results show that using the teacher's knowledge from certain modalities for training improves the student's performance on a different HAII5.0: Embracing Human-Aware AI in Industry 5.0, at ECAI 2024, <ref type="bibr" target="#b19">19</ref> October 2024, Santiago de Compostela, Spain. * Corresponding author. Envelope yizheng.sun@manchester.ac.uk (Y. Sun) Orcid 0009-0004-2600-1236 (Y. Sun); 0000-0002-9923-4346 (H. Li); 0000-0003-3454-2468 (C. Lin); 0000-0001-6693-7531 (R. Batista-Navarro) modality during inference. It is, however, unrealistic to assume that multiple modalities are always available. In contrast, the language modality is usually available because most existing EAR datasets are annotated according to target actions expressed in natural language <ref type="bibr" target="#b9">[9,</ref><ref type="bibr" target="#b10">10,</ref><ref type="bibr" target="#b11">11]</ref>. Additionally, the rapid growth and impressive performance of pre-trained Language Models (LMs) on natural language processing (NLP) and computer vision (CV) tasks have been notable <ref type="bibr" target="#b12">[12,</ref><ref type="bibr" target="#b13">13,</ref><ref type="bibr" target="#b14">14,</ref><ref type="bibr" target="#b15">15]</ref>. Pre-trained LMs bring broader knowledge of human actions, that can support the language modality.</p><p>Extensive research has delved into exploring the potential of learning vision representations through supervision embedded in natural language <ref type="bibr" target="#b16">[16,</ref><ref type="bibr" target="#b17">17,</ref><ref type="bibr" target="#b18">18,</ref><ref type="bibr" target="#b19">19]</ref>. Consequently, it is natural to investigate whether LMs can be employed for video action recognition. Siddharth et al. <ref type="bibr" target="#b20">[20]</ref> utilised language models to generate textual descriptions of videos, enabling their vision model to comprehend and identify actions more effectively through textual cues. Sun et al. <ref type="bibr" target="#b21">[21]</ref> jointly trained video and language modalities, enabling tasks like action recognition to benefit from textual context. While previous studies demonstrated the advantages of integrating the language modality into video learning, they typically fuse video and language modalities together instead of utilising a pre-trained language model's latent knowledge directly. Several considerations drive the advancement of leveraging pre-existing knowledge in modelling. Firstly, language models (LMs) have showcased exceptional capabilities in few-shot and zeroshot transfer learning <ref type="bibr" target="#b22">[22]</ref>. Consequently, LMs can be employed effectively with relatively small datasets, as their objective is solely to assist the existing LMs during inference. Secondly, methods based on LMs for video need little or even no training. Through plug-in modules, they can be utilised in a convenient manner <ref type="bibr" target="#b23">[23]</ref>. In this study, we take a different route and propose a cross-modal language-vision knowledge distillation framework for EAR.</p><p>Figure <ref type="figure">1</ref> depicts our framework. The conventional training approach employs one-hot encoding to represent target actions, treating "Taking cutting board" and "Cutting onion" as distinct target classes. Consequently, a vision model perceives these two videos as unrelated due to the lack of consideration for the correlation between the action classes in the one-hot encoding scheme. However, this perspective fails to reflect the inherent relationships within the video data, leading to a lack of generalisation. This is different from a human standpoint, as humans would recognise that both videos share relevant visual features associated with the cutting board object. Conversely, a language model perceives textual action labels such as "Taking cutting board" and "Cutting onion" as relevant, given their shared usage of the word "cut", which better aligns with the video content. To address this discrepancy, our framework leverages a language model as the teacher to capture and incorporate this contextual relevance information into the EAR training process to help improve vision models' general understanding of videos. Furthermore, our framework also follows a multi-task learning approach for capturing correlations between the vision and language representations. We demonstrate that utilising a pre-trained language model as teacher can improve a vision model's performance and generalisation capability on the EAR task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions (i)</head><p>We provide a cross-modal language-vision knowledge distillation framework for EAR. Our framework is highly flexible, and is not constrained in terms of the vision and language models involved. (ii) We demonstrate through experiments that a pre-trained In EAR data, samples include action labels described in natural language along with corresponding video clips. These video clips often exhibit relevant visual features corresponding to different action labels. However, traditional training methods commonly utilise one-hot encoding for action labels, which does not adequately capture this correlation and lacks generalisation. In contrast, our framework applies a language model on textual action labels to better understand the relationships among them, thereby aligning more closely with the inherent information in video data. language model's pre-existing knowledge is beneficial for a vision model's understanding of egocentric vision. (iii) Our experiments show that our framework's performance in terms of accuracy improves upon a baseline approach by up to 2.6%. This superior performance is achieved without adding any additional computation for inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Natural language supervision for vision learning focusses on learning visual representations from semantic information contained in natural language. Various methods have been introduced to learn visual presentations from text paired with images <ref type="bibr" target="#b16">[16,</ref><ref type="bibr" target="#b18">18,</ref><ref type="bibr" target="#b24">24,</ref><ref type="bibr" target="#b19">19]</ref>. Notably, a close work to ours is that of Gomez-Bigorda et al. <ref type="bibr" target="#b17">[17]</ref>, which projects given textual information into topic classes using Latent Dirichlet Allocation (LDA). They then use the probability distribution of topic classes as a supervisory signal to train a CNN with cross-entropy loss. In our case, we use pre-trained language models to generate the probability distribution and employ standard practice in knowledge distillation to train a transformer-based vision model. Furthermore, most of the aforementioned work are for pre-training visual representations, while our framework is directly applied to downstream tasks such as egocentric action recognition.</p><p>Multi-modal knowledge distillation. In the context of multi-modal knowledge distillation, several methods have been introduced in a cross-modal fashion <ref type="bibr" target="#b5">[5,</ref><ref type="bibr" target="#b6">6]</ref>, where a student and a teacher receive a different modality, respectively. Alternatively, some efforts explored the distillation of knowledge between more than two modalities <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b25">25,</ref><ref type="bibr" target="#b8">8,</ref><ref type="bibr" target="#b26">26]</ref>, which have utilised vision and audio-based data such as raw RGB, optical flow and sound waves, etc. In contrast, we focus on knowledge distillation from a teacher model receiving language modality to a student model receiving RGB modality. Compared with vision and audio-based modalities, the strength of using language as a teaching modality comes from modern pre-trained language models, whose pre-existing knowledge contain strong generalisation and understanding capability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Egocentric action recognition (EAR).</head><p>One line of work has focussed on model architecture design to model the interplay between spatial and temporal information within RGB video frames <ref type="bibr" target="#b27">[27,</ref><ref type="bibr" target="#b28">28,</ref><ref type="bibr" target="#b29">29]</ref>. Concurrently, another strand of research demonstrated that using object bounding boxes and categories to model hand-object interaction significantly improves EAR performance <ref type="bibr" target="#b30">[30,</ref><ref type="bibr" target="#b4">4]</ref>. Recent work showed that utilising multiple modalities demonstrates promising performance <ref type="bibr" target="#b2">[2,</ref><ref type="bibr" target="#b3">3,</ref><ref type="bibr" target="#b8">8]</ref>. They utilised vision and audio-based modalities and have used a shared model architecture for different modalities. Notably, the language modality poses unique challenges due to its distinct data format, making direct application of existing methods impractical. Thus, we propose a novel framework aimed at harnessing the language modality specifically for EAR tasks.</p><p>Multi-task learning was originally introduced by Caruana <ref type="bibr" target="#b31">[31]</ref>, where a shared model generates output predictions for multiple tasks on the same input. Recent research highlighted the strong performance of multi-task learning in computer vision tasks <ref type="bibr" target="#b32">[32,</ref><ref type="bibr" target="#b33">33,</ref><ref type="bibr" target="#b34">34]</ref>. In our study, we extend this concept to our knowledge distillation framework by incorporating a regression head. This head projects vision latent representations from a student onto pre-trained language latent representations provided by a teacher.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>This section provides a formal definition of the EAR task and delineates the procedural aspects of our framework, which we refer to as LanViKD. Figure <ref type="figure" target="#fig_1">2</ref> presents an overview of the architecture of LanViKD, which is comprised of two primary stages: Stage 1 entails the preparation of a language model designated as the teacher model, while Stage 2 involves performing cross-modal knowledge distillation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Egocentric Action Recognition Formulation</head><p>Following Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, we formally define the EAR task as follows. An RGB video clip is in the format of 𝑐 ∈ ℝ 𝑇 ×𝐶×𝑊 ×𝐻 , where 𝑇 is the number of sampled RGB frames, and 𝐶, 𝑊 and 𝐻 represent the number of channels, height and width. An egocentric action recognition dataset 𝔻 = {(𝑐 1 , 𝑤 1 , 𝑦 1 ), ..., (𝑐 𝑁 , 𝑤 𝑁 , 𝑦 𝑁 )} contains 𝑁 video clips 𝑐 𝑖 , together with textual narrations 𝑤 𝑖 describing actions in the clips, and one-hot encoding 𝑦 𝑖 ∈ ℝ 𝐶 + of the narrations. The goal of EAR is to predict ŷ 𝑖 ∈ ℝ + as the action class for a given video clip 𝑐 𝑖 , or alternatively, ( v , n ) ∈ ℝ 2 + as the verb and noun constituting the action in a video. The traditional training target for EAR is the one-hot encoding of actions expressed in text <ref type="bibr" target="#b35">[35,</ref><ref type="bibr" target="#b29">29,</ref><ref type="bibr" target="#b36">36]</ref>. However, as shown in Figure <ref type="figure">1</ref> some action classes such "taking cutting board" and "cut carrot" share common features with respect to the "cutting board" object in their corresponding RGB video frames. One-hot encoding ignores the this relationship between different action classes. The goal of our work is to utilise this relationship information for EAR training by distilling the knowledge of a pre-trained language model into an RGB video model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Language Teacher Model Preparation</head><p>As shown in Figure <ref type="figure" target="#fig_1">2</ref>, given an EAR dataset 𝔻 = {(𝑐 1 , 𝑤 1 , 𝑦 1 ), ..., (𝑐 𝑁 , 𝑤 𝑁 , 𝑦 𝑁 )}, we employ a pre-trained language model capable of processing sequences of text tokens to generate latent representations. Subsequently, we freeze the parameters of the language model and proceed to train a linear projection layer (or two separate linear projections in scenarios involving verb-noun compositional actions) atop the language model. This trained projection layer is tasked with classifying a textual action description 𝑤 𝑖 into its corresponding one-hot encoding index 𝑦 𝑖 (or verb and noun indices, as previously specified). Following training, the linear projection facilitates the generation of a soft probability distribution across all action classes given a textual action description as input. This soft distribution contains valuable semantic information, differing from conventional one-hot encoding. For instance, consider the actions "taking cutting board", which is associated with the noun label "cutting board" encoded as 1, and "cut carrot", labelled with the noun "carrot" encoded as 2. When inputting "taking cutting board" into the language model for noun index classification, it assigns the highest probability to 1 while also allocating a considerable probability to 2. This is due to the shared term "cut" in both textual actions, despite their distinct noun classes. Moreover, this semantic relationship is echoed in the video data, wherein both actions involve the object "cutting board". While one-hot indices categorise these videos into separate, unrelated classes, the probability distribution reflects their semantic connection, aligning more closely with the visual modality.</p><p>1-15</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Cross-modal Language-Vision knowledge distillation</head><p>Once the language teacher model is prepared, we opt for a vision model to serve as the student model, taking RGB video frames as its input. Similar to the teacher model, we apply linear projection(s) atop the student model. The parameters of the teacher model are then fixed, and we proceed with knowledge distillation, as originally proposed by Hinton et al. <ref type="bibr" target="#b37">[37]</ref>.</p><p>Training Objective. As described above, given a dataset 𝔻 = {(𝑐 1 , 𝑤 1 , 𝑦 1 ), ..., <ref type="bibr">(</ref> </p><formula xml:id="formula_0">ℒ 𝐾 𝐿 = 1 𝑁 ∑ 𝑁 𝑖=1 ŷ 𝑡 𝑖 ⋅ (𝑙𝑜𝑔 ŷ 𝑡 𝑖 − 𝑙𝑜𝑔 ŷ 𝑠 𝑖 ).</formula><p>Following standard practice <ref type="bibr" target="#b37">[37,</ref><ref type="bibr" target="#b8">8]</ref>, we use a temperature parameter 𝜏 to control the entropy of class probabilities predicted by the teacher ŷ 𝑡 𝑖 = 𝜎 ( ŷ 𝑡 𝑖 /𝜏 ) and the student ŷ 𝑠 𝑖 = 𝜎 ( ŷ 𝑠 𝑖 /𝜏 ), where 𝜎 is the softmax operator. We then scale the KL-divergence loss according to the temperature parameter ℒ 𝐾 𝐿 = ℒ 𝐾 𝐿 ⋅ 𝜏 2 . Additionally, we also minimise the standard cross-entropy objective of class probabilities predicted by the student</p><formula xml:id="formula_1">ℒ 𝐶𝐸 = 1 𝑁 ∑ 𝑁 𝑖=1 𝑦 𝑖 ⋅ 𝑙𝑜𝑔𝜎 ( ŷ 𝑠 𝑖 ).</formula><p>In the case of compositional actions containing verbs and nouns, the training objective becomes the sum of corresponding loss terms with respect to the verb and the noun, where ℒ 𝐾 𝐿 = 1 2 (ℒ 𝑛 𝐾 𝐿 + ℒ 𝑣 𝐾 𝐿 ) and ℒ 𝐶𝐸 = 1 2 (ℒ 𝑛 𝐶𝐸 + ℒ 𝑣 𝐶𝐸 ). Furthermore, we apply a multi-task learning approach in LanViKD by adding an extra linear projection layer on top of the student model to generate 𝑜 𝑠 𝑖 . We take the output from the last hidden layer from the teacher ℎ 𝑡 𝑖 , which is the latent representation of the input text given by the original pre-trained language model. We minimise the smooth L1 objective ℒ 𝐿1 to regress 𝑜 𝑠 𝑖 towards ℎ 𝑡 𝑖 .</p><formula xml:id="formula_2">ℒ 𝐿1 = { 0.5(𝑜 𝑠 𝑖 − ℎ 𝑡 𝑖 ) 2 /𝛽 if |𝑜 𝑠 𝑖 − ℎ 𝑡 𝑖 | &lt; 𝛽 |𝑜 𝑠</formula><p>𝑖 − ℎ 𝑡 𝑖 | − 0.5 * 𝛽 otherwise Where 𝛽 determines the threshold for switching between L1 and L2 loss, with a value of 1 used in our experiments. We compute the final loss as ℒ = 𝜆 ⋅ ℒ 𝐾 𝐿 + (1 − 𝜆) ⋅ ℒ 𝐶𝐸 + 𝜇 ⋅ ℒ 𝐿1 . We note that the weights sum of ℒ 𝐾 𝐿 and ℒ 𝐶𝐸 is 1 because they are based on the same output linear layer. Instead, we use a separate loss weight for ℒ 𝐿1 because it is based on the linear layer of a separate task. During the inference process, it is important to note that the language teacher model is dispensable. The student vision model operates solely on RGB video frames as its input.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head><p>In our experiments, our primary objective is to assess the potential benefits of integrating knowledge from a language model into a vision model for the EAR task. Specifically, we ask the following questions: (i) What is the performance of utilising LanViKD on regular EAR data samples, i.e. training and testing samples containing overlapping environments and/or objects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(ii)</head><p>To what extent can a student model, trained using LanViKD, effectively generalise to unseen environments and/or objects not encountered during training? (iii) How does the incorporation of a language model's teaching signal alongside the standard one-hot target affect the training of a student model, and what is the optimal balance between the two? (iiii) How does using the language modality compare to using the audio modality in cross-modal knowledge distillation with the RGB modality? We choose to compare language with audio because it is unlike optical flow and object bounding box/category which need to be computed using external algorithms or models for RGB data <ref type="bibr" target="#b38">[38,</ref><ref type="bibr" target="#b39">39]</ref>; audio and language are both raw data sources that are readily available in EAR datasets.</p><p>To address questions (i) and (ii), we conduct experiments across various datasets, encompassing those with overlapping environments and objects for both training and validation, as well as those featuring unseen or under-represented elements during validation. For question (iii), we perform experiments with different 𝜆 settings, regulating the ratio of the language model's teaching signal to the traditional one-hot target within the training objective. To address question (iv), we compare our findings with those of Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, who conducted similar knowledge distillation from audio modality to RGB video modality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Datasets.</head><p>Our experiments are conducted on three datasets: Epic-Kitchens-100 <ref type="bibr" target="#b9">[9]</ref>, Something-Something V2 <ref type="bibr" target="#b10">[10]</ref> and Something-Else <ref type="bibr" target="#b40">[40]</ref>.</p><p>Epic-Kitchens-100 (EK-100) is a large-scale dataset of egocentric videos. It contains 100 hours of non-scripted videos recorded by 37 participants in kitchen environments. The actions depicted in the videos include narrations in the form of English phrases. The training targets are verbs and nouns expressing the actions (e.g. "cutting onion" is an action narration, whose training targets are "cut" and "onion"). There are 300 unique noun classes and 97 unique verb classes. An action is considered to be correctly predicted if both the verb and the noun are correct.</p><p>The Something-Something V2 (SSV2) dataset is a large collection of (mostly egocentric) videos that show people performing 174 pre-defined basic actions with everyday objects (e.g. putting something on a surface, moving something up) <ref type="bibr" target="#b10">[10]</ref>. Notably, videos in SSV2 initially feature annotations with specific object names, which are then replaced with the word "something" for training targets (e.g., "putting box on a surface" becomes "putting something on a surface").</p><p>Something-Else (SthElse) is an alternative data re-split of the original SSV2 <ref type="bibr" target="#b40">[40]</ref>. SthElse splits SSV2 in such a way that the training and validation sets contain distinct objects. Therefore, SthElse focusses on using unseen objects during training to measure the generalisation capability of a model.</p><p>In a similar vein, we also incorporate the EK-100 Unseen and Tail split. The unseen split is a subset of the EK-100 validation set, which contains videos that are recorded by two participants who did not appear in the training set. The unseen split is specifically designed to measure the ability of models on unseen environments during training. The tail split is a subset containing action classes that have little training samples. Notably, the EK-100 regular split encompasses all samples excluding the unseen split.</p><p>Language Backbone. In this study, our language model of choice is MiniLM, featuring 12 layers and a hidden size of 384 <ref type="bibr" target="#b41">[41]</ref>. The rationale behind choosing MiniLM stems from its compact architecture and computational efficiency. Despite its smaller size, MiniLM maintains competitive performance over its teacher model, UniLM <ref type="bibr" target="#b42">[42]</ref>. For the EK-100 dataset, we utilised the original textual action annotations, consisting of English phrases describing actions, as input to MiniLM. Similarly, for the SthElse dataset, we employed the original annotations, which include object names as inputs to MiniLM. Vision Backbone. Following Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, we chose the Swin Transformer Tiny version (Swin-T) model as the vision model in LanViKD. Each video clip is represented as a sequence of RGB frames, where each frame is represented by a 3 × 224 × 224 tensor. Swin-T takes a video clip as input and produces a 768-dimension tensor as the latent representation of the video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Implementation details.</head><p>For teacher models, we train the linear head for 10 epochs across all datasets. As for the student models, we train them for 50 epochs on Epic-Kitchens, 40 epochs on SSV2 and 30 epochs on Something-Else. As per Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, we employ the AdamW optimiser <ref type="bibr" target="#b43">[43]</ref>, setting the peak learning rate at 1𝑒 − 4. Initially, the learning rate linearly increases for the first 3 epochs and then linear decreases to 0. A weight decay of 5𝑒 − 2 is utilised, along with gradient clipping, limiting the maximum norm to 5. Across all experiments, 𝜏 remains fixed at a value of 3. For EK-100, during training, we select a random starting frame and sample 32 frames with a fixed stride of 2. In inference, frames are sampled in the same manner to cover the central section of the video. For SSV2 and SthElse, 16 frames are sampled to cover the entire video during both training and inference. Standard data augmentation techniques are applied to RGB frames, including random cropping, color jitter, and random horizontal flips (exclusive to EK-100). Consistency is maintained within each video clip by applying the same augmentation methods to every frame. A single temporal crop is employed for inference.</p><p>Direct Comparison. In the study by Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, the Swin-T model was trained on the EK-100, SSV2 and SthElse datasets. A key distinction between their approach and ours is that while they incorporated multiple modalities, including RGB, optical flow, and audio, they did not include the language modality. In contrast, our work leverages only the language modality as the teacher modality. To ensure a direct and fair comparison, we adhered to the same experimental settings as Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, including the use of the backbone model, data augmentation techniques, and frame sampling methods. Evaluation Metrics. We calculate two widely used metrics, Accuracy@1 (ACC@1) and Accuracy@5 (ACC@5), on the test set, which play pivotal roles in assessing the effectiveness of such systems <ref type="bibr" target="#b44">[44]</ref>. By measuring the correctness of predictions within the top-ranked recommendations, both ACC@1 and ACC@5 provide valuable insights into the system's ability to deliver relevant and satisfactory outcomes to users, where ACC@1 quantifies the proportion of correct predictions among the top-1 ranked results. It signifies whether the single highestranked item recommended by the system aligns with the user's preference. On the other hand, ACC@5 expands the assessment to the top-5 ranked results, thereby offering a broader evaluation of the system's performance.   Generalisation capability on unrepresented and unseen environments and unseen objects. Table <ref type="table">2</ref> shows the performance on EK-100 unseen and tail splits, which contain unseen and unrepresented environments during training, respectively. It also shows the performance on SthElse, which contains videos involving objects that are unseen during training. These validation sets aim at evaluating a vision model's generalisation capability.</p><p>Our observations indicate that distilling knowledge from a language model into a vision model generally enhances the generalisation capability of the latter by up to 1.3% on the EK-100 unseen split and 2.6% on the SthElse dataset. Specifically, for the EK-100 unseen split, LanViKD outperforms the baseline across all three metrics (Noun, Verb, and Action) without the addition of the regression head. Furthermore, incorporating the regression head leads to an additional 1.3% improvement in performance specifically on the metrics for Action. For the EK-100 tail split, LanViKD demonstrates competitive results with the baseline when the regression head is absent. However, with the regression head, although LanViKD exhibits a slight performance decrease in the Verb metric compared to the baseline, it achieves a 0.5% enhancement in the primary metric, Action. Similarly, for the SthElse dataset, LanViKD surpasses the baseline by 2.6% in ACC@1 without the regression head. However, the addition of the regression head marginally diminishes performance by 0.4% compared to its absence. Moreover, Figure <ref type="figure" target="#fig_3">3</ref> shows per-class ACC@1 improvement in relation to the top 10 frequent nouns and verbs within the Table <ref type="table">3</ref> The impact of varying teaching weights on EK-100. 𝜆 = 0.4 represents the weight assigned to KLdivergence loss. A higher value of 𝜆 indicates that the training objective for the student model prioritizes the teaching signals from the language model to a greater extent, while reducing emphasis on the one-hot targets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method Teaching Parameters SthElse</head><p>Acc@1 Acc@5 LanViKD 𝜆 = 0.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Comparison of utilising audio modality and language modality on EK-100 dataset. The baseline is introduced by Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, which distils knowledge from audio modality to RGB modality during training. In contrast, our approach distils knowledge from language modality to RGB modality during training. Both approaches use only RGB video frames for inference.</p><p>EK-100 unseen split dataset, alongside the top 20 frequent actions identified in SthElse.</p><p>Teacher's influence on the student. To investigate the influence of the teacher language model on the student model's performance, we set the parameter 𝜆 to 0.4 and 0.8 for the EK-100 and SthElse datasets, respectively. Specifically, this adjustment increases the teaching signal's weight in the training objective from 40% to 80%, while maintaining the regression head. Tables <ref type="table" target="#tab_3">3 and 4</ref> present a comparative analysis of the model's performance with 𝜆 set at 0.4 and 0.8. The results indicate that increasing 𝜆 to 0.8 leads to a slight improvement on the unseen split of the EK-100 dataset. However, this increase is associated with a significant performance decline on the tail split of the EK-100 dataset and across the SthElse dataset.</p><p>Comparison with knowledge distillation on audio modality. We are interested in comparing the utilisation of audio modality for knowledge distillation, as opposed to optical flow (OF) and objects' bounding box and category (OBJ) modalities. Unlike OF and OBJ, which are derived from RGB modality through external algorithms or deep learning models <ref type="bibr" target="#b38">[38,</ref><ref type="bibr" target="#b39">39]</ref>, audio and text modalities represent raw data from the datasets. This distinction is crucial, as the computation of OF and OBJ may introduce hidden external model knowledge into training, making it uncertain whether all the knowledge distilled into a student is solely from the teacher. In the study by Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, they trained an audio model on EK-100 audio data alongside an RGB model. Subsequently, they combined these models as a teacher ensemble to train a Swin-T vision student model, which only received RGB video frames. Similarly, in our approach, we leverage knowledge from a language teacher alongside RGB video frames to train a vision student, also receiving only RGB frames; while Radevski et al. <ref type="bibr" target="#b8">[8]</ref> utilised audio and RGB modalities for training, we employ language and RGB modalities. Both approaches exclusively use the RGB modality for inference.</p><p>It is important to note that the audio modality is exclusive to the EK-100 dataset. Table <ref type="table">5</ref> presents a comparison between knowledge distillation using audio and RGB, and language and RGB modalities. Our findings indicate that training with language and RGB yields superior performance, surpassing training with audio and RGB by up to 1.7% on the EK-100 regular split, while also achieving competitive results on the unseen split.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>In this work, we propose a knowledge distillation framework, LanViKD, for language and vision (RGB) modalities. Our experiments demonstrate enhancement in performance compared to the baseline model, which is solely trained on one-hot labels utilising only the RGB modality. Additionally, we conduct a comparative analysis between the incorporation of audio modality and language modality for knowledge distillation. Our findings indicate the superiority of the language modality as a teacher for enhancing the learning of the vision-based student.</p><p>In our future work, we will investigate the integration of the language modality with additional modalities such as audio, depth, and thermography. We plan to find an approach for aligning multiple modalities and create a comprehensive teacher model with broader knowledge for knowledge distillation, potentially leading to further performance improvement.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Figure 1:In EAR data, samples include action labels described in natural language along with corresponding video clips. These video clips often exhibit relevant visual features corresponding to different action labels. However, traditional training methods commonly utilise one-hot encoding for action labels, which does not adequately capture this correlation and lacks generalisation. In contrast, our framework applies a language model on textual action labels to better understand the relationships among them, thereby aligning more closely with the inherent information in video data.</figDesc><graphic coords="3,306.99,89.73,120.89,139.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Overview of our LanViKD framework architecture. LanViKD consists of two main stages. During the first stage, we prepare a pre-trained language model to serve as the teacher by training a linear projection head atop it. In the second stage, we employ a vision model as the student and perform knowledge distillation.</figDesc><graphic coords="5,154.71,195.30,79.00,70.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>EK-100 unseen split verbs 0.075 0.050 0.025 0.000 0.025 0.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Per-class ACC@1 improvement over baselines of the 10 most frequent nouns and verbs within the EK-100 unseen split dataset, as well as the 20 most frequent actions in SthElse.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>𝑐 𝑁 , 𝑤 𝑁 , 𝑦 𝑁 )}, the teacher model takes 𝑤 𝑖 (action expressed in text) as input and predicts the class probability distribution ŷ 𝑡 𝑖 = [𝑦 𝑡 𝑖,1 , ..., 𝑦 𝑡 𝑖,𝐶 ]. Similarly, the student model takes 𝑐 𝑖 (RGB video frames) as input and predicts ŷ 𝑠 𝑖 = [𝑦 𝑠 𝑖,1 , ..., 𝑦 𝑠 𝑖,𝐶 ]. We minimise the KL-divergence between ŷ 𝑡 𝑖 and ŷ 𝑠 𝑖 as</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Teacher's influence for SthElse. 𝜆 represents the weight assigned to KL-divergence loss. +2.1 62.3 −0.1 39.6 +1.7 39.7 −2.1 51.9 +0.1 27.2 −0.3 LanViKD(𝜆 = 0.8, 𝜇 = 50) RGB&amp;Language RGB 52.4 +0.9 62.6 +0.2 39.2 +1.3 39.5 −2.3 52.9 +1.1 27.6 +0.1</figDesc><table><row><cell>Method</cell><cell>Training</cell><cell>Inference</cell><cell cols="3">EK-100 Regular</cell><cell cols="3">EK-100 Unseen</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Noun</cell><cell>Verb</cell><cell>Action</cell><cell>Noun</cell><cell>Verb</cell><cell>Action</cell></row><row><cell>Baseline</cell><cell>RGB&amp;Audio</cell><cell>RGB</cell><cell>51.5</cell><cell>62.4</cell><cell>37.9</cell><cell>41.8</cell><cell>51.8</cell><cell>27.5</cell></row><row><cell cols="2">LanViKD(𝜆 = 0.4, 𝜇 = 50) RGB&amp;Language</cell><cell>RGB</cell><cell>53.6</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to acknowledge the use of the Computational Shared Facility at The University of Manchester. The computational resource used in this work is supported by the CSF (aka Danzek), which is a High Performance Computing (HPC) cluster at the University of Manchester, managed by IT Services for the use of University academics, post-doctoral assistants and post-graduates to conduct academic research.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1-15</head><p>Method Teaching Parameters EK-100 Regular SSV2 Noun Verb Action ACC@1 ACC@5 Baseline 𝜆 = 0, 𝜇 = 0 51.5 61. <ref type="bibr" target="#b4">4</ref>   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Analysis</head><p>Across all our experiments, we adopt baseline results derived from Radevski et al. <ref type="bibr" target="#b8">[8]</ref>, adhering to identical experimental settings. However, for the EK-100 dataset, since they did not include results for the EK-100 tail split, we replicated the baseline experiment to serve as our own baseline.</p><p>Performance on regular environments and objects. Table <ref type="table">1</ref> shows the performance metrics obtained from experiments conducted on both the EK-100 regular split and SSV2 dataset.</p><p>For EK-100, all results are based on ACC@1. For SSV2, we report both ACC@1 and ACC@5 accuracy. We observe that incorporating knowledge distillation from a language model into a vision model generally enhances the performance of the vision model on the EK-100 regular split by up to 2%, while maintaining competitive results on the SSV2 dataset compared to the baselines. Specifically, in relation to the EK-100 dataset, integrating the regression head for LanViKD demonstrates superior performance in classifying nouns, whereas its removal results in improved classification of verbs. Furthermore, both scenarios show similar improvements in classifying actions, achieving approximately a 2% increase in ACC@1 over the baseline, which serves as the primary metric for EK-100. Conversely, for the SSV2 dataset, LanViKD's performance decreases by 1.9% compared to the baseline without the regression head. Moreover, incorporating the regression head yields performance that is competitive with the baseline.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Method Teaching Parameters EK-100 Regular EK-100 Unseen EK-100 Tail Noun Verb Action Noun Verb Action Noun Verb Action LanViKD</title>
		<imprint>
			<biblScope unit="volume">0</biblScope>
			<biblScope unit="page">50</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Egocentric vision-based action recognition: A survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Núñez-Marcos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Azkune</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Arganda-Carreras</surname></persName>
		</author>
		<idno type="DOI">10.1016/J.NEUCOM.2021.11.081</idno>
		<ptr target="https://doi.org/10.1016/j.neucom.2021.11.081.doi:10.1016/J.NEUCOM.2021.11.081" />
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">472</biblScope>
			<biblScope unit="page" from="175" to="197" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Omnivore: A single model for many visual modalities</title>
		<author>
			<persName><forename type="first">R</forename><surname>Girdhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ravi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="16081" to="16091" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">M&amp;m mix: A multimodal multiview transformer ensemble</title>
		<author>
			<persName><forename type="first">X</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arnab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nagrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<idno>CoRR abs/2206.09852</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Object-region video transformers</title>
		<author>
			<persName><forename type="first">R</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ben-Avraham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mangalam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chechik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Globerson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3138" to="3149" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Cross modal distillation for supervision transfer</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Malik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2827" to="2836" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Soundnet: Learning sound representations from unlabeled video</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Aytar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vondrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Torralba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NIPS</title>
		<imprint>
			<biblScope unit="page" from="892" to="900" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Multimodal knowledge expansion</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="834" to="843" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Multimodal distillation for egocentric action recognition</title>
		<author>
			<persName><forename type="first">G</forename><surname>Radevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Grujicic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Blaschko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tuytelaars</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="5190" to="5201" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100</title>
		<author>
			<persName><forename type="first">D</forename><surname>Damen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Doughty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Farinella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Furnari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kazakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moltisanti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Munro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Perrett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Price</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wray</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int. J. Comput. Vis</title>
		<imprint>
			<biblScope unit="volume">130</biblScope>
			<biblScope unit="page" from="33" to="55" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The &quot;something something&quot; video database for learning and evaluating visual common sense</title>
		<author>
			<persName><forename type="first">R</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Kahou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michalski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Materzynska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Westphal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Haenel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Fründ</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Yianilos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mueller-Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hoppe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Thurau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bax</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Memisevic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5843" to="5851" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">The kinetics human action video dataset</title>
		<author>
			<persName><forename type="first">W</forename><surname>Kay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hillier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vijayanarasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Viola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Green</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Back</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Natsev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suleyman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno>CoRR abs/1705.06950</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Mach. Learn. Res</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page">67</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">BERT: pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NAACL-HLT (1), Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghazvininejad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL, Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7871" to="7880" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Vilt: Vision-and-language transformer without convolution or region supervision</title>
		<author>
			<persName><forename type="first">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Son</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Machine Learning Research</title>
				<meeting>Machine Learning Research<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">139</biblScope>
			<biblScope unit="page" from="5583" to="5594" />
		</imprint>
	</monogr>
	<note>ICML</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Contrastive learning of medical visual representations from paired images and text</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Miura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">P</forename><surname>Langlotz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Machine Learning Research</title>
				<meeting>Machine Learning Research<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">182</biblScope>
			<biblScope unit="page" from="2" to="25" />
		</imprint>
	</monogr>
	<note>MLHC</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Self-supervised learning of visual features through embedding images into text topic spaces</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gomez-Bigorda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rusiñol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karatzas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">V</forename><surname>Jawahar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2017" to="2026" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Learning visual features from large weakly supervised data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jabri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Vasilache</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV (7</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">9911. 2016</date>
			<biblScope unit="page" from="67" to="84" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Machine Learning Research</title>
				<meeting>Machine Learning Research<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">139</biblScope>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
	<note>ICML</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Seeing what you&apos;re told: Sentence-guided activity recognition in video</title>
		<author>
			<persName><forename type="first">N</forename><surname>Siddharth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barbu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Siskind</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="732" to="739" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Videobert: A joint model for video and language representation learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Myers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vondrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="7463" to="7472" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>NeurIPS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Schick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dwivedi-Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dessì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Raileanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lomeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Cancedda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<title level="m">Toolformer: Language models can teach themselves to use tools</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Virtex: Learning visual representations from textual annotations</title>
		<author>
			<persName><forename type="first">K</forename><surname>Desai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, Computer Vision Foundation / IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="11162" to="11173" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">DMCL: distillation multiple choice learning for multimodal action recognition</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">C</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Bargal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ablavsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Morerio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Murino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sclaroff</surname></persName>
		</author>
		<idno>CoRR abs/1912.10982</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Modality distillation with multiple stream networks for action recognition</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">C</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Morerio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Murino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV (8</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">11212</biblScope>
			<biblScope unit="page" from="106" to="121" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Vivit: A video vision transformer</title>
		<author>
			<persName><forename type="first">A</forename><surname>Arnab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lucic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="6816" to="6826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Video swin transformer</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3192" to="3201" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Choi</surname></persName>
		</author>
		<title level="m">CAST: cross-attention in space and time for video action recognition</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Interactive fusion of multi-level features for compositional activity recognition</title>
		<author>
			<persName><forename type="first">R</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Shu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<idno>CoRR abs/2012.05689</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Multitask learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Caruana</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mach. Learn</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="41" to="75" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Multi-task self-training for learning general representations</title>
		<author>
			<persName><forename type="first">G</forename><surname>Ghiasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV, IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8836" to="8845" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Attentive single-tasking of multiple tasks</title>
		<author>
			<persName><forename type="first">K</forename><surname>Maninis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Radosavovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kokkinos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, Computer Vision Foundation / IEEE</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1851" to="1860" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Cross-stitch networks for multi-task learning</title>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shrivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hebert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="3994" to="4003" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Sener</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chatterjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yao</surname></persName>
		</author>
		<idno>CoRR abs/2106.03152</idno>
		<title level="m">Technical report: Temporal aggregate representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Movinets: Mobile video networks for efficient video recognition</title>
		<author>
			<persName><forename type="first">D</forename><surname>Kondratyuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Gong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, Computer Vision Foundation / IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="16020" to="16030" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Distilling the knowledge in a neural network</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno>CoRR 1-15 abs/1503.02531</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<title level="m" type="main">An iterative image registration technique with an application to stereo vision</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kanade</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1981">1981</date>
			<publisher>IJCAI, William Kaufmann</publisher>
			<biblScope unit="page" from="674" to="679" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">You only look once: Unified, real-time object detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Divvala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR, IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="779" to="788" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Something-else: Compositional action recognition with spatial-temporal interaction networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Materzynska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<idno type="DOI">10.1109/cvpr42600.2020.00113</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<title level="m">Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Unified language model pre-training for natural language understanding and generation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NeurIPS</title>
		<imprint>
			<biblScope unit="page" from="13042" to="13054" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Loshchilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<title level="m">Decoupled weight decay regularization</title>
				<imprint>
			<publisher>OpenReview</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>ICLR (Poster)</note>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">The effects of ratee prototypicality on rater observation and accuracy 1</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Favero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Ilgen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Social Psychology</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="932" to="946" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
