<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LanViKD: Cross-Modal Language-Vision Knowledge Distillation for Egocentric Action Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yizheng Sun</string-name>
          <email>yizheng.sun@manchester.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Li</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ChengHua Lin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riza Batista-Navarro</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Egocentric Action Recognition, Language-Vision Multi-modality, Knowledge Distillation</string-name>
        </contrib>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Understanding human actions through the analysis of egocentric videos is a desirable capability of intelligent agents, and is a research area that has gained popularity recently. Thus far, most approaches to egocentric (video) action recognition (EAR), i.e., the task of classifying a given video clip according to a predefined set of natural-language descriptions (actions), represent the target action classes (label) using one-hot encoding, thus ignoring any relationships or similarities between some of the actions. The goal of this work is to augment the generalisation capability of vision models through leveraging the pre-existing knowledge encoded within pre-trained language models. Specifically, we propose a language-vision knowledge distillation framework to distil a pre-trained language model's knowledge about actions (expressed in text) into a vision model. Instead of using the one-hot encoding representation of a label, we employ the probability distribution across all action classes-given by a language model-as a teaching signal. Our experiments demonstrate that our framework obtains improved performance and generalisation capability on EAR based on the EPIC-Kitchens, Something-Something V2 and Something∗Corresponding author.</p>
      </abstract>
      <kwd-group>
        <kwd>Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Egocentric vision is a subfield of computer vision that analyses first-person viewpoint vision data
captured by a wearable camera. Núñez-Marcos et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] highlights that compared with
thirdperson view (exocentric) videos, egocentric videos usually involve rich hand-object interactions.
Our framework leverages the observation that diferent egocentric actions often involve the
same objects (e.g., both “Taking cutting board” and “Cutting onion” involve a cutting board)
and captures such correlations using pre-trained large language models.
      </p>
      <p>
        Early work has demonstrated that, in addition to the RGB modality, leveraging multiple
modalities such as audio, optical flow, and the bounding box and category of an object help
improve a model’s capability to understand egocentric videos [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Such eforts have
explored the potential of multi-modal knowledge distillation, where the teacher and student
models receive diferent input modalities [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ]. Their results show that using the teacher’s
knowledge from certain modalities for training improves the student’s performance on a diferent
(R. Batista-Navarro)
modality during inference. It is, however, unrealistic to assume that multiple modalities are
always available. In contrast, the language modality is usually available because most existing
EAR datasets are annotated according to target actions expressed in natural language [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ].
Additionally, the rapid growth and impressive performance of pre-trained Language Models
(LMs) on natural language processing (NLP) and computer vision (CV) tasks have been notable
[
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
        ]. Pre-trained LMs bring broader knowledge of human actions, that can support
the language modality.
      </p>
      <p>
        Extensive research has delved into exploring the potential of learning vision representations
through supervision embedded in natural language [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19">16, 17, 18, 19</xref>
        ]. Consequently, it is natural
to investigate whether LMs can be employed for video action recognition. Siddharth et al.
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] utilised language models to generate textual descriptions of videos, enabling their vision
model to comprehend and identify actions more efectively through textual cues. Sun et al.
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] jointly trained video and language modalities, enabling tasks like action recognition to
benefit from textual context. While previous studies demonstrated the advantages of integrating
the language modality into video learning, they typically fuse video and language modalities
together instead of utilising a pre-trained language model’s latent knowledge directly. Several
considerations drive the advancement of leveraging pre-existing knowledge in modelling.
Firstly, language models (LMs) have showcased exceptional capabilities in few-shot and
zeroshot transfer learning [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Consequently, LMs can be employed efectively with relatively
small datasets, as their objective is solely to assist the existing LMs during inference. Secondly,
methods based on LMs for video need little or even no training. Through plug-in modules, they
can be utilised in a convenient manner [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In this study, we take a diferent route and propose
a cross-modal language-vision knowledge distillation framework for EAR.
      </p>
      <p>Figure 1 depicts our framework. The conventional training approach employs one-hot
encoding to represent target actions, treating “Taking cutting board” and “Cutting onion” as
distinct target classes. Consequently, a vision model perceives these two videos as unrelated due
to the lack of consideration for the correlation between the action classes in the one-hot encoding
scheme. However, this perspective fails to reflect the inherent relationships within the video data,
leading to a lack of generalisation. This is diferent from a human standpoint, as humans would
recognise that both videos share relevant visual features associated with the cutting board object.
Conversely, a language model perceives textual action labels such as “Taking cutting board”
and “Cutting onion” as relevant, given their shared usage of the word “cut”, which better aligns
with the video content. To address this discrepancy, our framework leverages a language model
as the teacher to capture and incorporate this contextual relevance information into the EAR
training process to help improve vision models’ general understanding of videos. Furthermore,
our framework also follows a multi-task learning approach for capturing correlations between
the vision and language representations. We demonstrate that utilising a pre-trained language
model as teacher can improve a vision model’s performance and generalisation capability on
the EAR task.</p>
      <p>Contributions (i) We provide a cross-modal language-vision knowledge distillation
framework for EAR. Our framework is highly flexible, and is not constrained in terms of the vision
and language models involved. (ii) We demonstrate through experiments that a pre-trained
Sample 1:
Action: "Taking cutting board"
Video:</p>
      <p>Sample 2:
Action: "Cutting onion"
Video:
language model’s pre-existing knowledge is beneficial for a vision model’s understanding of
egocentric vision. (iii) Our experiments show that our framework’s performance in terms
of accuracy improves upon a baseline approach by up to 2.6%. This superior performance is
achieved without adding any additional computation for inference.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <sec id="sec-3-1">
        <title>Natural language supervision for vision learning focusses on learning visual representa</title>
        <p>
          tions from semantic information contained in natural language. Various methods have been
introduced to learn visual presentations from text paired with images [
          <xref ref-type="bibr" rid="ref16 ref18 ref19 ref24">16, 18, 24, 19</xref>
          ]. Notably,
a close work to ours is that of Gomez-Bigorda et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], which projects given textual
information into topic classes using Latent Dirichlet Allocation (LDA). They then use the probability
distribution of topic classes as a supervisory signal to train a CNN with cross-entropy loss.
In our case, we use pre-trained language models to generate the probability distribution and
employ standard practice in knowledge distillation to train a transformer-based vision model.
Furthermore, most of the aforementioned work are for pre-training visual representations, while
our framework is directly applied to downstream tasks such as egocentric action recognition.
Multi-modal knowledge distillation. In the context of multi-modal knowledge distillation,
several methods have been introduced in a cross-modal fashion [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ], where a student and
a teacher receive a diferent modality, respectively. Alternatively, some eforts explored the
distillation of knowledge between more than two modalities [
          <xref ref-type="bibr" rid="ref25 ref26 ref7 ref8">7, 25, 8, 26</xref>
          ], which have utilised
vision and audio-based data such as raw RGB, optical flow and sound waves, etc. In contrast, we
focus on knowledge distillation from a teacher model receiving language modality to a student
model receiving RGB modality. Compared with vision and audio-based modalities, the strength
of using language as a teaching modality comes from modern pre-trained language models,
whose pre-existing knowledge contain strong generalisation and understanding capability.
Egocentric action recognition (EAR). One line of work has focussed on model architecture
design to model the interplay between spatial and temporal information within RGB video
frames [
          <xref ref-type="bibr" rid="ref27 ref28 ref29">27, 28, 29</xref>
          ]. Concurrently, another strand of research demonstrated that using object
bounding boxes and categories to model hand-object interaction significantly improves EAR
performance [
          <xref ref-type="bibr" rid="ref30 ref4">30, 4</xref>
          ]. Recent work showed that utilising multiple modalities demonstrates
promising performance [
          <xref ref-type="bibr" rid="ref2 ref3 ref8">2, 3, 8</xref>
          ]. They utilised vision and audio-based modalities and have used
a shared model architecture for diferent modalities. Notably, the language modality poses
unique challenges due to its distinct data format, making direct application of existing methods
impractical. Thus, we propose a novel framework aimed at harnessing the language modality
specifically for EAR tasks.
        </p>
        <p>
          Multi-task learning was originally introduced by Caruana [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], where a shared model
generates output predictions for multiple tasks on the same input. Recent research highlighted
the strong performance of multi-task learning in computer vision tasks [
          <xref ref-type="bibr" rid="ref32 ref33 ref34">32, 33, 34</xref>
          ]. In our study,
we extend this concept to our knowledge distillation framework by incorporating a regression
head. This head projects vision latent representations from a student onto pre-trained language
latent representations provided by a teacher.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>This section provides a formal definition of the EAR task and delineates the procedural aspects of
our framework, which we refer to as LanViKD. Figure 2 presents an overview of the architecture
of LanViKD, which is comprised of two primary stages: Stage 1 entails the preparation of a
language model designated as the teacher model, while Stage 2 involves performing cross-modal
knowledge distillation.</p>
      <sec id="sec-4-1">
        <title>3.1. Egocentric Action Recognition Formulation</title>
        <p>
          Following Radevski et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], we formally define the EAR task as follows. An RGB video clip is
in the format of  ∈ ℝ  ×× × , where  is the number of sampled RGB frames, and  ,  and 
represent the number of channels, height and width. An egocentric action recognition dataset
 = {( 1,  1,  1), ..., (  ,   ,   )} contains  video clips   , together with textual narrations  
describing actions in the clips, and one-hot encoding   ∈ ℝ+ of the narrations. The goal of
EAR is to predict  ̂ ∈ ℝ+ as the action class for a given video clip   , or alternatively, ( , ̂)̂ ∈ ℝ 2+
as the verb and noun constituting the action in a video. The traditional training target for
EAR is the one-hot encoding of actions expressed in text [
          <xref ref-type="bibr" rid="ref29 ref35 ref36">35, 29, 36</xref>
          ]. However, as shown in
Figure 1 some action classes such “taking cutting board” and “cut carrot” share common features
with respect to the “cutting board” object in their corresponding RGB video frames. One-hot
encoding ignores the this relationship between diferent action classes. The goal of our work
Stage 1: Prepare Language Teacher Model
1 0 0 0 ....0
0 1 0 0 ....0
is to utilise this relationship information for EAR training by distilling the knowledge of a
pre-trained language model into an RGB video model.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Language Teacher Model Preparation</title>
        <p>As shown in Figure 2, given an EAR dataset  = {( 1,  1,  1), ..., (  ,   ,   )}, we employ a
pre-trained language model capable of processing sequences of text tokens to generate latent
representations. Subsequently, we freeze the parameters of the language model and proceed
to train a linear projection layer (or two separate linear projections in scenarios involving
verb-noun compositional actions) atop the language model. This trained projection layer is
tasked with classifying a textual action description   into its corresponding one-hot encoding
index   (or verb and noun indices, as previously specified). Following training, the linear
projection facilitates the generation of a soft probability distribution across all action classes
given a textual action description as input. This soft distribution contains valuable semantic
information, difering from conventional one-hot encoding. For instance, consider the actions
“taking cutting board”, which is associated with the noun label “cutting board” encoded as 1,
and “cut carrot”, labelled with the noun “carrot” encoded as 2. When inputting “taking cutting
board” into the language model for noun index classification, it assigns the highest probability
to 1 while also allocating a considerable probability to 2. This is due to the shared term “cut” in
both textual actions, despite their distinct noun classes. Moreover, this semantic relationship is
echoed in the video data, wherein both actions involve the object “cutting board”. While one-hot
indices categorise these videos into separate, unrelated classes, the probability distribution
reflects their semantic connection, aligning more closely with the visual modality.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Cross-modal Language-Vision knowledge distillation</title>
        <p>
          Once the language teacher model is prepared, we opt for a vision model to serve as the student
model, taking RGB video frames as its input. Similar to the teacher model, we apply linear
projection(s) atop the student model. The parameters of the teacher model are then fixed, and
we proceed with knowledge distillation, as originally proposed by Hinton et al. [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Training Objective.</title>
        <p>As described above, given a dataset  = {( 1,  1,  1), ..., (  ,   ,   )},
the teacher model takes   (action expressed in text) as input and predicts the class probability
distribution  ̂  = [ ,1 , ...,</p>
        <p>input and predicts  ̂

 = [ ,1 , ..., 

, ]. Similarly, the student model takes   (RGB video frames) as</p>
        <p>, ]. We minimise the KL-divergence between  ̂  and  ̂  as
ℒ</p>
        <p>= 1 ∑=1  ̂ ⋅ ( ̂



  −  ̂</p>
        <p>
          ). Following standard practice [
          <xref ref-type="bibr" rid="ref37 ref8">37, 8</xref>
          ], we use a temperature
loss according to the temperature parameter ℒ
= ℒ
parameter  to control the entropy of class probabilities predicted by the teacher  ̂

and the student  ̂
 =  ( ̂

  / ) , where  is the softmax operator. We then scale the KL-divergence
 =  ( ̂
        </p>
        <p>/ )
⋅  2. Additionally, we also minimise
the standard cross-entropy objective of class probabilities predicted by the student ℒ
=

1 ∑=1   ⋅  ( ̂

the noun, where ℒ

 towards ℎ .</p>
        <p>training objective becomes the sum of corresponding loss terms with respect to the verb and
  ). In the case of compositional actions containing verbs and nouns, the
= 12
(ℒ

+ ℒ
 ) and ℒ
= 12
(ℒ

+ ℒ
 ).</p>
        <p>Furthermore, we apply a multi-task learning approach in LanViKD by adding an extra linear
projection layer on top of the student model to generate   . We take the output from the last

hidden layer from the teacher ℎ , which is the latent representation of the input text given by

the original pre-trained language model. We minimise the smooth L1 objective ℒ1 to regress
ℒ1 = {
0.5(  − ℎ )2/
|
 − ℎ | − 0.5 ∗ 
if |  − ℎ | &lt;</p>
        <p>otherwise</p>
        <p>Where  determines the threshold for switching between L1 and L2 loss, with a value of 1
used in our experiments. We compute the final loss as
We note that the weights sum of ℒ
and ℒ
ℒ =  ⋅ ℒ 
+ (1 − ) ⋅ ℒ</p>
        <p>+  ⋅ ℒ 1 .
is 1 because they are based on the same output
linear layer. Instead, we use a separate loss weight for ℒ1 because it is based on the linear
layer of a separate task. During the inference process, it is important to note that the language
teacher model is dispensable. The student vision model operates solely on RGB video frames as
its input.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental Setup</title>
      <p>
        In our experiments, our primary objective is to assess the potential benefits of integrating
knowledge from a language model into a vision model for the EAR task. Specifically, we ask
the following questions: (i) What is the performance of utilising LanViKD on regular EAR data
samples, i.e. training and testing samples containing overlapping environments and/or objects.
(ii) To what extent can a student model, trained using LanViKD, efectively generalise to unseen
environments and/or objects not encountered during training? (iii) How does the incorporation
of a language model’s teaching signal alongside the standard one-hot target afect the training
of a student model, and what is the optimal balance between the two? (iiii) How does using the
language modality compare to using the audio modality in cross-modal knowledge distillation
with the RGB modality? We choose to compare language with audio because it is unlike optical
lfow and object bounding box/category which need to be computed using external algorithms
or models for RGB data [
        <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
        ]; audio and language are both raw data sources that are readily
available in EAR datasets.
      </p>
      <p>
        To address questions (i) and (ii), we conduct experiments across various datasets,
encompassing those with overlapping environments and objects for both training and validation, as
well as those featuring unseen or under-represented elements during validation. For question
(iii), we perform experiments with diferent  settings, regulating the ratio of the language
model’s teaching signal to the traditional one-hot target within the training objective. To
address question (iv), we compare our findings with those of Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], who conducted
similar knowledge distillation from audio modality to RGB video modality.
      </p>
      <p>
        Datasets. Our experiments are conducted on three datasets: Epic-Kitchens-100 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
SomethingSomething V2 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Something-Else [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ].
      </p>
      <p>Epic-Kitchens-100 (EK-100) is a large-scale dataset of egocentric videos. It contains 100
hours of non-scripted videos recorded by 37 participants in kitchen environments. The actions
depicted in the videos include narrations in the form of English phrases. The training targets
are verbs and nouns expressing the actions (e.g. “cutting onion” is an action narration, whose
training targets are “cut” and “onion”). There are 300 unique noun classes and 97 unique verb
classes. An action is considered to be correctly predicted if both the verb and the noun are
correct.</p>
      <p>
        The Something-Something V2 (SSV2) dataset is a large collection of (mostly egocentric) videos
that show people performing 174 pre-defined basic actions with everyday objects (e.g. putting
something on a surface, moving something up) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Notably, videos in SSV2 initially feature
annotations with specific object names, which are then replaced with the word “something” for
training targets (e.g., “putting box on a surface” becomes “putting something on a surface”).
      </p>
      <p>
        Something-Else (SthElse) is an alternative data re-split of the original SSV2 [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ]. SthElse splits
SSV2 in such a way that the training and validation sets contain distinct objects. Therefore,
SthElse focusses on using unseen objects during training to measure the generalisation capability
of a model.
      </p>
      <p>In a similar vein, we also incorporate the EK-100 Unseen and Tail split. The unseen split is a
subset of the EK-100 validation set, which contains videos that are recorded by two participants
who did not appear in the training set. The unseen split is specifically designed to measure the
ability of models on unseen environments during training. The tail split is a subset containing
action classes that have little training samples. Notably, the EK-100 regular split encompasses
all samples excluding the unseen split.</p>
      <p>
        Language Backbone. In this study, our language model of choice is MiniLM, featuring 12
layers and a hidden size of 384 [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ]. The rationale behind choosing MiniLM stems from its
compact architecture and computational eficiency. Despite its smaller size, MiniLM maintains
competitive performance over its teacher model, UniLM [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]. For the EK-100 dataset, we utilised
the original textual action annotations, consisting of English phrases describing actions, as input
to MiniLM. Similarly, for the SthElse dataset, we employed the original annotations, which
include object names as inputs to MiniLM.
      </p>
      <p>
        Vision Backbone. Following Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we chose the Swin Transformer Tiny version
(Swin-T) model as the vision model in LanViKD. Each video clip is represented as a sequence of
RGB frames, where each frame is represented by a 3 × 224 × 224 tensor. Swin-T takes a video
clip as input and produces a 768-dimension tensor as the latent representation of the video.
Implementation details. For teacher models, we train the linear head for 10 epochs across
all datasets. As for the student models, we train them for 50 epochs on Epic-Kitchens, 40 epochs
on SSV2 and 30 epochs on Something-Else. As per Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we employ the AdamW
optimiser [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ], setting the peak learning rate at 1 − 4 . Initially, the learning rate linearly
increases for the first 3 epochs and then linear decreases to 0. A weight decay of 5 − 2 is
utilised, along with gradient clipping, limiting the maximum norm to 5. Across all experiments,
 remains fixed at a value of 3. For EK-100, during training, we select a random starting frame
and sample 32 frames with a fixed stride of 2. In inference, frames are sampled in the same
manner to cover the central section of the video. For SSV2 and SthElse, 16 frames are sampled
to cover the entire video during both training and inference. Standard data augmentation
techniques are applied to RGB frames, including random cropping, color jitter, and random
horizontal flips (exclusive to EK-100). Consistency is maintained within each video clip by
applying the same augmentation methods to every frame. A single temporal crop is employed
for inference.
      </p>
      <p>
        Direct Comparison. In the study by Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the Swin-T model was trained on
the EK-100, SSV2 and SthElse datasets. A key distinction between their approach and ours
is that while they incorporated multiple modalities, including RGB, optical flow, and audio,
they did not include the language modality. In contrast, our work leverages only the language
modality as the teacher modality. To ensure a direct and fair comparison, we adhered to the
same experimental settings as Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], including the use of the backbone model, data
augmentation techniques, and frame sampling methods.
      </p>
      <p>
        Evaluation Metrics. We calculate two widely used metrics, Accuracy@1 (ACC@1) and
Accuracy@5 (ACC@5), on the test set, which play pivotal roles in assessing the efectiveness
of such systems [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. By measuring the correctness of predictions within the top-ranked
recommendations, both ACC@1 and ACC@5 provide valuable insights into the system’s ability
to deliver relevant and satisfactory outcomes to users, where ACC@1 quantifies the proportion
of correct predictions among the top-1 ranked results. It signifies whether the single
highestranked item recommended by the system aligns with the user’s preference. On the other
hand, ACC@5 expands the assessment to the top-5 ranked results, thereby ofering a broader
evaluation of the system’s performance.
Method
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Analysis</title>
      <p>
        Across all our experiments, we adopt baseline results derived from Radevski et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], adhering
to identical experimental settings. However, for the EK-100 dataset, since they did not include
results for the EK-100 tail split, we replicated the baseline experiment to serve as our own
baseline.
      </p>
      <sec id="sec-6-1">
        <title>Performance on regular environments and objects. Table 1 shows the performance</title>
        <p>metrics obtained from experiments conducted on both the EK-100 regular split and SSV2 dataset.
For EK-100, all results are based on ACC@1. For SSV2, we report both ACC@1 and ACC@5
accuracy.</p>
        <p>We observe that incorporating knowledge distillation from a language model into a vision
model generally enhances the performance of the vision model on the EK-100 regular split
by up to 2%, while maintaining competitive results on the SSV2 dataset compared to the
baselines. Specifically, in relation to the EK-100 dataset, integrating the regression head for
LanViKD demonstrates superior performance in classifying nouns, whereas its removal results
in improved classification of verbs. Furthermore, both scenarios show similar improvements
in classifying actions, achieving approximately a 2% increase in ACC@1 over the baseline,
which serves as the primary metric for EK-100. Conversely, for the SSV2 dataset, LanViKD’s
performance decreases by 1.9% compared to the baseline without the regression head. Moreover,
incorporating the regression head yields performance that is competitive with the baseline.
0.125
0.100
0.075
0.050
0.025
0.000
0.025
tap spoon platecupboardknife pan lid bowl drawersponge
(a) EK-100 unseen split nouns
take put wash open close insertturn-on cut turn-offpour
(b) EK-100 unseen split verbs</p>
      </sec>
      <sec id="sec-6-2">
        <title>Generalisation capability on unrepresented and unseen environments and unseen ob</title>
        <p>jects. Table 2 shows the performance on EK-100 unseen and tail splits, which contain unseen
and unrepresented environments during training, respectively. It also shows the performance
on SthElse, which contains videos involving objects that are unseen during training. These
validation sets aim at evaluating a vision model’s generalisation capability.</p>
        <p>Our observations indicate that distilling knowledge from a language model into a vision
model generally enhances the generalisation capability of the latter by up to 1.3% on the EK-100
unseen split and 2.6% on the SthElse dataset. Specifically, for the EK-100 unseen split, LanViKD
outperforms the baseline across all three metrics (Noun, Verb, and Action) without the addition
of the regression head. Furthermore, incorporating the regression head leads to an additional
1.3% improvement in performance specifically on the metrics for Action. For the EK-100 tail
split, LanViKD demonstrates competitive results with the baseline when the regression head is
absent. However, with the regression head, although LanViKD exhibits a slight performance
decrease in the Verb metric compared to the baseline, it achieves a 0.5% enhancement in the
primary metric, Action. Similarly, for the SthElse dataset, LanViKD surpasses the baseline by
2.6% in ACC@1 without the regression head. However, the addition of the regression head
marginally diminishes performance by 0.4% compared to its absence. Moreover, Figure 3 shows
per-class ACC@1 improvement in relation to the top 10 frequent nouns and verbs within the</p>
        <p>EK-100 unseen split dataset, alongside the top 20 frequent actions identified in SthElse.
Teacher’s influence on the student. To investigate the influence of the teacher language
model on the student model’s performance, we set the parameter  to 0.4 and 0.8 for the EK-100
and SthElse datasets, respectively. Specifically, this adjustment increases the teaching signal’s
weight in the training objective from 40% to 80%, while maintaining the regression head.</p>
        <p>Tables 3 and 4 present a comparative analysis of the model’s performance with  set at 0.4
and 0.8. The results indicate that increasing  to 0.8 leads to a slight improvement on the unseen
split of the EK-100 dataset. However, this increase is associated with a significant performance
decline on the tail split of the EK-100 dataset and across the SthElse dataset.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Comparison with knowledge distillation on audio modality. We are interested in com</title>
        <p>
          paring the utilisation of audio modality for knowledge distillation, as opposed to optical flow
(OF) and objects’ bounding box and category (OBJ) modalities. Unlike OF and OBJ, which are
derived from RGB modality through external algorithms or deep learning models [
          <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
          ], audio
and text modalities represent raw data from the datasets. This distinction is crucial, as the
computation of OF and OBJ may introduce hidden external model knowledge into training,
making it uncertain whether all the knowledge distilled into a student is solely from the teacher.
In the study by Radevski et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], they trained an audio model on EK-100 audio data alongside
an RGB model. Subsequently, they combined these models as a teacher ensemble to train a
Swin-T vision student model, which only received RGB video frames. Similarly, in our approach,
we leverage knowledge from a language teacher alongside RGB video frames to train a vision
student, also receiving only RGB frames; while Radevski et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] utilised audio and RGB
modalities for training, we employ language and RGB modalities. Both approaches exclusively
use the RGB modality for inference.
        </p>
        <p>It is important to note that the audio modality is exclusive to the EK-100 dataset. Table 5
presents a comparison between knowledge distillation using audio and RGB, and language and
RGB modalities. Our findings indicate that training with language and RGB yields superior
performance, surpassing training with audio and RGB by up to 1.7% on the EK-100 regular split,
while also achieving competitive results on the unseen split.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and Future Work</title>
      <p>In this work, we propose a knowledge distillation framework, LanViKD, for language and vision
(RGB) modalities. Our experiments demonstrate enhancement in performance compared to
the baseline model, which is solely trained on one-hot labels utilising only the RGB modality.
Additionally, we conduct a comparative analysis between the incorporation of audio modality
and language modality for knowledge distillation. Our findings indicate the superiority of the
language modality as a teacher for enhancing the learning of the vision-based student.</p>
      <p>In our future work, we will investigate the integration of the language modality with additional
modalities such as audio, depth, and thermography. We plan to find an approach for aligning
multiple modalities and create a comprehensive teacher model with broader knowledge for
knowledge distillation, potentially leading to further performance improvement.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We would like to acknowledge the use of the Computational Shared Facility at The University of
Manchester. The computational resource used in this work is supported by the CSF (aka Danzek),
which is a High Performance Computing (HPC) cluster at the University of Manchester, managed
by IT Services for the use of University academics, post-doctoral assistants and post-graduates
to conduct academic research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Núñez-Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Azkune</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Arganda-Carreras, Egocentric vision-based action recognition: A survey</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>472</volume>
          (
          <year>2022</year>
          )
          <fpage>175</fpage>
          -
          <lpage>197</lpage>
          . URL: https://doi.org/10.1016/j. neucom.
          <year>2021</year>
          .
          <volume>11</volume>
          .081. doi:
          <volume>10</volume>
          .1016/J.NEUCOM.
          <year>2021</year>
          .
          <volume>11</volume>
          .081.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. van der Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <article-title>Omnivore: A single model for many visual modalities</article-title>
          , in: CVPR, IEEE,
          <year>2022</year>
          , pp.
          <fpage>16081</fpage>
          -
          <lpage>16091</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nagrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>M&amp;</surname>
          </string-name>
          <article-title>m mix: A multimodal multiview transformer ensemble</article-title>
          ,
          <source>CoRR abs/2206</source>
          .09852 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ben-Avraham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mangalam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chechik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Globerson</surname>
          </string-name>
          ,
          <article-title>Object-region video transformers</article-title>
          , in: CVPR, IEEE,
          <year>2022</year>
          , pp.
          <fpage>3138</fpage>
          -
          <lpage>3149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>Cross modal distillation for supervision transfer</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>2827</fpage>
          -
          <lpage>2836</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Aytar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <article-title>Soundnet: Learning sound representations from unlabeled video</article-title>
          ,
          <source>in: NIPS</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>892</fpage>
          -
          <lpage>900</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Multimodal knowledge expansion</article-title>
          , in: ICCV, IEEE,
          <year>2021</year>
          , pp.
          <fpage>834</fpage>
          -
          <lpage>843</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Radevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grujicic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Blaschko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moens</surname>
          </string-name>
          , T. Tuytelaars,
          <article-title>Multimodal distillation for egocentric action recognition</article-title>
          , in: ICCV, IEEE,
          <year>2023</year>
          , pp.
          <fpage>5190</fpage>
          -
          <lpage>5201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Doughty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Farinella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Furnari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kazakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moltisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Perrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wray</surname>
          </string-name>
          ,
          <article-title>Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100, Int</article-title>
          .
          <source>J. Comput. Vis</source>
          .
          <volume>130</volume>
          (
          <year>2022</year>
          )
          <fpage>33</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Kahou</surname>
          </string-name>
          , V. Michalski, J. Materzynska,
          <string-name>
            <given-names>S.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Haenel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fründ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yianilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mueller-Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thurau</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Bax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Memisevic</surname>
          </string-name>
          ,
          <article-title>The ”something something” video database for learning and evaluating visual common sense</article-title>
          , in: ICCV, IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>5843</fpage>
          -
          <lpage>5851</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suleyman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The kinetics human action video dataset</article-title>
          ,
          <source>CoRR abs/1705</source>
          .06950 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>140</lpage>
          :
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT (1), Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          , in: ACL, Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kim</surname>
          </string-name>
          ,
          <article-title>Vilt: Vision-and-language transformer without convolution or region supervision</article-title>
          , in: ICML, volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5583</fpage>
          -
          <lpage>5594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Jiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          ,
          <article-title>Contrastive learning of medical visual representations from paired images and text</article-title>
          , in: MLHC, volume
          <volume>182</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gomez-Bigorda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rusiñol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <article-title>Self-supervised learning of visual features through embedding images into text topic spaces</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>2017</fpage>
          -
          <lpage>2026</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. van der Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jabri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasilache</surname>
          </string-name>
          ,
          <article-title>Learning visual features from large weakly supervised data</article-title>
          ,
          <source>in: ECCV (7)</source>
          , volume
          <volume>9911</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2016</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: ICML, volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Siddharth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barbu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Siskind</surname>
          </string-name>
          ,
          <article-title>Seeing what you're told: Sentence-guided activity recognition in video</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2014</year>
          , pp.
          <fpage>732</fpage>
          -
          <lpage>739</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Videobert: A joint model for video and language representation learning</article-title>
          , in: ICCV, IEEE,
          <year>2019</year>
          , pp.
          <fpage>7463</fpage>
          -
          <lpage>7472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: NeurIPS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dwivedi-Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dessì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hambro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cancedda</surname>
          </string-name>
          , T. Scialom,
          <article-title>Toolformer: Language models can teach themselves to use tools</article-title>
          , in: NeurIPS,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Desai</surname>
          </string-name>
          , J. Johnson, Virtex:
          <article-title>Learning visual representations from textual annotations</article-title>
          , in: CVPR, Computer Vision Foundation / IEEE,
          <year>2021</year>
          , pp.
          <fpage>11162</fpage>
          -
          <lpage>11173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Bargal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ablavsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morerio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murino</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Sclarof,</surname>
          </string-name>
          <article-title>DMCL: distillation multiple choice learning for multimodal action recognition</article-title>
          , CoRR abs/
          <year>1912</year>
          .10982 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morerio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murino</surname>
          </string-name>
          ,
          <article-title>Modality distillation with multiple stream networks for action recognition</article-title>
          ,
          <source>in: ECCV (8)</source>
          , volume
          <volume>11212</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Vivit: A video vision transformer</article-title>
          , in: ICCV, IEEE,
          <year>2021</year>
          , pp.
          <fpage>6816</fpage>
          -
          <lpage>6826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Video swin transformer</article-title>
          , in: CVPR, IEEE,
          <year>2022</year>
          , pp.
          <fpage>3192</fpage>
          -
          <lpage>3201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Choi,</surname>
          </string-name>
          <article-title>CAST: cross-attention in space and time for video action recognition</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Interactive fusion of multi-level features for compositional activity recognition</article-title>
          , CoRR abs/
          <year>2012</year>
          .05689 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>Multitask learning</article-title>
          ,
          <source>Mach. Learn</source>
          .
          <volume>28</volume>
          (
          <year>1997</year>
          )
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghiasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Lin, Multi-task self-training for learning general representations</article-title>
          , in: ICCV, IEEE,
          <year>2021</year>
          , pp.
          <fpage>8836</fpage>
          -
          <lpage>8845</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maninis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Radosavovic</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kokkinos</surname>
          </string-name>
          ,
          <article-title>Attentive single-tasking of multiple tasks</article-title>
          , in: CVPR, Computer Vision Foundation / IEEE,
          <year>2019</year>
          , pp.
          <fpage>1851</fpage>
          -
          <lpage>1860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>I.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hebert</surname>
          </string-name>
          ,
          <article-title>Cross-stitch networks for multi-task learning</article-title>
          ,
          <source>in: CVPR, IEEE Computer Society</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>3994</fpage>
          -
          <lpage>4003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <source>Technical report: Temporal aggregate representations</source>
          ,
          <source>CoRR abs/2106</source>
          .03152 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondratyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brown</surname>
          </string-name>
          , B. Gong, Movinets:
          <article-title>Mobile video networks for eficient video recognition</article-title>
          , in: CVPR, Computer Vision Foundation / IEEE,
          <year>2021</year>
          , pp.
          <fpage>16020</fpage>
          -
          <lpage>16030</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>CoRR abs/1503</source>
          .02531 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kanade</surname>
          </string-name>
          ,
          <article-title>An iterative image registration technique with an application to stereo vision</article-title>
          , in: IJCAI, William Kaufmann,
          <year>1981</year>
          , pp.
          <fpage>674</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>J.</given-names>
            <surname>Materzynska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , Something-else:
          <article-title>Compositional action recognition with spatial-temporal interaction networks</article-title>
          ,
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1109/ cvpr42600.
          <year>2020</year>
          .
          <volume>00113</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Minilm:
          <article-title>Deep self-attention distillation for task-agnostic compression of pre-trained transformers</article-title>
          , in: NeurIPS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hon</surname>
          </string-name>
          ,
          <article-title>Unified language model pre-training for natural language understanding and generation</article-title>
          , in: NeurIPS,
          <year>2019</year>
          , pp.
          <fpage>13042</fpage>
          -
          <lpage>13054</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization, in: ICLR (Poster), OpenReview</article-title>
          .net,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Favero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Ilgen</surname>
          </string-name>
          ,
          <article-title>The efects of ratee prototypicality on rater observation and accuracy 1</article-title>
          ,
          <source>Journal of Applied Social Psychology</source>
          <volume>19</volume>
          (
          <year>1989</year>
          )
          <fpage>932</fpage>
          -
          <lpage>946</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>