<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Scientific Workshop on Applied Information Technologies and Artificial Intelligence Systems,
December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Vid-LLM architectures systemic analysis integration of multimodal video understanding⋆ for the</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arsirii</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hodovychenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Odesa Polytechnic National University</institution>
          ,
          <addr-line>Shevchenko Avenue 1, 65044 Odesa</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Svitlana Antoshchuk</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Recent advances in Large Language Models (LLMs) enable Video-Language Models (Vid-LLMs) for complex spatiotemporal video understanding. A systemic analysis of modern Vid-LLM architectures is presented, highlighting three main categories based on input processing strategies: Analyzer + LLM (relying on symbolic outputs), Embedder + LLM (using visual representations), and Hybrid frameworks as a combination of the first two. We analyzed their design principles, functional roles, and applications (captioning, QA, localization, agents). Challenges in long-context modeling, video tokenization, grounded reasoning, and integration with external tools are discussed. In conclusion, future research directions for improving Vid-LLM scalability, interpretability, and robustness are substantiated.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Vid-LLMs</kwd>
        <kwd>Video Comprehension</kwd>
        <kwd>Multimodal Architectures</kwd>
        <kwd>Spatio-temporal Reasoning</kwd>
        <kwd>Video Analysis</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The swift expansion of video content across digital platforms, such as social media, entertainment,
surveillance, and autonomous systems, has generated an increased demand for intelligent systems
capable of autonomously analyzing and comprehending complex visual data. Video
comprehension, encompassing the identification of objects, actions, events, and the inference of
high-level semantics over time, represents a core challenge in the fields of computer vision and
artificial intelligence [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Conventional approaches, such as manual feature engineering and early
neural networks, laid the groundwork for significant progress; however, they have proven
inadequate in fully conveying the complexity and diversity inherent in real-world video
footage [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In the last ten years, deep learning has made models much better at handling spatio-temporal
data. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and more
recently, Transformer-based architectures, have been successful in recognizing actions, classifying
videos, adding captions, and finding the right time. Self-supervised learning has sped up this
advancement even further by making it possible to train strong video encoders without a lot of
human annotation. But these models are frequently just good at certain tasks and don’t have the
generalization and reasoning skills needed to interpret videos that are more abstract and include
more than one phase [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Concurrently, Large Language Models (LLMs) such as GPT-4 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], PaLM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and LLaMA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
have attained pioneering results in natural language processing tasks. These models demonstrate
emergent capabilities, such as few-shot learning, instruction adherence, and advanced reasoning,
by utilizing extensive text corpora during the pretraining phase. Recent initiatives have
commenced to investigate the integration of large language models with video data, leading to the
emergence of a novel category of video-language models (Vid-LLMs) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These systems integrate
vision encoders and language models to enable multimodal video comprehension, capable of
answering questions about videos, generating descriptions, identifying temporal events, and
performing commonsense reasoning based on visual input.
      </p>
      <p>Vid-LLMs offer a unified interface for executing various video comprehension tasks through
prompting or in-context learning, generally without requiring extensive retraining. Their
versatility and applicability across various domains render them valuable for purposes including
robotics, surveillance, education, and content moderation.</p>
      <p>The objective of this study is to perform a systematic analysis and classification of Video- Language
Model (Vid-LLM) architectures to identify and assess the principal integration strategies between the
visual and language modalities that characterize the capacity of Vid-LLMs for intricate multimodal
video comprehension. The following goals are addressed:
1. Systematize the most advanced methods for developing Vid-LLMs by categorizing them
according to architectural paradigms (Analyzer + LLM, Embedder + LLM, and Hybrid), and
evaluate their fundamental design principles.
2. Assess the capabilities of Vid-LLMs across a broad spectrum of applications, including query
answering, temporal localization, and agentic reasoning.
3. Identify the technical limitations and assessment criteria associated with video tokenization,
long-context modeling, and interpretability.
4. Develop a strategic roadmap and delineate future research directions aimed at creating
more robust, scalable, and efficient systems for visual environment integration.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Trends and milestones in video comprehension methods</title>
      <p>The field of video comprehension has undergone substantial development in the last two decades,
driven by advancements in computer vision, machine learning, and, more recently, multimodal
artificial intelligence. The increasing volume of video data across diverse domains, including
entertainment, social media, surveillance, and autonomous systems, has heightened the necessity
for efficient and scalable methods for its interpretation and analysis. This paper analyzes the
historical development of video comprehension techniques.</p>
      <sec id="sec-2-1">
        <title>2.1. Early techniques</title>
        <p>
          The initial phase of video comprehension involved manually crafted feature extraction techniques
and conventional machine learning algorithms. Spatial characteristics have traditionally been
obtained using descriptors such as Scale-Invariant Feature Transform (SIFT) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Speeded-Up
Robust Features (SURF) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and Histogram of Oriented Gradients (HOG) [10], which aid in
identifying and representing key visual patterns within discrete frames. Techniques including
optical flow, background removal, and Improved Dense Trajectories (IDT) [11] were employed to
characterize motion and temporal dynamics.
        </p>
        <p>Temporal dependencies in video sequences have frequently been examined through statistical
models, notably Hidden Markov Models (HMMs) [12], which enabled the recognition of sequential
patterns. Conventional machine learning models, such as Support Vector Machines (SVMs) [13],
Decision Trees [14], and Random Forests [15], have been extensively utilized for classification and
recognition tasks. Furthermore, unsupervised methods including cluster analysis and
dimensionality reduction techniques such as Principal Component Analysis (PCA) were employed
to categorize video segments and decrease computational complexity.</p>
        <p>The techniques provided valuable insights into video analysis; however, their applicability in
other contexts was limited, and they faced challenges in scaling, particularly when dealing with
complex, high-dimensional, or extended-duration videos. This prompted the exploration of more
dependable methods, ultimately resulting in the adoption of deep learning-based techniques.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. First-generation neural video models</title>
        <p>Initial neural video models represented a substantial transition from conventional handmade
methods by using deep learning architectures, especially convolutional and recurrent neural
networks. Early models like DeepVideo [16] used 3D Convolutional Neural Networks (CNNs) [17]
to derive visual features from video frames; however, they failed to surpass handmade features
owing to insufficient motion representation. To overcome this problem, two-stream networks were
developed, integrating RGB frame data with motion information (e.g., optical flow) to more
effectively capture temporal dynamics.</p>
        <p>Recurrent Neural Networks (RNNs) [18], particularly Long Short-Term Memory (LSTM) [19]
networks, were used to analyze sequential data and improve the representation of long-range
temporal relationships. Temporal Segment Networks (TSN) [20] consolidated data from poorly
sampled segments to facilitate efficient analysis of long-form videos. Additional advances,
including Fisher Vectors [21] and Bi-linear pooling [22], were used to enhance video-level
representations.</p>
        <p>The advent of 3D CNNs, including C3D and Inflated 3D ConvNets (I3D) [23], facilitated the
integrated modeling of spatial and temporal data via volumetric convolutions.</p>
        <p>The models demonstrated impressive performance on benchmarks such as UCF-101 [24] and
HMDB51 [25], resulting in the adaptation of well-known 2D architectures (e.g., ResNet, SENet) into
3D formats (e.g., R3D, MFNet, STC) [26]. To enhance computational efficiency, decomposed
convolution methods (e.g., S3D, ECO, P3D) [27] divide 3D operations into separable 2D and 1D
convolutions.</p>
        <p>Subsequent progress involved the development of long-range temporal modeling techniques
(e.g., LTC, T3D, Non-local Networks, V4D) [28] and the introduction of efficient architectures such
as SlowFast and X3D [29]. The incorporation of Vision Transformers (ViT) [30] has spurred the
development of models including TimeSformer [31], ViViT [32], and MViT [33]. These models
substitute convolutional operations with attention mechanisms, thereby providing enhanced
scalability and improved temporal reasoning abilities for intricate video understanding
applications.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Unsupervised pretraining for video understanding</title>
        <p>Self-supervised pretraining for movies is a big step forward in video understanding because it lets
models learn complete and generalizable representations from huge amounts of unprocessed video
data. This method makes it easier to switch between jobs and lessens the need for notes that are
specific to each task. VideoBERT [34] was a groundbreaking and well-known model that used
hierarchical k-means clustering to tokenize video features and masked modeling to get
representations that could go both ways. This model could be improved so that it does better at
things like recognizing actions and adding captions to videos.</p>
        <p>Subsequently, various approaches employed the pretraining-finetuning paradigm, incorporating
innovations in architecture and training objectives. Models such as ActBERT, Spatio-temporalMAE,
OmniMAE, VideoMAE, and MotionMAE have investigated masked video modeling and multimodal
learning [35]. Others, including MaskFeat and CLIP-ViP, concentrated on contrastive learning and
vision-language alignment [36]. These models incorporated mechanisms for reconstructing or
predicting obscured video segments, aligning visual and textual modalities, or generating latent
feature representations that encode both temporal and semantic information.</p>
        <p>Self-supervised models have markedly enhanced performance on standard video benchmarks
and exhibited robust generalization to tasks such as video classification, summarization, captioning,
and question answering [37]. They also endorsed cross-modal learning, whereby coupled video and
language data enabled the development of video-language models proficient in multimodal
reasoning. This phase established the foundation for the integration of pretrained visual models
with large language models in subsequent systems.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. LLM-based approaches to video comprehension</title>
        <p>The incorporation of Large Language Models (LLMs) into video comprehension represents the
most recent and significant advancement in the domain (Fig. 1). Large Language Models, like
ChatGPT and GPT-4, pretrained on extensive text corpora, exhibit robust in-context learning,
instruction adherence, and reasoning ability. Their application to video comprehension represents
a paradigm change by framing intricate video interpretation difficulties as language modeling
challenges, often without necessitating considerable task-specific fine-tuning.</p>
        <p>Large language models (LLMs) are capable of processing textual representations obtained from
video content or engaging with visual information via multimodal encoders. Utilizing these
capabilities, systems like Visual-ChatGPT and other Vid-LLMs have been created to execute
openended video reasoning, generate captions, respond to video-related inquiries, and invoke external
vision APIs or tools based on prompts. This facilitates a dynamic and flexible comprehension of
visual scenes through natural language. Instruction tuning and prompt engineering are essential
for adapting large language models to perform various video-related tasks. These models
demonstrate emergent capabilities, enabling them to perform multi-granularity reasoning –
abstract, temporal, and spatio-temporal – through the integration of visual and commonsense
knowledge. In contrast to earlier models designed for specific tasks, LLM-based video
understanding systems exhibit the ability to generalize across various tasks through unified
interfaces and few-shot or zero-shot learning [38].</p>
        <p>Combining LLMs with video analysis opens the door to more scalable and human-like video
comprehension systems that can handle multimodal problems in the real world in fields like
robotics, education, entertainment, and surveillance. This development marks a move toward
instruction-driven, general-purpose video intelligence.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem statement</title>
      <sec id="sec-3-1">
        <title>Let a video be defined as a sequence of visual frames over time:</title>
        <p>V ={f 1 , f 2 , ... , f T }, f t ∈ R H ×V ×C , (1)
where each frame  is an RGB image of spatial resolution  ×  with  = 3 channels, and 
denotes the temporal length of the video.</p>
        <p>Let an optional multimodal query  be a sequence of language tokens</p>
        <p>Q={q1 , q2 , ... , qL}, qi∈V ,
(2)
where  is the vocabulary space of the language model, and  is the length of the query or
prompt.</p>
        <p>The goal of video comprehension with Large Language Models (LLMs) is to define a function:
FΘ :(V , Q)→ A ,
(3)
where  is a parameterized model (e.g., a Vid-LLM) that maps the video and optional query to a
structured output , such as:
1. A natural language sequence:  ∈  *.
2. A classification label: ∈ .
3. Or continuous-valued predictions (e.g., timestamps, coordinates):  ∈ .</p>
        <p>To learn this mapping, the system typically consists of:
1. Video encoder :
d
ϕ : V → v ={v1 , v2 , ... , vT } vt ∈ R v .
2. Language encoder (optional) :</p>
        <p>ψ : Q → q={q1 , q2 , ... , qL} qi∈ Rdq . (5)
3. Multimodal fusion function that aligns and integrates video and language features into a
unified space interpretable by the LLM.
4. LLM core  : an autoregressive transformer that models the conditional probability
distribution:
(4)
(6)
(7)</p>
        <p>M ( x1:i−1)= p ( xi∨ x1:i−1) ,
where  ∈  ∪ {} and 1:−1 may include fused visual-linguistic context.
The training objective is to minimize a task-specific loss , for instance:</p>
        <p>θ ✳=argmin ( E(V ,Q , A✳)∼ D [ L( Fθ (V , Q) F ✳)]) ,
where * is the ground truth target from dataset  , and  includes parameters from the
encoders, fusion module, and (optionally) the LLM.</p>
        <p>This formulation encapsulates various subtasks, including:
1. Captioning:  ∈  * or .
2. Localization:  = (, ), , , ∈ [1, ].</p>
        <p>T
3. Tracking: A={|( xt , yt )|}t=1 , etc.</p>
        <p>The primary challenge lies in bridging the modality gap between continuous spatio-temporal
visual signals and discrete symbolic reasoning in LLMs, while maintaining scalability,
generalization, and data efficiency.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Vid-LLMs classification</title>
      <p>Based on the strategy used to process input video data, we divide Vid-LLMs into three main
categories.</p>
      <sec id="sec-4-1">
        <title>4.1. Video Analyzer + LLM</title>
        <p>The Video Analyzer + LLM architecture exemplifies a modular strategy for video comprehension,
where the video content is initially handled by a specialized video analyzer module that extracts
interpretable intermediate representations in textual format (Fig. 2). The outputs generally
encompass video captions, detailed temporal captions, object tracking data, audio transcriptions
(through ASR), or subtitle text (through OCR) [39]. The resulting textual descriptions are
subsequently provided as input to a Large Language Model (LLM), which conducts high-level
reasoning, question answering, or task-specific inference based on the structured input [40].</p>
        <p>This design effectively reformulates video comprehension as a text-based reasoning task,
allowing LLMs to operate without the need for direct visual or spatio-temporal input processing.
As a result, it leverages the zero-shot and in-context learning capabilities of pretrained LLMs while
avoiding the computational overhead and training complexity of end-to-end multimodal models.</p>
        <sec id="sec-4-1-1">
          <title>Two functional variants of this architecture are commonly employed:</title>
          <p>1. LLM as Summarizer: in this arrangement, the LLM passively receives the output from the
video analyzer and produces natural language summaries, captions, or responses. The
information flow is unidirectional (i.e., Video → Analyzer → LLM), and the LLM does not
influence the video processing pipeline. Notable examples include LaViLa, VAST, LLoVi,
Video ReCap, Grounding-Prompter, and AntGPT [41].
2. LLM as Manager: this form assigns the LLM an active function, whereby it provides
directives, oversees various analytical instruments, and participates in iterative exchanges
to accomplish intricate tasks. The LLM operates as a sophisticated coordinator of perceptual
modules. Notable systems in this area include ViperGPT, HuggingGPT, VideoAgent,
SCHEMA, VideoTree, GPTSee, and AssistGPT [42].</p>
          <p>This architecture is especially appealing due to its training-free and modular structure,
facilitating swift prototyping and deployment of video comprehension systems with readily
available LLMs and vision tools. It constitutes a fundamental pattern in the architecture of
numerous contemporary Vid-LLM systems.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Video Embedder + LLM</title>
        <p>The Video Embedder combined with the LLM architecture presents a more cohesive approach to
video comprehension by embedding video content into a continuous feature space through a visual
encoder (or embedder), and directly supplying these dense representations to a Large Language
Model (LLM) (Fig. 3). Unlike the Video Analyzer + LLM framework, which depends on textual
intermediate outputs, this design permits the LLM to process raw visual data in embedded form,
facilitating more detailed multimodal reasoning [43].</p>
        <p>In this design, the video embedder, which may include a 3D CNN, a Transformer-based encoder,
or a vision-language pretrained model, analyzes the video input  to produce a series of latent
embeddings {1, 2, ..., }, which are then aligned with the token space of the LLM. A bridging
modality mechanism, such as a linear projection, adapter module, or cross-attention layer, is often
used to transform visual embeddings into a format compatible with the input space of the LLM
[44].</p>
        <p>This method facilitates comprehensive training and enables the LLM to directly engage with
visual representations, positioning it effectively for tasks that necessitate temporal alignment,
spatial grounding, or multimodal reasoning across video and language inputs [45].</p>
        <p>
          Training these models necessitates extensive multimodal data and meticulous design of the
fusion mechanisms to guarantee stable integration of visual and textual features. Prominent
instances of this architecture encompass:
1. BLIP-2, MiniGPT4, and LLaVA (adapted for video with visual embedding extensions) [
          <xref ref-type="bibr" rid="ref10">46</xref>
          ].
2. Video-ChatGPT, Video-LLaVA, Video-Chat, and MM-VID (which incorporate visual
embedding adapters) [
          <xref ref-type="bibr" rid="ref11">47</xref>
          ].
3. SEED, Video-LLaMA, and mPLUG-Owl (leveraging pretrained vision-language encoders and
        </p>
        <p>
          LLMs for multimodal interaction) [
          <xref ref-type="bibr" rid="ref12">48</xref>
          ].
        </p>
        <p>
          The Video Embedder + LLM architecture provides a cohesive multimodal interface between
vision and language, facilitating enhanced interaction across modalities. Nonetheless, it often
requires task specific adjustment and is susceptible to the quality of visual embeddings and their
alignment with linguistic representations. This methodology signifies progress in achieving
coherent video-language integration, facilitating a diverse array of downstream tasks like video
captioning, question answering, and temporal localization [
          <xref ref-type="bibr" rid="ref13">49</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. (Analyzer + Embedder) + LLM</title>
        <p>
          The design of the (Analyzer + Embedder) + LLM integrates the optimal elements of both textual
and visual feature pathways by using a video analyzer in conjunction with a video embedder. The
outcomes of both are further processed using a Large Language Model (LLM) (Fig. 3). This
combined design seeks to use the advantages of both organized symbolic information and intricate
visual imagery to enhance video comprehension and adaptability [
          <xref ref-type="bibr" rid="ref14">50</xref>
          ].
        </p>
        <p>
          In this configuration, the video analyzer produces results that are comprehensible and often
interpretable by people. These outputs include action labels, subtitles, and temporal annotations
that encapsulate the video material coherently. The video embedder converts the video into a
sequence of visual embeddings that capture intricate spatial and temporal details. Token union,
dual-stream focus, and multimodal adapters are examples of modality fusion methods used to
integrate both symbolic and visual streams into the LLM [
          <xref ref-type="bibr" rid="ref15">51</xref>
          ].
        </p>
        <p>
          This design facilitates the LLM’s execution of complex multimodal cognitive tasks by
integrating advanced symbolic concepts with fundamental visual attributes. It is effective in
scenarios when either symbolic or embedded information alone is insufficient, such as when
simultaneous visual grounding and semantic summarization are required. Examples of systems
using this design include:
1. Video-LLaVA, Video-Chat, and MM-ReAct (which combine dense vision features with
analyzergenerated text) [
          <xref ref-type="bibr" rid="ref16">52</xref>
          ].
2. GPT4Tools, MM-ReAct, and MM-Vid (that enable dynamic tool use and feature fusion based
on LLM-directed instructions) [
          <xref ref-type="bibr" rid="ref17">53</xref>
          ].
3. Video-ChatGPT, which leverages both vision encoders and captioning modules for
multimodal dialogue and reasoning [
          <xref ref-type="bibr" rid="ref18">54</xref>
          ].
        </p>
        <p>
          This hybrid model architecture makes it easier to be flexible and understand, and it also lets
LLMdriven control dynamically organize visual and symbolic clues. However, it makes system
design more complicated since the analyzer, embedder, and LLM inputs need to be carefully
synchronized and aligned [
          <xref ref-type="bibr" rid="ref19">55</xref>
          ].
        </p>
        <p>The (Analyzer + Embedder) + LLM model is a potential step toward creating video-language
models that can do a wide range of jobs by combining different types of data and tools in real time.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Vid-LLMs applications</title>
      <p>Adding Large Language Models (LLMs) to video comprehension systems has opened up new
possibilities for a wide range of real-world uses. These models, especially when used with visual
encoders or analytic modules, are quite flexible and may be used for tasks that require multimodal
thinking, semantic comprehension, and interacting with video information in a way that is similar
to how humans do it. We talk about some of the most important areas where Video-Language
Models (Vid-LLMs) have had a big effect or have a lot of promise in this part.</p>
      <p>
        1. Summarizing and captioning videos involves distilling content and providing textual
representation for accessibility and comprehension. The automatic generation of natural
language descriptions for video content represents a key application of Vid-LLMs. These
systems can generate concise summaries or detailed captions by analyzing dynamic scenes
and relating them to linguistic semantics. These models can generate captions that are
contextually and temporally aware, as well as highly meaningful, due to the reasoning
capabilities of large language models (LLMs). They can capture purpose, emotion, and
narrative structure, alongside object or action recognition. This capability is essential for
activities such as content generation, material categorization, and support for individuals
with disabilities [
        <xref ref-type="bibr" rid="ref20">56</xref>
        ].
2. Video Question Answering (Video QA.Vid-LLMs) lets people ask natural language questions
about the video material, which makes it possible to interactively interpret videos. The
model takes in both visual and spoken inputs and gives correct answers. This job needs not
only object or event identification, but also spatio-temporal thinking, comprehension of
causation, and even fundamental knowledge. Vid-LLMs show a lot of promise in this area,
even when there are just a few examples or none at all. This is because they can follow
instructions and respond to prompts, which they got from LLMs [
        <xref ref-type="bibr" rid="ref21">57</xref>
        ].
3. Temporal and Spatial Localization. A significant application area pertains to identifying
particular moments, actions, or objects within video streams. This encompasses temporal
action localization, moment retrieval, and referring object grounding. In these contexts,
VidLLMs can effectively associate language-based prompts (e.g., “when does the person start
running?”) with specific temporal segments and spatial areas of interest within the video.
This functionality is crucial for applications such as video indexing, surveillance, sports
analysis, and the comprehension of instructional content [
        <xref ref-type="bibr" rid="ref21">57</xref>
        ].
4. Multimodal Video Dialogues. The rise of multimodal chat systems has led to the integration
of Vid-LLMs in interactive dialogue interfaces that encompass video comprehension. These
systems facilitate natural conversations that include follow-up questions, temporal
references, and iterative reasoning related to video content. This approach is especially
beneficial in the realms of educational technology, customer support automation, and
interactive storytelling, as comprehending video context is essential for producing relevant
and coherent dialogue [
        <xref ref-type="bibr" rid="ref22">58</xref>
        ].
5. Using Tools and Video-Based Agents. Recent Vid-LLM designs improve their usefulness by
acting as independent agents that can understand video input, think logically, and utilize
other tools or APIs as needed. These agents can work with long videos, make guesses,
assess their work using tools like object detectors, trackers, and summarizers, and provide
multi-step answers. Because they act like agents, they are excellent at challenging
occupations that require making choices, like as robotics, autonomous monitoring, or
interpreting scientific movies.
6. Retrieval and recommendation across modes. Vid-LLMs are being utilized more and more in
systems that enable you search for videos and text and vice versa, where it is vital for the
semantics of the two forms of material to line up. By placing both types of data into a
shared latent space, these systems make it easier to identify relevant content based on
natural language descriptions or the other way around. This tool is highly helpful for
searching video databases, recommending material, and managing digital assets.
7. Monitoring, security, and compliance. In high-stakes situations like surveillance or forensic
analysis, Vid-LLMs could help find events, spot strange behavior, or check for conformity
with regulatory standards. Their ability to assess and express visual information in plain
English enables transparent and verifiable decision-making, particularly advantageous in
legal, security, and auditing contexts [
        <xref ref-type="bibr" rid="ref23">59</xref>
        ].
      </p>
      <p>The application range of Vid-LLMs encompasses descriptive, analytical, interactive, and
operational domains, facilitated by their ability to merge vision and language within a cohesive,
human-centered framework. As the discipline advances, we foresee wider use of these models in
domains necessitating explainability, multi-turn interaction, and dynamic management of
multimodal inputs. Furthermore, forthcoming advancements in fine-tuning methodologies,
longcontext modeling, and the incorporation of domain-specific tools will likely enhance their
applicability significantly.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future research opportunities</title>
      <p>As Large Language Models (LLMs) become more common in video understanding systems, a
number of important research areas and problems arise that are likely to impact the future
generation of multimodal intelligence. Current Vid-LLMs have shown great promise in video
reasoning, captioning, and interactivity, but they still have big problems with scalability, accuracy,
interpretability, and generalization across domains.</p>
      <p>
        1. Video modeling with a long context. One of the most important problems is figuring out how
to analyze long-form films well. When working with long temporal sequences, current
models generally have trouble with memory and processing limitations. Future research
should investigate more effective temporal compression methods, hierarchical modeling,
and sparse attention processes that facilitate the representation and retrieval of relevant
parts over extended periods while preserving essential context [39].
2. Learning to Tokenize and Represent Video. Video data does not have a widely used and
effective way to break it down into tokens, unlike photos or text. It’s important to come up
with better ways to turn raw video into symbolic or discrete representations that are both
useful and easy for computers to work with. Improvements in video-language tokenizers,
separate visual vocabularies, and multimodal pretraining goals will be important for making
video work better with LLM structures [
        <xref ref-type="bibr" rid="ref24">60</xref>
        ].
3. Reasoning that is based on facts and can be explained. Future Vid-LLMs must be able to do
grounded reasoning, which means that predictions must be clearly connected to particular
visual or temporal data in the input video. This is necessary for real-world applications to be
trustworthy and open. Developing methods that produce explainable and verifiable
outputs – for instance, via textual rationales, highlighted video frames, or traceable
inference paths – will be vital for adoption in sensitive domains such as healthcare, legal
analysis, and autonomous systems [61].
4. Dynamic interaction and adding tools. The trend toward LLM-driven agents is expected to
continue. Vid-LLMs will be able to use outside resources to help them see, track, summarize,
and find things. Future systems could be better at planning and reasoning if they have
better memory, self-correction, and adaptive tool invocation mechanisms. Research into
multi-agent collaboration, where different expert models cooperate via LLM coordination, is
also a promising direction [62].
5. Grounded and explainable reasoning. Future Vid-LLMs must demonstrate grounded
reasoning to ensure trust and transparency in real-world applications, linking predictions
explicitly to specific visual or temporal evidence in the input video. Creating methods that
yield explainable and verifiable outputs – such as textual rationales, highlighted video
frames, or traceable inference paths – will be essential for implementation in sensitive fields
like healthcare, legal analysis, and autonomous systems [61].
6. Dynamic interaction and tool enhancement. The trend of LLM-driven agents is expected to
persist, with Vid-LLMs enhanced by external tools for perception, tracking, summarization,
and retrieval. Future systems could improve through advanced planning and reasoning
abilities, encompassing memory, self-correction, and adaptive strategies for tool invocation.
Investigating multi-agent collaboration, in which various expert models interact through
LLM coordination, represents a promising avenue of research [62].
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In recent years, the integration of video analysis with extensive language modeling has
significantly transformed the field of multimodal artificial intelligence. Video-Language Models
(Vid-LLMs) represent an advanced type of system capable of performing complex reasoning,
engaging in interactive dialogue, and demonstrating semantic understanding of video content.
Their functionality is achieved through the integration of Large Language Models (LLMs) with
vision encoders and perceptual tools.</p>
      <p>This survey has given a full picture of how Vid-LLMs have changed over time, how they are
built, how they are used, and what problems they face. We put current methods into three main
architectural paradigms: Video Analyzer + LLM, Video Embedder + LLM, and (Analyzer +
Embedder) + LLM. We did this by showing how they work, how they are designed, and what
models they are based on. We also looked at a lot of real-world uses, such as captioning, answering
questions, temporal localization, multimodal conversation, retrieval, and agentic reasoning.</p>
      <p>Even though modern Vid-LLMs are quite powerful, they are still in the early stages of
development. Their performance is generally limited by problems with video tokenization,
longcontext modeling, multimodal alignment, and explainability. The research community still has a lot
of work to do on topics like scalability, domain robustness, and standardizing evaluations.</p>
      <p>Integrating LLMs into video understanding represents a significant advancement in the
development of general-purpose, instruction-driven, and human-aligned multimodal AI systems.
With advancements in modeling architectures, pretraining techniques, and system-level design,
Vid-LLMs are poised to play a significant role in the development of intelligent systems capable of
seamlessly interacting with their visual environments in the future.</p>
      <p>This study lays the groundwork for understanding the current landscape of Vid-LLMs and offers
a structured approach for future research initiatives aimed at improving the capabilities, efficiency,
and reliability of video-language models.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[10] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of
the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
CVPR’05, IEEE, New York, NY, 2005, pp. 886–893. doi:10.1109/CVPR.2005.177.
[11] H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the
2013 IEEE International Conference on Computer Vision, ICCV ’2013, IEEE, New York, NY,
2013, pp. 3551–3558. doi:10.1109/ICCV.2013.441.
[12] X. Liu, T. Cheng, Video-based face recognition using adaptive hidden Markov models, in:
Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, CVPR ’2003, IEEE, New York, NY, 2003, pp. I–I. doi:10.1109/CVPR.2003.1211373.
[13] H. Sidenbladh, Detecting human motion with support vector machines, in: Proceedings of the
Pattern Recognition, International Conference on, ICPR ’2004, IEEE, New York, NY, 2004,
pp. 188–191. doi:10.1109/ICPR.2004.1334092.
[14] A. Mittal, S. Gupta, Automatic content-based retrieval and semantic classification of video
content, Int. J. Digit. Libr. 6 (2006) 30–38. doi:10.1007/s00799-005-0119-y.
[15] A. B. Chan, N. Vasconcelos, Modeling, clustering, and segmenting video with mixtures of
dynamic textures, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 909–926.
doi:10.1109/TPAMI.2007.70738.
[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video
classification with convolutional neural networks, in: Proceedings of the 2014 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR ’2014, IEEE, New York, NY,
2014, pp. 1725–1732. doi:10.1109/CVPR.2014.223.
[17] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition,</p>
        <p>IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 221–231. doi:10.1109/TPAMI.2012.59.
[18] F. Salem, Recurrent neural networks: from simple to gated architectures, 2022.</p>
        <p>doi:10.1007/978-3-030-89929-5.
[19] B. Lindemann, T. Müller, H. Vietz, N. Jazdi, M. Weyrich, A survey on long short-term memory
networks for time series prediction, Procedia CIRP 99 (2021) 650–655.
doi:10.1016/j.procir.2021.03.088.
[20] G. Yang, Y. Yang, Z. Lu, J. Yang, D. Liu, C. Zhou, Z. Fan, STA-TSN: spatial–temporal attention
temporal segment network for action recognition in video, PLoS ONE 17 (2022) 1–19.
doi:10.1371/journal.pone.0265115.
[21] M. Sekma, M. Mejdoub, C. Ben Amar, Human action recognition based on multi-layer Fisher
vector encoding method, Pattern Recognit. Lett. 65 (2015) 37–43.
doi:10.1016/j.patrec.2015.06.029.
[22] A. Diba, V. Sharma, L. Van Gool, Deep temporal linear encoding networks, in: Proceedings of
the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’2017, IEEE,
New York, NY, 2017, pp. 1541–1550. doi:10.1109/CVPR.2017.168.
[23] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the Kinetics dataset,
in: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR ’2017, IEEE, New York, NY, 2017, pp. 4724–4733. doi:10.1109/CVPR.2017.502.
[24] A. Gabriel, S. Cosar, N. Bellotto, P. Baxter, A dataset for action recognition in the wild, in:
Annual Conference Towards Autonomous Robotic Systems, Springer, London, UK, 2019, pp.
362–374. doi:10.1007/978-3-030-23807-0_30.
[25] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for
human motion recognition, in: 2011 International Conference on Computer Vision (ICCV),
2011, pp. 2556–2563. doi:10.1109/ICCV.2011.6126543.
[26] K. Hara, H. Kataoka, Y. Satoh, Learning spatio-temporal features with 3D residual networks
for action recognition, arXiv preprint arXiv:1708.07632 (2017). doi:10.48550/arXiv.1708.07632.
[27] Q. Yu, Y. Li, J. Mei, Y. Zhou, A. L. Yuille, CAKES: channel-wise automatic kernel shrinking for
efficient 3D networks, arXiv preprint arXiv:2003.12798 (2020). doi:10.48550/arXiv.2003.12798.
[28] J. Lao, W. Hong, X. Guo, Y. Zhang, J. Wang, J. Chen, W. Chu, Simultaneously short- and
longterm temporal modeling for semi-supervised video semantic segmentation, in: Proceedings of
the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR ’2023,
IEEE, New York, NY, 2023, pp. 14763–14772. doi:10.1109/CVPR52729.2023.01418.
[29] C. Feichtenhofer, X3D: expanding architectures for efficient video recognition, arXiv preprint
arXiv:2004.04730 (2020). doi:10.48550/arXiv.2004.04730.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.</p>
        <p>Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth
16×16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
(2021). doi:10.48550/arXiv.2010.11929.
[31] G. Bertasius, H. Wang, L. Torresani, Is space–time attention all you need for video
understanding?, arXiv preprint arXiv:2102.05095 (2021). doi:10.48550/arXiv.2102.05095.
[32] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, ViViT: a video vision
transformer, arXiv preprint arXiv:2103.15691 (2021). doi:10.48550/arXiv.2103.15691.
[33] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, Multiscale vision
transformers, arXiv preprint arXiv:2104.11227 (2021). doi:10.48550/arXiv.2104.11227.
[34] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: what we know about how</p>
        <p>BERT works, Trans. Assoc. Comput. Linguist. 8 (2020) 842–866. doi:10.1162/tacl_a_00349.
[35] L. Zhu, Y. Yang, ActBERT: learning global–local video–text representations, in: Proceedings of
the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR ’2020,
IEEE, New York, NY, 2020, pp. 8743–8752. doi:10.1109/CVPR42600.2020.00877.
[36] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, C. Feichtenhofer, Masked feature prediction for
selfsupervised visual pre-training, arXiv preprint arXiv:2112.09133 (2023).
doi:10.48550/arXiv.2112.09133.
[37] T. Zhang, C. Xu, G. Zhu, S. Liu, H. Lu, A generic framework for video annotation via
semisupervised learning, IEEE Trans. Multimedia 14 (2012) 1206–1219.
doi:10.1109/TMM.2012.2191944.
[38] L. Song, G. Yin, B. Liu, Y. Zhang, N. Yu, FSFT-Net: face transfer video generation with
fewshot views, in: Proceeding of the 2021 IEEE International Conference on Image Processing,
ICIP’ 2021, IEEE, New York, NY, 2021, pp. 3582–3586. doi:10.1109/ICIP42928.2021.9506512.
[39] K. Chen, M. Hu, An automatic video tag extraction method based on large language model text
content parsing, in: 2024 7th International Conference on Data Science and Information
Technology (DSIT), 2024, pp. 1–4. doi:10.1109/DSIT61374.2024.10881284.
[40] W. Wen, Y. Wang, N. Birkbeck, B. Adsumilli, An ensemble approach to short-form video
quality assessment using multimodal LLM, in: Proceeding of the 2025 IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP ’2025, IEEE, New York, NY,
2025, pp. 1–5. doi:10.1109/ICASSP49660.2025.10888524.
[41] Q. Zhao, S. Wang, C. Zhang, C. Fu, M. Q. Do, N. Agarwal, K. Lee, C. Sun, AntGPT: can large
language models help long-term action anticipation from videos?, arXiv preprint
arXiv:2307.16368 (2024). doi:10.48550/arXiv.2307.16368.
[42] D. Surís, S. Menon, C. Vondrick, ViperGPT: visual inference via Python execution for
reasoning, arXiv preprint arXiv:2303.08128 (2023). doi:10.48550/arXiv.2303.08128.
[43] C. Hori, M. Kambara, K. Sugiura, K. Ota, S. Khurana, S. Jain, R. Corcodel, D. Jha, D. Romeres, J.</p>
        <p>Le Roux, Interactive robot action replanning using multimodal LLM trained from human
demonstration videos, in: Proceedings of the 2025 IEEE International Conference on Acoustics,
Speech and Signal Processing, ICASSP ’2025, IEEE, New York, NY, 2025, pp. 1–5.
doi:10.1109/ICASSP49660.2025.10887717.
[44] R. Liu, C. Li, Y. Ge, T. H. Li, Y. Shan, G. Li, BT-Adapter: video conversation is feasible without
video instruction tuning, in: Proceeding of the 2024 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR ’2024, IEEE, New York, NY, 2024, pp. 13658–13667.
doi:10.1109/CVPR52733.2024.01296.
[45] M. Bain, A. Nagrani, G. Varol, A. Zisserman, Frozen in time: a joint video and image encoder
for end-to-end retrieval, in: Proceeding of the 2021 IEEE/CVF International Conference on
Conference on Advances in Information Technology (ICAIT), vol. 1, 2024, pp. 1–5.
doi:10.1109/ICAIT61638.2024.10690747.
[61] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liao, T. Gao, E. Li,
K. Tang, Z. Cao, T. Zhou, A. Liu, X. Yan, S. Mei, J. Cao, Z. Wang, C. Zheng, A survey on
multimodal large language models for autonomous driving, in: Proceedings of the 2024
IEEE/CVF Winter Conference on Applications of Computer Vision Workshops,
WACVW ’2024, IEEE, New York, NY, 2024, pp. 958–979.
doi:10.1109/WACVW60836.2024.00106.
[62] Y. Chen, J. Arkin, Y. Zhang, N. Roy, C. Fan, Scalable multi-robot collaboration with large
language models: centralized or decentralized systems?, in: Proceedings, of the 2024 IEEE
International Conference on Robotics and Automation, ICRA ’2024, IEEE, New York, NY, 2024,
pp. 4311–4317. doi:10.1109/ICRA57147.2024.10610676.
[63] H. Qi, L. Dai, W. Chen, Z. Jia, X. Lu, Performance characterization of large language models on
high-speed interconnects, in: Proceeding of the 2023 IEEE Symposium on High-Performance
Interconnects, HOTI ’2023, IEEE, New York, NY, 2023, pp. 53–60.
doi:10.1109/HOTI59126.2023.00022.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M. H.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          , S. Hoi, InstructBLIP: towards
          <article-title>general-purpose vision-language models with instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2305.06500</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2305.06500.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Argaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Heilbron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deilamsalehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Scaling up video summarization pretraining with large language models</article-title>
          ,
          <source>in: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition, CVPR '
          <year>2024</year>
          , IEEE, New York, NY,
          <year>2024</year>
          , pp.
          <fpage>8332</fpage>
          -
          <lpage>8341</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR52733.
          <year>2024</year>
          .
          <volume>00796</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Luo,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Video understanding with large language models: a survey</article-title>
          ,
          <source>arXiv preprint arXiv:2312.17432</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2312.17432.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] OpenAI, GPT-4
          <source>technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2303.08774.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Qingnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiaodong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meixin</surname>
          </string-name>
          , W. Xiangqing,
          <article-title>Video content understanding based on spatio-temporal feature extraction and pruning network</article-title>
          ,
          <source>in: Proceedings of the 2023 4th International Symposium on Computer Engineering and Intelligent Communications, ISCEIC '</source>
          <year>2023</year>
          , IEEE, New York, NY,
          <year>2023</year>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>497</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISCEIC59030.
          <year>2023</year>
          .
          <volume>10271098</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>LLaMA: open and efficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2302.13971.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <article-title>Surveillance video-and-language understanding: from small to large multimodal models</article-title>
          ,
          <source>IEEE Trans. Circuits Syst. Video Technol</source>
          .
          <volume>35</volume>
          (
          <year>2025</year>
          )
          <fpage>300</fpage>
          -
          <lpage>314</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCSVT.
          <year>2024</year>
          .
          <volume>3462433</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lindeberg</surname>
          </string-name>
          ,
          <article-title>Scale invariant feature transform</article-title>
          ,
          <source>Scholarpedia</source>
          <volume>7</volume>
          (
          <year>2012</year>
          )
          <article-title>10491</article-title>
          . doi:
          <volume>10</volume>
          .4249/scholarpedia.10491.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Gool</surname>
          </string-name>
          ,
          <article-title>Speeded-up robust features (SURF)</article-title>
          ,
          <source>Comput. Vis. Image Underst</source>
          .
          <volume>110</volume>
          (
          <year>2008</year>
          )
          <fpage>346</fpage>
          -
          <lpage>359</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.cviu.
          <year>2007</year>
          .
          <volume>09</volume>
          .014.
          <string-name>
            <surname>Computer</surname>
            <given-names>Vision</given-names>
          </string-name>
          , ICCV '
          <year>2021</year>
          , IEEE, New York, NY,
          <year>2021</year>
          , pp.
          <fpage>1708</fpage>
          -
          <lpage>1718</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>00175</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ataallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sleiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          ,
          <fpage>MiniGPT4</fpage>
          - Video:
          <article-title>advancing multimodal LLMs for video understanding with interleaved visual-textual tokens</article-title>
          ,
          <source>arXiv preprint arXiv:2404.03413</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2404.03413.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Qiao,</surname>
          </string-name>
          <article-title>VideoChat: chatcentric video understanding</article-title>
          ,
          <source>arXiv preprint arXiv:2305.06355</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2305.06355.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <article-title>Video-LLaMA: an instruction-tuned audio-visual language model for video understanding</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>543</fpage>
          -
          <lpage>553</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-demo.
          <volume>49</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yilmazer</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Karakose, LLM-based video analytics test scenario generation in smart cities</article-title>
          ,
          <source>in: 2025 29th International Conference on Information Technology (IT)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/IT64745.
          <year>2025</year>
          .
          <volume>10930297</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          , G. Luo,
          <article-title>VerilogReader: LLM-aided hardware test generation</article-title>
          ,
          <source>in: Proceeding of the 2024 IEEE LLM Aided Design Workshop</source>
          , LAD '
          <year>2024</year>
          , IEEE, New York, NY,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/LAD62341.
          <year>2024</year>
          .
          <volume>10691801</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Exploring the potential of ChatGPT in automated code refinement: an empirical study</article-title>
          ,
          <source>in: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering</source>
          , ICSE '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . doi:
          <volume>10</volume>
          .1145/3597503.3623306.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>S.</given-names>
            <surname>Munasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thushara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>PGVideoLLaVA: pixel grounding large video-language models</article-title>
          ,
          <source>arXiv preprint arXiv:2311.13435</source>
          (
          <year>2023</year>
          ). URL: doi:10.48550/arXiv.2311.13435.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Azarnasab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>MM-VID: advancing video understanding with GPT-4V(ision</article-title>
          ),
          <source>arXiv preprint arXiv:2310.19773</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2310.19773.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Q.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>VAMOS:</surname>
          </string-name>
          <article-title>versatile action models for video understanding</article-title>
          ,
          <source>arXiv preprint arXiv:2311.13627</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2311.13627.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          , W. Zhu,
          <article-title>VTimeLLM: empower LLM to grasp video moments</article-title>
          ,
          <source>in: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition, CVPR '
          <year>2024</year>
          , IEEE, New York, NY,
          <year>2024</year>
          , pp.
          <fpage>14271</fpage>
          -
          <lpage>14280</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR52733.
          <year>2024</year>
          .
          <volume>01353</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shechtman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Localizing moments in video with natural language</article-title>
          ,
          <source>in: Proceeding of the 2017 IEEE International Conference on Computer Vision</source>
          , ICCV '
          <year>2017</year>
          , IEEE, New York, NY,
          <year>2017</year>
          , pp.
          <fpage>5804</fpage>
          -
          <lpage>5813</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>618</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Gao,</surname>
          </string-name>
          <article-title>DocVideoQA: towards comprehensive understanding of documentcentric videos through question answering</article-title>
          ,
          <source>in: Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          , ICASSP '
          <year>2025</year>
          , IEEE, New York, NY,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP49660.
          <year>2025</year>
          .
          <volume>10887668</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [58]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sivic</surname>
            , I. Laptev,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Lacoste-Julien</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning from narrated instruction videos</article-title>
          ,
          <source>in: Proceeding of the 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition, CVPR '
          <year>2016</year>
          , IEEE, New York, NY,
          <year>2016</year>
          , pp.
          <fpage>4575</fpage>
          -
          <lpage>4583</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>495</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Research on logical understanding of video surveillance systems based on knowledge graphs</article-title>
          ,
          <source>in: 2023 4th International Conference on Computer, Big Data and Artificial Intelligence (ICCBD+AI)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>617</fpage>
          -
          <lpage>622</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCBD-AI62252.
          <year>2023</year>
          .
          <volume>00113</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>A.</given-names>
            <surname>Senthilselvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prawin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Harshit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Santhosh</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Senthil</given-names>
            <surname>Pandi</surname>
          </string-name>
          ,
          <article-title>Abstractive summarization of YouTube videos using Lamini-FLAN-</article-title>
          <string-name>
            <surname>T5</surname>
            <given-names>LLM</given-names>
          </string-name>
          , in: 2024 Second International
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>