<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLM on the edge: the new frontier</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhiy O. Semerikov</string-name>
          <email>semerikov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetiana A. Vakaliuk</string-name>
          <email>tetianavakaliuk@acnsci.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga B. Kanevska</string-name>
          <email>o.b.kanevska@gmail.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykhailo V. Moiseienko</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan I. Donchev</string-name>
          <email>donchev@pdpu.edu.ua</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii O. Kolhatin</string-name>
          <email>kolhatin.a@gmail.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>PCWrEooUrckResehdoinpgs ISSNc1e6u1r-3w-0s0.o7r3g</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Academy of Cognitive and Natural Sciences</institution>
          ,
          <addr-line>54 Gagarin Ave., Kryvyi Rih, 50086</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Digitalisation of Education of the NAES of Ukraine</institution>
          ,
          <addr-line>9 M. Berlynskoho Str., Kyiv, 04060</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Kryvyi Rih National University</institution>
          ,
          <addr-line>11 Vitalii Matusevych Str., Kryvyi Rih, 50027</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Kryvyi Rih State Pedagogical University</institution>
          ,
          <addr-line>54 Universytetskyi Ave., Kryvyi Rih, 50086</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>South Ukrainian National Pedagogical University named after K. D. Ushynsky</institution>
          ,
          <addr-line>26 Staroportofrankivska Str., Odesa, 65020</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Zhytomyr Polytechnic State University</institution>
          ,
          <addr-line>103 Chudnivsyka Str., Zhytomyr, 10005</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>137</fpage>
      <lpage>161</lpage>
      <abstract>
        <p>The advent of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in text generation, reasoning, and human-machine interaction. However, their deployment on resource-constrained edge devices presents significant challenges due to high computational complexity, large model sizes, and stringent latency and privacy requirements. This survey provides a comprehensive examination of the emerging field of edge-based LLMs, exploring the techniques, frameworks, hardware solutions, and real-world applications that enable their eficient deployment at the edge. We review key strategies such as model quantization, pruning, knowledge distillation, and adapter tuning, alongside edge-cloud collaborative architectures like EdgeShard, Edge-LLM, and PAC. Additionally, we analyze hardware acceleration solutions, including Cambricon-LLM, AxLaM, and DTATrans/DTQAtten, and their role in overcoming resource limitations. The survey highlights diverse applications, from IoT and smart cities to personalized services and multi-modal intelligence, supported by case studies of real-world deployments. Finally, we discuss open challenges - such as resource eficiency, privacy, security, and scalability - and propose future research directions to advance this transformative technology.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;edge computing</kwd>
        <kwd>large language models (LLMs)</kwd>
        <kwd>model compression</kwd>
        <kwd>edge-cloud collaboration</kwd>
        <kwd>hardware acceleration</kwd>
        <kwd>IoT applications</kwd>
        <kwd>personalized services</kwd>
        <kwd>multi-modal intelligence</kwd>
        <kwd>privacy-preserving AI</kwd>
        <kwd>resource eficiency</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The rapid advancements in deep learning and natural language processing have led to the development
of large language models (LLMs) that exhibit remarkable performance on a wide range of tasks, from
question answering and text generation to reasoning and dialogue [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These models, such as GPT-3
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and T5 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], have achieved human-level performance on many benchmarks and have the
potential to transform various industries, including healthcare, education, and finance.
      </p>
      <p>
        However, the deployment of LLMs in real-world scenarios often requires running these models on
edge devices, such as smartphones, IoT sensors, and embedded systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This is motivated by several
factors, including the need for low-latency inference in applications like virtual assistants and real-time
translation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the desire to preserve user privacy by processing data locally, without sending it to
the cloud [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the requirement for ofline functionality in scenarios with limited or no internet
connectivity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In fact, deploying LLMs on edge devices is challenging due to their large model sizes, high
computational requirements, and memory footprint [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For example, GPT-3 has 175 billion parameters and
requires 350GB of memory for inference [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This far exceeds the resources available on most edge
devices, which typically have limited CPU/GPU capabilities, memory (a few GB), and power constraints.
      </p>
      <p>
        To address these challenges, there has been a growing interest in developing techniques and
frameworks for the eficient deployment of LLMs on edge devices [
        <xref ref-type="bibr" rid="ref11 ref12 ref6">11, 6, 12</xref>
        ]. This survey aims to provide an
overview of this emerging field, covering the key aspects of edge-based LLMs, from model compression
and acceleration techniques to collaborative frameworks and hardware solutions.
      </p>
      <p>The rest of this survey is organized as follows. Section 2 provides an overview of LLMs for edge
deployment, discussing the key characteristics, challenges, and popular open-source frameworks.
Section 3 reviews the edge-cloud collaborative frameworks and architectures for optimized LLM inference.
Section 4 surveys the hardware acceleration solutions and chipsets for eficient LLM execution on edge
devices. Section 5 highlights the real-world applications and systems leveraging edge-based LLMs
across various domains. Section 6 discusses the open challenges, opportunities, and future research
directions in this field. Finally, section 7 concludes the survey and provides an outlook on the future of
edge-based LLMs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of LLMs for edge deployment</title>
      <p>
        Large language models (LLMs) have demonstrated remarkable performance in various natural language
processing tasks, such as text generation, question answering, and sentiment analysis [
        <xref ref-type="bibr" rid="ref10 ref3 ref4">10, 3, 4</xref>
        ]. However,
deploying these models on edge devices presents unique challenges due to the resource constraints and
heterogeneity of edge environments [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ].
      </p>
      <p>
        Deploying LLMs on edge devices presents several significant challenges due to their inherent
characteristics. One prominent dificulty stems from their substantial model size. LLMs frequently encompass
hundreds of millions, if not billions, of parameters, which translates into considerable storage and
memory demands. For instance, GPT-3, with its 175 billion parameters, requires approximately 350GB
of memory for inference – a requirement that far exceeds the capacity of most edge devices, typically
limited to just a few gigabytes of memory, as noted by Brown et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Zhang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Another
hurdle arises from the high computational complexity involved in LLM inference. This process relies on
intricate matrix operations and attention mechanisms, resulting in substantial computational overhead,
as described by Vaswani et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Such demands strain the limited CPU and GPU capabilities of
edge devices, further compounded by their restrictive power budgets, a challenge highlighted by Shen
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Latency requirements also pose a critical concern. Many edge applications, such as virtual
assistants and real-time translation tools, depend on rapid inference to ensure a seamless user
experience, according to Cai et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Yet, the computational intensity of LLMs often introduces significant
delays, particularly on resource-constrained hardware, as Yu et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] observes, making it dificult
to meet these stringent timing needs. Finally, privacy considerations add another layer of complexity.
Edge devices frequently process sensitive user data, and transmitting this information to the cloud for
inference can raise substantial privacy risks, as Li et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] points out. Consequently, there is a pressing
need for methods that facilitate local inference on edge devices while safeguarding user privacy, an
issue Qiao et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] emphasizes as essential for practical deployment.
      </p>
      <p>To address these challenges, various techniques and frameworks have been proposed for the eficient
deployment of LLMs on edge devices. These include model compression and acceleration techniques
(section 2.2), edge-cloud collaborative frameworks (section 3), and hardware acceleration solutions
(section 4).</p>
      <sec id="sec-2-1">
        <title>2.1. Popular open-source LLMs and frameworks</title>
        <p>Several open-source LLMs and frameworks have been developed to facilitate the deployment of LLMs
on edge devices. These frameworks provide pre-trained models, tools, and techniques for eficient
inference and adaptation to edge environments. Table 1 provides an overview of popular open-source
LLMs and frameworks for edge deployment.</p>
        <p>
          TinyAgent [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is a framework for training and deploying task-specific small language model agents
on edge devices. It provides pre-trained models, TinyAgent-1.1B and TinyAgent-7B, which achieve
accurate function calling and eficient inference by leveraging techniques like tool retrieval and
quantization. TinyAgent also supports real-time interaction through voice input and output, making it suitable
for building intelligent personal assistants on edge devices.
        </p>
        <p>
          MNN-LLM [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is a generic inference engine for fast LLM deployment on mobile devices. It addresses
the memory and latency challenges of edge inference by employing techniques like model quantization
and DRAM-Flash hybrid storage. MNN-LLM optimizes the inference process based on the characteristics
of mobile CPUs and GPUs, achieving significant speedups compared to other mobile-friendly LLM
frameworks.
        </p>
        <p>
          h2oGPT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is an open-source ecosystem for state-of-the-art LLMs, providing a family of fine-tuned
models ranging from 7B to 70B parameters. It includes the H2O LLM Studio, a no-code GUI for eficient
model fine-tuning, evaluation, and deployment using advanced techniques. h2oGPT models are designed
to be scalable and adaptable to various edge environments, making them suitable for a wide range of
applications.
        </p>
        <p>Figure 1 illustrates the comparative performance of these frameworks on a benchmark dataset,
highlighting their inference speed and memory eficiency on edge devices.</p>
        <p>1
) 0.8
B
/
B
G
(y 0.6
c
n
e
i
c
fie 0.4
y
r
o
m
eM 0.2
0
0</p>
        <sec id="sec-2-1-1">
          <title>TinyAgent MNN-LLM h2oGPT</title>
          <p>20
40
60
80
100</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Inference speed (tokens/sec)</title>
          <p>As shown in figure 1, TinyAgent achieves the highest inference speed and memory eficiency
among the compared frameworks, making it a promising choice for real-time applications on
resourceconstrained devices. MNN-LLM and h2oGPT also demonstrate competitive performance, with
MNNLLM focusing on mobile-specific optimizations and h2oGPT providing a wide range of fine-tuned
models for diferent scenarios.</p>
          <p>The selection of an appropriate framework depends on factors such as the target application, device
capabilities, and performance requirements.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Techniques for eficient LLM deployment on edge devices</title>
        <p>Various techniques have been proposed for model compression and acceleration to enable the eficient
deployment of LLMs on resource-constrained edge devices. These techniques aim to reduce the model
size, computational complexity, and memory footprint of LLMs while maintaining their performance
and generalization capabilities. Table 2 summarizes the key techniques for eficient LLM deployment
on edge devices.</p>
        <p>
          Quantization and pruning are two widely used techniques for reducing the model size and
computational complexity of LLMs [
          <xref ref-type="bibr" rid="ref12 ref18 ref19">18, 19, 12</xref>
          ]. Quantization reduces the precision of model weights and
activations from 32-bit floating-point to lower bit-widths (e.g., 8-bit or 4-bit integers), resulting in smaller
model sizes and faster inference. Pruning removes redundant or less important model parameters,
leading to sparse models with reduced computational and memory requirements.
        </p>
        <p>
          For example, Shen et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] propose HotaQ, a hardware-oriented token adaptive quantization
framework for LLMs. HotaQ achieves 4/8-bit quantization for weights and activations while maintaining
performance comparable to full-precision models. Yu et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] introduce EDGE-LLM, a layer-wise
unified compression technique that generates pruning sparsity and quantization bit-width policies,
achieving significant speedups and memory savings on edge devices.
        </p>
        <p>
          Knowledge distillation [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] is a technique for transferring knowledge from a large teacher model to
a small student model, enabling the deployment of compact and eficient models on edge devices. The
student model is trained to mimic the behaviour of the teacher model by minimizing the divergence
between their output distributions.
        </p>
        <p>
          Sanh et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] propose DistilBERT, a distilled version of BERT that achieves 95% of the teacher’s
performance while being 40% smaller and 60% faster. Jiao et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] introduce TinyBERT, a two-stage
distillation framework that transfers knowledge from the pre-trained BERT model to a smaller student
model, achieving competitive performance on various NLP tasks.
        </p>
        <p>
          Adapter tuning [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is a parameter-eficient fine-tuning technique that adapts pre-trained LLMs
to downstream tasks by training small adapter modules while keeping the base model fixed. This
approach reduces the memory and computational requirements of fine-tuning, making it suitable for
edge deployment.
        </p>
        <p>
          Qiao et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] propose Tri-AFLLM, an adaptive asynchronous accelerated federated LLM framework
that leverages adapter tuning for eficient edge deployment. Tri-AFLLM updates only the adapter
parameters while keeping the base model frozen, achieving significant resource eficiency and accuracy
improvements in federated learning scenarios.
        </p>
        <p>160
140
)
c
e
s
/
sn 120
e
k
o
t
(
ed 100
e
p
s
e
c
n 80
e
r
e
f
n
I
60
40</p>
        <sec id="sec-2-2-1">
          <title>Quantization</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Pruning</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Knowledge distillation</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Adapter runing</title>
          <p>400
600
800
1,000
1,200</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>Model size (MB)</title>
          <p>As shown in figure 2, all four techniques efectively reduce the model size and improve the inference
speed of LLMs, with quantization and knowledge distillation achieving the most significant
improvements. Pruning and adapter tuning also demonstrate considerable benefits, especially when combined
with other techniques.</p>
          <p>The choice of technique depends on various factors, such as the target device’s capabilities,
performance requirements, and the availability of pre-trained models or training data.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Edge-cloud collaborative frameworks and architectures</title>
      <p>
        Deploying large language models on edge devices often necessitates a partnership between edge
and cloud environments to address the inherent resource constraints and performance limitations of
standalone devices, as explored by Zhou et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and Zhao et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. These edge-cloud collaborative
frameworks harness the distinct advantages of both edge and cloud computing, paving the way for
eficient and scalable LLM inference tailored to real-world applications, a concept further elaborated by
Yao et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and Cai et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The drive toward edge-cloud collaboration in LLM deployment stems from several compelling factors.
One key incentive is scalability. By distributing computational workloads across multiple edge devices
and cloud servers, these frameworks enable LLMs to operate efectively in expansive, large-scale settings,
as Zhang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] demonstrates. Another advantage lies in eficiency. Ofloading computationally
demanding tasks to the cloud while conducting local inference on edge devices allows these systems
to enhance overall performance and optimize resource use, a benefit underscored by Yao et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
Adaptability also plays a crucial role. This collaboration permits LLM inference to adjust dynamically
to fluctuating network conditions, varying device capabilities, and diverse application needs, according
to Cai et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Privacy emerges as yet another vital consideration. Such frameworks can safeguard
sensitive data by processing it locally on edge devices and reserving the cloud for handling non-sensitive
computations, a strategy highlighted by Li et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Despite these advantages, crafting efective edge-cloud collaborative frameworks for LLM deployment
is far from straightforward, presenting a range of obstacles. Heterogeneity stands out as a significant
hurdle. Edge devices vary widely in their hardware and software configurations, complicating the
creation of frameworks capable of leveraging these diverse resources eficiently, as Friha et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] points
out. Communication overhead poses another challenge. Transferring data and model parameters
between edge devices and the cloud can generate substantial delays, particularly when network bandwidth
is limited, an issue raised by Zhang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Synchronization adds further complexity. Maintaining
consistency and coherence in model updates across edge devices and the cloud is essential to ensure
the stability and performance of these collaborative systems, a concern emphasized by Qiao et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Security, too, demands attention. The interplay between edge and cloud introduces risks such as data
breaches and adversarial attacks during data transfer and processing across platforms, a problem noted
by Nazari et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        To tackle these issues, researchers have put forward a variety of edge-cloud collaborative frameworks
and architectures in the literature. These eforts focus on critical aspects of LLM deployment, including
model partitioning, task ofloading, and resource allocation, as evidenced by the work of Yao et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ],
Cai et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and Zhang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Such innovations aim to bridge the gap between the potential of
LLMs and the practical realities of edge environments, fostering robust and adaptable solutions.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Overview of collaborative frameworks</title>
        <sec id="sec-3-1-1">
          <title>Zhang et al. [9]</title>
          <p>
            Cai et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Ouyang et al. [29]</title>
          <p>
            EdgeShard, as introduced by Zhang et al. [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], ofers a collaborative edge computing framework
designed to facilitate the eficient deployment of computationally demanding large language models
on resource-constrained edge devices. This framework achieves its goals by dividing the LLM into
smaller, manageable shards, distributing them across edge devices and cloud servers according to their
computational capacities and prevailing network conditions. To enhance system performance,
EdgeShard incorporates an adaptive algorithm that jointly optimizes device selection and model partitioning,
aiming to reduce inference latency while boosting throughput.
          </p>
          <p>• The framework splits the LLM into smaller shards, enabling their execution on edge devices
despite limited resources. This partitioning approach carefully considers the computational
complexity and memory demands of various model components, such as attention layers and
feedforward networks, ensuring eficient operation.
• EdgeShard dynamically identifies the most suitable edge devices and cloud servers for collaborative
inference. This selection process adapts to current resource availability, network states, and input
data characteristics, ensuring optimal participation across the system.
• To streamline performance, EdgeShard employs a dynamic programming algorithm focused
on minimizing end-to-end inference latency and maximizing system throughput. This method
accounts for factors like communication overhead, computation duration, and memory limitations
inherent to each device.</p>
          <p>EdgeShard finds application in diverse LLM-based tasks, including content generation and intelligent
decision-making within IoT systems. Its implementation has yielded notable enhancements in latency
and throughput, outperforming traditional cloud-centric deployment strategies.</p>
          <p>
            Edge-LLM, presented by Cai et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], serves as a collaborative framework tailored for large-scale
language model deployment within edge computing contexts. By tapping into the computational
strengths of both edge devices and cloud servers, it accelerates LLM fine-tuning and inference under
resource-limited conditions, all while prioritizing quality of service (QoS) for end users.
• The framework adopts an adaptive quantization strategy, dynamically tailoring the precision of
model weights and activations. This adjustment aligns with the computational capabilities of
edge devices and the specific needs of the application, striking a balance between inference speed
and accuracy.
• Edge-LLM integrates a frequency-based model (FM) cache mechanism, storing frequently accessed
model parameters and intermediate results directly on edge devices. This approach cuts down on
communication overhead and reduces latency during collaborative inference.
• A value density first (VDF) scheduling algorithm guides Edge-LLM in prioritizing compute-heavy
tasks for execution on edge devices with superior capabilities. Less demanding tasks are shifted
to the cloud, optimizing resource use and overall system performance.
          </p>
          <p>Edge-LLM has proven efective across various AI applications, including natural language
processing and computer vision. Compared to conventional edge computing methods, it delivers marked
improvements in computational speed, task throughput, and GPU overhead reduction.</p>
          <p>
            PAC, short for Pluto and Charon, as outlined by Ouyang et al. [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ], stands out as a time- and
memory-eficient collaborative edge AI framework focused on personal LLM fine-tuning. It harnesses
the computational resources of nearby edge devices to enable on-the-spot fine-tuning of personalized
LLMs, minimizing communication demands while upholding data privacy.
          </p>
          <p>• PAC introduces parallel adapters, a novel fine-tuning technique that adapts pre-trained LLMs
to individual preferences and domains. By training small adapter modules concurrently while
keeping the base model unchanged, this method reduces the computational and memory burdens
of the fine-tuning process.
• An activation cache mechanism enhances eficiency in PAC by storing intermediate activations
from the base model during the forward pass. This storage eliminates redundant computations in
the backward pass, streamlining the fine-tuning of parallel adapters.
• The framework blends data parallelism with pipeline parallelism to distribute fine-tuning
workloads across proximate edge devices. This hybrid approach minimizes communication overhead
and maximizes the use of available computational resources.</p>
          <p>PAC excels in personal LLM applications, such as tailored language understanding and generation. Its
deployment has demonstrated substantial gains in fine-tuning speed and memory eficiency, surpassing
existing state-of-the-art techniques.</p>
          <p>Figure 3 provides a comparative analysis of the three collaborative frameworks in terms of their
inference latency, throughput, and memory eficiency on representative edge devices and cloud servers.</p>
          <p>As shown in figure 3, all three frameworks achieve significant improvements in inference latency,
throughput, and memory eficiency compared to traditional cloud-based deployment approaches.
EdgeShard demonstrates the lowest latency, while PAC achieves the highest throughput, and Edge-LLM
exhibits the best memory eficiency. The choice of framework depends on the specific requirements
and constraints of the target application and deployment scenario.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Techniques for optimized LLM inference</title>
        <p>To further optimize the performance and eficiency of LLM inference in edge-cloud collaborative
frameworks, various techniques have been proposed in the literature, focusing on aspects such as
0.7
0.8
0.6
0.9
0.8
1
0.8
0.9</p>
        <p>0.7</p>
        <sec id="sec-3-2-1">
          <title>Latency</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Throughput</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Memory eficiency</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>EdgeShard</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Edge-LLM PAC</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Dynamically adjusting the precision of</title>
          <p>model weights and activations</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Learning optimal task scheduling policies</title>
          <p>through interaction with the environment</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>Storing and retrieving frequently accessed model parameters and intermediate results</title>
          <p>adaptive quantization, scheduling, and caching. Table 4 summarizes the key techniques for optimized
LLM inference in collaborative frameworks.</p>
          <p>
            Adaptive quantization and scheduling are two key techniques for optimizing the performance and
resource utilization of LLM inference in edge-cloud collaborative frameworks. Adaptive quantization
dynamically adjusts the precision of model weights and activations based on the computational
capabilities of edge devices and the requirements of the target application, achieving a balance between
inference speed and accuracy [
            <xref ref-type="bibr" rid="ref18 ref6">6, 18</xref>
            ]. For example, Cai et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] propose an adaptive quantization
strategy in the Edge-LLM framework, which dynamically selects the optimal quantization scheme for
each model layer based on the computational capabilities of the target edge device and the performance
requirements of the application.
          </p>
          <p>
            Reinforcement learning-based scheduling learns optimal task scheduling policies through interaction
with the environment, considering factors such as the communication overhead, computation time,
and resource constraints of each device [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. Yao et al. [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ] introduce a reinforcement learning-based
scheduling algorithm in the VELO framework, which learns to optimize the task ofloading decisions
and resource allocation policies through trial and error, adapting to the dynamic network conditions
and workload characteristics.
          </p>
          <p>
            Vector databases and caching mechanisms are essential for reducing the communication overhead
and latency of LLM inference in edge-cloud collaborative frameworks. Vector databases store and
retrieve frequently accessed model parameters and intermediate results, enabling eficient reuse of
computations across diferent inference requests [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. Caching mechanisms, such as the FM cache in
Edge-LLM [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], store recently accessed data and models on edge devices, reducing the need for redundant
data transfers and computations.
          </p>
          <p>
            For example, Yao et al. [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ] propose a vector database-assisted caching mechanism in the VELO
framework, which stores the results of recent LLM inference requests on edge devices and reuses
          </p>
        </sec>
        <sec id="sec-3-2-9">
          <title>Adaptive quantization</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>RL-based scheduling</title>
        </sec>
        <sec id="sec-3-2-11">
          <title>Vector databases and caching</title>
          <p>
            them for similar requests in the future, significantly reducing the response time and computational
cost of the system. Cai et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] introduce the FM cache mechanism in Edge-LLM, which maintains a
frequency-based model cache on edge devices, storing the most frequently accessed model parameters
and intermediate results for fast retrieval and reuse.
          </p>
          <p>Figure 4 illustrates the impact of these optimization techniques on the inference latency and
throughput of LLMs in edge-cloud collaborative frameworks, highlighting their efectiveness in improving the
performance and eficiency of the system.</p>
          <p>40
60
80</p>
          <p>As shown in figure 4, all three optimization techniques efectively reduce the inference latency
and improve the throughput of LLMs in edge-cloud collaborative frameworks. Adaptive quantization
achieves the most significant latency reduction, while vector databases and caching mechanisms
demonstrate the highest throughput improvement. Reinforcement learning-based scheduling also
shows considerable benefits in terms of both latency and throughput optimization.</p>
          <p>The choice of optimization technique depends on the specific characteristics and requirements of
the target application and deployment scenario. In practice, these techniques are often used in
combination to achieve the best performance and eficiency for LLM inference in edge-cloud collaborative
frameworks.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Hardware acceleration solutions and chipsets</title>
      <p>
        Deploying large language models on edge devices demands eficient hardware acceleration solutions to
navigate the computational and memory limitations inherent to these resource-constrained platforms,
as noted by Bhardwaj et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Crafting and implementing such accelerators for edge-based LLMs,
however, is no small feat, entailing a complex interplay of challenges and requirements. Achieving high
computational performance stands as a primary concern. These accelerators need to support real-time
inference and response generation, all while adhering to the stringent power and thermal boundaries of
edge devices, a point emphasized by Yu et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Equally critical is the pursuit of energy eficiency. To
extend battery life and curtail operational costs, LLM accelerators must minimize power consumption
for each computational task, a goal highlighted by Glint et al. [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Memory eficiency also plays a
pivotal role. Given the scarce on-chip and of-chip memory resources available, these accelerators are
tasked with optimizing the memory footprint and bandwidth demands of the models, as Wang et al.
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] underscores. Flexibility emerges as another essential requirement. With the diverse array of LLM
architectures and applications in play, edge accelerators must ofer programmability, scalability, and
adaptability to accommodate varying workload characteristics, according to Tambe et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Lastly,
minimizing latency is crucial. The accelerators must streamline the entire inference pipeline – from data
transfer to preprocessing and postprocessing – to deliver responsive and interactive user experiences;
a necessity pointed out by Yang et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. To meet these multifaceted challenges and requirements,
researchers and industry experts have proposed an assortment of hardware acceleration solutions and
chipsets. These innovations draw on techniques like specialized processing units, optimized memory
hierarchies, and tailored dataflow architectures, as evidenced by the work of Yu et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], Glint et al.
[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], and Yang et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], paving the way for more efective LLM deployment at the edge.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Overview of hardware solutions</title>
        <sec id="sec-4-1-1">
          <title>Dynamic</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>VSSA</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Hybrid architecture,</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>NAND flash</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>POSIT-based multipliers, HBM</title>
          <p>
            NPU, 3.44 tokens/s (70B), 36.34 tokens/s Yu et al. [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]
(7B)
          </p>
        </sec>
        <sec id="sec-4-1-6">
          <title>9x energy reduction, 58% area re- Glint et al. [31] duction mixed-precision, 16.04x speedup, 3.62x energy sav- Yang et al. [32] ing</title>
          <p>
            Cambricon-LLM, as introduced by Yu et al. [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], presents a chiplet-based hybrid architecture tailored
for on-device inference of large language models with up to 70 billion parameters. This design integrates
a neural processing unit (NPU) with a dedicated NAND flash chip, striking a balance between high
performance and energy eficiency while reducing the data movement overhead between processing
and memory elements. The framework capitalizes on the NPU’s robust computing power alongside the
substantial data capacity of the NAND flash, enabling eficient execution of large-scale LLMs directly
on edge devices. To further enhance its capability, the NAND flash chip incorporates innovative
inlfash computing and on-die error correction code (ECC) techniques. These advancements facilitate
lightweight processing within the chip itself, significantly cutting down on data transfer demands
between the NPU and flash storage.
          </p>
          <p>Additionally, Cambricon-LLM employs a hardware-tiling strategy to streamline data movement and
computation scheduling between these components. This approach minimizes memory access latency
and maximizes resource utilization, ensuring optimal performance. In practice, Cambricon-LLM delivers
an impressive on-device inference speed of 3.44 tokens per second for 70B LLMs and 36.34 tokens per
second for 7B LLMs – performance levels that surpass existing flash-ofloading technologies by 22 to 45
times – demonstrating its prowess in enabling large-scale LLM deployment on edge devices.</p>
          <p>
            AxLaM, detailed by Glint et al. [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ], emerges as an energy-eficient accelerator crafted for language
models on edge devices, harnessing approximate fixed-point POSIT-based multipliers and high
bandwidth memory (HBM) to deliver both high performance and low power consumption. This design
leverages POSIT-based multipliers to simplify the computational complexity and energy demands of
matrix operations central to language models, all while preserving acceptable accuracy levels.
Complementing this, AxLaM incorporates high bandwidth memory to eficiently manage the storage and
retrieval of model parameters and intermediate activations. This setup reduces memory access latency,
thereby boosting overall performance. The accelerator also features a dataflow architecture
meticulously tuned to the specific needs of language model workloads. By optimizing the flow of data, AxLaM
maximizes the eficiency of its processing units and minimizes unnecessary data movement overhead.
When benchmarked against the state-of-the-art Simba accelerator, AxLaM achieves a remarkable 9-fold
reduction in energy use and a 58% decrease in area, underscoring its suitability for deployment in
resource-constrained edge environments.
          </p>
          <p>
            DTATrans, described by Yang et al. [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ], and DTQAtten, outlined by Yang et al. [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ], represent
hardware-software co-designed solutions aimed at eficient transformer-based LLM inference on edge
devices. These approaches leverage dynamic mixed-precision quantization and a variable-speed systolic
array (VSSA) architecture to achieve exceptional performance and energy eficiency. Central to their
design is a dynamic mixed-precision quantization scheme that adjusts the precision of model weights
and activations based on their significance and the computational capacity of the target device, striking
an efective balance between accuracy and eficiency. The accelerators employ a variable-speed systolic
array architecture, which dynamically tunes the processing speed and parallelism of matrix operations
to match workload characteristics and available resources, thereby optimizing both performance and
energy use. Furthermore, DTATrans and DTQAtten benefit from a tight integration of hardware
and software, encompassing the compiler, runtime, and programming model. This co-design ensures
eficient mapping and scheduling of LLM workloads onto the hardware, enhancing overall efectiveness.
In terms of results, DTATrans delivers a 16.04-fold speedup and a 3.62-fold energy saving over the
earlier Eyeriss accelerator, while DTQAtten achieves a 3.62-fold speedup and a 4.22-fold improvement
in energy eficiency compared to the state-of-the-art SpAtten accelerator, highlighting their significant
contributions to edge-based LLM inference.
          </p>
          <p>Figure 5 compares the performance and energy eficiency of the three hardware acceleration solutions,
highlighting their efectiveness in enabling eficient LLM deployment on edge devices.
e
c
n
a
m
r
o
f
r
e
p
d
e
z
i
l
a
m
r
o
N
4
2
0
4.5</p>
          <p>As shown in figure 5, all three hardware acceleration solutions achieve significant improvements
in speed and energy eficiency compared to baseline edge devices without specialized accelerators.
Cambricon-LLM demonstrates the highest speedup, while AxLaM and DTATrans/DTQAtten exhibit
better energy eficiency. The choice of hardware solution depends on the specific requirements and
constraints of the target application and deployment scenario, such as the model size, latency, and
power budget.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Emerging trends and future directions</title>
        <p>
          The domain of hardware acceleration for deploying large language models on edge devices is advancing
at a brisk pace, unveiling a host of emerging trends and future research avenues aimed at enhancing the
performance, eficiency, and scalability of these solutions. One notable direction involves the melding
of diverse specialized accelerators – such as neural processing units, graphics processing units, and
ifeld-programmable gate arrays – into a single chip or package. This heterogeneous integration, as
explored in [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], facilitates the eficient handling of varied workload types and adapts seamlessly to the
dynamic demands of edge LLM applications. Another promising development centres on in-memory
computing, where innovative memory technologies like non-volatile memory and computational RAM
are harnessed to perform computations directly within the memory itself. This approach, detailed in
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], significantly cuts down on data movement overhead between processing units and the memory
hierarchy, boosting overall eficiency.
        </p>
        <p>
          Equally compelling is the focus on exploiting sparsity within LLM structures. Researchers are delving
into sparsity-aware hardware architectures and dataflows that capitalize on the inherently sparse nature
of these models, thereby reducing both computational and memory demands during inference, as
discussed by Bhardwaj et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Alongside this, the concept of adaptive precision is gaining traction.
This involves crafting accelerators capable of dynamically adjusting computational precision based on
workload specifics and desired accuracy levels, a strategy outlined in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] that minimizes energy use
while maximizing system performance. Furthermore, the synergy of hardware and software through
codesign is proving vital. By jointly optimizing the hardware architecture, software stack, and algorithms,
this approach – highlighted in [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] – tailors solutions to the unique constraints and characteristics of
target devices and applications, ensuring a more cohesive deployment process.
        </p>
        <p>These evolving trends and forward-looking directions underscore the importance of a comprehensive
strategy for hardware acceleration in edge LLM deployment. They emphasize the intricate interplay
among hardware, software, and application layers, suggesting that a holistic perspective is essential for
progress. By embracing these advancements, future edge devices stand poised to support increasingly
sophisticated and eficient LLM-based applications, opening the door to novel use cases and enriched
user experiences.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Applications and real-world systems</title>
      <p>
        The deployment of large language models on edge devices has enabled a wide range of applications
and real-world systems across various domains, such as IoT, personalized services, and multi-modal
interaction [
        <xref ref-type="bibr" rid="ref11 ref35 ref36">11, 35, 36</xref>
        ]. These applications leverage the capabilities of edge-based LLMs to provide
intelligent, responsive, and context-aware services to users while preserving privacy and reducing
latency.
      </p>
      <sec id="sec-5-1">
        <title>5.1. IoT and smart city applications</title>
        <p>
          The integration of LLMs with IoT devices and smart city infrastructure has enabled the development
of intelligent and adaptive systems that can process and analyze real-time data streams, providing
actionable insights and optimized decision-making [
          <xref ref-type="bibr" rid="ref24 ref37">24, 37</xref>
          ]. Table 6 summarizes representative IoT and
smart city applications of edge-based LLMs.
        </p>
        <p>
          Edge-based LLMs have been applied to trafic forecasting and management systems to enable accurate
and timely prediction of trafic conditions, optimizing the use of transportation infrastructure and
reducing congestion [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]. These systems leverage the spatio-temporal modelling capabilities of LLMs to
capture the complex dependencies between trafic flow, weather conditions, and road network topology,
adapting to the dynamic and heterogeneous nature of urban environments.
        </p>
        <p>
          For example, Rong et al. [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] propose a lightweight spatio-temporal generative LLM (LSGLLM) for
large-scale trafic flow forecasting, which is deployed on edge devices and collaborates with cloud
servers to process and analyze real-time trafic data eficiently. The LSGLLM model achieves superior
performance compared to traditional baselines, demonstrating the efectiveness of edge-based LLMs in
trafic management applications.
        </p>
        <p>
          Edge-based LLMs have also been employed in anomaly detection and predictive maintenance systems
for IoT and industrial applications, enabling the early identification of potential faults and the
optimization of maintenance schedules [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. These systems utilize the unsupervised learning capabilities of
LLMs to model the normal behaviour of IoT devices and industrial equipment, detecting deviations and
anomalies in real-time and triggering appropriate actions.
        </p>
        <p>
          For instance, Zhang and Shi [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] propose a blockchain-based edge intelligence framework that
integrates large AI models and IoT devices for anomaly detection and predictive maintenance in smart
city applications. The framework leverages the distributed and secure nature of blockchain technology
to enable the collaborative training and inference of LLMs across multiple edge devices, ensuring the
privacy and integrity of the data and the models.
        </p>
        <p>Figure 6 illustrates the architecture of a typical IoT and smart city application of edge-based LLMs,
highlighting the key components and the data flow between the edge devices and the cloud servers.</p>
        <sec id="sec-5-1-1">
          <title>Feedback and control</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>IoT sensors and devices</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Raw data</title>
        </sec>
        <sec id="sec-5-1-4">
          <title>Edge devices with LLMs</title>
        </sec>
        <sec id="sec-5-1-5">
          <title>Processed data</title>
        </sec>
        <sec id="sec-5-1-6">
          <title>Cloud servers with LLMs</title>
        </sec>
        <sec id="sec-5-1-7">
          <title>IoT and smart city applications</title>
        </sec>
        <sec id="sec-5-1-8">
          <title>Insights and actions</title>
          <p>As shown in figure 6, IoT sensors and devices collect raw data from the environment and send it to
the edge devices for real-time processing and analysis using LLMs. The edge devices extract relevant
features and patterns from the data and transmit the processed information to the cloud servers for
further analysis and decision-making. The cloud servers utilize more powerful LLMs to generate insights
and actionable recommendations, which are then fed back to the IoT and smart city applications for
implementation. The applications, in turn, provide feedback and control signals to the sensors and
devices, closing the loop and enabling adaptive and intelligent behaviour.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Personalized services and human-machine interaction</title>
        <p>
          Edge-based LLMs have also been applied to personalized services and human-machine interaction
applications, enabling the development of intelligent and context-aware systems that can understand
and respond to user needs and preferences in real-time [
          <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
          ]. Table 7 summarizes representative
personalized services and human-machine interaction applications of edge-based LLMs.
        </p>
        <p>Edge-based LLMs have been leveraged to develop intelligent personal assistants that can understand
and respond to user queries and commands in natural language, providing personalized and
context</p>
        <sec id="sec-5-2-1">
          <title>Multi-modal interaction, Enhanced user experience context awareness</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>User profiling, collabora- Improved service quality tive filtering</title>
          <p>
            aware services [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. These assistants integrate multi-modal interaction capabilities, such as speech
recognition, computer vision, and natural language processing, to enable seamless and intuitive
humanmachine communication.
          </p>
          <p>
            For example, Shen et al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] propose an edge-based autonomous AI system that leverages the
capabilities of LLMs to provide high-quality, low-latency, and privacy-preserving personal assistant
services. The system utilizes a combination of on-device LLMs and cloud-based models to process user
queries and generate appropriate responses, adapting to the user’s context and preferences.
          </p>
          <p>
            Edge-based LLMs have also been applied to the personalized recommendation and content generation
systems, enabling the delivery of tailored and engaging experiences to users based on their interests
and behaviour [
            <xref ref-type="bibr" rid="ref39">39</xref>
            ]. These systems leverage the language understanding and generation capabilities of
LLMs to create user profiles, analyze user feedback, and generate personalized recommendations and
content.
          </p>
          <p>
            For instance, Piccialli et al. [
            <xref ref-type="bibr" rid="ref39">39</xref>
            ] propose a federated and edge learning framework for LLMs that
enables the collaborative training and inference of recommendation models across multiple edge devices
while preserving user privacy. The framework utilizes techniques such as diferential privacy and secure
multi-party computation to ensure the confidentiality of user data and the integrity of the models.
          </p>
          <p>Figure 7 illustrates the workflow of a typical personalized service or human-machine interaction
application of edge-based LLMs, highlighting the key steps and the interaction between the user and
the system.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Recommendation or content</title>
        </sec>
        <sec id="sec-5-2-4">
          <title>User</title>
        </sec>
        <sec id="sec-5-2-5">
          <title>Multi-modal input Edge-based LLMs</title>
          <p>As shown in figure 7, the user initiates the interaction by providing a query or command to the system,
which can be in the form of text, speech, image, or other modalities. The edge-based LLMs process the
input and generate a personalized response, which is then presented to the user as a recommendation
or a piece of content. The user can provide feedback on the output, which is used by the system to
refine the user profile and improve the quality of future recommendations and interactions.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Multi-modal edge intelligence</title>
        <p>
          The integration of edge-based LLMs with other AI domains, such as computer vision, speech recognition,
and robotics, has enabled the development of multi-modal edge intelligence applications that can
process and analyze heterogeneous data streams in real-time [
          <xref ref-type="bibr" rid="ref40 ref41">40, 41</xref>
          ]. These applications leverage the
complementary strengths of diferent AI technologies to provide more comprehensive and accurate
insights and actions, enhancing the capabilities of edge devices and systems. Table 8 summarizes
representative multi-modal edge intelligence applications of edge-based LLMs.
        </p>
        <sec id="sec-5-3-1">
          <title>Robotics and au- Multi-modal perception, nat- Enhanced robot capabilities tonomous systems ural language interaction</title>
        </sec>
        <sec id="sec-5-3-2">
          <title>Computer vision Cross-modal learning, real- Improved scene understanding Xu et al. [43] and video analytics time processing</title>
          <p>
            Edge-based LLMs have been integrated with robotics and autonomous systems to enable more
natural and intuitive human-robot interaction, as well as more robust and adaptive robot behaviour [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ].
These systems leverage the language understanding and generation capabilities of LLMs to process and
respond to human commands and queries while also utilizing the perception and action capabilities of
robots to perform tasks in the physical world.
          </p>
          <p>
            For example, Kawaharazuka et al. [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ] propose a framework for applying pre-trained vision-language
models to various recognition behaviours in robotic applications, enabling robots to understand and
respond to visual and linguistic cues in real-world environments. The framework utilizes techniques
such as zero-shot learning and prompt engineering to adapt the pre-trained models to specific tasks
and domains without requiring extensive fine-tuning or data collection.
          </p>
          <p>
            Edge-based LLMs have also been combined with computer vision and video analytics techniques to
enable more accurate and eficient scene understanding and object recognition in real-time [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ]. These
applications leverage the cross-modal learning capabilities of LLMs to process and analyze visual and
textual data streams simultaneously, extracting relevant features and generating semantic descriptions
of the observed scenes.
          </p>
          <p>
            For instance, Xu et al. [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ] propose a benchmark suite for evaluating the performance of multi-modal
deep neural networks (DNNs) in edge computing environments, focusing on the hardware and software
implications of deploying these models on resource-constrained devices. The benchmark includes a set
of representative computer vision and video analytics tasks, such as object detection, image captioning,
and video summarization, and provides insights into the trade-ofs between accuracy, latency, and
energy eficiency of diferent multi-modal DNN architectures and optimization techniques.
          </p>
          <p>Figure 8 illustrates the architecture of a typical multi-modal edge intelligence application of
edgebased LLMs, highlighting the integration of diferent AI technologies and the flow of data and control
between them.</p>
          <p>As shown in figure 8, multi-modal sensors, such as cameras, microphones, and tactile sensors, collect
data from the environment and send it to the respective AI modules for processing. The computer vision
module extracts visual features and detects objects of interest, while the speech recognition module
transcribes and interprets spoken commands and queries. The robotics and control module processes
the sensor data and generates appropriate actions and behaviours for the robot. The edge-based LLMs
integrate the outputs of the individual AI modules and generate a cohesive and semantically meaningful
representation of the scene, which is then used to guide the robot’s actions and interactions. The
multi-modal output, such as natural language descriptions, visual explanations, and motor commands,</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>Computer vision</title>
        </sec>
        <sec id="sec-5-3-4">
          <title>Robotics and control</title>
        </sec>
        <sec id="sec-5-3-5">
          <title>Multi-modal sensors</title>
        </sec>
        <sec id="sec-5-3-6">
          <title>Speech recognition Edge-based LLMs</title>
          <p>is fed back to the AI modules for further processing and refinement, creating a closed-loop system that
can adapt to dynamic and unstructured environments.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Case studies and real-world deployments</title>
        <p>To demonstrate the practical impact and potential of edge-based LLMs, this section presents several case
studies and real-world deployments of these technologies across diferent domains and applications.
Table 9 summarizes the key features and outcomes of each case study.</p>
        <sec id="sec-5-4-1">
          <title>Autonomous edge AI systems [11] Personal assistants</title>
        </sec>
        <sec id="sec-5-4-2">
          <title>LLM-powered smartphones [36]</title>
        </sec>
        <sec id="sec-5-4-3">
          <title>Personalized digital avatars [35]</title>
        </sec>
        <sec id="sec-5-4-4">
          <title>Multi-modal interac- Enhanced user experition, privacy-preserving ence</title>
        </sec>
        <sec id="sec-5-4-5">
          <title>Mobile devices On-device inference, en- Improved functionalergy eficiency ity and performance</title>
        </sec>
        <sec id="sec-5-4-6">
          <title>Human-computer in- Realistic appearance Engaging and immerteraction and voice, low latency sive interactions</title>
          <p>
            Shen et al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] present an autonomous edge AI system that uses LLMs to provide intelligent and
personalized services to users, such as voice assistants, recommendation systems, and content generation.
The system employs a hierarchical architecture that combines on-device LLMs for low-latency inference
and privacy-preserving data processing with cloud-based models for more complex and computationally
intensive tasks. The system also incorporates techniques such as federated learning and diferential
privacy to enable collaborative model training and adaptation across multiple edge devices while
ensuring the security and confidentiality of user data.
          </p>
          <p>The autonomous edge AI system has been deployed in various real-world scenarios, such as smart
homes, connected vehicles, and personal robotics, demonstrating significant improvements in user
experience, service quality, and operational eficiency compared to traditional cloud-based solutions. The
system has also been shown to reduce the energy consumption and network bandwidth requirements
of edge devices, making it suitable for resource-constrained environments.</p>
          <p>
            Wu et al. [
            <xref ref-type="bibr" rid="ref36">36</xref>
            ] investigate the integration of LLMs into smartphones to enable advanced functionality
and improved performance for mobile users. The authors develop a framework for optimizing the
deployment of LLMs on mobile devices, considering factors such as model compression, quantization,
and hardware acceleration. The framework also includes a runtime system that dynamically adapts the
inference process based on the available resources and the user’s context, ensuring optimal performance
and energy eficiency.
          </p>
          <p>LLM-powered smartphones have been evaluated in terms of their natural language processing
capabilities, such as text classification, language translation, and question answering, as well as their
impact on the user experience and battery life. The results show that the optimized LLMs can achieve
comparable accuracy to cloud-based models while significantly reducing the devices’ latency and energy
consumption. The LLM-powered smartphones have also been shown to enable new applications and
services, such as on-device virtual assistants, real-time language translation, and personalized content
recommendations.</p>
          <p>
            Basit and Shafique [
            <xref ref-type="bibr" rid="ref35">35</xref>
            ] propose a multi-modal LLM-based framework for creating personalized digital
avatars that can engage in natural and expressive interactions with users. The framework integrates
LLMs for natural language processing, deep learning models for speech synthesis and recognition, and
computer vision techniques for generating realistic facial expressions and gestures. The digital avatars
are designed to run on edge devices, such as smartphones and smart speakers, providing low-latency
and privacy-preserving interactions.
          </p>
          <p>Personalized digital avatars have been evaluated in terms of their naturalness, expressiveness, and
user engagement using both objective metrics and subjective user studies. The results show that the
avatars can generate highly realistic and context-appropriate responses while also adapting to the user’s
preferences and emotions. The digital avatars have been deployed in various applications, such as
virtual customer service agents, personal tutors, and social companions, demonstrating their potential
to enhance the user experience and create more engaging and immersive interactions.</p>
          <p>These case studies and real-world deployments highlight the diversity and impact of edge-based
LLMs across diferent domains and applications. They also demonstrate the practical feasibility and
benefits of deploying these technologies on resource-constrained devices, paving the way for more
intelligent, responsive, and user-centric edge computing systems.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Challenges, opportunities, and future directions</title>
      <p>Despite the significant advancements and promising applications of edge-based LLMs, several challenges
and opportunities remain that need to be addressed to fully realize their potential.</p>
      <sec id="sec-6-1">
        <title>6.1. Resource constraints and eficiency optimization</title>
        <p>
          One of the main challenges in deploying LLMs on edge devices is the limited computational resources
and energy budget of these devices [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. To address this challenge, there is a need for more eficient and
adaptive optimization techniques that can dynamically adjust the model architecture, hyperparameters,
and deployment strategy based on the available resources and the target task. This includes the
development of more advanced compression methods, such as quantization-aware training [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], network
pruning [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and knowledge distillation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], as well as the exploration of novel architectures and
learning paradigms, such as mixture-of-experts [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] and meta-learning [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ].
        </p>
        <p>
          Another important direction is the design of more eficient hardware accelerators and platforms
for edge-based LLMs, considering the unique characteristics and requirements of these models [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
This includes the development of specialized processing units, such as tensor processing units (TPUs)
[
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] and neural processing units (NPUs) [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ], as well as the optimization of memory hierarchies and
interconnects for fast and low-power data movement. Additionally, the co-design of hardware and
software components, such as compilers, runtime systems, and frameworks, can enable more seamless
and eficient deployment of LLMs on edge devices [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Privacy, security, and trustworthiness</title>
        <p>
          Another critical challenge in edge-based LLMs is ensuring the privacy, security, and trustworthiness
of these systems, especially when dealing with sensitive user data and interactions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. To address
this challenge, there is a need for more robust and scalable privacy-preserving techniques, such as
federated learning [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], diferential privacy [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ], and homomorphic encryption [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ], that can enable
the collaborative training and inference of LLMs across multiple edge devices, without compromising
the confidentiality and integrity of the data and the models.
        </p>
        <p>
          Moreover, the development of more secure and resilient architectures and protocols for edge-based
LLMs is crucial to prevent unauthorized access, tampering, and attacks on these systems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This
includes the use of trusted execution environments (TEEs) [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ], blockchain-based authentication
and access control [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], and anomaly detection and mitigation techniques [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. Additionally, the
incorporation of explainable and interpretable AI techniques [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] can enhance the transparency and
accountability of edge-based LLMs, enabling users to understand and trust the decisions and actions of
these systems.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Domain-specific adaptation and customization</title>
        <p>
          Another important challenge and opportunity in edge-based LLMs is the adaptation and customization
of these models to specific domains and applications [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. While pre-trained LLMs can provide a good
starting point for many tasks, they often require fine-tuning and domain-specific optimization to achieve
optimal performance and user experience. This includes the incorporation of domain knowledge and
constraints into the model architecture and training process, as well as the development of more eficient
and efective transfer learning techniques [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ].
        </p>
        <p>
          Moreover, the design of more modular and composable LLMs that can be easily adapted and extended
to new tasks and domains is an important research direction [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ]. This includes the development of
plugand-play modules, such as adapters [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and prefix-tuning [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ], that can be seamlessly integrated into
existing LLMs to enable fast and eficient adaptation to new requirements and preferences. Additionally,
the exploration of more interactive and collaborative learning paradigms, such as active learning [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ]
and human-in-the-loop learning [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ], can enable the continuous improvement and customization of
edge-based LLMs based on user feedback and real-world usage patterns.
        </p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Scalability and interoperability in heterogeneous environments</title>
        <p>
          Another challenge and opportunity in edge-based LLMs is ensuring their scalability and interoperability
in heterogeneous and dynamic environments [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Edge computing systems often involve a large number
of diverse devices, protocols, and platforms, which can hinder the seamless deployment and coordination
of LLMs across these systems. To address this challenge, there is a need for more flexible and adaptive
middleware and frameworks that can abstract away the underlying heterogeneity and enable the
eficient and reliable communication and synchronization of LLMs across diferent edge devices and
networks.
        </p>
        <p>
          Moreover, the development of standardized interfaces and protocols for edge-based LLMs is crucial
to enable their interoperability and compatibility with existing tools and services [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This includes
the design of common architectures, such as the Open Neural Network Exchange (ONNX) [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ], and
the adoption of open standards, such as the Edge Computing Reference Architecture (ECRA) [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ], to
facilitate the integration and deployment of LLMs in edge computing environments. Additionally, the
exploration of more decentralized and self-organizing architectures, such as peer-to-peer networks [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ]
and multi-agent systems [
          <xref ref-type="bibr" rid="ref60">60</xref>
          ], can enable the scalable and resilient coordination of edge-based LLMs in
large-scale and dynamic environments.
        </p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Emerging applications and future research directions</title>
        <p>
          Finally, there are many emerging applications and future research directions that can further advance
the field of edge-based LLMs and unlock new opportunities for innovation and impact. Some of these
directions include:
1. The integration of edge-based LLMs with other AI technologies, such as reinforcement learning
[
          <xref ref-type="bibr" rid="ref61">61</xref>
          ], graph neural networks [
          <xref ref-type="bibr" rid="ref62">62</xref>
          ], and knowledge graphs [
          <xref ref-type="bibr" rid="ref63">63</xref>
          ], to enable more intelligent and
context-aware decision making and reasoning in edge computing environments.
2. The exploration of edge-based LLMs for multi-modal and cross-lingual applications, such as
image captioning [
          <xref ref-type="bibr" rid="ref64">64</xref>
          ], video summarization [
          <xref ref-type="bibr" rid="ref65">65</xref>
          ], and machine translation [
          <xref ref-type="bibr" rid="ref66">66</xref>
          ], to enable more
natural and expressive interactions between humans and edge devices.
3. The development of edge-based LLMs for mission-critical and safety-critical applications, such
as autonomous driving [
          <xref ref-type="bibr" rid="ref67">67</xref>
          ], industrial control systems [
          <xref ref-type="bibr" rid="ref68">68</xref>
          ], and healthcare monitoring [
          <xref ref-type="bibr" rid="ref69">69</xref>
          ], to
ensure the reliability, security, and performance of these systems in real-world environments.
4. The investigation of edge-based LLMs for sustainable and green computing, considering the
energy eficiency, carbon footprint, and environmental impact of these systems [
          <xref ref-type="bibr" rid="ref70">70</xref>
          ], and exploring
techniques such as energy harvesting [
          <xref ref-type="bibr" rid="ref71">71</xref>
          ], workload consolidation [
          <xref ref-type="bibr" rid="ref72">72</xref>
          ], and renewable energy
integration [
          <xref ref-type="bibr" rid="ref73">73</xref>
          ] to reduce their ecological footprint.
5. The study of the social, economic, and ethical implications of edge-based LLMs, including aspects
such as fairness, accountability, transparency, and explainability [
          <xref ref-type="bibr" rid="ref74">74</xref>
          ], and the development of
responsible AI principles and guidelines [
          <xref ref-type="bibr" rid="ref75">75</xref>
          ] to ensure the beneficial and trustworthy deployment
of these technologies in society.
        </p>
        <p>These emerging applications and future research directions highlight the vast potential and impact
of edge-based LLMs in shaping the future of intelligent and sustainable computing systems.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This survey has explored the dynamic and rapidly advancing field of edge-based large language models
(LLMs), a frontier that bridges the power of advanced AI with the constraints of resource-limited edge
devices. We have examined the techniques – such as quantization, pruning, and knowledge distillation –
that enable eficient LLM deployment alongside frameworks like TinyAgent, MNN-LLM, and h2oGPT
that facilitate practical implementation. Edge-cloud collaborative architectures, including EdgeShard,
Edge-LLM, and PAC, demonstrate how hybrid systems can overcome computational bottlenecks, while
hardware innovations like Cambricon-LLM, AxLaM, and DTATrans/DTQAtten push the boundaries of
performance and energy eficiency. These advancements have unlocked a diverse array of applications,
from IoT-driven smart cities and personalized assistants to multi-modal edge intelligence, as illustrated
by real-world deployments such as LLM-powered smartphones and autonomous AI systems.</p>
      <p>The significance of edge-based LLMs extends beyond technical achievements; they represent a
paradigm shift in how intelligent computing is delivered. By enabling low-latency, privacy-preserving,
and context-aware AI at the edge, these models democratize access to cutting-edge capabilities,
empowering users in resource-constrained environments – whether in remote areas with limited connectivity
or urban settings demanding real-time responsiveness. The survey ties together these threads –
techniques, frameworks, hardware, and applications – into a cohesive narrative: edge-based LLMs are not
merely an adaptation of existing technology but a foundational step toward ubiquitous, sustainable
intelligence. For instance, model compression and hardware acceleration address resource constraints, while
collaborative frameworks and privacy-preserving techniques ensure scalability and trust, collectively
paving the way for innovative applications that redefine human-machine interaction.</p>
      <p>Looking forward, the future of edge-based LLMs is both promising and demanding. Continued
innovation is needed to address persistent challenges, such as optimizing resource eficiency, enhancing
privacy and security, and achieving seamless scalability across heterogeneous environments.
Specific research directions include developing adaptive, modular LLM architectures for domain-specific
customization, integrating multi-modal reasoning for richer interactions, and exploring sustainable
computing paradigms to minimize environmental impact. These eforts will unlock transformative
possibilities – imagine autonomous systems reasoning in real-time, personalized services adapting
instantly to user needs or green edge AI reducing the carbon footprint of intelligent devices.</p>
      <p>This survey serves as both a comprehensive resource and a call to action for researchers and
practitioners at the confluence of LLMs and edge computing. The advancements chronicled here are not
endpoints but stepping stones toward a future where edge-based LLMs become integral to everyday
life – secure, eficient, and universally accessible. We conclude with a bold vision: edge-based LLMs
have the potential to reshape the landscape of computing, bringing intelligence closer to users than
ever before and fostering a world where AI is not just powerful but personal, pervasive, and profoundly
impactful.</p>
      <p>Declaration on Generative AI: During the preparation of this work, the authors used Scopus AI to generate a literature
review, Claude 3 Opus to draft content, Grok 3 for abstract drafting, improved writing style and citation management, and
Grammarly for grammar and spelling check. After using these tools, the authors reviewed and edited the content as needed
and took full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Slobodianiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <article-title>Advances in neural text generation: A systematic review (</article-title>
          <year>2022</year>
          -2024),
          <source>CEUR Workshop Proceedings</source>
          <volume>3917</volume>
          (
          <year>2025</year>
          )
          <fpage>332</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Liashenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <article-title>Bibliometric analysis and experimental assessment of chatbot training approaches</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3917</volume>
          (
          <year>2025</year>
          )
          <fpage>199</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          . 18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>5485</fpage>
          -
          <lpage>5551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Friha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Amine</given-names>
            <surname>Ferrag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kantarci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cakmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ozgun</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Ghoualmi-Zine, LLMBased Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness</article-title>
          ,
          <source>IEEE Open Journal of the Communications Society</source>
          <volume>5</volume>
          (
          <year>2024</year>
          )
          <fpage>5799</fpage>
          -
          <lpage>5856</lpage>
          . doi:
          <volume>10</volume>
          .1109/OJCOMS.
          <year>2024</year>
          .
          <volume>3456549</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <article-title>Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing</article-title>
          , in: R. N.
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>K. K.</given-names>
          </string-name>
          <string-name>
            <surname>Fletcher</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ardagna</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beheshti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Atukorala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ludwig</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Reif-Marganiec</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sailer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bena</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Watanabe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Wei (Eds.),
          <source>Proceedings of the IEEE International Conference on Web Services</source>
          ,
          <string-name>
            <surname>ICWS</surname>
          </string-name>
          , Institute of Electrical and Electronics Engineers Inc.,
          <year>2024</year>
          , pp.
          <fpage>799</fpage>
          -
          <lpage>809</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICWS62655.
          <year>2024</year>
          .
          <volume>00099</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Governing Open Vocabulary Data Leaks Using an Edge LLM through Programming by Example</article-title>
          ,
          <source>Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <article-title>179</article-title>
          . doi:
          <volume>10</volume>
          .1145/3699760.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhardwaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Pandit</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments</article-title>
          , in: 2024 16th International Conference on Computer and Automation Engineering,
          <string-name>
            <surname>ICCAE</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>172</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCAE59995.
          <year>2024</year>
          .
          <volume>10569285</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          , S. Jiang,
          <article-title>EdgeShard: Eficient LLM Inference via Collaborative Edge Computing</article-title>
          ,
          <source>IEEE Internet of Things Journal</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/JIOT.
          <year>2024</year>
          .
          <volume>3524255</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language Models are Few-Shot Learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings. neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. B.
          <string-name>
            <surname>Letaief</surname>
          </string-name>
          ,
          <string-name>
            <surname>Large Language Models Empowered Autonomous Edge AI for Connected</surname>
            <given-names>Intelligence</given-names>
          </string-name>
          ,
          <source>IEEE Communications Magazine</source>
          <volume>62</volume>
          (
          <year>2024</year>
          )
          <fpage>140</fpage>
          -
          <lpage>146</lpage>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>MCOM</surname>
          </string-name>
          .
          <volume>001</volume>
          .2300550.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bommu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Lin</surname>
          </string-name>
          , EDGE-LLM:
          <article-title>Enabling Eficient Large Language Model Adaptation on Edge Devices via Unified Compression and Adaptive Layer Voting</article-title>
          ,
          <source>in: Proceedings of the 61st ACM/IEEE Design Automation Conference</source>
          , DAC '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>327</fpage>
          . doi:
          <volume>10</volume>
          .1145/3649329. 3658473.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is All you Need, in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          , W. Jin, Tri-AFLLM:
          <article-title>Resource-Eficient Adaptive Asynchronous Accelerated Federated LLMs</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/TCSVT.
          <year>2024</year>
          .
          <volume>3519790</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Erdogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tabrizi</surname>
          </string-name>
          , S. Moon,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hooper</surname>
          </string-name>
          , G. Anumanchipalli,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Gholami, TinyAgent: Function Calling at the Edge, in:
          <string-name>
            <given-names>D. I. H.</given-names>
            <surname>Farias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hope</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          Li (Eds.),
          <source>EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of System Demonstrations, Association for Computational Linguistics (ACL)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Zhang,</surname>
          </string-name>
          <article-title>MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices</article-title>
          ,
          <source>in: Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops</source>
          , MMAsia '24 Workshops, Association for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>11</fpage>
          . doi:
          <volume>10</volume>
          .1145/3700410.3702126.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Candel</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McKinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeblick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Conde</surname>
          </string-name>
          ,
          <article-title>H2O Open Ecosystem for State-of-the-art Large Language Models</article-title>
          , in: Y.
          <string-name>
            <surname>Feng</surname>
          </string-name>
          , E. Lefever (Eds.),
          <source>EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings of the System Demonstrations, Association for Computational Linguistics (ACL)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            .
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leeser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Wang,</surname>
          </string-name>
          <article-title>HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/ TCAD.
          <year>2024</year>
          .
          <volume>3487781</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tambe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Hooper,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Whatmough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zuckerman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C. D. Santos</surname>
            ,
            <given-names>E. J.</given-names>
          </string-name>
          <string-name>
            <surname>Loscalzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Giri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Shepard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Carloni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Brooks</surname>
          </string-name>
          , G.-Y. Wei,
          <volume>22</volume>
          .9
          <string-name>
            <surname>A</surname>
          </string-name>
          12nm
          <fpage>18</fpage>
          .1TFLOPs/
          <string-name>
            <given-names>W</given-names>
            <surname>Sparse Transformer</surname>
          </string-name>
          <article-title>Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management</article-title>
          , in: Digest of Technical Papers - IEEE
          <source>International Solid-State Circuits Conference</source>
          , volume 2023-
          <article-title>February, Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2023</year>
          , pp.
          <fpage>342</fpage>
          -
          <lpage>344</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISSCC42615.
          <year>2023</year>
          .
          <volume>10067817</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          , CoRR abs/
          <year>1910</year>
          .01108 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .01108. arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giurgiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jastrzebski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morrone</surname>
          </string-name>
          , Q. de Laroussilhe,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gesmundo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Attariyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <article-title>Parameter-Eficient Transfer Learning for NLP</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2019</year>
          ,
          <volume>9</volume>
          -
          <fpage>15</fpage>
          June 2019, Long Beach, California, USA, volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2790</fpage>
          -
          <lpage>2799</lpage>
          . URL: http://proceedings.mlr.press/v97/houlsby19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the Knowledge in a Neural Network</article-title>
          ,
          <source>CoRR abs/1503</source>
          .02531 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1503.02531. arXiv:
          <volume>1503</volume>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. Liu,</surname>
          </string-name>
          <article-title>TinyBERT: Distilling BERT for natural language understanding</article-title>
          , in: T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>4163</fpage>
          -
          <lpage>4174</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>372</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>GenG: An LLM-Based Generic Time Series Data Generation Approach for Edge Intelligence via Cross-Domain Collaboration</article-title>
          , in: IEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops,
          <source>INFOCOM WKSHPS</source>
          <year>2024</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/INFOCOMWKSHPS61880.
          <year>2024</year>
          .
          <volume>10620716</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Edge and Terminal Cooperation Enabled LLM Deployment Optimization in Wireless Network</article-title>
          , in: International Conference on Communications in China,
          <source>ICCC Workshops</source>
          <year>2024</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>225</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCCWorkshops62562.
          <year>2024</year>
          .
          <volume>10693742</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>GKT: A Novel Guidance-Based Knowledge Transfer Framework For Eficient Cloud-edge Collaboration LLM Deployment</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Proceedings of the Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2024</year>
          , pp.
          <fpage>3433</fpage>
          -
          <lpage>3446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jia</surname>
          </string-name>
          , VELO:
          <string-name>
            <given-names>A Vector</given-names>
            <surname>Database-Assisted Cloud-Edge Collaborative LLM QoS</surname>
          </string-name>
          <article-title>Optimization Framework</article-title>
          , in: R. N.
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>K. K.</given-names>
          </string-name>
          <string-name>
            <surname>Fletcher</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ardagna</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beheshti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Atukorala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ludwig</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Reif-Marganiec</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sailer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bena</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Watanabe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Wei (Eds.),
          <source>Proceedings of the IEEE International Conference on Web Services</source>
          ,
          <string-name>
            <surname>ICWS</surname>
          </string-name>
          , Institute of Electrical and Electronics Engineers Inc.,
          <year>2024</year>
          , pp.
          <fpage>865</fpage>
          -
          <lpage>876</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICWS62655.
          <year>2024</year>
          .
          <volume>00105</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nazari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Makrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Patwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sayadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rafatirad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Chuah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Homayoun</surname>
          </string-name>
          ,
          <article-title>LLM-FIN: Large Language Models Fingerprinting Attack on Edge Devices</article-title>
          ,
          <source>in: Proceedings - International Symposium on Quality Electronic Design</source>
          ,
          <string-name>
            <surname>ISQED</surname>
          </string-name>
          , IEEE Computer Society,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISQED60706.
          <year>2024</year>
          .
          <volume>10528736</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Pluto and Charon: A Time and Memory Eficient Collaborative Edge AI Framework for Personal LLMs Fine-tuning</article-title>
          ,
          <source>in: Proceedings of the 53rd International Conference on Parallel Processing, ICPP '24</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>762</fpage>
          -
          <lpage>771</lpage>
          . doi:
          <volume>10</volume>
          .1145/3673038.3673043.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zhi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          , T. Chen,
          <article-title>Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, in:</article-title>
          <source>Proceedings of the Annual International Symposium on Microarchitecture, MICRO, IEEE Computer Society</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1474</fpage>
          -
          <lpage>1488</lpage>
          . doi:
          <volume>10</volume>
          .1109/MICRO61859.
          <year>2024</year>
          .
          <volume>00108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Glint</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Ronak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kasture</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Momin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Mekie,</surname>
          </string-name>
          <article-title>AxLaM: Energy-eficient accelerator design for language models for edge computing</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences</source>
          <volume>383</volume>
          (
          <year>2025</year>
          )
          <article-title>20230395</article-title>
          . doi:
          <volume>10</volume>
          .1098/rsta.
          <year>2023</year>
          .
          <volume>0395</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          , L. Jiang,
          <article-title>DTATrans: Leveraging Dynamic Token-Based Quantization With Accuracy Compensation Mechanism for Eficient Transformer Architecture</article-title>
          ,
          <source>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>509</fpage>
          -
          <lpage>520</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCAD.
          <year>2022</year>
          .
          <volume>3181541</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          , L. Jiang,
          <article-title>DTQAtten: Leveraging Dynamic Token-based Quantization for Eficient Attention Architecture</article-title>
          , in: C.
          <string-name>
            <surname>Bolchini</surname>
            ,
            <given-names>I. Verbauwhede</given-names>
          </string-name>
          , I. Vatajelu (Eds.),
          <source>Proceedings of the 2022 Design, Automation and Test in Europe Conference and Exhibition, DATE</source>
          <year>2022</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2022</year>
          , pp.
          <fpage>700</fpage>
          -
          <lpage>705</lpage>
          . doi:
          <volume>10</volume>
          .23919/DATE54114.
          <year>2022</year>
          .
          <volume>9774692</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Panda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raychowdhury</surname>
          </string-name>
          , Special Session:
          <string-name>
            <surname>Neuro-Symbolic Architecture Meets Large Language Models: A Memory-Centric</surname>
            <given-names>Perspective</given-names>
          </string-name>
          , in: Proceedings - 2024
          <source>International Conference on Hardware/Software Codesign and System Synthesis</source>
          ,
          <source>CODES+ISSS</source>
          <year>2024</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .1109/CODES-ISSS60120.
          <year>2024</year>
          .
          <volume>00012</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Basit</surname>
          </string-name>
          , M. Shafique,
          <article-title>TinyDigiClones: A Multi-Modal LLM-Based Framework for Edge-optimized Personalized Avatars</article-title>
          ,
          <source>in: Proceedings of the International Joint Conference on Neural Networks</source>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .1109/IJCNN60899.
          <year>2024</year>
          .
          <volume>10649909</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A First</surname>
          </string-name>
          <article-title>Look at LLM-powered Smartphones</article-title>
          ,
          <source>in: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops</source>
          , ASEW '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>208</fpage>
          -
          <lpage>217</lpage>
          . doi:
          <volume>10</volume>
          .1145/3691621.3694952.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Shi,
          <article-title>Blockchain-based Edge Intelligence Enabled by AI Large Models for Future Internet of Things</article-title>
          ,
          <source>in: 2024 IEEE 12th International Conference on Information and Communication Networks</source>
          ,
          <string-name>
            <surname>ICICN</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2024</year>
          , pp.
          <fpage>368</fpage>
          -
          <lpage>374</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICICN62625.
          <year>2024</year>
          .
          <volume>10761527</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Large-Scale Trafic Flow Forecast with Lightweight LLM in Edge Intelligence</article-title>
          ,
          <source>IEEE Internet of Things Magazine</source>
          <volume>8</volume>
          (
          <year>2025</year>
          )
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .1109/IOTM.001.2400047.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>F.</given-names>
            <surname>Piccialli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          , E. Damiani,
          <article-title>Federated and edge learning for large language models</article-title>
          ,
          <source>Information Fusion</source>
          <volume>117</volume>
          (
          <year>2025</year>
          )
          <article-title>102840</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.inffus.
          <year>2024</year>
          .
          <volume>102840</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Bader</surname>
          </string-name>
          , Z. Han,
          <article-title>Distributed Foundation Models for MultiModal Learning in 6G Wireless Networks</article-title>
          ,
          <source>IEEE Wireless Communications</source>
          <volume>31</volume>
          (
          <year>2024</year>
          )
          <fpage>20</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .1109/MWC.009.2300501.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lipson</surname>
          </string-name>
          ,
          <article-title>Reconfigurable Robot Identification from Motion Data</article-title>
          ,
          <source>in: IEEE International Conference on Intelligent Robots and Systems</source>
          , Institute of Electrical and Electronics Engineers Inc.,
          <year>2024</year>
          , pp.
          <fpage>14133</fpage>
          -
          <lpage>14140</lpage>
          . doi:
          <volume>10</volume>
          .1109/IROS58592.
          <year>2024</year>
          .
          <volume>10801809</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawaharazuka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Obinata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanazawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Okada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Inaba</surname>
          </string-name>
          ,
          <article-title>Robotic Applications of PreTrained Vision-Language Models to Various Recognition Behaviors</article-title>
          , in: IEEE-RAS International Conference on Humanoid Robots, IEEE Computer Society,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/ Humanoids57100.
          <year>2023</year>
          .
          <volume>10375211</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          , K.-T. Cheng, M. Guo,
          <article-title>MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their HardwareSoftware Implications</article-title>
          , in: Proceedings - 2023
          <source>IEEE International Symposium on Workload Characterization, IISWC</source>
          <year>2023</year>
          ,
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <year>2023</year>
          , pp.
          <fpage>154</fpage>
          -
          <lpage>166</lpage>
          . doi:
          <volume>10</volume>
          .1109/IISWC59245.
          <year>2023</year>
          .
          <volume>00014</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mirhoseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Maziarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer</article-title>
          ,
          <source>in: 5th International Conference on Learning Representations, ICLR</source>
          <year>2017</year>
          , Toulon, France,
          <source>April 24-26</source>
          ,
          <year>2017</year>
          , Conference Track Proceedings, OpenReview.net,
          <year>2017</year>
          . URL: https://openreview.net/forum?id=
          <fpage>B1ckMDqlg</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Model-agnostic meta-learning for fast adaptation of deep networks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17</source>
          , JMLR.org,
          <year>2017</year>
          , p.
          <fpage>1126</fpage>
          -
          <lpage>1135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Jouppi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patterson</surname>
          </string-name>
          , G. Agrawal,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bajwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Boden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borchers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Boyle</surname>
          </string-name>
          , P.-l. Cantin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Coriell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Daley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gelb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. V.</given-names>
            <surname>Ghaemmaghami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gottipati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gulland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hagmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hogberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hundt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hurt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ibarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jafey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaworski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Khaitan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Killebrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Laudon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lundin</surname>
          </string-name>
          , G. MacKean,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maggiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nagarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Narayanaswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Norrie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omernick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Penukonda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Phelps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Samadiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Severn</surname>
          </string-name>
          , G. Sizikov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Snelham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Souter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Steinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Thorson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tuttle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Yoon</surname>
          </string-name>
          , In-Datacenter
          <source>Performance Analysis of a Tensor Processing Unit, SIGARCH Comput. Archit. News</source>
          <volume>45</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3140659.3080246.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <source>System Virtualization for Neural Processing Units, in: Proceedings of the 19th Workshop on Hot Topics in Operating Systems</source>
          , HOTOS '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>80</fpage>
          -
          <lpage>86</lpage>
          . doi:
          <volume>10</volume>
          .1145/3593856.3595912.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <source>The Algorithmic Foundations of Diferential Privacy, Foundations and Trends in Theoretical Computer Science</source>
          <volume>9</volume>
          (
          <year>2014</year>
          )
          <fpage>211</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          .1561/0400000042.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cheon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Youm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          , Privacy Set:
          <article-title>Privacy-AuthorityAware Compiler for Homomorphic Encryption on Edge-Cloud System</article-title>
          ,
          <source>IEEE Internet Things J</source>
          .
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <fpage>35167</fpage>
          -
          <lpage>35184</lpage>
          . doi:
          <volume>10</volume>
          .1109/JIOT.
          <year>2024</year>
          .
          <volume>3437356</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Achemlal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouabdallah</surname>
          </string-name>
          , Trusted Execution Environment: What It is, and What It is Not, in: 2015 IEEE Trustcom/BigDataSE/ISPA, volume
          <volume>1</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .1109/ Trustcom.
          <year>2015</year>
          .
          <volume>357</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kumar, Integrating Explainable AI with Federated Learning for Next-Generation IoT: A comprehensive review and prospective insights</article-title>
          ,
          <source>Computer Science Review</source>
          <volume>56</volume>
          (
          <year>2025</year>
          )
          <article-title>100697</article-title>
          . doi:
          <volume>10</volume>
          .1016/J.COSREV.
          <year>2024</year>
          .
          <volume>100697</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>A.</given-names>
            <surname>Petrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Miozzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dini</surname>
          </string-name>
          ,
          <article-title>Mobile Trafic Prediction at the Edge Through Distributed and Deep Transfer Learning</article-title>
          ,
          <source>IEEE Access 12</source>
          (
          <year>2024</year>
          )
          <fpage>191288</fpage>
          -
          <lpage>191303</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2024</year>
          .
          <volume>3518483</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karout</surname>
          </string-name>
          ,
          <article-title>Eficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition</article-title>
          , in: ICASSP 2022
          <article-title>-</article-title>
          2022 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>7102</fpage>
          -
          <lpage>7106</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP43922.
          <year>2022</year>
          .
          <volume>9746223</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>L.</given-names>
            <surname>Falissard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Afeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadif</surname>
          </string-name>
          , Attentive Perturbation:
          <article-title>Extending Prefix Tuning to Large Language Models Inner Representations</article-title>
          , in: G. Nicosia,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Malfa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Malfa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Pardalos</surname>
          </string-name>
          , R. Umeton (Eds.),
          <source>Machine Learning, Optimization, and Data Science - 9th International Conference, LOD</source>
          <year>2023</year>
          ,
          <article-title>Grasmere</article-title>
          , UK,
          <source>September 22-26</source>
          ,
          <year>2023</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <string-name>
            <given-names>Papers</given-names>
            ,
            <surname>Part</surname>
          </string-name>
          <string-name>
            <surname>I</surname>
          </string-name>
          , volume
          <volume>14505</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>488</fpage>
          -
          <lpage>496</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -53969-5_
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Jiang, Hide and Seek in Noise Labels:
          <article-title>Noise-Robust Collaborative Active Learning with LLMs-Powered Assistance</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2024</year>
          , Bangkok, Thailand,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>10977</fpage>
          -
          <lpage>11011</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>592</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Xu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Zhang,</surname>
          </string-name>
          <article-title>LLM-TSFD: An industrial time series human-inthe-loop fault diagnosis method based on a large language model</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>264</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2024</year>
          .
          <volume>125861</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garofalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Colosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Catalfamo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villari</surname>
          </string-name>
          ,
          <article-title>Web-Centric Federated Learning over the Cloud-Edge Continuum Leveraging ONNX and WASM</article-title>
          ,
          <source>in: IEEE Symposium on Computers and Communications, ISCC 2024</source>
          , Paris, France, June 26-29,
          <year>2024</year>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ISCC61673.
          <year>2024</year>
          .
          <volume>10733614</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Martinez-Casanueva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Lentisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>An Initial Approach to a Multiaccess Edge Computing Reference Architecture Implementation Using Kubernetes</article-title>
          , in: H.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>R. J. D.</given-names>
          </string-name>
          <string-name>
            <surname>Barroso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          (Eds.),
          <source>Broadband Communications, Networks, and Systems - 11th EAI International Conference, BROADNETS</source>
          <year>2020</year>
          , Qingdao, China,
          <source>December 11-12</source>
          ,
          <year>2020</year>
          , Proceedings, volume
          <volume>355</volume>
          <source>of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>193</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -68737-3_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Lua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Crowcroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>A survey and comparison of peer-to-peer overlay network schemes</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          <volume>7</volume>
          (
          <year>2005</year>
          )
          <fpage>72</fpage>
          -
          <lpage>93</lpage>
          . doi:
          <volume>10</volume>
          . 1109/COMST.
          <year>2005</year>
          .
          <volume>1610546</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , G. Liu, T. Huang,
          <article-title>Task Ofloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing</article-title>
          ,
          <source>Sensors</source>
          <volume>25</volume>
          (
          <year>2025</year>
          )
          <article-title>175</article-title>
          . doi:
          <volume>10</volume>
          .3390/ s25010175.
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Niyato</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. G.</surname>
          </string-name>
          <article-title>Brinton, Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Ofloading</article-title>
          ,
          <source>CoRR abs/2501</source>
          .14205 (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2501.14205. arXiv:
          <volume>2501</volume>
          .
          <fpage>14205</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Ishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ishiguro</surname>
          </string-name>
          , HAM-GNN:
          <article-title>A hierarchical attention-based multi-dimensional edge graph neural network for dialogue act classification</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>261</volume>
          (
          <year>2025</year>
          )
          <article-title>125459</article-title>
          . doi:
          <volume>10</volume>
          .1016/J.ESWA.
          <year>2024</year>
          .
          <volume>125459</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          [63]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>A design method for edge-cloud collaborative product service system: a dynamic event-state knowledge graph-based approach with real case study</article-title>
          ,
          <source>International Journal of Production Research</source>
          <volume>62</volume>
          (
          <year>2024</year>
          )
          <fpage>2584</fpage>
          -
          <lpage>2605</lpage>
          . doi:
          <volume>10</volume>
          .1080/00207543.
          <year>2023</year>
          .
          <volume>2219345</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          [64]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          , Q. Cheng, J. Wu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Eficient Image Captioning for Edge Devices</article-title>
          , in: B.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , J. Neville (Eds.),
          <source>Thirty-Seventh AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI</source>
          <year>2023</year>
          , Washington, DC, USA, February 7-
          <issue>14</issue>
          ,
          <year>2023</year>
          , AAAI Press,
          <year>2023</year>
          , pp.
          <fpage>2608</fpage>
          -
          <lpage>2616</lpage>
          . doi:
          <volume>10</volume>
          .1609/AAAI.V37I2. 25359.
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          [65]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <article-title>Latency-Aware Adaptive Video Summarization for Mobile Edge Clouds</article-title>
          ,
          <source>IEEE Trans. Multim</source>
          .
          <volume>22</volume>
          (
          <year>2020</year>
          )
          <fpage>1193</fpage>
          -
          <lpage>1207</lpage>
          . doi:
          <volume>10</volume>
          .1109/TMM.
          <year>2019</year>
          .
          <volume>2939753</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          [66]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liashenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <article-title>The Determination and Visualisation of Key Concepts Related to the Training of Chatbots</article-title>
          , in: E. Faure,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tryus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vartiainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Danchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bazilo</surname>
          </string-name>
          , G. Zaspa (Eds.),
          <source>Information Technology for Education, Science, and Technics</source>
          , volume
          <volume>222</volume>
          <source>of Lecture Notes on Data Engineering and Communications Technologies</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71804-
          <issue>5</issue>
          _
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          [67]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mukovoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vakaliuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <article-title>Road Sign Recognition Using Convolutional Neural Networks</article-title>
          ,
          <source>in: Information Technology for Education, Science, and Technics</source>
          , volume
          <volume>222</volume>
          <source>of Lecture Notes on Data Engineering and Communications Technologies</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>188</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71804-5_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          [68]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fakih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dharmaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Moghaddas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Quiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ogundare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Al</surname>
          </string-name>
          <string-name>
            <surname>Faruque</surname>
          </string-name>
          ,
          <article-title>LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems</article-title>
          ,
          <source>in: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP '24</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>192</fpage>
          -
          <lpage>203</lpage>
          . doi:
          <volume>10</volume>
          .1145/3639477.3639743.
        </mixed-citation>
      </ref>
      <ref id="ref69">
        <mixed-citation>
          [69]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          , M. Srivastava, MindGuard: Towards Accessible and
          <string-name>
            <surname>Sitgma-free Mental Health First Aid via Edge</surname>
            <given-names>LLM</given-names>
          </string-name>
          ,
          <source>CoRR abs/2409</source>
          .10064 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ ARXIV.2409.10064. arXiv:
          <volume>2409</volume>
          .
          <fpage>10064</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref70">
        <mixed-citation>
          [70]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Energy and Policy Considerations for Modern Deep Learning Research</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>13693</fpage>
          -
          <lpage>13696</lpage>
          . doi:
          <volume>10</volume>
          .1609/aaai.v34i09.
          <fpage>7123</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref71">
        <mixed-citation>
          [71]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khoshsirat</surname>
          </string-name>
          , G. Perin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Decentralized LLM inference over edge networks with energy harvesting</article-title>
          ,
          <source>CoRR abs/2408</source>
          .15907 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2408.15907. arXiv:
          <volume>2408</volume>
          .
          <fpage>15907</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref72">
        <mixed-citation>
          [72]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mohiuddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almogren</surname>
          </string-name>
          ,
          <article-title>Workload aware VM consolidation method in edge/cloud computing for iot applications</article-title>
          ,
          <source>J. Parallel Distributed Comput</source>
          .
          <volume>123</volume>
          (
          <year>2019</year>
          )
          <fpage>204</fpage>
          -
          <lpage>214</lpage>
          . doi:
          <volume>10</volume>
          .1016/J.JPDC.
          <year>2018</year>
          .
          <volume>09</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref73">
        <mixed-citation>
          [73]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hu</surname>
          </string-name>
          , J. Liu, G. Wei,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Immersive Multimedia Service Caching in Edge Cloud with Renewable Energy</article-title>
          ,
          <source>ACM Trans. Multim. Comput. Commun. Appl</source>
          .
          <volume>20</volume>
          (
          <year>2024</year>
          )
          <volume>173</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>173</lpage>
          :
          <fpage>23</fpage>
          . doi:
          <volume>10</volume>
          .1145/3643818.
        </mixed-citation>
      </ref>
      <ref id="ref74">
        <mixed-citation>
          [74]
          <string-name>
            <given-names>D. O.</given-names>
            <surname>Hanchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <article-title>Implementing MLOps practices for efective machine learning model deployment: A meta synthesis</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3918</volume>
          (
          <year>2024</year>
          )
          <fpage>329</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref75">
        <mixed-citation>
          [75]
          <string-name>
            <given-names>D. O.</given-names>
            <surname>Hanchuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. O. Semerikov,</surname>
          </string-name>
          <article-title>Automating machine learning: A meta-synthesis of MLOps tools, frameworks and architectures</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3917</volume>
          (
          <year>2025</year>
          )
          <fpage>362</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>