<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SIGIR Workshop on eCommerce, Jul</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.2407.21783</article-id>
      <title-group>
        <article-title>the bleeding edge: large-scale production inference of LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yubin Kim</string-name>
          <email>yubin@vody.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arthur Maciejewicz</string-name>
          <email>arthur@vody.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brandon Beveridge</string-name>
          <email>brandon@vody.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>New York</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>17</volume>
      <issue>2025</issue>
      <abstract>
        <p>Large Language Models (LLMs) have demonstrated strong generalization capabilities across a range of natural language tasks and are increasingly being integrated into production systems. However, scalable, cost-eficient, and maintainable deployment of LLMs remains underexplored in academic literature. This paper presents our experiences building a production-grade architecture for asynchronous, large-scale batch inference of LLMs at Vody, a generative AI startup focused on product data enrichment for e-commerce. We describe the key architectural decisions that enabled us to process tens of millions of products through multiple LLMs within hours. We also share pragmatic lessons learned about tooling reliability, debugging strategies, and infrastructure design. Our goal is to provide actionable guidance for teams facing similar challenges in deploying LLMs at scale using open-source tooling.</p>
      </abstract>
      <kwd-group>
        <kwd>LLM</kwd>
        <kwd>eficient inference</kwd>
        <kwd>system design</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have attracted significant attention in both academia and industry due
to their strong performance and ability to generalize across many tasks. While a naive implementation
of an LLM can be computationally expensive to inference, eficient inference of LLMs is a thriving area
of academic research [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In addition, a robust open-source community is translating this research
into common libraries and tooling [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], so that scalable inference of LLMs is becoming accessible to
those without deep research expertise.
      </p>
      <p>However, there is very little in the literature that discusses practical architecture design choices for
productionizing LLMs using commonly available tools. Ganiev et al. discusses an system architecture
for a BERT-based model. Mailach et al. and Parnin et al. survey developers for what they consider
challenges in designing an LLM-based application, but do not include design recommendations or a
discussion of open source tools.</p>
      <p>Vody is an early-stage generative AI startup that builds multimodal LLMs for e-commerce. Our
models enrich product catalog data in ways targeted to improve the performance of downstream search
&amp; discovery applications. Some examples of product data enrichment tasks include copywriting (e.g.
generating product titles, descriptions), product attribute labeling (e.g. color, style, material), and
open-ended keyword generation (e.g. related search queries). Access to our oferings is provided
through a software-as-a-service (SaaS) API and is an asynchronous batch process system designed to
process 10s of millions of items through multiple task-specific models in a matter of hours.</p>
      <p>This paper chronicles our journey in designing an architecture for eficient, scalable batch inference
of LLMs, detailing our decision-making and the pitfalls we encountered. We share key lessons we
learned along our journey. While model training and evaluation are also valuable problems to examine,
in this work, we focus on the understudied topic of scalable deployment.</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
    </sec>
    <sec id="sec-2">
      <title>2. Architecture</title>
      <p>The design of our system architecture was driven by the following desiderata:
1. Consistency across diferent cloud providers : our ofering is deployed on both Google Cloud Platform
and Amazon Web Services, thus we desire to minimize the use of cloud-specific services such
that our infrastructure remains consistent across both clouds.
2. Support for client-specific data workflows : each of our clients have diferent data formats and
subscribe to a diferent subset of our product oferings. Furthermore, our models fine-tuned to be
client-specific. Thus, our system must support easily configurable custom workflows containing
diferent data processing and model inference steps.
3. Horizontally scalable batch inference: our clients include large retailers that have 10s of millions of
distinct items their catalog which must be re-processed on a monthly basis. In addition, there are
tens of thousands of daily updates to the product catalog which must be processed. This requires
a design that can quickly scale up based on the request load.
4. Eficient serving of multiple LLMs : as mentioned above, a client may be subscribed to multiple
diferent product oferings. In addition, a single product ofering may be executed through multiple
tasks-specific models, including output quality guardrail models that ensure that generated
content adhere to client-specific brand guidelines. This means a single client tenant must serve
and inference multiple diferent LLMs.
5. Future-proof against new open-source model releases: one of our key value propositions is
continued technology updates that incorporate the latest advances in open-source LLMs. Thus, it is
imperative that our architecture can flexibly accommodate diferent types of base models and is
easy to update and maintain.</p>
      <p>Our architecture is designed to meet the above requirements for large-scale batch inference of LLMs
with an emphasis on scalability, modularity, and operational eficiency. We implemented a system
that separates data processing from model serving, standardizes API communication, and leverages
containerization for deployment flexibility.</p>
      <sec id="sec-2-1">
        <title>2.1. System overview</title>
        <p>The overall system architecture follows a microservices design pattern with distinct components
handling diferent aspects of the inference pipeline:
• REST API layer : Serves as the entry point to our system, handling request validation,
authentication, and routing.
• Orchestration layer : Manages the flow of data through the system using Airflow directed acyclic
graphs (DAGs), which coordinate batch processing tasks.
• Processing layer : CPU-optimized containers backed by Kubernetes that handle data preparation,
pre/post-processing, and business logic.
• Inference layer : GPU-optimized containers backed by Kubernetes dedicated to model serving
with minimal dependencies beyond inference requirements.</p>
        <p>• Storage layer : Persistent storage for models, inference results, and logging data.</p>
        <p>To meet Desideratum 1, we use open-source tools that are cross-cloud compatible where possible.
Figure 1 presents an overview of our system architecture and how data flows through the system
based on an API request by the user. Users can interact with our system in two ways. First, they can
upload product data to our API to initiate a request for data enrichment, which is handled in a batch
asynchronous fashion and shown with the red arrows in the Figure:
1. User submits product data to the Django REST API
2. API writes product data to PostgreSQL queue and metadata tables
3. Kubernetes worker retrieves N latest products from the queue and schedules Airflow jobs on</p>
        <p>CPU worker nodes
4. Worker calls VLLM/LMDeploy service via OpenAI interface for LLM enrichment
Afterwards, the user can initiate a request to retrieve the enriched product data retrieval process,
shown with the blue arrows in the Figure:
1. User retrieves processed data from the API
2. API reads processed data from PostgreSQL database</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Airflow-based workflow management</title>
        <p>We orchestrate and schedule data processing and model inference steps through Airflow DAGs defined
in Python. Defining these workflows programmatically gives us fine-grained control over execution
logic, simplifies versioning and reuse, and enables us to easily customize and swap pipelines across
client tenants based on their specific requirements, meeting Desideratum 2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Kubernetes configuration and resource allocation</title>
        <p>For horizontal scalability (Desideratum 3), our computation is backed by Kubernetes. A key architectural
decision was to separate CPU-intensive data processing from GPU-dependent model serving in our
Kubernetes deployment. This separation addresses several challenges:
• Container Size Management: CUDA dependencies make model-serving containers significantly
larger (often 10–20 GB versus 1–2 GB for processing containers). Maintaining this separation
keeps most containers lightweight, decreasing node spin up time.
• Resource Utilization: GPU resources are allocated only where needed, maximizing cost
eficiency and computational throughput.
• Dependency Isolation: We encountered numerous dependency conflicts, particularly with
diferent versions of the transformers1 library required by diferent models. Isolation prevents
these conflicts.</p>
        <p>For horizontal scaling, we implemented auto-scaling policies based on queue depth and processing
latency metrics. The system can dynamically adjust the number of processing nodes while maintaining
a core set of inference nodes that remain warmed up with models loaded in memory.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Model management</title>
        <p>
          Our model management approach focuses on flexibility and utilizing parameter-eficient fine-tuning,
specifically LoRA [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], in order to allow us to easily serve multiple task-specific models (Desideratum 4).
LoRA fine-tuning significantly reduces storage requirements and deployment complexity, as only the
base model needs to be loaded into GPU memory, with small LoRA adapters (~10–100 MB) swapped
dynamically.
        </p>
        <p>
          Our system is flexible to the type of base model. Throughout various points in time, we have used
the following open source models: LLaMA 2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], LLaMA 3 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], Qwen-VL 1.5 [11], Qwen-VL 2.5 [12],
Mistral [13]. Base models and fine-tuned LoRA adapters are versioned and stored in cloud storage
(S3/GCS), providing a single source of truth across all deployment environments.
        </p>
        <p>To address cold-start latency, particularly for large models, we implemented a warm-up strategy
where frequently used models remain loaded in memory. In future iterations, we plan to implement
stateful storage attachment to Kubernetes nodes to accelerate model initialization times.</p>
        <p>We primarily use models in the 7–8 billion parameter range, for two key reasons: a) we have
empirically found it to be the sweet spot in the trade-of between accuracy and eficiency for our data
enrichment tasks; b) we encountered significant operational challenges while trying to secure access to
GPUs with larger VRAM capacities, such as NVIDIA A100 GPUs.</p>
        <p>In both clouds, A100 spot instances were often entirely unavailable due to high demand. Even
with quota approvals for long-term reservations, due to limited physical supply, acquiring an instance
required running an automated scripts polling at five-minute intervals—often over multiple hours.
Furthermore, reserving A100s on a sustained basis proved cost-prohibitive. To support horizontal
scalability while maintaining reasonable availability and cost, we transitioned to using NVIDIA L40S
GPUs, which were more reliably accessible. This hardware constraint was a key factor in our decision to
use 7–8 billion parameter models, which can run eficiently within the memory limits of L40S instances.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Inference server implementation</title>
        <p>
          Our inference server architecture prioritizes the ability to be cross-compatible for diferent base models
and leverages cloud storage and dynamic model loading. While our preferred inference server is
vLLM [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], our architecture supports multiple diferent types of inference servers for broader compatibility
with diferent base models (Desideratum 5).
        </p>
        <p>
          • We standardized on the OpenAI API definition for inference server communication, which
provides a well-documented interface that simplifies integration with various types of models
and clients.
• We implemented dynamic adapter loading using vLLM’s LoRA support, allowing us to load
diferent adapters at runtime based on request parameters without maintaining separate copies
of the full model weights.
• For the Qwen-VL series of models, we implemented lmdeploy [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] which better supports the
specialized architecture of these models.
        </p>
        <p>Note that swapping between inference server implementations has hidden pitfalls. While they utilize
the same interface, making it easy to swap them out, inference server implementations like vLLM
and lmdeploy can have mechanistic diferences that make configuration non-portable between them.
Similarly, configuration parameters that map to the same underlying mechanism can difer in name
and description, like vllm’s --max-model-len and lmdeploy’s --session-len. This tends to manifest
in performance and reliability pitfalls, making tuning a separate job for each backend.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Lessons Learned and Recommendations</title>
      <p>In this section, we present key recommendations based on our (often painful) lessons learned during
the process of designing and implementing an architecture for large-scale LLM inference.</p>
      <p>One of the most important lessons we learned is that the open-source ecosystem around LLMs
remains highly volatile. The underlying technologies are evolving rapidly, and many open-source
projects–often developed in academic or experimental contexts–can be unstable or contain critical
bugs. With that in mind, we ofer the following guidance to teams getting ready to embark on similar
journeys:</p>
      <sec id="sec-3-1">
        <title>3.1. Make debugging easy</title>
        <p>Debugging stochastic systems, i.e. systems that rely on LLM output, is inherently dificult. This
combined with bleeding-edge code bases means that you will be debugging, a lot. Our most important
recommendation is to make significant, up-front investments into developer experience and lean into
system design choices that prioritize ease of debugging and error detection:
• Minimize iteration time. For efective debugging, it is critical to minimize the time between
making a change and being able to test said change. We use docker-compose to run Airflow
and our API server locally, while maintaining connections to remote inference servers through
Kubernetes port forwarding. This hybrid approach allows us to securely expose
productionequivalent inference servers to local DAG runs without the resource-intensive task of running AI
models on developer machines, which reduced our iteration time from 30 minutes to less than 1
minute.
• Invest in collated logging. At scale, collated logs simplify maintenance. Expected, low priority,
or unactionable issues can be triaged appropriately without requiring redundant investigations.
Aggregations automatically organize repeated issues into bugs, reducing management toil. Causal
relationships can be established through cross-analysis with other parts of your production stack.</p>
        <p>We use Sentry2 for logging (Figure 2).
• Carefully curate metrics and alerting. Ensure metrics and alerting are high quality. For
example, to prevent alert exhaustion, ensure graceful exits (example: empty work queue) are not
logged as failures for a given workflow.
• Make replicating state easy. Multi-step stochastic systems are often dificult to debug
end-toend. Log everything required to replicate a step in isolation, including intermediate data output
and random seed settings, if applicable. We use a combination of home-grown scripts and tools
such as Weights and Biases to be able to log and reproduce individual steps in a DAG.
• Fail fast. To fail fast, prefer to halt when handling errors, rather than continuing with the program.</p>
        <p>At scale, this makes it easier to identify the root cause of issues, rather than investigating a cascade
of misleading comorbidities.
• Less is more. Our initial architecture design had several more frameworks and components,
e.g. we used the Langchain framework, we were using Celery workers to manage parts of data
ingestion. However, as we iterated on the design, we quickly realized that more components
meant more debugging complexity. In addition, LLM-ecosystem frameworks such as Langchain,
became a source of additional bugs. We thus removed unnecessary frameworks and components
and simplified our architecture as much as possible: we removed the use of Langchain as found
that standardizing to Open AI API was suficient for our needs, and we removed Celery workers
to ingest the data lazily as needed.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Follow the herd</title>
        <p>Given the fast-moving nature of the nascent LLM tooling ecosystem, some reliance on immature
technology is inevitable. Thus, wherever possible, we strongly recommend choosing established,
welldocumented technologies. For example, we use Airflow for workflow orchestration and Django for
our API serving layer—both of which ofer strong community support, mature documentation, and
proven reliability in production environments, making them far easier to troubleshoot and integrate
at scale. An important corollary is that GenAI coding assistants are excellent at generating code for
mature frameworks compared to new libraries, which can substantially reduce developer efort.</p>
        <p>In addition, even within the LLM ecosystem, some paths are better trodden than others: we favor
using LLaMA models when reasonable because of the robust community surrounding the LLaMA
project and related tooling due to its broad adoption.</p>
        <p>There are also pairs of technologies that work better together than other options. For example,
Google’s open source LLM Gemma has oficial published guides for integration with vLLM 3. Similarly,
we have found that the Qwen series of models are best supported by lmdeploy as its inference service.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Distrust your tools</title>
        <p>The rapid pace of development in the LLM ecosystem means that frameworks and tools are still maturing
and can exhibit inconsistent or unreliable behavior. As a result, we have learned to approach these
tools with a healthy degree of skepticism. In this section, we highlight one illustrative example that
may be particularly relevant for teams navigating similar environments:</p>
        <p>nvidia-smi is a command line tool provided by NVIDIA to monitor GPU usage. However, we
came to realize that the tool’s utilization metric is a simplified proxy for GPU saturation. Inference
optimizations like batching can still improve throughput even when utilization is reported to be high
(Figure 3). As a proxy for relative GPU saturation, we find that power consumption is more useful,
and overall recommend instrumenting your pipeline to identify your own bottlenecks with your own
metrics.</p>
        <p>Throughout the development of our system, we encountered bugs across the entire stack, including in
the LLM models themselves, as well as in model serving frameworks. In the course of addressing these
issues, we actively contributed bug reports and patches to open-source projects such as LangChain and
LMDeploy, helping to improve the broader ecosystem, and we encourage other teams to do the same.
3https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm
(a) vLLM GPU utilization with batch size 2
(b) vLLM GPU utilization with batch size 100</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Eficient large-scale inference of LLMs in production is still a developing discipline. Through our work at
Vody, we have found that achieving scale, reliability, and maintainability requires careful system design
choices that prioritize flexibility and ease of debugging. Our architecture, based on well-established
tools such as Airflow, Kubernetes, and Django, has allowed us to meet demanding client requirements
while maintaining a small, agile engineering team. However, many open-source projects in the LLM
ecosystem remain immature, and unexpected failures are common. We emphasize the importance of
simplifying system components, following established patterns, and maintaining a critical perspective
on tooling. By sharing our architecture and key lessons, we hope to contribute to a growing body of
knowledge on operationalizing LLMs and support other teams on similar paths.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Writefull to: Grammar and spelling check.
ChatGPT and Claude were used to: Paraphrase and reword, Improve writing style, Abstract drafting,
and Drafting content. After using these tool(s)/service(s), the author(s) reviewed and edited the content
as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huai</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zhang, Taming the Titans: A Survey of Eficient LLM Inference Serving</article-title>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2504.19720. doi:
          <volume>10</volume>
          .48550/arXiv.2504.19720, arXiv:
          <fpage>2504</fpage>
          .19720 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>A Survey on Eficient Inference for Large Language Models</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2404.14294. doi:
          <volume>10</volume>
          .48550/arXiv.2404.14294, arXiv:
          <fpage>2404</fpage>
          .14294 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Contributors</surname>
          </string-name>
          ,
          <article-title>Lmdeploy: A toolkit for compressing, deploying, and serving llm</article-title>
          , https://github. com/InternLM/lmdeploy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Stoica,
          <article-title>Eficient Memory Management for Large Language Model Serving with PagedAttention, 2023</article-title>
          . URL: http://arxiv.org/abs/2309.06180. doi:
          <volume>10</volume>
          .48550/arXiv.2309.06180, arXiv:
          <fpage>2309</fpage>
          .06180 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chapin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Andrade</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>An Architecture for Accelerated Large-Scale Inference of Transformer-Based Language Models</article-title>
          , in: Y.
          <article-title>-b.</article-title>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Rambow (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>169</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-industry.
          <volume>21</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-industry.
          <volume>21</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mailach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Siegmund</surname>
          </string-name>
          ,
          <article-title>Themes of Building LLM-based Applications for Production: A Practitioner's View, 2025</article-title>
          . URL: http://arxiv.org/abs/2411.08574. doi:
          <volume>10</volume>
          .48550/ arXiv.2411.08574, arXiv:
          <fpage>2411</fpage>
          .08574 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Parnin</surname>
          </string-name>
          , G. Soares,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pandita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Henley</surname>
          </string-name>
          , Building Your Own Product Copilot: Challenges, Opportunities, and
          <string-name>
            <surname>Needs</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2312.14231. doi:
          <volume>10</volume>
          . 48550/arXiv.2312.14231, arXiv:
          <fpage>2312</fpage>
          .14231 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: Proceedings of the Tenth International Conference on Learning Representations (ICLR)</source>
          ,
          <source>OpenReview.net</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id= nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Open</given-names>
            <surname>Foundation</surname>
          </string-name>
          and
          <string-name>
            <surname>Fine-Tuned Chat</surname>
            <given-names>Models</given-names>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2307.09288. doi:
          <volume>10</volume>
          .48550/arXiv.2307.09288, arXiv:
          <fpage>2307</fpage>
          .09288 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sravankumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinsvark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gregerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spataru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Roziere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Biron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Caucheteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McConnell</surname>
          </string-name>
          , C. Keller, C. Touret,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nikolaidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Allonsius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pintz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Livshits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wyatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Olano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Perino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hupkes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lakomkin</surname>
          </string-name>
          , E. AlBadawy, E. Lobanova,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radenovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Thattai</surname>
          </string-name>
          , G. Nail, G. Mialon, G. Pang, G. Cucurell,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Korevaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Ibarra</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Evtimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gefert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vranes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahadeokar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shah</surname>
          </string-name>
          , J. v. d. Linde,
          <string-name>
            <given-names>J.</given-names>
            <surname>Billock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bitton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spisak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rocca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnstun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. V.</given-names>
            <surname>Alwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Upasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Plawiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>El-Arini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bhalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rantala-Yeary</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          v. d.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>