<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Shchoholiev);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AI-Facilitated Software Project Generation from Natural Language Using Curated Code Snippets⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artem Khovrat</string-name>
          <email>artem.khovrat@nure.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serhii Shchoholiev</string-name>
          <email>serhii.shchoholiev@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariya Shirokopetleva</string-name>
          <email>marija.shirokopetleva@nure.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Kobziev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Strukov</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kay.ai</institution>
          ,
          <addr-line>240 Kent Avenue, Brooklyn, New York, 11249</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>14, Nauky, Ave., Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>V.N. Karazin Kharkiv National University</institution>
          ,
          <addr-line>4, Svobody, Sq., Kharkiv, 61022</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper addresses the challenge of generating software projects efficiently from natural language descriptions. Instead of relying on unconstrained code generation, which often produces inconsistent or unreliable results, this research explores an approach based on reusing curated, pre-stored code snippets. The study focuses on the critical step of mapping project descriptions to relevant code assets, evaluating how effectively an AI model can predict snippet relevance through systematic experimentation. The methodology combines deterministic filtering with semantic reasoning, utilizing a NoSQL database for snippet storage and large language models for relevance prediction. Multiple models including GPT-4omini, O3-mini, and GPT-4o are evaluated on a benchmark of 100 synthetic project descriptions against 100 curated code snippets. The research investigates three prompt engineering strategies: zero-shot, fewshot, and chain-of-thought approaches. Results demonstrate that natural language input can be reliably aligned with reusable code components, with chain-of-thought prompting achieving 43.1% accuracy compared to 30.3% for zero-shot approaches. GPT-4o-mini emerges as the optimal model, balancing performance with cost-effectiveness at approximately 7.33× lower cost than premium alternatives. The findings support the feasibility of snippet-augmented project generation as a pathway toward faster and more consistent software development. This study highlights the potential of combining AI-powered interpretation with structured code reuse, offering an alternative to purely generative approaches that maintains quality while accelerating development cycles. The approach provides a foundation for enterprise-scale deployment and integration into existing coding environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;code generation</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>code snippet</kwd>
        <kwd>natural language processing1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In enterprise software development, the reuse of existing code is a well-established practice aimed
at enhancing productivity and maintaining consistency across projects. However, despite the
availability of extensive codebases, developers often resort to "vibe coding" — a rapid,
heuristicdriven approach to coding that prioritizes speed over reliability. This method frequently leads to
the introduction of defects and technical debt, undermining long-term maintainability. A study by
Tornhill and Borg [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] highlights the significant impact of code quality on development efficiency,
revealing that low-quality code contains 15 times more defects than high-quality code, and
resolving issues in such code takes, on average, 124% more time. This underscores the necessity for
a more structured approach to code reuse that balances speed with reliability.
      </p>
      <p>
        The challenge lies in effectively identifying and integrating relevant, high-quality code snippets
from vast repositories. Manual selection is time-consuming and error-prone, while existing
automated systems often lack the sophistication to understand the context of a developer's natural
language input, leading to irrelevant or suboptimal code suggestions. This paper introduces an
AIpowered system designed to enhance software project generation by predicting the relevance of
pre-stored code snippets to a given natural language project description [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. By focusing on the
critical task of snippet relevance prediction, this approach aims to streamline the development
process, enabling developers to leverage existing code assets efficiently.
      </p>
      <p>The key contributions of this research are the development of a relevance prediction algorithm
that accurately predicts the relevance of code snippets based on natural language project
descriptions, the evaluation of the algorithm’s performance through comprehensive experiments
demonstrating its capability to enhance the software development process, and providing a
foundation for integrating curated code databases and relevance prediction into advanced coding
agents, such as Cursor, to further automate and improve the software development lifecycle.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        The practice of code reuse has long been recognized as a key strategy to improve software
development productivity and maintainability. Studies have shown that systematic reuse of code
components, templates, and libraries reduces defects and development time, particularly in
enterprise environments with complex, proprietary codebases. Research by Borg et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
investigates the returns of highly maintainable code, revealing that maintaining high code quality
can lead to significant reductions in maintenance costs and defect risks. Their study emphasizes the
importance of proactive code quality management in sustaining long-term software health.
      </p>
      <p>
        Recent advances in artificial intelligence have enabled the development of AI-powered coding
agents capable of generating code from natural language descriptions. Systems such as GitHub
Copilot and Cursor leverage large language models to assist developers by producing code snippets
or scaffolds. While these agents can accelerate coding, their outputs often lack precision and
consistency, particularly when dealing with enterprise-specific or legacy code. The effectiveness of
such systems is therefore closely tied to their ability to retrieve relevant code snippets and adapt
them appropriately to the context of a given project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Relevance prediction and code retrieval have emerged as critical components in improving the
utility of AI-assisted coding systems. Approaches in this domain typically involve embedding code
snippets and project descriptions into a shared semantic space to measure similarity, allowing
models to recommend the most contextually relevant components. Prior work has explored
techniques such as neural code search, retrieval-augmented generation, and embedding-based
similarity metrics to identify applicable code assets efficiently. These methods demonstrate that
structured retrieval and prediction mechanisms can significantly enhance code reuse while
maintaining high-quality output [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>This work builds on these foundations by focusing specifically on predicting the relevance of
pre-stored code snippets given a natural language project description. Unlike prior research that
emphasizes full project generation or LLM-only outputs, this approach isolates the retrieval and
relevance prediction step, providing a scalable foundation for future integration into coding agents
and enterprise software pipelines.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The methodology outlines the overall design of the system and the experimental approach taken in
this research. The aim is to establish a repeatable framework for connecting natural language
project descriptions to relevant code snippets stored in curated repositories. To achieve this,
principles from information retrieval, machine learning, and prompt engineering are combined into
a unified pipeline.</p>
      <p>The section is structured into three main components. First, the process of retrieving candidate
snippets from a company-specific code database is described. Second, the rationale for selecting the
large language model used to evaluate snippet relevance is presented, supported by experimental
comparisons. Finally, the role of prompt engineering in shaping the interaction between project
descriptions and the model is discussed, including the evaluation of different prompting strategies.
Together, these elements form the methodological foundation for assessing how effectively AI can
assist developers in leveraging existing code snippets for project generation. Implementation is
available at https://github.com/Shchoholiev/assets-manager-api.</p>
      <sec id="sec-3-1">
        <title>3.1. Process of getting relevant code snippets</title>
        <p>Before the process of identifying relevant code snippets can begin, snippets must be ingested.
During ingestion, the complete source code of each snippet is provided to a large language model
to generate a rich, task-oriented description that captures purpose, inputs/outputs, dependencies,
preconditions, side effects, security/compliance notes, and typical usage. This description is
persisted as metadata and later serves as the primary semantic signal during selection. Ingestion
also performs schema validation and deduplication to keep the corpus clean.</p>
        <p>
          All code snippets are stored in a NoSQL database (CosmosDB) which serves as a centralized
repository of curated components. CosmosDB is chosen because its document model lets us persist
the full snippet source code alongside rich relationship data within a single logical record, with
automatic indexing for low-latency queries. Each snippet is annotated with metadata such as name,
programming language, description, and company identifier [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This metadata makes it possible to
enforce strict technical and organizational boundaries before introducing semantic reasoning into
the selection process.
        </p>
        <p>The process of identifying relevant code snippets starts with a natural language project
description provided by the developer. The input includes three components: the programming
language, the company identifier, and the textual project description. These parameters act as
constraints that guide the retrieval pipeline, ensuring that only contextually appropriate snippets
are considered for reuse.</p>
        <p>As shown in Fig. 1, the system first applies deterministic filtering. Snippets are restricted by
programming language to match the intended technology stack and by company identifier to
ensure that only organization-specific, internally approved code is included. This filtering step
prevents irrelevant or incompatible code from entering the workflow and reduces the number of
candidate snippets that need to be evaluated downstream.</p>
        <p>Once the candidate snippets are identified, they are passed to the large language model (LLM).
At this stage, the LLM evaluates the semantic relationship between the project description and the
available snippets; however, to keep inference practical, only each snippet’s project-level metadata
(its identifier, name, and short description) is sent to the model. During ingestion, the full source is
distilled into a robust, task-oriented description that captures purpose, inputs/outputs,
dependencies, and security/compliance notes; this serves as a compact semantic proxy for the code
that the model can reliably consume. Full source files are intentionally excluded: including raw
code for a large candidate set would quickly exceed typical context windows and make prompts
unwieldy, while per-snippet ranking via separate LLM calls would drive latency and cost to
impractical levels. Although reading full code might improve relevance during research
experiments, it is not viable for a production system operating at enterprise scale. Instead, the
model selects the snippets it deems most relevant based on the metadata and returns a structured
list with brief textual justifications, providing transparency about why particular assets were
chosen.</p>
        <p>For example, given the project description “a secure, modern login service for customer
accounts,” the system first narrows candidates by language and company_id. The LLM then favors
an authentication snippet whose metadata shows support for modern login flows (e.g., OIDC),
MFA, token-based sessions, and the company’s encryption and logging standards. In its rationale, it
notes that these features directly address authentication and data security for the stated use case,
while excluding look-alike snippets that lack required compliance or use a different tech stack.</p>
        <p>The final output consists of the most relevant snippets accompanied by concise model
rationales. This hybrid workflow—combining deterministic filtering with semantic reasoning—
yields selections that are both technically sound and contextually appropriate, increasing
transparency and developer trust. After retrieval, the proposed snippets are confirmed with the
developer, then a starter project is generated and compiled to validate correctness.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model selection</title>
        <p>Several LLMs were evaluated for the task of selecting relevant code snippets given a
naturallanguage project description. The goal of this study was not end-to-end project generation, but to
determine which model most reliably identifies the correct subset of pre-stored, curated snippets.
To ensure fair comparison, all models received the same project description, the same filtered pool
of candidate snippets (after language/company constraints), and the same prompt format. Each
model returned the snippets it deemed relevant along with a short justification. A benchmark of
100 synthetic project descriptions was constructed to mirror concise enterprise requirements (e.g.,
authentication, logging/auditing, API scaffolding, messaging, batch processing). Descriptions vary
in wording and specificity to test whether models map intent — not just keywords — to appropriate
building blocks. The candidate pool contains 100 curated, company-scoped code snippets from the
CosmosDB repository, each tagged with language and companyId and accompanied by a short
functional description (e.g., "JWT auth middleware," "transactional outbox publisher," "service
template with health checks," "centralized logging adapter," "base CI pipeline").</p>
        <p>Ground-truth sets were defined manually by selecting the minimal snippet set that would
plausibly satisfy each description in a starter-project context. During scoring, only snippet IDs
present in the provided candidate list were considered valid; references to out-of-scope or
nonexistent snippets were treated as errors.</p>
        <p>Commonly available models spanning a wide cost/quality range were tested: gpt-4.1-nano,
gpt-4o-mini, o3-mini, gpt-4o, o1, and gpt-4.5. To account for deployment constraints in enterprise
settings, both selection quality (distribution across the five outcome categories) and cost efficiency
(published price per 1M in + 1M out tokens) were considered, as well as qualitative factors such as
stability across prompts.</p>
        <p>The stacked bar chart in Fig. 2 summarizes outcome distributions per model. Higher bars in
“Exact match” and lower bars in “Mismatch” indicate better performance. Mid-tier models
demonstrated strong accuracy without incurring the steep costs of frontier models, while the
smallest model showed more frequent “Partial—Mixed” and “Mismatch” outcomes. In practice,
“Partial—Extra” is often acceptable, whereas “Partial—Missing” and “Mismatch” impose higher
developer overhead.</p>
        <p>Balancing selection quality with cost and latency, gpt-4o-mini was adopted as the default
model. On the 100-case benchmark it delivered competitive Exact rates with acceptable Partial—
Extra at a fraction of the cost of larger models, satisfying enterprise constraints. It also responds
well to prompt design, yielding further gains under guided reasoning prompts. By contrast,
autoreasoning models such as o3-mini perform better with broad, high-level prompts but lose accuracy
when given more detailed, constrained instructions. Accordingly, gpt-4o-mini is used as the
backbone for the remainder of the study with its prompt engineering explored in the next section,
with o3-mini as a comparative baseline.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompt engineering</title>
        <p>
          Three prompting styles are compared for the snippet-selection task under identical conditions
(same description, same candidate list filtered by language and company ID). Each prompt enforces
an output schema with valid snippet IDs only and requests a brief rationale. Datasets, prompts,
ground truth, and evaluation scripts are available at
https://github.com/Shchoholiev/assetsmanager-start-projects-evaluation.
3.3.1. Zero-Shot
A single instruction without exemplars that specifies the task and output schema. It is the
lowestcost, lowest-latency configuration and serves as the baseline. Zero-shot performs well when
snippet names/descriptions are clear and the schema is explicit, but it is sensitive to phrasing and
more prone to over- or under-selection if constraints are not enforced strictly (Fig. 3 for example)
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Figure 3: Zero-shot prompt example [created by the authors].
3.3.2. Few-Shot
The instruction is preceded by one compact worked example that demonstrates the mapping from
a description to a set of snippet IDs. The exemplar improves schema adherence and reduces
spurious selections by giving the model a concrete pattern to imitate while keeping token overhead
modest. Care is taken to keep the exemplar short, stylistically consistent with the evaluation items,
and different from the current query to avoid leakage or superficial cue matching (Fig. 4 for
example) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3.3. Chain-of-thought (CoT)</title>
        <p>
          The instruction asks the model to articulate a brief reasoning step before emitting the final JSON
answer. This encourages the model to align functions mentioned in the description with
capabilities in the candidate list (e.g., security, auditing, messaging) and helps disambiguate
nearmiss snippets. CoT typically increases exact selections and reduces mixed/mismatch outcomes at
the cost of additional tokens; reasoning length is capped and the final answer is still required in a
fenced, machine-parseable schema to preserve determinism (Fig. 5 for instance) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. Prompt engineering — evaluation and results</title>
        <p>The three prompting styles were assessed on the same 100-case benchmark described in Section
IIIB, utilizing the same model (gpt-4o-mini), identical decoding settings, and the same candidate list
per case (filtered by language and companyId). Outputs were scored using the categories from
Section III-C: Exact, Partial—Extra, Partial—Missing, Partial—Mixed, and Mismatch. The
distribution of outcomes per style is reported as percentages over the 100 cases.</p>
        <p>Exact matches increased with prompt guidance: 16% (zero-shot) → 22% (few-shot) → 34%
(CoT). Zero-shot produced the highest rate of Partial—Extra (tending to include superfluous
snippets), while few-shot reduced this by anchoring the format and selection behavior to the
exemplar. CoT further improved precision and lowered Mixed/Mismatch cases by encouraging
brief reasoning against the candidate list; it showed a modest rise in Partial—Missing (the model
occasionally chose a minimal, defensible set) that is considered acceptable in practice. Fig. 6
summarizes these distributions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>A single weighted accuracy is reported that emphasizes exact matches and normalizes to a 0–100
scale:</p>
      <p>A c c u r a c y ( %)= 15 E +3 X +1 M +2 D
15 * 100
(1)
where E is the share (%) of Exact matches, X is Partial–Extra, M is Partial–Missing, and D is
Partial–Mixed; W (Mismatch) has weight 0 and is omitted.</p>
      <p>The coefficients encode “developer effort”: Exact gets the dominant weight (15) because it
requires no rework; Extra earns partial credit (3) since the solution is functionally complete with
minor cleanup; Mixed (2) is valued above Missing (1) because it typically contains more of the
needed functionality; and Mismatch contributes nothing. The normalization by 15 makes A=100
when E=100%, keeping the metric interpretable and comparable across experiments.</p>
      <p>On a 100-case benchmark against 100 curated, company-scoped snippets, mid-tier models
offered the best practicality–accuracy balance. With the metric above, gpt-4o-mini reached 34.3%,
which is ≈80% of o3-mini (42.7%) and below gpt-4o (38.9%); the smallest model lagged
(gpt-4.1nano: 14.6%), and the most expensive frontier model underperformed (gpt-4.5: 13.3%). Despite
scoring below o3-mini, gpt-4o-mini’s cost and latency profile makes it the preferable default
backbone for frequent, large-scale runs for enterprise use case.</p>
      <p>Prompting materially shifts outcomes for gpt-4o-mini: 30.3% (zero-shot) → 34.3% (few-shot) →
43.1% (CoT) on the same cases. Few-shot raises Exact from 16%→22% and cuts Extra from
52%→36%, which means fewer superfluous snippets to clean up, though Partial–Mixed rises
(27%→36%) as the model hews more tightly to the exemplar. CoT then delivers the biggest jump
by pushing Exact to 34% and lowering Partial–Mixed to 26%, while Partial–Extra drops to 22%;
Partial– Missing increases (5%→18%), but the metric’s heavy weight on Exact dominates, yielding
the best overall score. In practice, CoT’s short, constrained rationales help the model rule out
lookalike snippets (e.g., logging vs auditing adapters) and align selections to compliance cues in the
metadata, reducing triage despite the uptick in minimal sets. Mismatch stays ~0–1% across
prompts, indicating stable schema adherence.</p>
      <p>On cost, $0.75 (gpt-4o-mini) vs $5.50 (o3-mini) is ~7.33× cheaper per token. Normalized by the
new accuracy metric, gpt-4o-mini delivers ~5.9× more accuracy per dollar than o3-mini (≈45.7 vs
7.8 percentage-points per $1 per 1M tokens). This comfortably funds CoT prompting by default
while staying within enterprise constraints. Moreover, o3-mini (an auto-reasoning model) shows
low upside from prompt engineering in this setting because it favors broad, high-level prompts;
tighter, schema-constrained instructions do not yield proportional gains. Hence gpt-4o-mini
remains the best balance of accuracy, predictability, and cost for snippet selection at scale.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations and future directions</title>
      <p>The benchmark relies on synthetic, enterprise-style descriptions and a single-organization snippet
corpus. While this design controls variability and protects proprietary code, it limits external
validity. Real specifications are longer, noisier, and interleave functional and non-functional
requirements; future studies should replicate these experiments on multi-org, real-world backlogs
to assess generalization.</p>
      <p>To better approximate developer effort and risk, weighted metric should be replaced — or at
least calibrated — using an LLM-as-judge protocol rather than fixed coefficients. Concretely, a
judge model would receive the project description, the selected snippet set, and concise metadata
(and, when feasible, quick static checks or minimal tests), then score the outcome on a rubric that
distinguishes critical vs. benign deviations (e.g., missing an authentication dependency vs.
including a harmless utility). The rubric would be anchored with labeled exemplars, using pairwise
comparisons for robustness, and scores would be calibrated via scale-anchoring and isotonic
regression. To ensure reliability, agreement against human ratings would be measured. This
judgebased metric is task-aware, explainable, and better aligned with practitioner costs than plain, static
weights.</p>
      <p>To strengthen external validity, future evaluations should include models from other vendors,
including closed source and open source models, acknowledging that they are trained on different
corpora, supervision mixes, architectural choices, alignment procedures, tokenizers, and context
limits — all of which can materially affect retrieval and selection behavior. Comparisons should use
a standardized protocol (same prompts, decoding settings, candidate pools, and scoring), report
both aggregate accuracy and error profiles, and stratify by domain and prompt style.</p>
      <p>Beyond the methodological constraints discussed above, an important practical limitation lies
in maintaining and scaling snippet databases in industrial settings. As repositories grow, ensuring
snippet freshness, dependency compatibility, and security compliance becomes nontrivial. In
production environments, versioning, deduplication, and quality auditing must be automated
through integration with existing CI/CD and Git workflows. Enterprise deployment further
requires strict access controls, metadata refresh pipelines, and continuous retraining of embeddings
to reflect code evolution. From a scalability standpoint, large-scale snippet retrieval may require
distributed vector databases or hybrid search architectures to sustain low-latency, high-throughput
selection under enterprise workloads. Addressing these challenges will be key to operationalizing
the proposed system in real-world development ecosystems.</p>
      <p>Additionally, while the current benchmark uses synthetic, well-controlled project descriptions
to isolate model behavior, future research should incorporate diverse, real-world project briefs
from open-source and industrial backlogs. Such inputs would introduce realistic noise, ambiguity,
and interleaved functional/non-functional requirements, offering a more rigorous test of snippet
relevance prediction under production conditions.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study demonstrates the potential of integrating AI-powered generative models to automate
software project generation from natural-language inputs. By grounding assembly in curated code
snippets, the approach cuts setup time, standardizes scaffolds, and lets developers focus on
highervalue design work. Importantly, this should not replace unconstrained code synthesis but extend it:
use free-form generation for genuinely novel logic and “glue” while anchoring core functionality in
vetted components. This hybrid reduces the variability of “vibe coding,” improving reliability,
security, and compliance without sacrificing speed—turning AI from a code copier into a
qualityaware accelerator of real-world development.</p>
      <p>From a deployment standpoint, the selection algorithm is a strong candidate for packaging as a
Model Context Protocol (MCP) service, exposing endpoints for deterministic filtering, snippet
metadata retrieval, relevance selection, and starter assembly. Such an MCP tool can be plugged into
Cursor, Windsurf, or other code-generation environments so developers can invoke
snippetaugmented generation directly from the editor, receive concise rationales, and enforce
organizational policies—making the hybrid retrieval plus generation workflow immediately
actionable in day-to-day practice.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the Armed Forces of Ukraine for the opportunity to write a valid
work during the full-scale invasion of the Russian Federation on the territory of Ukraine. Also, the
authors wish to extend their gratitude to Kharkiv National University of Radio Electronics for
providing licences for additional software to prepare algorithms and the paper.
During the preparation of this work, the authors used Grammarly Edu in order to check grammar
and spelling. After using these services, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tornhill</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Borg</surname>
          </string-name>
          , “
          <source>Code Red: The Business Impact of Code Quality - A Quantitative Study of 39 Proprietary Production Codebases”</source>
          <year>2022</year>
          , arXiv:
          <fpage>2203</fpage>
          .04374. URL: https://arxiv.org/abs/2203.04374
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khovrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kobziev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Volokhovskyi</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Nazarov</surname>
          </string-name>
          ,
          <article-title>"Using Classifiers Based on Large Language Models and Naïve Bayes for Domain Specific Text,"</article-title>
          <source>2024 IEEE 19th International Conference on Computer Science and Information Technologies (CSIT)</source>
          , Lviv, Ukraine,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          , doi: 10.1109/CSIT65290.
          <year>2024</year>
          .
          <volume>10982586</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Borg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pruvost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mones</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Tornhill</surname>
          </string-name>
          , “Increasing, not Diminishing: Investigating the Returns of Highly Maintainable Code”
          <year>2024</year>
          , arXiv:
          <fpage>2401</fpage>
          .13407. URL: https://arxiv.org/abs/2401.13407
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yo</surname>
          </string-name>
          . Ishibashi, Yo. Nishimura, “
          <article-title>Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation</article-title>
          and Optimization”
          <year>2024</year>
          , arXiv:
          <fpage>2404</fpage>
          .02183. URL: https://arxiv.org/abs/2404.02183
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zh. Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
          </string-name>
          , “
          <article-title>An Empirical Study of RetrievalAugmented Code Generation: Challenges and Opportunities”</article-title>
          <year>2025</year>
          , arXiv:
          <fpage>2501</fpage>
          .13742 URL: https://arxiv.org/abs/2501.13742v1
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Deng,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and Y. Liu, “
          <article-title>Prompt Injection attack against LLM-integrated Applications”</article-title>
          ,
          <year>2024</year>
          , arXiv:
          <fpage>2306</fpage>
          .05499 URL:  https://arxiv.org/abs/2306.05499
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>“A Practical Survey on Zero-shot Prompt Design for In-context Learning”</article-title>
          ,
          <year>2023</year>
          , arXiv:
          <fpage>2309</fpage>
          .13205 URL: https://arxiv.org/abs/2309.13205
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buschek</surname>
          </string-name>
          , “
          <article-title>How to Prompt? Opportunities and Challenges of Zero- and Few-Shot Learning for Human-AI Interaction in Creative Applications of Generative Models”</article-title>
          ,
          <year>2022</year>
          , arXiv:
          <fpage>2209</fpage>
          .01390 URL: https://arxiv.org/abs/2209.01390
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , “
          <article-title>Beyond Chain-of-</article-title>
          <string-name>
            <surname>Thought</surname>
          </string-name>
          ,
          <source>Effective Graph-of-Thought Reasoning in Language Models”</source>
          ,
          <year>2023</year>
          , arXiv:
          <fpage>2305</fpage>
          .16582 URL: https://arxiv.org/abs/2305.16582
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>