<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging LLMs for Formal Software Requirements: Challenges and Prospects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arshad Beg</string-name>
          <email>arshad.beg@mu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diarmuid O'Donoghue</string-name>
          <email>diarmuid.odonoghue@mu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosemary Monahan</string-name>
          <email>rosemary.monahan@mu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maynooth University</institution>
          ,
          <addr-line>Maynooth</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Software correctness is ensured mathematically through formal verification, which requires the generation of a formal requirement specification and an implementation that must be verified. Tools such as model-checkers and theorem provers ensure software correctness by verifying the implementation against the specification. Formal methods deployment is regularly enforced in the development of safety-critical systems e.g. aerospace, medical devices and autonomous systems. Generating these specifications from informal and ambiguous natural language requirements remains the key challenge. Our project, VERIFYAI 1, aims to investigate automated and semi-automated approaches to bridge this gap, using techniques from Natural Language Processing (NLP), ontology-based domain modelling, artefact reuse, and large language models (LLMs). This position paper presents a preliminary synthesis of relevant literature to identify recurring challenges and prospective research directions in the generation of verifiable specifications from informal requirements.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Knowledge Representation and Reasoning</kwd>
        <kwd>Formal Languages</kwd>
        <kwd>Software Requirements{Engineering</kwd>
        <kwd>Specifications}</kwd>
        <kwd>Formal Verification</kwd>
        <kwd>Theorem Proving</kwd>
        <kwd>Model Checking</kwd>
        <kwd>Chain-of-Thought (CoT)</kwd>
        <kwd>Prompt-Engineering</kwd>
        <kwd>{Zero</kwd>
        <kwd>One</kwd>
        <kwd>and Few}-shot prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As software systems grow in complexity and criticality, so does the need for scalable verification
methods that ensure correctness and reliability. Formal verification is especially important in
safetycritical systems where minor software errors can lead to serious consequences, including loss of life,
environmental damage, or large-scale system failures. Under all circumstances, the software used in
sectors like aviation, healthcare, automotive, and nuclear control must behave exactly as intended. While
testing can only check specific scenarios, formal verification uses mathematical techniques to prove
that a system meets its specifications under every possible condition. This level of assurance is crucial
when human safety depends on software behaving reliably. The purpose of sound software engineering
principles is to catch flaws early in the design phase, ensuring consistency between requirements and
implementation. The mathematical techniques used in formal methods improve trust, and support
compliance with industry standards and regulations. However, their adoption in industry is consistently
hindered by the challenges of writing and maintaining formal specifications, which demand rigorous
developer training and significantly increase the software development cycle time by up to 30% [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This
motivates research into automated and semi-automated approaches that can make formal verification
more accessible to a wider audience of software engineers.
      </p>
      <p>
        The VERIFYAI project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims to address a central challenge of formal software engineering:
translating informal, natural-language requirements into formal, verifiable specifications. This
paper outlines the challenges and prospects in leveraging Large Language Models (LLMs) for
formalising software requirements. The project aims to integrate techniques from Natural Language
Processing (NLP), ontology-driven domain modelling, artefact reuse, and large language models
(LLMs) to support the automated generation and traceability of formal specifications. Like all other
ifelds, the development of large language models (LLMs) has opened a world of opportunities for
the challenge of formalisation of software requirements. We consolidate initial findings to
highlight research gaps and recurring dificulties in LLM-assisted formal specification generation.
Noting firstly that the main body of the paper is supported by several detailed Appendices available at
https://github.com/arshadbeg/OVERLAY2025_SupportingDocs.git.
      </p>
      <p>The key contributions of this paper are as follows:
Identification of Core Challenges in Formalisation: We present a structured analysis of the barriers
to translating informal, natural language software requirements into formal specifications, such as
ambiguity, lack of domain models, and LLM instability.</p>
      <p>VERIFYAI Research Framework: We propose our research framework, which integrates LLMs
with NLP, ontology-driven modelling, and artefact reuse to support the semi-automated generation of
verifiable formal specifications.</p>
      <p>State-of-the-Art Synthesis: Through a focused literature review, we categorise and compare (Section
2, supported by Appendix A) recent LLM-based tools and techniques—such as Req2Spec, SpecGen,
AssertLLM, and nl2spec—highlighting their approaches, strengths, and limitations in requirement
formalisation.</p>
      <p>Experimental Evaluation: We include empirical evaluations (Appendices B and C) comparing multiple
SMT solvers (Alt-Ergo, Z3, CVC4, CVC5) in terms of specification verification success and execution
time, using the Frama-C PathCrawler tool and standard program sets.</p>
      <p>Highlighting Gaps and Future Directions: Based on our synthesis, we outline critical open problems
such as prompt instability, fragility of formal outputs, and the need for domain-specific context
grounding (Appendix D). These pave the way for developing more robust LLM-based formal specification
pipelines.</p>
      <p>Positioning for Long-Term Vision: As a position paper, our work serves as a foundational step
toward a longer-term vision of trustworthy, LLM-assisted formal methods tooling that bridges the
expertise gap in safety-critical software development.</p>
      <p>
        The structure of the paper is: Section 2 describes a focused state of the art. Section 3 outlines the
challenges and future directions for the research, supported by Appendix B, that outlines the experiment
setup and analysis performed on an example of the PathCrawler tool of Frama-C where we have
resimulated the methodology presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This analysis is based on four provers available in Frama-C
i.e. Alt-Ergo, Z3, CVC4 and CVC5. Appendix C presents the execution time comparison for these
provers on the programs provided in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Section 4 discusses our plans for future work and Section 5
concludes the paper.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>This section synthesises what we found to be the state of the art. The main research questions for
conducting our systematic literature review on the topic were as follows:
RQ1: What methodologies leverage Large Language Models (LLMs) to transform natural language
software requirements into formal notations?
RQ2: What are the emerging trends and future research directions in using LLMs for formal
requirements formalisation?</p>
      <p>Here, we summarise our main findings. For a comprehensive overview of the literature survey,
including detailed comparisons and categorised insights, we encourage readers to consult Appendix A
and the accompanying GitHub repository, which presents the full set of supporting tables.</p>
      <p>
        GPT-3.5 has assisted requirement analysis for code verification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while Explanation-Refiner
integrates LLMs with theorem provers for NLI validation and iterative correction [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Evaluation of GPT-4o
with VeriFast shows generation of functional specifications, though verification remains limited due to
redundancy and failed assertions [7].
      </p>
      <p>The nl2spec tool supports interactive synthesis from unstructured requirements [8], while the SpecSyn
tool improves sequence-to-sequence contract generation with a 21% accuracy gain [9]. Req2Spec
converts 71% of BOSCH automotive requirements into formal specs [10], while SpecGen uses prompt
mutation and verification feedback to improve LLM-generated specifications, succeeding on 279 out of
384 benchmark programs [11]. In the hardware domain, AssertLLM synthesizes assertions with 89%
correctness via multi-phase prompting and validation [12], while VLSI applications leverage LLMs for
spec review and generation in SpecLLM [13]. Smart grid requirements have been formalised using
GPT-4o and Claude 3.5, achieving F1 scores between 79% and 94% [14]. The trend in F1-scores observed
by authors of [14] suggested that GPT-4o and Claude 3.5 not only maintain robustness but may actually
benefit from increased system specification complexity, highlighting a potential alignment between
model reasoning depth and specification richness. This aspect not mirrored in Gemini 1.5 or
GPT-3.5turbo and warranting further investigation. NASA’s software verification efort surfaced requirement
errors, demonstrating practical utility of LLM-in-the-loop workflows [15].</p>
      <p>
        NL-to-LTL translation has seen progress via few-shot prompting and dynamic reasoning [16],
achieving 94.4% accuracy. Likewise, NL-to-JML contract synthesis for Java programs has been explored with
promising results [17]. Historical systems like RSL [18], ARSENAL [19], and RML [20] demonstrated
early rule-based and logic-based extraction pipelines, while hybrid neuro-symbolic systems ofer greater
reliability. SAT-LLM couples SMT solvers with LLMs to detect inconsistencies with F1 of 0.91 [21]
and LeanDojo, ReProver, and Thor enhance formal proving via retrieval-augmented generation and
LLM-guided reasoning [
        <xref ref-type="bibr" rid="ref7 ref8">22, 23</xref>
        ]. IDE-integrated eforts like those combining Copilot with PathCrawler
and EVA demonstrate semi-automated ACSL specification generation [
        <xref ref-type="bibr" rid="ref3 ref9">24, 3</xref>
        ].
      </p>
      <p>
        As we expected, assertion-level synthesis shows better reliability than full contract generation. For
example, Laurel generates assertions for Dafny with over 50% success [
        <xref ref-type="bibr" rid="ref10">25</xref>
        ], and AssertLLM exceeds 89%
correctness when guided by contract type and context. Full specifications are more error-prone, often
requiring multiple prompt iterations or external validation [11].
      </p>
      <p>
        LLM selection and prompting strategies critically afect performance. While zero-shot prompting is
strong in base performance [
        <xref ref-type="bibr" rid="ref11">26</xref>
        ], one-shot [
        <xref ref-type="bibr" rid="ref12">27</xref>
        ] and few-shot [
        <xref ref-type="bibr" rid="ref13">28</xref>
        ] ofer alternative trade-ofs.
Chain-ofThought (CoT) prompting improves logical flow via intermediate steps [
        <xref ref-type="bibr" rid="ref14">29</xref>
        ], and like Structured CoT
(SCoT) [
        <xref ref-type="bibr" rid="ref15">30</xref>
        ] it can sufer from context decay in long prompts (“lost-in-the-middle”) [
        <xref ref-type="bibr" rid="ref16">31</xref>
        ], yet zero-shot
often remains competitive [
        <xref ref-type="bibr" rid="ref17">32</xref>
        ].
      </p>
      <p>
        Advanced prompting methods like Automate-CoT generate CoT examples automatically [
        <xref ref-type="bibr" rid="ref18">33</xref>
        ], while
Reprompting uses Gibbs sampling to escape prompt local optima [34]. Structured prompting with graphs
and trees improves reasoning robustness and eficiency [ 35]. RAG (Retrieval-Augmented Generation)
improves grounding for knowledge-intensive synthesis [36]. [37] discusses a wider range of almost
30 prompting strategies, some of which seem not to have been explored in relation to formalising
specifications. Of course, this does not include related approaches such as fine-tuning LLM via LoRA
adaptation training [38], but this may only be applicable when there is access to the LLM’s architecture
and weights. Additionally, reinforcement learning (RL) may help with specific challenges, opening even
more avenues for exploration.
      </p>
      <p>Key Observations:</p>
      <p>
        Based on our literature survey, we proceed with some key observations. We observed a significant
diference between the success rates of assertion generation and full contract synthesis using LLMs.
AssertLLM [12] and Laurel [
        <xref ref-type="bibr" rid="ref10">25</xref>
        ] achieved high accuracy in generating helper assertions for programs
written in Dafny language [39] and design-specific verification statements. These tools achieved an
accuracy of 89% and over 50%, operating at a local level on source code or isolated signals. On the other
hand, [17] reported that while generating formal specifications for Java Modelling Language (JML)
contracts or temporal logic formulas ended up in frequent verification failures by the SMT solvers
embedded in OpenJML [40]. This happened even if the output appeared semantically sound, leading to
the conclusion of disparity between human-readable correctness and automated formal verification,
especially when the source code was written for complex tasks.
      </p>
      <p>In general, we observed that tasks with small scope and well-defined semantics yield better results
where the limited context in these assertions helps in improving verification accuracy [ 11, 9]. LLMs
handle such tasks more reliably due to reduced ambiguity and fewer dependencies on broader system
knowledge. On the other hand, for larger program segments, end-to-end contract synthesis involves
multiple interacting components or function bodies. It demands a deeper understanding of program
semantics, logic, and behaviour over time. SpecGen [11] and SpecSyn [9] presented significant progress
in tackling such challenges. However, their outputs require post-processing steps, such as mutation
operators or human-in-the-loop (software testing experts were involved), before the generated outputs
are usable for formal verification.</p>
      <p>The efort accuracy is influenced by tool design and integration. For example, tools like nl2spec [8]
improve generated specification quality through step-by-step refinement, adopting iterative and
userin-the-loop approach to help address some LLM limitations. Similarly, prompt engineering techniques
utilising guided templates or Chain-of-Thought (CoT) [7, 41, 42] promised improved output coherence
and correctness. These strategies work well in scenarios involving localised tasks, such as assertion
synthesis or narrow-scope descriptions. As the program size and complexity of the specification goal
increase, the chances of ambiguity, under-specification, and logical inconsistency increase. Therefore,
we conclude that the current LLM architectures excel in focused, declarative tasks but require
augmentation for broader specification goals. However, [ 43] showed that diferent versions of language
models including LLM, can vary greatly in their responses to the same queries, suggesting that much
experimental work might be required to achieve optimum results.</p>
      <p>We conclude from our synthesis of the current literature that the research trend is increasing in
combining the strength of LLMs, symbolic reasoning and iterative user interaction. At the moment,
assertion generation dominates in terms of accuracy and usability, but the parallel eforts of better prompt
design, domain-specific fine-tuning, and verifier-in-the-loop are closely matching the performance of
the process. The challenges of abstraction and consistency drive research eforts in this domain.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges and Future Directions</title>
      <p>Based on our finding, we summarise five key challenges that we have identified, as well as our research
goals with brief description given in Table 1. A detailed description of these is included in Appendix D.</p>
      <p>Challenges Future Directions
C1: Semantic Ambiguity F1: Human-in-the-loop Formalisation
Ambiguity in natural language due to context- Combines LLM support with domain expert oversight
dependence and jargon afects requirement interpre- to improve accuracy, reduce ambiguity, and increase
tation. Needs structured knowledge and human-in- trust via feedback and interactive refinements.
the-loop refinement.</p>
      <p>C2: Lack of Ground Truth Datasets
Absence of standardised, annotated datasets limits
model training, reproducibility, and scalability.</p>
      <p>F3: Standardised Benchmarks
Creation of high-quality, domain-diverse datasets
will enable consistent evaluation and push the field
forward.</p>
      <p>C3: Tool Interoperability F4: Neuro-symbolic Reasoning
Formal verification tools lack standard interfaces and Combines neural flexibility with symbolic precision
integration capabilities, hampering automation. to improve integration, consistency, and constraint
enforcement.</p>
      <p>C4: Traceability Across Artefacts F5: Interactive Traceability Tools
Dificult to maintain consistent trace links between Tools that enable visual navigation, version tracking,
text, models, code, and tests over lifecycle. and LLM-assisted trace linking improve usability and
compliance.</p>
      <p>C5: Explainability and User Trust F2: Multi-modal Artefact Alignment
Limited transparency in LLM-generated outputs re- Integrating diverse input types (text, diagrams,
taduces trust, especially in safety-critical domains. bles) through semantic matching increases clarity
and confidence in outputs.</p>
      <p>Semantic ambiguity (C1) due to natural language remains a critical issue, needing structured domain
knowledge and improved human-in-the-loop interventions. The lack of publicly available, high-quality
datasets (C2) hinders model training, reproducibility, and scalability. Tool interoperability (C3) is
impeded by incompatible formats and absence of standardised interfaces, complicating automation.
Ensuring traceability across artefact lifecycles (C4) is essential but dificult without explainable and
collaborative processes. In addition, explainability and user trust (C5) are limited by opaque model
behavior and insuficient rationale in outputs. To address these, human-in-the-loop formalisation (F1)
ofers controlled semi-automation and improved trust, while multi-modal artefact alignment (F2) enables
contextual completeness via diverse input formats. The creation of standardised benchmarks (F3) would
mitigate dataset-related limitations and promote progress. Neuro-symbolic reasoning (F4) blends LLM
lfexibility with logic-based precision, enhancing model reliability. Finally, interactive traceability tools
(F5) that support collaboration, visual navigation, and auditability are crucial for regulated and complex
software domains.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Embedding LLMs in the Specification Generation</title>
      <sec id="sec-4-1">
        <title>4.1. Planned Evaluation of Prompting Strategies</title>
        <p>
          A systematic and quantitative evaluation of prompting strategies is the central part of our planned
research. In particular, we intend to compare zero-shot, one-shot, few-shot, and Chain-of-Thought (CoT)
prompting across both assertion-level and full-contract generation tasks. Prior work in the literature
[
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18">26–35</xref>
          ] suggests that prompt type and articulation can significantly influence specification quality.
However, a comprehensive evaluation requires a larger and more diverse benchmark than we currently
report.
        </p>
        <p>Future experiments will therefore assess prompt sensitivity using precision, recall, and F1-based
correctness metrics, and investigate robustness under small variations of prompt formulation. We
anticipate that assertion-level tasks will prove more stable under prompt rephrasing, whereas
fullcontract synthesis may show higher variability — an observation that motivates deeper analysis in
follow-up work. Our current contribution is to highlight the importance of prompt design in our work
and to outline how this dimension will be systematically investigated going forward.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Human-in-the-Loop Integration and Interoperability Considerations</title>
        <p>Our experiments currently implement the early stages of a human-in-the-loop process through manual
review. In both the Tritype baseline and PathCrawler-augmented workflows, every LLM-generated
ACSL specification was checked by at least one author with formal methods expertise. This review
ensured (i) semantic alignment with intended behavior, (ii) logical completeness, and (iii) iterative
refinement by feeding corrected fragments back into the prompts. Although active learning is not yet
integrated, our project approach anticipates it: revised specifications, along with their source code and
verification results, can be versioned and selectively reused for retraining or fine-tuning.</p>
        <p>Figure 1b shows our proposed workflow based on the state of the art and figured out challenges in
Sections 2 and 3. Natural language requirements and domain ontologies form the input, grounding the
LLM in the target context. Diferent prompt strategies (zero-shot, few-shot, and Chain-of-Thought)
shape how inputs are presented for specification generation. Outputs are stored in a tool-neutral
intermediate format (JSON-LD), which can be translated into the syntax required by verification tools.
These tools check the generated specifications, while symbolic reasoning provides constraint-based
feedback that helps refine them. Human reviewers remain part of the loop, validating results, focusing
on manageable units, and collecting useful examples for future adaptation. The design is modular,
so new reasoning engines, prompt methods, or domain adapters can be added without changing the
overall pipeline.</p>
        <p>For interoperability, the pipeline currently uses custom scripts to convert LLM outputs to tool-specific
formats (e.g., ACSL for Frama-C, JML for OpenJML). While functional, this approach is not generalisable.
As part of VERIFYAI, we plan to design a lightweight JSON-LD-based schema that maps to multiple
formal languages. This would allow LLM outputs to be stored in a tool-agnostic format and exported to
diferent targets, potentially supporting Frama-C, OpenJML, Dafny, and others with minimal per-tool
changes.</p>
        <sec id="sec-4-2-1">
          <title>Input C Program</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>PathCrawler Analysis</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Symbolic Paths + I/O Examples</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>Prompt LLM for ACSL Specs</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Annotated C Code (with ACSL)</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Frama-C WP + SMT Solvers</title>
          <p>
            Verification Goals + Outcomes
(a) Methodology of the initial experiments following
the approach of [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], combining LLM with symbolic
analysis tools in the Frama-C ecosystem. The
worklfow integrates path-based I/O examples and
verification outputs to guide the generation of
contextaware ACSL specifications.
          </p>
          <p>Input:
Natural Language Requirements</p>
          <p>+ Domain Ontologies
Diferent Prompt Strategies
(zero-shot, few-shot, CoT)</p>
          <p>LLM Specification</p>
          <p>Generation
Tool-Neutral Intermediate</p>
          <p>Format (JSON-LD)</p>
          <p>Verification Tools
+ Symbolic Reasoning Output
Human-in-the-Loop Validation</p>
          <p>+ Example Collection
(b) Proposed workflow for the VERIFYAI pipeline:
Natural-language requirements and domain
ontologies are combined with prompt strategies to
generate formal specifications via LLMs. Outputs in
JSON-LD feed verification tools, with symbolic and
human feedback refining results.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Symbolic Reasoning, Traceability, and Scalability Outlook</title>
        <p>In our initial Tritype experiments, symbolic solvers were used only for post-generation verification. We
see potential for tighter integration, where solver feedback—such as unsatisfiable path conditions—could
guide the LLM during generation, reducing logical errors before verification. Currently, traceability
relies on manual annotations linking specification fragments to source code and natural-language
requirements. We are exploring semi-automated approaches where the LLM proposes initial links for
human validation. These links could be stored in a graph database to support version-aware navigation
and visual trace maps, which would be particularly valuable in regulated settings.</p>
        <p>We aim to build datasets that combine real-world, diverse requirements with verified specifications
and execution traces. Currently, we are curating a small seed set from open-source safety-critical
software, supplemented with synthetic examples to cover edge cases. While synthetic data is useful, it
lacks the nuances of industrial requirements, so mixed datasets appear most promising. VERIFYAI’s
prototype handles single-module programs well, but multi-module systems pose memory and latency
challenges. To manage complexity, we plan to employ hierarchical specification synthesis, verifying
each module independently prior to integration. Additionally, we aim to release an annotated subset of
the dataset to support transparency and reproducibility.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This paper identifies clear and concise challenges and prospects in leveraging LLMs for formal software
requirements, based on our initial anaysis of the literature. Semantic ambiguity, lack of ground truth
data, tool interoperability, lifecycle traceability, and explainability all present significant barriers to full
automation. However, each challenge also points to fertile ground for innovation. Future directions
such as human-in-the-loop systems, multi-modal alignment, standardised benchmarks, neuro-symbolic
reasoning, and interactive traceability tools ofer practical and scalable paths forward. As a final remark,
we can say that as AI and formal methods continue to converge, interdisciplinary collaboration will be
key to bridging the gaps and turning conceptual advances into robust, real-world solutions.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work is partly funded by the ADAPT Research Centre for AI-Driven Digital Content Technology,
which is funded by Research Ireland through the Research Ireland Centres Programme and is co funded
under the European Regional Development Fund (ERDF) through Grant 13/RC/2106 P2. The submission
aligns with Digital Content Transformation (DCT) thread of the ADAPT research centre.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>We acknowledge the use of free version of OpenAI’s GPT-4 and GPT-4o-mini, solely for refining text.
The text has been thoroughly reviewed and discussed by all authors to ensure accuracy and integrity.
[7] W. Fan, M. Rego, X. Hu, S. Dod, Z. Ni, D. Xie, J. DiVincenzo, L. Tan, Evaluating the ability of large
language models to generate verifiable specifications in verifast, 2025. URL: https://arxiv.org/abs/
2411.02318. arXiv:2411.02318.
[8] M. Cosler, C. Hahn, D. Mendoza, F. Schmitt, C. Trippel, nl2spec: Interactively translating
unstructured natural language to temporal logics with large language models, 2023. URL: https:
//arxiv.org/abs/2303.04864. arXiv:2303.04864.
[9] S. Mandal, A. Chethan, V. Janfaza, S. M. F. Mahmud, T. A. Anderson, J. Turek, J. J. Tithi, A. Muzahid,
Large language models based automatic synthesis of software specifications, 2023. URL: https:
//arxiv.org/abs/2304.09181. arXiv:2304.09181.
[10] A. Nayak, H. P. Timmapathini, V. Murali, K. Ponnalagu, V. G. Venkoparao, A. Post, Req2spec:
Transforming software requirements into formal specifications using natural language processing,
in: Requirements Engineering: Foundation for Software Quality: 28th International Working
Conference, REFSQ 2022, Birmingham, UK, March 21–24, 2022, Proceedings, Springer-Verlag,
Berlin, Heidelberg, 2022, p. 87–95.
[11] L. Ma, S. Liu, Y. Li, X. Xie, L. Bu, Specgen: Automated generation of formal program specifications
via large language models (2024). URL: https://arxiv.org/abs/2401.08807. arXiv:2401.08807.
[12] W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, Z. Xie, Assertllm: Generating hardware verification
assertions from design specifications via multi-llms, in: 2024 IEEE LLM Aided Design Workshop
(LAD), 2024, pp. 1–1. doi:10.1109/LAD62341.2024.10691792.
[13] M. Li, W. Fang, Q. Zhang, Z. Xie, Specllm: Exploring generation and review of vlsi design
specification with large language model, 2024. URL: https://arxiv.org/abs/2401.13266. arXiv:2401.13266.
[14] L. M. Reinpold, M. Schieseck, L. P. Wagner, F. Gehlhof, A. Fay, Exploring llms for verifying
technical system specifications against requirements, 2024. URL: https://arxiv.org/abs/2411.11582.
arXiv:2411.11582.
[15] V. Gervasi, B. Nuseibeh, Lightweight validation of natural language requirements, Softw. Pract.</p>
      <p>Exper. 32 (2002) 113–133. URL: https://doi.org/10.1002/spe.430. doi:10.1002/spe.430.
[16] Y. Xu, J. Feng, W. Miao, Learning from failures: Translation of natural language requirements into
linear temporal logic with large language models, in: 2024 IEEE 24th International Conference
on Software Quality, Reliability and Security (QRS), 2024, pp. 204–215. doi:10.1109/QRS62785.
2024.00029.
[17] I. T. Leong, R. Barbosa, Translating natural language requirements to formal specifications:
A study on gpt and symbolic nlp, in: 2023 53rd Annual IEEE/IFIP International Conference
on Dependable Systems and Networks Workshops (DSN-W), 2023, pp. 259–262. doi:10.1109/
DSN-W58399.2023.00065.
[18] W. Nowakowski, M. Śmiałek, A. Ambroziewicz, T. Straszak, Requirements-level language and
tools for capturing software system essence, Computer Science and Information Systems 10 (2013)
1499–1524.
[19] S. Ghosh, D. Elenius, W. Li, P. Lincoln, N. Shankar, W. Steiner, Arsenal: Automatic requirements
specification extraction from natural language, in: S. Rayadurgam, O. Tkachuk (Eds.), NASA
Formal Methods, Springer International Publishing, Cham, 2016, pp. 41–46.
[20] S. J. Greenspan, A. Borgida, J. Mylopoulos, A requirements modeling language and its logic,
Information Systems 11 (1986) 9–23. URL: https://www.sciencedirect.com/science/article/pii/
0306437986900207. doi:https://doi.org/10.1016/0306-4379(86)90020-7.
[21] M. Fazelnia, M. Mirakhorli, H. Bagheri, Translation titans, reasoning challenges: Satisfiability-aided
language models for detecting conflicting requirements, in: Proceedings of the 39th IEEE/ACM
[34] W. Xu, A. Banburski-Fahey, N. Jojic, Reprompting: Automated chain-of-thought prompt inference
through gibbs sampling, CoRR abs/2305.09993 (2023). URL: https://doi.org/10.48550/arXiv.2305.
09993. doi:10.48550/ARXIV.2305.09993. arXiv:2305.09993.
[35] M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, N. Blach, P. Nyczyk, M. Copik, G. Kwasniewski,
J. Müller, L. Gianinazzi, A. Kubicek, H. Niewiadomski, O. Mutlu, T. Hoefler, Topologies of reasoning:
Demystifying chains, trees, and graphs of thoughts, CoRR abs/2401.14295 (2024). URL: https:
//doi.org/10.48550/arXiv.2401.14295. doi:10.48550/arXiv.2401.14295. arXiv:2401.14295.
[36] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive NLP
tasks, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural
Information Processing Systems 33: Annual Conference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/
2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
[37] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt
engineering in large language models: Techniques and applications (2025). URL: https://arxiv.org/
abs/2402.07927. arXiv:2402.07927.
[38] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora:
Lowrank adaptation of large language models, in: The Tenth International Conference on
Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL:
https://openreview.net/forum?id=nZeVKeeFYf9.
[39] K. R. M. Leino, Dafny: An automatic program verifier for functional correctness, in: Proceedings of
the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning
(LPAR), volume 6355 of Lecture Notes in Computer Science, Springer, 2010, pp. 348–370. URL:
https://doi.org/10.1007/978-3-642-17511-4_20. doi:10.1007/978-3-642-17511-4\_20.
[40] D. R. Cok, Openjml: Software verification for java 7 using jml, openjdk, and eclipse, in: NASA
Formal Methods (NFM 2011), volume 6617 of Lecture Notes in Computer Science, Springer, 2011, pp. 472–
479. URL: https://doi.org/10.1007/978-3-642-20398-5_35. doi:10.1007/978-3-642-20398-5\
_35.
[41] M. R. H. Misu, C. V. Lopes, I. Ma, J. Noble, Towards ai-assisted synthesis of verified dafny methods,</p>
      <p>Proc. ACM Softw. Eng. 1 (2024). URL: https://doi.org/10.1145/3643763. doi:10.1145/3643763.
[42] J. Yao, Y. Liu, Z. Dong, M. Guo, H. Hu, K. Keutzer, L. Du, D. Zhou, S. Zhang, Promptcot: Align prompt
distribution via adapted chain-of-thought, in: 2024 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2024, pp. 7027–7037. doi:10.1109/CVPR52733.2024.00671.
[43] A. Porshnev, et al., Modelling implicit bias in gender–career associations: A systematic comparison
of language models, PsyArXiv (2025). doi:10.31234/osf.io/p7hvw\_v1, preprint, 22 May 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huisman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malkis</surname>
          </string-name>
          ,
          <article-title>Formal methods: From academia to industrial practice</article-title>
          .
          <source>a travel guide</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/
          <year>2002</year>
          .07279. arXiv:
          <year>2002</year>
          .07279.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Beg</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Donoghue</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Monahan</surname>
          </string-name>
          ,
          <article-title>Formalising software requirements using large language models (</article-title>
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2506.10704. arXiv:
          <volume>2506</volume>
          .
          <fpage>10704</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Granberry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ahrendt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johansson</surname>
          </string-name>
          ,
          <article-title>Specify what? enhancing neural specification synthesis by symbolic methods</article-title>
          , in: N.
          <string-name>
            <surname>Kosmatov</surname>
          </string-name>
          , L. Kovács (Eds.),
          <source>Integrated Formal Methods</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Robles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kosmatov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prevosto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Le</surname>
          </string-name>
          <string-name>
            <surname>Gall</surname>
          </string-name>
          ,
          <article-title>High-level program properties in frama-c: Definition, verification and deduction</article-title>
          ,
          <source>in: Leveraging Applications of Formal Methods, Verification and Validation. Specification and Verification: 12th International Symposium, ISoLA</source>
          <year>2024</year>
          , Crete, Greece,
          <source>October 27-31</source>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2024</year>
          , p.
          <fpage>159</fpage>
          -
          <lpage>177</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -75380-0_
          <fpage>10</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -75380-0_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Couder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ochoa</surname>
          </string-name>
          ,
          <article-title>Requirements verification through the analysis of source code by large language models</article-title>
          ,
          <source>in: SoutheastCon</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>80</lpage>
          . doi:
          <volume>10</volume>
          .1109/ SoutheastCon52093.
          <year>2024</year>
          .
          <volume>10500073</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valentino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Dennis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <article-title>Verification and refinement of natural language explanations through llm-symbolic theorem proving</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2405.01379. arXiv:
          <volume>2405</volume>
          .01379. International Conference on Automated Software Engineering, ASE '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>2294</fpage>
          -
          <lpage>2298</lpage>
          . URL: https://doi.org/10.1145/3691620.3695302. doi:
          <volume>10</volume>
          .1145/3691620.3695302.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chalamala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Godil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Prenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <article-title>Leandojo: Theorem proving with retrieval-augmented language models</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>21573</fpage>
          -
          <lpage>21612</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tworkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Czechowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Odrzygóźdź</surname>
          </string-name>
          , P. Mił oś, Y. Wu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamnik</surname>
          </string-name>
          , Thor:
          <article-title>Wielding hammers to integrate language models and automated theorem provers</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>8360</fpage>
          -
          <lpage>8373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Granberry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ahrendt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johansson</surname>
          </string-name>
          ,
          <article-title>Towards integrating copiloting and formal methods</article-title>
          , in: T.
          <string-name>
            <surname>Margaria</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Stefen (Eds.),
          <source>Leveraging Applications of Formal Methods, Verification and Validation. Specification and Verification</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>144</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mugnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jhala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Polikarpova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Laurel:
          <article-title>Generating dafny assertions using large language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2405.16792. arXiv:
          <volume>2405</volume>
          .
          <fpage>16792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Si</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-H. Chen</surname>
            , J. Liu,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>One-shot learning as instruction data prospector for large language models (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2312.10302. arXiv:
          <volume>2312</volume>
          .
          <fpage>10302</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Self-convinced prompting: Few-shot question answering with repeated introspection (</article-title>
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2310.05035. arXiv:
          <volume>2310</volume>
          .
          <fpage>05035</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          , M. Bosma, b. ichter,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Structured chain-of-thought prompting for code generation</article-title>
          ,
          <source>ACM Trans. Softw. Eng. Methodol</source>
          .
          <volume>34</volume>
          (
          <year>2025</year>
          ). URL: https://doi.org/10.1145/3690635. doi:
          <volume>10</volume>
          .1145/3690635.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ratner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , T. Pfister,
          <article-title>Found in the middle: Calibrating positional attention bias improves long context utilization</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Findings of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2024</year>
          , Bangkok, Thailand and virtual meeting,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>14982</fpage>
          -
          <lpage>14995</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          . ifndings-acl.
          <volume>890</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .FINDINGS-ACL.
          <year>890</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Durrett,</surname>
          </string-name>
          <article-title>The unreliability of explanations in few-shot prompting for textual reasoning</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>30378</fpage>
          -
          <lpage>30392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Diao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Automatic prompt augmentation and selection with chain-of-thought from labeled data</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>12113</fpage>
          -
          <lpage>12139</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>811</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>811</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>