<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rimvydas Rubavicius</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Valerio Miceli-Barone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Lascarides</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Subramanian Ramamoorthy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, University of Edinburgh 10</institution>
          <addr-line>Crichton Street, Edinburgh EH8 9AB</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>26</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domainspecific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model (LLM), to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than generation without engaging in extended conversation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code Generation</kwd>
        <kwd>Autonomous Vehicles</kwd>
        <kwd>Simulation</kwd>
        <kwd>Human-computer Interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Testing autonomous vehicles exclusively on public roads, especially in near-crash scenarios, is not
feasible [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Iterative development and testing in simulation is essential [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], especially in the early
stages of the development process. Simulators like CARLA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] support this iterative development
process for autonomous vehicles (AVs). Using CARLA, engineers can generate driving scenarios using
Scenic [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: probabilistic scenario description languages such as Scenic specify a distribution over the
driving scenarios that satisfy the description, and then sample many simulation instances for evaluation.
In this way, challenging driving scenarios can be generated [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], which can then be used for evaluation
of control algorithms for autonomous vehicles without a physical system, safely and cost-efectively.
      </p>
      <p>Writing programs to specify driving scenarios in Scenic, or more generally programming in
domainspecific languages (DSLs), is challenging with steep learning curves. Domain experts with deep
knowledge of edge cases and desired driving behaviours may lack proficiency in Scenic or other DSLs, or
even when they do have such proficiency, they may still benefit from some assistance in their workflow.
A natural language interface in the form of a dialogue system in which engineers converse with a
chatbot that knows how to program in Scenic is a useful tool for facilitating this. In this paper, we
aim to develop such a dialogue system, and by doing so increase engineer access to simulation-driven
testing, by allowing them to interactively synthesise scenarios using natural language dialogue.</p>
      <p>
        This is a challenging task for two reasons. Firstly due to data scarcity (only 32 examples of Scenic
programs with natural language descriptions are available online), approaches in developing a dialogue
system using instruction-following large language models (IFLLM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are not directly applicable.
Secondly, the scenario description provided by the user may be underspecified : missing some details
about the desired driving scenario in the description that is only exposed through interaction with
updated simulation instances
updated
program
.scenic
conversation
user
simulator
program
.scenic
simulation instances
      </p>
      <p>code
generation  </p>
      <p>simulator
inaccurate simulation due to 
incomplete specification
user</p>
      <p>feedback
I want the Ego vehicle to overtake
the adversary vehicles one by one.
the simulator. Figure 1 illustrates this phenomenon. When a user utters “an ego vehicle overtakes
the two adversary vehicles”, the dialogue system can synthesize the simulation instance satisfying the
description, but the description itself does not provide all information about the situation, which in this
case is the manner of overtaking (could be one by one, or both at the same time). The user only becomes
aware of the necessity to provide such details by observing simulation instances. In this context, it
would be helpful for a natural language interface to allow the user to express feedback to the system,
which can then be used to update the program and, in turn, generate simulation instances.</p>
      <p>To this end, this paper makes the following contributions: 1) we create a dataset of English
descriptionScenic program pairs for a variety of driving scenarios; 2) we investigate how IFLLM can be used to
synthesize Scenic programs from natural language expressions in an embodied conversation where the
user converses with a chatbot and reacts to driving simulation instances; and 3) we conduct human
trials to evaluate the dialogue system, and in particular the value of having multiple dialogue turns.
Our results demonstrate that the dialogue system can be a valuable tool for facilitating the generation
of interactive driving scenarios.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Code Generation The problem of converting natural language to executable programs has been
widely studied as semantic parsing [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9, 10</xref>
        ]. Recently, using IFLLMs has been explored for this task
and also the wider problem of arbitrary code generation, including auto-completion [11] to assist the
software development process. In this paper, we concentrate on developing a dialogue system that
generates Scenic programs using LLMs. It is a low-resource domain where only a handful of examples
are available. This is in contrast to other popular programming languages like Python, which benefit
from many code repositories for training [12]. Furthermore, the code for high-resource programming
languages follows style guidelines, making it easier for models to learn generalisations from the
wellstructured data. In contrast, the Scenic program repository was written in diferent styles by diferent
software engineers.
      </p>
      <p>Testing AVs Testing is a fundamental process within software engineering, which is particularly
important for designing safety-critical systems, such as AVs [13, 14]. There are a variety of methods
for testing and validating AV design, even with black-box components [15]. This paper focuses on
testing by generating driving scenarios as a complementary procedure to formal verification. Such
testing involves interaction with expert users, which is essential for human-centred design [16]. Driving
scenario generation has been explored before: both direct scenario generation [17, 18] or by utilising
Scenic [19, 20]. What is unique about this work is: (i) human experiments; and (ii) utilising dialogue
to align the generated driving scenarios with the user’s intent. We view this latter contribution as
description embedding</p>
      <p>Embedding
essential, given the ubiquitous phenomenon of natural language pragmatics [21]: speakers of natural
language often intend to convey a meaning that goes beyond what they make linguistically explicit,
relying instead on their interlocutor’s capacity to decode the intended content using the linguistic
and non-linguistic context. There is no guarantee that the sampling processes in Scenic match how
competent interlocutors perform this task, and so errors in decoding the intended message via Scenic
sampling may happen, prompting the need for user feedback.</p>
      <p>Dialogue Systems and Tools Natural language interfaces in the form of a dialogue system have
become a popular means for users to interact with systems without needing to know the underlying
system or how to program it[22, 23]. Broadly, dialogue systems are categorised by their function as
goal-oriented or for chit-chat. We are designing a goal-oriented dialogue system with tool use [24],
which in this case is the CARLA simulator. What is unique about our work is that the driving scenarios
generated by the simulator are not observed in the dialogue system and are latent, with observations
coming only from the user’s feedback after observing simulation instances.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Dialogue System</title>
      <p>This section describes the dialogue system for generating driving simulations via conversations. Figure 2
demonstrates the two modes of the system: code generation (§3.1); and conversation (§3.2), in which
the user responds to simulation instances as depicted in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Code Generation</title>
        <p>To generate Scenic programs from natural language descriptions, the dialogue system uses
retrievalaugmented generation (RAG) [25] with in-context learning [26] to construct the CODE prompt.</p>
        <p>To generate a program ˆ from natural language description , we constructed a dataset  =
{(, )}=1 of size  = 105 description-program pairs1, via manual augmentation of the few
ex
emplars available on the web.2 Additionally, in our manual augmentation process, we standardise
the codebase to minimise variability and ease the learning process in this low-resource regime. Each
program is standardised in its coding style, with five distinct code blocks—map and model, constants,
agent’s behaviour, spatial relations, and scenario specification—to minimise the variation between the
exemplars. The dataset  is indexed via embeddings of descriptions.3
1https://huggingface.co/datasets/assistive-autonomy/scenic-driving-scenarios
2https://github.com/BerkeleyLearnVerify/Scenic/tree/2.x/examples
3https://huggingface.co/BAAI/bge-small-en-v1.5</p>
        <sec id="sec-3-1-1">
          <title>CODE prompt for code generation.</title>
          <p>[INST] You are a helpful assistant that translates English descriptions to Scenic programs. Scenic is a
domain-specific probabilistic programming language for creating distributions over specified scenarios.
For driving scenarios, each program has the following blocks:
• MAP AND MODEL: importing town assets and enabling simulator;
• CONSTANTS: specifying vehicle blueprint and other constants like vehicle speed, brake intensity
and safety distance;
• AGENT’S BEHAVIOUR: describing how individual vehicles behave in the scenario;
• SPATIAL RELATIONS: outlining the type of road the scenario needs to be synthesized in (e.g.</p>
          <p>having or not having intersections)
• SCENARIO SPECIFICATION: creating individual vehicles and pedestrians in the specified roads,
together with constraints that are required to be true for the full simulation as well as the
termination condition.</p>
          <p>Here are examples of descriptions and programs: {DESCRIPTION-PROGRAM PAIRS}
Now, please translate the following English description to a Scenic program. Just give the program. No
extra information. Description: {DESCRIPTION}[/INST]
SUMM prompt for description update.
[INST] You are a helpful assistant that creates updated driving scenario descriptions. Given a driving
scenario description and corresponding feedback regarding the simulation, your task is to generate an
updated description that incorporates the feedback. Please ensure that the updated description accurately
reflects the intended action based on the feedback received and does not introduce additional information
or lose information, like the number of vehicles or pedestrians in the situation. The description will outline
a specific scenario or context, while the feedback will provide information about how the described
scenario could have been improved or modified. Be sure to maintain the original meaning and intent
of both the initial description and the feedback. Be brief and only return the updated driving scenario
description. Description: {DESCRIPTION} Feedback: {FEEDBACK}[/INST]</p>
          <p>At inference time, the embedding d of the user’s description  is used to find the most similar  = 3
description-program pairs from  using maximum inner-product search:
The description with the retrieved exemplars are then used to create the CODE prompt (see Figure 3 for
the prompt specification) which is then used to generate the program ˆ using an IFLLM:
(d) = {(, ) ∈  | (, ) ∈ (, d, )}
(1)
ˆ = IFLLM(CODE((d), ))
(2)
There is no guarantee that the program ˆ will execute. To fix this, we perform error feeding: the
program ˆ is passed to the simulator to attempt to create simulation instances; and if an error occurs,
the natural language description, ˆ and its errors are passed back to the IFLLM, to attempt to self-correct
the output. If, after three attempts of error feeding, the generated program doesn’t execute, the user is
asked to provide an alternative description.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Conversation</title>
        <p>When ˆ executes,  = 3 instances of the driving scenario simulations are shown to the user, who
is given the opportunity to react to these, either by indicating successful generation or by providing
natural language feedback  that corrects or refines the results, in an attempt to provide simulations
that better align with the user’s communicative intent.</p>
        <p>Feedback is a response to the simulations and not to ˆ (which is not observable to the user): it can
feature and refer to information that is not present in the Scenic program but is present in the simulation
due to the nature of random sampling. Because of this, feeding only  to IFLLM is suboptimal due to
its missing context, which cannot be given in a principled way (e.g. conditioning on the simulation
video or some domain-specific trace would lead to further data scarcity, which is already problematic
in this domain). To alleviate this issue, we feedback an updated description by summarising  and  ,
using a SUMM prompt (see the Figure 3 for prompt specification), to generate the updated description
1. With this description, our aim is to guide IFLLM, minimise the overall context, and not introduce
hallucinations and unwanted behaviours.</p>
        <p>The updated description 1 is used to generate the updated program ˆ1 following the previously
described code generation procedure.</p>
        <p>1 = IFLLM(SUMM(,  ))
ˆ1 = IFLLM(CODE((d1), 1))
(3)
(4)
The conversation continues up to  = 4 turns. If the user remains unsatisfied with the simulations, the
conversation is deemed unsuccessful.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We conduct experiments for code generation and human evaluation for the overall dialogue system.
NVIDIA RTX A6000 is used for serving IFLLM via a text generation inference server. For human trials,
we use an Alienware Area-51m laptop with NVIDIA RGX 2080 to generate simulation instances in
CARLA.</p>
      <sec id="sec-4-1">
        <title>4.1. Code Generation Experiments</title>
        <p>4.1.1. Experimental Setup
We experiment with several open-source IFLLMs with 7B parameters: Mistral [27], Gemma [28], and
CodeLlama [29] that used  = 3 exemplars to construct CODE that’s retrieved using RAG, or at random
with and without error feeding. To cope with the small dataset bias when evaluating diferent IFLLMs,
we perform leave-one-out validation [30]. We record three metrics: (a) BLEU [31, 32] to measure
generation precision; (b) ROUGE-L [33] to measure generation recall; and (c) Execution (EXEC) to
measure the percentage of the generated code that is executable, all measured with error feeding (if
required). More sophisticated code generation evaluation metrics like CodeBLEU [34], to abstract
away from the surface form and focus on the syntactic similarity, are not considered because such
metrics require adaptation of the parser to Scenic programs. This is partially circumvented by program
standardisation performed in the dataset construction.
4.1.2. Results and Discussion
Table 1 records the results for code generation. We observe that IFLLM that uses CODE prompt
constructed using RAG is better than a random selection, which most of the time results in non-executable
programs. Error feeding boosts performance for IFLMM that are not fine-tuned on code (Mistral and
Gemma ), yet it does increase inference time, which for Gemma leads to non-termination in a reasonable
time. High BLEU and ROUGE-L indicate a high correspondence between prediction and reference,
as expected after program standardisation. For EXEC, CodeLlama is better than Mistral and Gemma,
which is as expected due to the supervised fine-tuning on code. Based on code generation experiments,
we chose CodeLlama with RAG and error feeding to use for human evaluation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Human Evaluation</title>
        <p>4.2.1. Experimental Setup
We perform a human evaluation of the dialogue system. Visual stimuli used for human trials are given
in Figure 4. The user is presented with a (diagrammatic) stimulus and asked to describe it. After that,
CARLA is run to produce 3 simulation instances.4 After observing it, the user may respond with further
natural language input to what they see, or check “satisfied with the scenario produced” and move to
the next stimulus. The user can have up to  = 4 dialogue turns. If code generation does not produce
executable code, the user is asked to paraphrase their description as if it is a fresh interaction. The
stimuli are categorised into three types: bypassing (e.g. the user stimulus may yield the user description
“an ego overtakes an adversary”), intersections (e.g. “an ego goes left at a 4-way intersection while the
adversary approaches from behind.”), and pedestrians (e.g.“an ego goes left and yields to the pedestrian”).
The experiments were conducted with 20 users who have a driving license (as a proxy for domain
proficiency), with each asked to describe 5 of the 25 stimuli. In total, 100 conversations were recorded:
44 intersections, 19 bypassing, and 37 pedestrian scenarios.
4The dialogue system can produce many (diferent) instances, but to keep the stimuli contained, we limit it to 3.</p>
        <p>2 3
Number of dialoge turns
4
1</p>
        <p>2 3
Number of dialoge turns
4
(a) With execution errors
(b) without execution errors
4.2.2. Results and Discussion</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We have presented a dialogue system for generating conversational driving scenarios. Our results show
that users can, in principle, generate desired driving scenarios with conversation being a crucial element
in reducing scenario generation errors.</p>
      <p>For future work, we envision several developments. Firstly, there is a gap in generating scenarios,
despite the extended conversation. To address this, further techniques could be considered, including
data augmentation [35], constrained decoding [36], prompt engineering [37], and memory-based
techniques for explicit state tracking [38]. Second, the current setup does not dynamically change the
controller and use the interface to test it. This should be evaluated and considered testing of such new
controllers in tandem with formal verification methods [39].</p>
      <p>first the pedestrian crosses,
then ego vehicle moves.
the pedestrian crosses the
road on which the ego
vehicle enters
the pedestrian is to the
right of the vehicle</p>
      <sec id="sec-5-1">
        <title>Outcome: simulation generated successfully</title>
        <p>at a 3-way intersection, a pedestrian crosses
the road. An ego vehicle waits to make a left turn.</p>
        <p>At a 3-way intersection, a pedestrian crosses
the road and an ego vehicle waits to make
a left turn.</p>
        <p>At a 3-way intersection, an ego vehicle waits
to make a left turn, and a pedestrian
crosses the road.</p>
        <p>At a 3-way intersection, an ego vehicle waits to
make a left turn, and a pedestrian crosses
the road to the right.</p>
        <p>An ego vehicle overtakes an adversary vehicle,
then overtakes another adversary vehicle
An ego vehicle overtakes an adversary vehicle,
then overtakes another adversary vehicle on a
dual carriageway behind two adversary vehicles
in the left lane.</p>
        <p>An ego vehicle overtakes an
adversary vehicle, then overtakes another adversary
vehicle on a dual carriageway behind
two adversary vehicles in the left lane.</p>
        <p>An ego vehicle overtakes an adversary vehicle,
4 The road only has two lanes tdhueanl coavrerritaagkeews aaynobtehheirnaddtvweorsaadryvevreshaircyleveohnicales
in the left lane.</p>
        <p>Outcome: failed to generate the simulation</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by UKRI Strategic Priorities Fund to the UKRI Research Node on
Trustworthy Autonomous Systems Governance and Regulation (grant EP/V026607/1) and UKRI Turing
AI World Leading Researcher Fellowship on AI for Person-Centred and Teachable Autonomy (grant
EP/Z534833/1)</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to do the following:
Grammar and spell checking; generating a paraphrase or alterantive word. After using this tool, the
author(s) reviewed and edited the content as needed, and we take full responsibility for the publication’s
content.
of the AFNLP, Association for Computational Linguistics, Suntec, Singapore, 2009, pp. 976–984.</p>
      <p>URL: https://aclanthology.org/P09-1110/.
[10] L. Dong, M. Lapata, Language to logical form with neural attention, in: K. Erk, N. A. Smith (Eds.),
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 33–43.</p>
      <p>URL: https://aclanthology.org/P16-1004/. doi:10.18653/v1/P16-1004.
[11] J. Jiang, F. Wang, J. Shen, S. Kim, S. Kim, A survey on large language models for code
generation, CoRR abs/2406.00515 (2024). URL: https://doi.org/10.48550/arXiv.2406.00515. doi:10.48550/
ARXIV.2406.00515. arXiv:2406.00515.
[12] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, Y. Jernite, M. Mitchell, C. M. Ferrandis, S. Hughes,
T. Wolf, D. Bahdanau, L. von Werra, H. de Vries, The stack: 3 TB of permissively licensed source
code, Trans. Mach. Learn. Res. 2023 (2023). URL: https://openreview.net/forum?id=pxpbTdUEpD.
[13] S. Feng, X. Yan, H. Sun, Y. Feng, H. X. Liu, Intelligent driving intelligence test for autonomous
vehicles with naturalistic and adversarial environment, Nature Communications 12 (2021) 748.</p>
      <p>URL: https://doi.org/10.1038/s41467-021-21007-8. doi:10.1038/s41467-021-21007-8.
[14] B. Padmaja, C. V. K. N. S. N. Moorthy, N. Venkateswarulu, M. M. Bala, Exploration of issues,
challenges and latest developments in autonomous cars, Journal of Big Data 10 (2023) 61. URL:
https://doi.org/10.1186/s40537-023-00701-y. doi:10.1186/s40537-023-00701-y.
[15] A. Corso, R. J. Moss, M. Koren, R. Lee, M. J. Kochenderfer, A survey of algorithms for
blackbox safety validation of cyber-physical systems, J. Artif. Intell. Res. 72 (2021) 377–428. URL:
https://doi.org/10.1613/jair.1.12716. doi:10.1613/JAIR.1.12716.
[16] H. Rosenbrock, Designing human-centred technology: a cross-disciplinary project in
computeraided manufacturing, 1989. URL: https://api.semanticscholar.org/CorpusID:106694042.
[17] C. Chang, S. Wang, J. Zhang, J. Ge, L. Li, Llmscenario: Large language model driven scenario
generation, IEEE Transactions on Systems, Man, and Cybernetics: Systems 54 (2024) 6581–6594.
doi:10.1109/TSMC.2024.3392930.
[18] L. Feng, Q. Li, Z. Peng, S. Tan, B. Zhou, Traficgen: Learning to generate diverse and realistic
trafic scenarios, in: 2023 IEEE International Conference on Robotics and Automation (ICRA),
2023, pp. 3567–3575. doi:10.1109/ICRA48891.2023.10160296.
[19] A. V. Miceli Barone, C. Innes, A. Lascarides, Dialogue-based generation of self-driving simulation
scenarios using large language models, in: A. Padmakumar, M. Inan, Y. Fan, X. Wang, M. Alikhani
(Eds.), Proceedings of the 3rd Combined Workshop on Spatial Language Understanding and
Grounded Communication for Robotics (SpLU-RoboNLP 2023), Association for Computational
Linguistics, Singapore, 2023, pp. 1–12. URL: https://aclanthology.org/2023.splurobonlp-1.1/. doi:10.
18653/v1/2023.splurobonlp-1.1.
[20] J. Zhang, C. Xu, B. Li, Chatscene: Knowledge-enabled safety-critical scenario generation for
autonomous vehicles, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2024, pp. 15459–15469.
[21] H. P. Grice, Utterer’s meaning and intentions, Philosophical Review 68 (1969) 147–177.
[22] H. Chen, X. Liu, D. Yin, J. Tang, A survey on dialogue systems: Recent advances and new
frontiers, SIGKDD Explor. Newsl. 19 (2017) 25–35. URL: https://doi.org/10.1145/3166054.3166058.
doi:10.1145/3166054.3166058.
[23] J. Ni, T. Young, V. Pandelea, F. Xue, E. Cambria, Recent advances in deep learning based dialogue
systems: a systematic survey, Artificial Intelligence Review 56 (2023) 3055–3155. URL: https:
//doi.org/10.1007/s10462-022-10248-8. doi:10.1007/s10462-022-10248-8.
[24] T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda,
T. Scialom, Toolformer: Language models can teach themselves to use tools, in: A. Oh, T. Naumann,
A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing
Systems, volume 36, Curran Associates, Inc., 2023, pp. 68539–68551. URL: https://proceedings.neurips.
cc/paper_files/paper/2023/file/d842425e4bf79ba039352da0f658a906-Paper-Conference.pdf.
[25] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis,
W.t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for
knowledgeintensive nlp tasks, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin
(Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates,
Inc., 2020, pp. 9459–9474. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
6b493230205f780e1bc26945df7481e5-Paper.pdf.
[26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran
Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[27] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
T. Lacroix, W. E. Sayed, Mistral 7b, CoRR abs/2310.06825 (2023). URL: https://doi.org/10.48550/
arXiv.2310.06825. doi:10.48550/ARXIV.2310.06825. arXiv:2310.06825.
[28] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale,
J. Love, P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros,
A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A.
Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan,
G. Tucker, G. Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin,
J. Keeling, J. Labanowski, J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, J. Chiu, et al., Gemma:
Open models based on gemini research and technology, CoRR abs/2403.08295 (2024). URL: https:
//doi.org/10.48550/arXiv.2403.08295. doi:10.48550/ARXIV.2403.08295. arXiv:2403.08295.
[29] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong,
A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, G. Synnaeve,
Code llama: Open foundation models for code, CoRR abs/2308.12950 (2023). URL: https:
//doi.org/10.48550/arXiv.2308.12950. doi:10.48550/ARXIV.2308.12950. arXiv:2308.12950.
[30] C. M. Bishop, Pattern recognition and machine learning, 5th Edition, Information science and
statistics, Springer, 2007. URL: https://www.worldcat.org/oclc/71008143.
[31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040/.
doi:10.3115/1073083.1073135.
[32] M. Post, A call for clarity in reporting BLEU scores, in: O. Bojar, R. Chatterjee, C. Federmann,
M. Fishel, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol,
M. Neves, M. Post, L. Specia, M. Turchi, K. Verspoor (Eds.), Proceedings of the Third Conference on
Machine Translation: Research Papers, Association for Computational Linguistics, Brussels,
Belgium, 2018, pp. 186–191. URL: https://aclanthology.org/W18-6319/. doi:10.18653/v1/W18-6319.
[33] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.
[34] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, S. Ma,
Codebleu: a method for automatic evaluation of code synthesis, CoRR abs/2009.10297 (2020). URL:
https://arxiv.org/abs/2009.10297. arXiv:2009.10297.
[35] P. Chen, G. Lampouras, Exploring data augmentation for code generation tasks, in: A. Vlachos,
I. Augenstein (Eds.), Findings of the Association for Computational Linguistics: EACL 2023,
Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 1542–1550. URL: https:
//aclanthology.org/2023.findings-eacl.114/. doi: 10.18653/v1/2023.findings-eacl.114.
[36] S. Geng, M. Josifoski, M. Peyrard, R. West, Grammar-constrained decoding for structured NLP
tasks without finetuning, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, Association for Computational Linguistics,
Singapore, 2023, pp. 10932–10952. URL: https://aclanthology.org/2023.emnlp-main.674/. doi:10.
18653/v1/2023.emnlp-main.674.
[37] M. L. Siddiq, B. Casey, J. C. S. Santos, A lightweight framework for high-quality code
generation, CoRR abs/2307.08220 (2023). URL: https://doi.org/10.48550/arXiv.2307.08220. doi:10.48550/
ARXIV.2307.08220. arXiv:2307.08220.
[38] P. Jain, M. Lapata, Memory-based semantic parsing, Transactions of the Association for
Computational Linguistics 9 (2021) 1197–1212. URL: https://aclanthology.org/2021.tacl-1.71/.
doi:10.1162/tacl_a_00422.
[39] Y. Wang, M. Nakamura, K. Sakakibara, Y. Okura, Formal specification and verification of an
autonomous vehicle control system by the ots/cafeobj method (S), in: S. Chang (Ed.), The 35th
International Conference on Software Engineering and Knowledge Engineering, SEKE 2023, KSIR
Virtual Conference Center, USA, July 1-10, 2023, KSI Research Inc., 2023, pp. 363–366. URL:
https://doi.org/10.18293/SEKE2023-170. doi:10.18293/SEKE2023-170.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Paddock</surname>
          </string-name>
          ,
          <article-title>Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?</article-title>
          ,
          <source>Transportation Research Part A: Policy and Practice</source>
          <volume>94</volume>
          (
          <year>2016</year>
          )
          <fpage>182</fpage>
          -
          <lpage>193</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0965856416302129. doi:https: //doi.org/10.1016/j.tra.
          <year>2016</year>
          .
          <volume>09</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Schöner</surname>
          </string-name>
          ,
          <article-title>Simulation in development and testing of autonomous vehicles</article-title>
          , in: M.
          <string-name>
            <surname>Bargende</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-C. Reuss</surname>
          </string-name>
          , J.
          <source>Wiedemann (Eds.)</source>
          ,
          <volume>18</volume>
          . Internationales Stuttgarter Symposium, Springer Fachmedien Wiesbaden, Wiesbaden,
          <year>2018</year>
          , pp.
          <fpage>1083</fpage>
          -
          <lpage>1095</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Codevilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <surname>CARLA:</surname>
          </string-name>
          <article-title>an open urban driving simulator</article-title>
          ,
          <source>in: 1st Annual Conference on Robot Learning</source>
          ,
          <source>CoRL</source>
          <year>2017</year>
          ,
          <string-name>
            <given-names>Mountain</given-names>
            <surname>View</surname>
          </string-name>
          , California, USA, November
          <volume>13</volume>
          -
          <issue>15</issue>
          ,
          <year>2017</year>
          , Proceedings, volume
          <volume>78</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . URL: http://proceedings.mlr.press/v78/dosovitskiy17a.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fremont</surname>
          </string-name>
          , E. Kim,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dreossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Sangiovanni-Vincentelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Seshia</surname>
          </string-name>
          ,
          <article-title>Scenic: a language for scenario specification and data generation, Mach</article-title>
          . Learn.
          <volume>112</volume>
          (
          <year>2023</year>
          )
          <fpage>3805</fpage>
          -
          <lpage>3849</lpage>
          . URL: https://doi.org/10.1007/s10994-021-06120-5. doi:
          <volume>10</volume>
          .1007/S10994-021-06120-5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Metadrive:
          <article-title>Composing diverse driving scenarios for generalizable reinforcement learning</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>45</volume>
          (
          <year>2023</year>
          )
          <fpage>3461</fpage>
          -
          <lpage>3475</lpage>
          . URL: https://doi.org/10.1109/TPAMI.
          <year>2022</year>
          .
          <volume>3190471</volume>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2022</year>
          .
          <volume>3190471</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Mo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Scenarionet:
          <article-title>Open-source platform for large-scale trafic scenario simulation and modeling</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <article-title>Learning for semantic parsing with statistical machine translation</article-title>
          , in: R. C.
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bilmes</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          Sanderson (Eds.),
          <source>Proceedings of the Human Language Technology Conference of the NAACL</source>
          ,
          <string-name>
            <surname>Main</surname>
            <given-names>Conference</given-names>
          </string-name>
          , Association for Computational Linguistics, New York City, USA,
          <year>2006</year>
          , pp.
          <fpage>439</fpage>
          -
          <lpage>446</lpage>
          . URL: https://aclanthology.org/N06-1056/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , M. Collins,
          <article-title>Learning context-dependent mappings from sentences to logical form</article-title>
          , in: K.
          <string-name>
            <surname>-Y. Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wiebe</surname>
          </string-name>
          , H. Li (Eds.),
          <source>Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>