=Paper=
{{Paper
|id=Vol-3864/quasoq-2024-paper-07
|storemode=property
|title=A Novel Approach to Automated Test Script Generation using Large Language Models for Domain-Specific Languages
|pdfUrl=https://ceur-ws.org/Vol-3864/quasoq-2024-paper-07.pdf
|volume=Vol-3864
|authors=Jianing Sun,Jiahui Wang,Yuyan Zhu,Xingyu Li,Ying Xie,Jiaxin Chen
|dblpUrl=https://dblp.org/rec/conf/apsec/SunWZLXC24
}}
==A Novel Approach to Automated Test Script Generation using Large Language Models for Domain-Specific Languages==
Jianing Sun∗, Jiahui Wang, Yuyan Zhu, Xingyu Li, Ying Xie and Jiaxin Chen
Chongqing University, 400044 Chongqing, China
Abstract
This paper presents a novel method for generating automated test scripts for Domain-Specific Languages
(DSLs) in software testing, particularly for the automotive industry. It emphasizes the growing importance
of software testing in ensuring product quality amid IT advancements. The paper reviews software testing's
evolution, modern processes, and the role of Large Language Models (LLMs). It highlights DSLs' significance
and uses the automotive sector to show how LLMs can automate test script generation. Tests indicate that
in cases with a small sample size, the effectiveness of prompt engineering is superior to model fine-tuning.
The proposed method thus relies on well-designed prompts to direct LLMs to produce accurate scripts. The
generation system's overview is discussed, along with an evaluation of the scripts' quality using metrics
like Levenshtein Distance. Results indicate that LLMs boost test automation, defect detection, and software
reliability. Future work will optimize these tools for higher testing automation levels.
Keywords
software testing, domain-specific languages, large language models, Levenshtein Distance 1
1. Introduction describing testing as executing a program to uncover
errors [6].
Software testing is a key component in ensuring the By 1983, IEEE had standardized software testing,
quality and reliability of software products. In the defining it as a process -manual or automated- to verify
rapidly developing information technology era, software system requirements [7]. The 1990s brought agile
has become an indispensable part of our daily life and methodologies, integrating testing and development and
work. With the increasing complexity and encouraging tester involvement from the earliest
diversification of software functions, the importance of development stages [8]. In the 21st century, testing has
software testing has become increasingly prominent. advanced, with a focus on exploratory testing that
Software testing is a series of processes designed to highlights the tester initiative. The era of AI and big data
check that a software product meets specified has intensified scrutiny of software testing. Despite still
requirements and ensures its quality. It not only helps leveraging 20th-century methods, the field anticipates
developers to find and fix defects, but also greatly future innovations, potentially revolutionizing testing
enhances system security, especially in fields with high practices [9].
software safety requirements such as automotive and
aviation [1]. 1.2. Modern Approaches
1.1. A Brief History of Software Testing The modern software testing process is crucial for
ensuring software quality and functionality. It starts
The origins of software testing date back to the 1950s, with requirement analysis, followed by developing a test
focusing initially on debugging to identify and rectify plan, designing test cases, and preparing test data
software faults [2][3][4]. As software complexity grew, (Figure 1). The test environment is set up, tests are
the need for independent testing organizations became executed and recorded, and defects are tracked.
apparent. In 1957, Charles Baker first defined program Regression and performance testing are conducted,
testing, in his review of the book Digital Computer
Programming by Dan McCracken, separating it from
debugging. Bill Hetzel formalized software testing as a
concept at the University of North Carolina in 1972,
establishing it to ensure a program performs as intended Figure 1: Modern software testing process.
[5]. Glenford J. Myers further refined this in 1979,
QuASoQ 2024:12th International Workshop on Quantitative Approaches 0009-0001-4943-6038 (J. Sun); 0009-0000-0107-0666 (J. Wang); 0009-
to Software Quality,3rd December 2024,Chongqing, China, 0008-1670-4607 (Y. Zhu); 0009-0009-4526-007X (X. Li); 0009-0003-8939-
∗
Corresponding author. 7832 (Y. Xie); 0009-0002-4492-4734 (J. Chen)
© 2024 Copyright for this paper by its authors. Use permitted under
j.sun@cqu.edu.cn (J. Sun) Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
52
along with security and system testing. Acceptance aid in document summarization, provide travel advice,
testing confirms business requirements are met. Test and improve user engagement. Tools like GitHub
reports summarize results, and evaluations identify Copilot demonstrate their advantage by assisting in
process improvements. coding tasks [10].
Techniques like automated testing, Continuous Also, LLMs boost software testing by automating
Integration (CI), and Continuous Delivery (CD) enhance tasks, detecting defects, and ensuring reliability. They
testing efficiency. Agile testing fosters collaboration improve fuzz and unit testing, creating test cases, and
between testers and developers. Performance, security, suggesting fixes. Research shows their significant
and mobile application testing ensure software benefits in expanding test coverage and error detection.
reliability across different aspects. Cloud testing Future efforts will focus on optimizing testing tools and
leverages cloud resources for extensive testing. techniques.
2. Domain-Specific Software
Testing
2.1. Domain-Specific Languages
Domain-specific languages (DSLs) are specialized
languages designed for particular domains or tasks,
offering simplified syntax for ease of use by domain
experts [11]. They can be integrated with general-
Figure 2: CI/CD is a software development practice purpose languages (GPL) like Java and C++, enhancing
in which code changes are automatically integrated, development efficiency through tool support such as
built, and tested, with successful builds being deployed analyzers and compilers. DSLs are crucial in various
to production. industries, for example, HTML in web development and
SQL in databases. They automate tasks like API
AI-based testing leverages machine learning to documentation and legal document generation,
automate software testing processes, enhancing improving efficiency and reducing errors. DSLs also
efficiency and accuracy. It encompasses exploratory facilitate team collaboration by allowing non-technical
testing to identify issues without fixed test cases, members to express requirements in a natural language-
ensuring broader coverage. Model-driven testing and like format.
testability design optimize test case generation and For software development, DSLs boost efficiency by
software sustainability. Additionally, managing test data simplifying complex representations and promoting
and implementing strategies such as Test Left Shift and code reusability. They accelerate prototyping and
Test Right Shift further refine the development cycle. iteration, integrating seamlessly into existing tools and
These dynamic approaches adapt to different workflows. Testing DSL-developed programs requires a
methodologies, ensuring consistent software quality detailed plan with automated test scripts for regression
throughout the testing process. testing. Test cases must be readable, and test data should
reflect the domain specifics to ensure comprehensive
1.3. Large Language Models coverage and identify defects. Maintainability of test
cases and DSL is essential for ongoing development
Large Language Models (LLMs) are cutting-edge AI
success.
specialized in natural language understanding and
generation. Trained on extensive datasets and
2.2. Software Testing using DSL from the
employing neural networks like Transformers, they
capture linguistic subtleties and perform a variety of Automobile Industry
language tasks such as categorization, analysis, This section addresses the critical need for rigorous
translation, and Q&A systems. They discern nuances, testing in passenger car product development, ensuring
generate realistic text, and continuously adapt to quality and performance meet standards.
linguistic evolution, raising concerns over data privacy Traditionally, automotive testing relies on manual,
and ethics. labor-intensive translation of requirements into test
LLMs have significantly impacted sectors like smart cases and scripts, causing significant strain on resources.
offices, travel, e-commerce, and government by To streamline this process and integrate Continuous
enhancing efficiency and personalization. In software Testing/Continuous Delivery
development, LLMs are revolutionizing the field. They
53
Table 1
A segment of the test data, describing the test cases, pre-conditions, and the desired test script DSL to be generated.
Test case Pre-condition Test script
Forward gear is Configuration.Gears.Drive.is_activated() Signals.Check(signals=[Gears_GearsStatus],
activated when the values=[Gears_GearsStatus_shift], waiting_time=100)
car is moving Gear.Shift('drive')
forward
Reverse gear is Configuration.Gears.Reverse.is_activated() Signals.Check(signals=[Gears_GearsStatus],
activated when the values=[Gears_GearsStatus_shift], waiting_time=100)
car is being driven Gear.Shift('reverse')
reversely
At driving P gear is Configuration.Gears.Park.is_deactivated() current_speed = car.get_speed()
deactivated current_gear = car.get_gear_status()
Self.assertNotEqual(current_gear, 'park')
car.stop()
car.shift_gear('park')
current_gear = car.get_gear_status()
(CT/CD) pipelines, the industry is moving towards limited data, prompt engineering emerges as the slightly
automated test development. superior approach.
The automotive sector is in pursuit of an AI-driven Consequently, we have intentionally opted to
solution to streamline the automation of test script employ the finesse of prompt engineering for the
generation for its proprietary DSLs, which are integral automated crafting of test scripts. This strategic choice
to the testing of a spectrum of automotive systems. The is rooted in its proven ability to deliver optimized
prevailing manual methodology is marred by outcomes, even within the confines of our data scarcity.
inefficiencies, susceptibility to errors, and variability in By leveraging the finesse of prompt engineering, we aim
code quality, alongside insufficient test coverage. By to transcend the limitations imposed by scant data
harnessing the capabilities of large language models, an availability, thereby enhancing the overall performance
AI-powered tool has the potential to orchestrate this and reliability of our test script generation process.
process, amplifying efficiency, curbing errors, and An integral element of our approach is the selection
upholding uniformity in code excellence, thereby of the foundational Large Language Model. To this end,
conquering existing challenges and invigorating the we have undertaken a model selection process,
software development lifecycle. meticulously assessing ChatGLM3, Llama3, and Qwen2.
Following an exhaustive comparison, we determined
2.3. Sample Data that Llama3's generative capabilities align more closely
with our requirements. Hence, we have chosen Llama3
A total of 51 data samples (Table 1), each representing
to serve as the underlying LLM for this study.
a true mapping from a test case to a test script in a
particular DSL format.
For privacy protection purposes, all information,
3.1. Prompt to Make Precise Test Script
program code and data in this paper have been Generation
anonymized. Through a meticulous process of refinement, we've
perfected our prompt for generating test scripts, as
3. Approach shown in the example. This fine-tuning ensures our AI
model produces outputs that are both accurate and meet
Broadly speaking, two dominant strategies have
our objectives.
emerged for augmenting the knowledge base: the art of
Our prompt is divided into four key components
prompt engineering, which is particularly effective for
(Table 2): First, an exhaustive list of potential samples,
modest datasets, and the process of model fine-tuning,
excluding the current focus, provides a comprehensive
which is best suited for addressing more substantial
training context. Second, we concentrate on the specific
volumes of data. Considering the current data landscape,
test purpose to create targeted, efficient test scripts.
characterized by a dearth of samples and inherent
Third, we provide clear instructions in natural English
uncertainties, a comprehensive evaluation was
for the LLM to follow, ensuring a seamless and accurate
undertaken to compare the merits of both prompt
generation process. Lastly, we impose constraints to
engineering and LLM fine-tuning. This analysis has
optimize the generation process, enabling our LLM
demonstrated that, under the present circumstances of
54
model to autonomously produce precise and relevant reflecting deeper learning and improved autonomy in
test scripts without excessive input. script generation, ultimately advancing AI in software
testing.
Table 2
Pseudo code of prompt design.
Prompt for test script generation (pseudo code)
dataFrame = All sample mappings except the one
which is being generated
testCase = one test case which is being processed
instruction = “Above is a list of test cases and
corresponding test scripts, assembled in Json format.
Please generate test script for the following test case:”
condition = “Please export generated test script only, Figure 3: Generation System Overview.
no leading text, no leading new lines.”
prompt = dataFrame + CRLF + instruction + testCase 4. Result and Evaluation
+ condition
This structured approach not only boosts script 4.1. Evaluation Metric
accuracy but also enhances the efficiency of our testing This paper employs the Levenshtein Distance [12] to
process, bringing us closer to our goal of fully automated evaluate the textual accuracy of our language model,
AI-driven test script generation. providing an objective measure of how closely
generated text matches the ground truth. This edit
3.2. Test Script Generation System distance metric, devised by Vladimir Levenshtein,
Overview quantifies the minimum number of single-character
The test script generation system, illustrated in Figure edits required to transform one string into another,
3, converts input test cases into executable scripts in the offering insights into model performance. It plays a
partner's DSL language, verifying product functionality. crucial role in fields like Natural Language Processing,
It uses outlined methodologies, and scripts are evaluated where it assesses text similarity, and Bioinformatics,
by experts for accuracy and reliability, with corrections where it indicates genetic relatedness. Despite its higher
made as needed. computational demands for longer strings, our use of
Validated scripts are executed and stored, dynamic programming makes it an efficient tool for our
informing future prompts and enhancing script analysis. The Levenshtein Distance aids in refining our
generation over time. This cycle of evaluation and model, ensuring that the text generation is both accurate
learning improves script quality and reduces manual and reliable.
creation, aiming for an automated, self-improving The formal definition of Levenshtein Distance
system that streamlines software testing. As data between two arbitrary strings 𝑎 and 𝑏 with length of
storage grows, prompts become more complex, |𝑎| and |𝑏| respectively is given by
|𝑎|, if |𝑏| = 0,
⎧ |𝑏|, if |𝑎| = 0,
⎪
⎪ 𝐥𝐞𝐯 tail(𝑎), tail(𝑏) , if head(𝑎) = head(𝑏),
𝐥𝐞𝐯(𝑎, 𝑏) = 𝐥𝐞𝐯(tail(𝑎), 𝑏)
⎨
⎪ 1 + min 𝐥𝐞𝐯 𝑎, tail(𝑏) , otherwise.
⎪
⎩ 𝐥𝐞𝐯 tail(𝑎), tail(𝑏)
4.2. Result and Discussion
where tail(𝑥) of any string 𝑥 of length 𝑛 is a
In our comprehensive analysis, we have utilized the
substring of 𝑥 without the first character, i.e. tail(𝑥) =
Levenshtein Distance alongside the test script
tail(𝑥 𝑥 ⋯ 𝑥 ) = 𝑥 𝑥 ⋯ 𝑥 and head(𝑥) of any
generation methodologies previously discussed to assess
string 𝑥 of length 𝑛 is a substring of 𝑥 without the last
the output across all 51 data samples. It is crucial to
character, i.e. head(𝑥) = head(𝑥 𝑥 ⋯ 𝑥 ) =
highlight the exceptional stability achieved with the
𝑥 𝑥 ⋯𝑥 .
prompts we've designed, particularly when employing
55
Llama3 as our LLM. The consistency of Llama3 is our continuous efforts to enhance its capabilities.
noteworthy; for a given data sample, or in other words, Moreover, this stability ensures that our test script
with the same prompt, the model reliably produces generation process is not only efficient but also
identical results in each test scenario. This uniformity is dependable, providing our partners and users with a tool
a testament to the robustness of our prompt engineering that they can trust to deliver consistent results.
and the model's ability to deliver reliable outcomes. This The test results for all 51 samples are displayed in
level of consistency is not only a significant advantage the horizontal bar chart in Figure 4, offering a clear
in the context of test script generation but also a key visual representation of our system's performance. The
factor in ensuring the reproducibility of our red bars in the chart signify the text lengths of the
experiments. It allows us to confidently attribute any ground truth test scripts, serving as a benchmark for
variations in the output to changes in the input data or comparison. It represents the ideal output, against which
to the model's fine-tuning, rather than to the inherent the effectiveness of our system is measured. The pink
instability of the model itself. By achieving such a high bars, on the other hand, denote the lengths of the test
degree of stability, we pave the way for more accurate scripts generated by our system. This provides insight
and meaningful evaluations of our model's performance, into the output of our AI-driven script generation
which in turn, informs process, highlighting the efficiency and effectiveness
with which our system translates prompts into
executable test scripts. Most importantly, the blue bars
in the chart represent the Levenshtein Distances for
each sample, a critical metric that quantifies the
difference between the generated test scripts and the
ground truth. This distance is calculated based on the
minimum number of single-character edits required to
transform the generated test scripts into the ground
truth test scripts. In this context, a shorter blue bar
indicates a higher degree of similarity, suggesting that
the generated script closely mirrors the ground truth,
which is the goal of our system.
As observed from Error! Reference source not
found., it is evident that the system currently exhibits a
noticeable margin of error. This finding is further
accentuated and clarified in the subsequent statistical
box plot in Figure 5, which provides a more detailed
visualization of the distribution of errors across our
dataset. It is apparent that our dataset, comprising a
mere 51 samples, is significantly limited for a deep
learning initiative. The consensus in the field is that a
larger dataset is often necessary to train models to
achieve higher accuracy and reliability.
Figure 4: Discrete distribution as a horizontal bar
chart to illustrate the result evaluation.
Figure 5: Box plot display of the generated test
results.
56
However, it is remarkable to note that despite this [2] Campbell, Robert V. D. “Evolution of Automatic
constraint, our system has produced flawless results in Computation.” In Proceedings of the 1952 ACM
six instances where the generated test scripts matched National Meeting (Pittsburgh), 29–32. ACM ’52.
the ground truth perfectly. This achievement is New York, NY, USA: Association for Computing
particularly impressive given the small sample size and Machinery, 1952.
serves as a testament to the potential of our approach https://doi.org/10.1145/609784.609786.
using prompt engineering with large language models. [3] Orden, Alex. “Solution of Systems of Linear
The fact that our system was able to generate scripts Inequalities on a Digital Computer.” In
indistinguishable from the ground truth in these cases Proceedings of the 1952 ACM National Meeting
suggests that with further optimization and a more (Pittsburgh), 91–95. ACM ’52. New York, NY, USA:
extensive dataset, we could see a substantial Association for Computing Machinery, 1952.
improvement in the system's overall performance. https://doi.org/10.1145/609784.609793.
This early success with a limited dataset is not just [4] Demuth, Howard B., John B. Jackson, Edmund
encouraging; it also validates the feasibility of our Klein, N. Metropolis, Walter Orvedahl, and James
methodological approach. It indicates that our system H. Richardson. “MANIAC.” In Proceedings of the
has the innate capacity to learn and produce high- 1952 ACM National Meeting (Toronto), 13–16.
quality outputs, even when faced with data scarcity. As ACM ’52. New York, NY, USA: Association for
we continue to expand our dataset and refine our Computing Machinery, 1952.
models, we are confident that the performance will see https://doi.org/10.1145/800259.808982.
a marked enhancement, further solidifying the [5] Hetzel, William C. Program Test Methods.
effectiveness of our AI-driven test script generation Prentice-Hall, 1973.
system in the field of software testing. [6] Myers, Glenford J., Corey Sandler, and Tom
Badgett. The Art of Software Testing. John Wiley
5. Conclusion and Future Work & Sons, 2011.
[7] “IEEE Standard for Software Test Documentation.”
This research highlights the significant impact of LLMs Accessed September 17, 2024.
on enhancing software testing efficiency, particularly in https://standards.ieee.org/ieee/829/1217/.
the automotive sector. Our findings underscore the [8] Martin, James. Rapid Application Development.
superiority of prompt engineering over model fine- Macmillan Publishing Company, 1991.
tuning, especially with smaller datasets. The [9] Khaliq, Zubair, Sheikh Umar Farooq, and Dawood
Levenshtein Distance proved a reliable metric for script Ashraf Khan. “Artificial Intelligence in Software
accuracy. Notably, LLMs, such as Llama3, demonstrated Testing : Impact, Problems, Challenges and
remarkable consistency, indicating the robustness of our Prospect.” arXiv, January 14, 2022.
framework. Even with a limited dataset, our system https://doi.org/10.48550/arXiv.2201.05371.
achieved high accuracy, showcasing LLMs' potential in [10] Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and
software testing. Frank Tip. “An Empirical Evaluation of Using
Our study introduces a novel approach to DSL Large Language Models for Automated Unit Test
testing, with a user-friendly web application for our test Generation.” IEEE Transactions on Software
script generation system, enhancing accessibility and Engineering 50, no. 1 (January 2024): 85–105.
testing efficiency. Future work includes expanding our https://doi.org/10.1109/TSE.2023.3334955.
dataset to improve script performance and integrating [11] “Domain Specific Languages.” Accessed
the system into CI/CD pipelines for real-time testing. September 17, 2024.
Ethical considerations and model transparency will also https://martinfowler.com/books/dsl.html.
be prioritized. In conclusion, our research establishes [12] Levenshtein, Vladimir I. “Двоичные Коды с
LLMs as a viable solution for automating DSL test script Исправлением Выпадений, Вставок и
generation, laying the groundwork for future Замещений Символов [Binary Codes Capable of
advancements in AI-assisted software testing. Correcting Deletions, Insertions, and Reversals].”
Soviet Physics Doklady 163, no. 4 (February 1966):
References 845–48.
[1] Awedikian, Roy, and Bernard Yannou. “Design of
a Validation Test Process of an Automotive
Software.” International Journal on Interactive
Design and Manufacturing (IJIDeM) 4, no. 4
(November 1, 2010): 259–68.
https://doi.org/10.1007/s12008-010-0108-2.
57