=Paper= {{Paper |id=Vol-3864/quasoq-2024-paper-07 |storemode=property |title=A Novel Approach to Automated Test Script Generation using Large Language Models for Domain-Specific Languages |pdfUrl=https://ceur-ws.org/Vol-3864/quasoq-2024-paper-07.pdf |volume=Vol-3864 |authors=Jianing Sun,Jiahui Wang,Yuyan Zhu,Xingyu Li,Ying Xie,Jiaxin Chen |dblpUrl=https://dblp.org/rec/conf/apsec/SunWZLXC24 }} ==A Novel Approach to Automated Test Script Generation using Large Language Models for Domain-Specific Languages== https://ceur-ws.org/Vol-3864/quasoq-2024-paper-07.pdf
                                Jianing Sun∗, Jiahui Wang, Yuyan Zhu, Xingyu Li, Ying Xie and Jiaxin Chen

                                Chongqing University, 400044 Chongqing, China



                                                  Abstract
                                                  This paper presents a novel method for generating automated test scripts for Domain-Specific Languages
                                                  (DSLs) in software testing, particularly for the automotive industry. It emphasizes the growing importance
                                                  of software testing in ensuring product quality amid IT advancements. The paper reviews software testing's
                                                  evolution, modern processes, and the role of Large Language Models (LLMs). It highlights DSLs' significance
                                                  and uses the automotive sector to show how LLMs can automate test script generation. Tests indicate that
                                                  in cases with a small sample size, the effectiveness of prompt engineering is superior to model fine-tuning.
                                                  The proposed method thus relies on well-designed prompts to direct LLMs to produce accurate scripts. The
                                                  generation system's overview is discussed, along with an evaluation of the scripts' quality using metrics
                                                  like Levenshtein Distance. Results indicate that LLMs boost test automation, defect detection, and software
                                                  reliability. Future work will optimize these tools for higher testing automation levels.

                                                  Keywords
                                                  software testing, domain-specific languages, large language models, Levenshtein Distance 1



                                  1. Introduction                                                         describing testing as executing a program to uncover
                                                                                                          errors [6].
                                  Software testing is a key component in ensuring the                         By 1983, IEEE had standardized software testing,
                                  quality and reliability of software products. In the                    defining it as a process -manual or automated- to verify
                                  rapidly developing information technology era, software                 system requirements [7]. The 1990s brought agile
                                  has become an indispensable part of our daily life and                  methodologies, integrating testing and development and
                                  work. With the increasing complexity and                                encouraging tester involvement from the earliest
                                  diversification of software functions, the importance of                development stages [8]. In the 21st century, testing has
                                  software testing has become increasingly prominent.                     advanced, with a focus on exploratory testing that
                                  Software testing is a series of processes designed to                   highlights the tester initiative. The era of AI and big data
                                  check that a software product meets specified                           has intensified scrutiny of software testing. Despite still
                                  requirements and ensures its quality. It not only helps                 leveraging 20th-century methods, the field anticipates
                                  developers to find and fix defects, but also greatly                    future innovations, potentially revolutionizing testing
                                  enhances system security, especially in fields with high                practices [9].
                                  software safety requirements such as automotive and
                                  aviation [1].                                                           1.2. Modern Approaches
                                  1.1. A Brief History of Software Testing                                The modern software testing process is crucial for
                                                                                                          ensuring software quality and functionality. It starts
                                  The origins of software testing date back to the 1950s,                 with requirement analysis, followed by developing a test
                                  focusing initially on debugging to identify and rectify                 plan, designing test cases, and preparing test data
                                  software faults [2][3][4]. As software complexity grew,                 (Figure 1). The test environment is set up, tests are
                                  the need for independent testing organizations became                   executed and recorded, and defects are tracked.
                                  apparent. In 1957, Charles Baker first defined program                  Regression and performance testing are conducted,
                                  testing, in his review of the book Digital Computer
                                  Programming by Dan McCracken, separating it from
                                  debugging. Bill Hetzel formalized software testing as a
                                  concept at the University of North Carolina in 1972,
                                  establishing it to ensure a program performs as intended                    Figure 1: Modern software testing process.
                                  [5]. Glenford J. Myers further refined this in 1979,



                                QuASoQ 2024:12th International Workshop on Quantitative Approaches              0009-0001-4943-6038 (J. Sun); 0009-0000-0107-0666 (J. Wang); 0009-
                                to Software Quality,3rd December 2024,Chongqing, China,                       0008-1670-4607 (Y. Zhu); 0009-0009-4526-007X (X. Li); 0009-0003-8939-
                                ∗
                                  Corresponding author.                                                       7832 (Y. Xie); 0009-0002-4492-4734 (J. Chen)
                                                                                                                            © 2024 Copyright for this paper by its authors. Use permitted under
                                   j.sun@cqu.edu.cn (J. Sun)                                                                Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                     52
along with security and system testing. Acceptance               aid in document summarization, provide travel advice,
testing confirms business requirements are met. Test             and improve user engagement. Tools like GitHub
reports summarize results, and evaluations identify              Copilot demonstrate their advantage by assisting in
process improvements.                                            coding tasks [10].
    Techniques like automated testing, Continuous                    Also, LLMs boost software testing by automating
Integration (CI), and Continuous Delivery (CD) enhance           tasks, detecting defects, and ensuring reliability. They
testing efficiency. Agile testing fosters collaboration          improve fuzz and unit testing, creating test cases, and
between testers and developers. Performance, security,           suggesting fixes. Research shows their significant
and mobile application testing ensure software                   benefits in expanding test coverage and error detection.
reliability across different aspects. Cloud testing              Future efforts will focus on optimizing testing tools and
leverages cloud resources for extensive testing.                 techniques.

                                                                 2. Domain-Specific Software
                                                                    Testing
                                                                 2.1. Domain-Specific Languages
                                                                 Domain-specific languages (DSLs) are specialized
                                                                 languages designed for particular domains or tasks,
                                                                 offering simplified syntax for ease of use by domain
                                                                 experts [11]. They can be integrated with general-
    Figure 2: CI/CD is a software development practice           purpose languages (GPL) like Java and C++, enhancing
in which code changes are automatically integrated,              development efficiency through tool support such as
built, and tested, with successful builds being deployed         analyzers and compilers. DSLs are crucial in various
to production.                                                   industries, for example, HTML in web development and
                                                                 SQL in databases. They automate tasks like API
     AI-based testing leverages machine learning to              documentation and legal document generation,
automate software testing processes, enhancing                   improving efficiency and reducing errors. DSLs also
efficiency and accuracy. It encompasses exploratory              facilitate team collaboration by allowing non-technical
testing to identify issues without fixed test cases,             members to express requirements in a natural language-
ensuring broader coverage. Model-driven testing and              like format.
testability design optimize test case generation and                 For software development, DSLs boost efficiency by
software sustainability. Additionally, managing test data        simplifying complex representations and promoting
and implementing strategies such as Test Left Shift and          code reusability. They accelerate prototyping and
Test Right Shift further refine the development cycle.           iteration, integrating seamlessly into existing tools and
These dynamic approaches adapt to different                      workflows. Testing DSL-developed programs requires a
methodologies, ensuring consistent software quality              detailed plan with automated test scripts for regression
throughout the testing process.                                  testing. Test cases must be readable, and test data should
                                                                 reflect the domain specifics to ensure comprehensive
1.3. Large Language Models                                       coverage and identify defects. Maintainability of test
                                                                 cases and DSL is essential for ongoing development
Large Language Models (LLMs) are cutting-edge AI
                                                                 success.
specialized in natural language understanding and
generation. Trained on extensive datasets and
                                                                 2.2. Software Testing using DSL from the
employing neural networks like Transformers, they
capture linguistic subtleties and perform a variety of                    Automobile Industry
language tasks such as categorization, analysis,                     This section addresses the critical need for rigorous
translation, and Q&A systems. They discern nuances,              testing in passenger car product development, ensuring
generate realistic text, and continuously adapt to               quality and performance meet standards.
linguistic evolution, raising concerns over data privacy             Traditionally, automotive testing relies on manual,
and ethics.                                                      labor-intensive translation of requirements into test
    LLMs have significantly impacted sectors like smart          cases and scripts, causing significant strain on resources.
offices, travel, e-commerce, and government by                   To streamline this process and integrate Continuous
enhancing efficiency and personalization. In software            Testing/Continuous Delivery
development, LLMs are revolutionizing the field. They




                                                            53
Table 1
A segment of the test data, describing the test cases, pre-conditions, and the desired test script DSL to be generated.

 Test case               Pre-condition                                   Test script
 Forward gear is         Configuration.Gears.Drive.is_activated()        Signals.Check(signals=[Gears_GearsStatus],
 activated when the                                                      values=[Gears_GearsStatus_shift], waiting_time=100)
 car is moving                                                           Gear.Shift('drive')
 forward
 Reverse gear is         Configuration.Gears.Reverse.is_activated()      Signals.Check(signals=[Gears_GearsStatus],
 activated when the                                                      values=[Gears_GearsStatus_shift], waiting_time=100)
 car is being driven                                                     Gear.Shift('reverse')
 reversely
 At driving P gear is    Configuration.Gears.Park.is_deactivated()       current_speed = car.get_speed()
 deactivated                                                             current_gear = car.get_gear_status()
                                                                         Self.assertNotEqual(current_gear, 'park')
                                                                         car.stop()
                                                                         car.shift_gear('park')
                                                                         current_gear = car.get_gear_status()

(CT/CD) pipelines, the industry is moving towards                   limited data, prompt engineering emerges as the slightly
automated test development.                                         superior approach.
    The automotive sector is in pursuit of an AI-driven                 Consequently, we have intentionally opted to
solution to streamline the automation of test script                employ the finesse of prompt engineering for the
generation for its proprietary DSLs, which are integral             automated crafting of test scripts. This strategic choice
to the testing of a spectrum of automotive systems. The             is rooted in its proven ability to deliver optimized
prevailing manual methodology is marred by                          outcomes, even within the confines of our data scarcity.
inefficiencies, susceptibility to errors, and variability in        By leveraging the finesse of prompt engineering, we aim
code quality, alongside insufficient test coverage. By              to transcend the limitations imposed by scant data
harnessing the capabilities of large language models, an            availability, thereby enhancing the overall performance
AI-powered tool has the potential to orchestrate this               and reliability of our test script generation process.
process, amplifying efficiency, curbing errors, and                     An integral element of our approach is the selection
upholding uniformity in code excellence, thereby                    of the foundational Large Language Model. To this end,
conquering existing challenges and invigorating the                 we have undertaken a model selection process,
software development lifecycle.                                     meticulously assessing ChatGLM3, Llama3, and Qwen2.
                                                                    Following an exhaustive comparison, we determined
2.3. Sample Data                                                    that Llama3's generative capabilities align more closely
                                                                    with our requirements. Hence, we have chosen Llama3
A total of 51 data samples (Table 1), each representing
                                                                    to serve as the underlying LLM for this study.
a true mapping from a test case to a test script in a
particular DSL format.
    For privacy protection purposes, all information,
                                                                    3.1. Prompt to Make Precise Test Script
program code and data in this paper have been                               Generation
anonymized.                                                         Through a meticulous process of refinement, we've
                                                                    perfected our prompt for generating test scripts, as
3. Approach                                                         shown in the example. This fine-tuning ensures our AI
                                                                    model produces outputs that are both accurate and meet
Broadly speaking, two dominant strategies have
                                                                    our objectives.
emerged for augmenting the knowledge base: the art of
                                                                        Our prompt is divided into four key components
prompt engineering, which is particularly effective for
                                                                    (Table 2): First, an exhaustive list of potential samples,
modest datasets, and the process of model fine-tuning,
                                                                    excluding the current focus, provides a comprehensive
which is best suited for addressing more substantial
                                                                    training context. Second, we concentrate on the specific
volumes of data. Considering the current data landscape,
                                                                    test purpose to create targeted, efficient test scripts.
characterized by a dearth of samples and inherent
                                                                    Third, we provide clear instructions in natural English
uncertainties, a comprehensive evaluation was
                                                                    for the LLM to follow, ensuring a seamless and accurate
undertaken to compare the merits of both prompt
                                                                    generation process. Lastly, we impose constraints to
engineering and LLM fine-tuning. This analysis has
                                                                    optimize the generation process, enabling our LLM
demonstrated that, under the present circumstances of




                                                               54
model to autonomously produce precise and relevant                 reflecting deeper learning and improved autonomy in
test scripts without excessive input.                              script generation, ultimately advancing AI in software
                                                                   testing.
Table 2
Pseudo code of prompt design.

 Prompt for test script generation (pseudo code)
 dataFrame = All sample mappings except the one
 which is being generated
 testCase = one test case which is being processed
 instruction = “Above is a list of test cases and
 corresponding test scripts, assembled in Json format.
 Please generate test script for the following test case:”
 condition = “Please export generated test script only,               Figure 3: Generation System Overview.
 no leading text, no leading new lines.”
 prompt = dataFrame + CRLF + instruction + testCase                4. Result and Evaluation
 + condition
    This structured approach not only boosts script                4.1. Evaluation Metric
accuracy but also enhances the efficiency of our testing           This paper employs the Levenshtein Distance [12] to
process, bringing us closer to our goal of fully automated         evaluate the textual accuracy of our language model,
AI-driven test script generation.                                  providing an objective measure of how closely
                                                                   generated text matches the ground truth. This edit
3.2. Test Script Generation System                                 distance metric, devised by Vladimir Levenshtein,
         Overview                                                  quantifies the minimum number of single-character
The test script generation system, illustrated in Figure           edits required to transform one string into another,
3, converts input test cases into executable scripts in the        offering insights into model performance. It plays a
partner's DSL language, verifying product functionality.           crucial role in fields like Natural Language Processing,
It uses outlined methodologies, and scripts are evaluated          where it assesses text similarity, and Bioinformatics,
by experts for accuracy and reliability, with corrections          where it indicates genetic relatedness. Despite its higher
made as needed.                                                    computational demands for longer strings, our use of
     Validated scripts are executed and stored,                    dynamic programming makes it an efficient tool for our
informing future prompts and enhancing script                      analysis. The Levenshtein Distance aids in refining our
generation over time. This cycle of evaluation and                 model, ensuring that the text generation is both accurate
learning improves script quality and reduces manual                and reliable.
creation, aiming for an automated, self-improving                      The formal definition of Levenshtein Distance
system that streamlines software testing. As data                  between two arbitrary strings 𝑎 and 𝑏 with length of
storage grows, prompts become more complex,                        |𝑎| and |𝑏| respectively is given by

                                                               |𝑎|,            if |𝑏| = 0,
                                   ⎧                           |𝑏|,            if |𝑎| = 0,
                                   ⎪
                                   ⎪         𝐥𝐞𝐯 tail(𝑎), tail(𝑏) ,            if head(𝑎) = head(𝑏),
                      𝐥𝐞𝐯(𝑎, 𝑏) =              𝐥𝐞𝐯(tail(𝑎), 𝑏)
                                  ⎨
                                  ⎪ 1 + min    𝐥𝐞𝐯 𝑎, tail(𝑏)     ,            otherwise.
                                  ⎪
                                  ⎩         𝐥𝐞𝐯 tail(𝑎), tail(𝑏)



                                                                   4.2. Result and Discussion
    where tail(𝑥) of any string 𝑥 of length 𝑛 is a
                                                                   In our comprehensive analysis, we have utilized the
substring of 𝑥 without the first character, i.e. tail(𝑥) =
                                                                   Levenshtein Distance alongside the test script
tail(𝑥 𝑥 ⋯ 𝑥 ) = 𝑥 𝑥 ⋯ 𝑥           and head(𝑥) of any
                                                                   generation methodologies previously discussed to assess
string 𝑥 of length 𝑛 is a substring of 𝑥 without the last
                                                                   the output across all 51 data samples. It is crucial to
character,     i.e.    head(𝑥) = head(𝑥 𝑥 ⋯ 𝑥 ) =
                                                                   highlight the exceptional stability achieved with the
𝑥 𝑥 ⋯𝑥 .
                                                                   prompts we've designed, particularly when employing




                                                              55
Llama3 as our LLM. The consistency of Llama3 is                    our continuous efforts to enhance its capabilities.
noteworthy; for a given data sample, or in other words,            Moreover, this stability ensures that our test script
with the same prompt, the model reliably produces                  generation process is not only efficient but also
identical results in each test scenario. This uniformity is        dependable, providing our partners and users with a tool
a testament to the robustness of our prompt engineering            that they can trust to deliver consistent results.
and the model's ability to deliver reliable outcomes. This             The test results for all 51 samples are displayed in
level of consistency is not only a significant advantage           the horizontal bar chart in Figure 4, offering a clear
in the context of test script generation but also a key            visual representation of our system's performance. The
factor in ensuring the reproducibility of our                      red bars in the chart signify the text lengths of the
experiments. It allows us to confidently attribute any             ground truth test scripts, serving as a benchmark for
variations in the output to changes in the input data or           comparison. It represents the ideal output, against which
to the model's fine-tuning, rather than to the inherent            the effectiveness of our system is measured. The pink
instability of the model itself. By achieving such a high          bars, on the other hand, denote the lengths of the test
degree of stability, we pave the way for more accurate             scripts generated by our system. This provides insight
and meaningful evaluations of our model's performance,             into the output of our AI-driven script generation
which               in            turn,            informs         process, highlighting the efficiency and effectiveness
                                                                   with which our system translates prompts into
                                                                   executable test scripts. Most importantly, the blue bars
                                                                   in the chart represent the Levenshtein Distances for
                                                                   each sample, a critical metric that quantifies the
                                                                   difference between the generated test scripts and the
                                                                   ground truth. This distance is calculated based on the
                                                                   minimum number of single-character edits required to
                                                                   transform the generated test scripts into the ground
                                                                   truth test scripts. In this context, a shorter blue bar
                                                                   indicates a higher degree of similarity, suggesting that
                                                                   the generated script closely mirrors the ground truth,
                                                                   which is the goal of our system.
                                                                       As observed from Error! Reference source not
                                                                   found., it is evident that the system currently exhibits a
                                                                   noticeable margin of error. This finding is further
                                                                   accentuated and clarified in the subsequent statistical
                                                                   box plot in Figure 5, which provides a more detailed
                                                                   visualization of the distribution of errors across our
                                                                   dataset. It is apparent that our dataset, comprising a
                                                                   mere 51 samples, is significantly limited for a deep
                                                                   learning initiative. The consensus in the field is that a
                                                                   larger dataset is often necessary to train models to
                                                                   achieve higher accuracy and reliability.




   Figure 4: Discrete distribution as a horizontal bar
chart to illustrate the result evaluation.
                                                                       Figure 5: Box plot display of the generated test
                                                                   results.




                                                              56
    However, it is remarkable to note that despite this           [2]  Campbell, Robert V. D. “Evolution of Automatic
constraint, our system has produced flawless results in                Computation.” In Proceedings of the 1952 ACM
six instances where the generated test scripts matched                 National Meeting (Pittsburgh), 29–32. ACM ’52.
the ground truth perfectly. This achievement is                        New York, NY, USA: Association for Computing
particularly impressive given the small sample size and                Machinery,                                     1952.
serves as a testament to the potential of our approach                 https://doi.org/10.1145/609784.609786.
using prompt engineering with large language models.              [3] Orden, Alex. “Solution of Systems of Linear
The fact that our system was able to generate scripts                  Inequalities on a Digital Computer.” In
indistinguishable from the ground truth in these cases                 Proceedings of the 1952 ACM National Meeting
suggests that with further optimization and a more                     (Pittsburgh), 91–95. ACM ’52. New York, NY, USA:
extensive dataset, we could see a substantial                          Association for Computing Machinery, 1952.
improvement in the system's overall performance.                       https://doi.org/10.1145/609784.609793.
    This early success with a limited dataset is not just         [4] Demuth, Howard B., John B. Jackson, Edmund
encouraging; it also validates the feasibility of our                  Klein, N. Metropolis, Walter Orvedahl, and James
methodological approach. It indicates that our system                  H. Richardson. “MANIAC.” In Proceedings of the
has the innate capacity to learn and produce high-                     1952 ACM National Meeting (Toronto), 13–16.
quality outputs, even when faced with data scarcity. As                ACM ’52. New York, NY, USA: Association for
we continue to expand our dataset and refine our                       Computing              Machinery,              1952.
models, we are confident that the performance will see                 https://doi.org/10.1145/800259.808982.
a marked enhancement, further solidifying the                     [5] Hetzel, William C. Program Test Methods.
effectiveness of our AI-driven test script generation                  Prentice-Hall, 1973.
system in the field of software testing.                          [6] Myers, Glenford J., Corey Sandler, and Tom
                                                                       Badgett. The Art of Software Testing. John Wiley
5. Conclusion and Future Work                                          & Sons, 2011.
                                                                  [7] “IEEE Standard for Software Test Documentation.”
This research highlights the significant impact of LLMs                Accessed         September          17,        2024.
on enhancing software testing efficiency, particularly in              https://standards.ieee.org/ieee/829/1217/.
the automotive sector. Our findings underscore the                [8] Martin, James. Rapid Application Development.
superiority of prompt engineering over model fine-                     Macmillan Publishing Company, 1991.
tuning, especially with smaller datasets. The                     [9] Khaliq, Zubair, Sheikh Umar Farooq, and Dawood
Levenshtein Distance proved a reliable metric for script               Ashraf Khan. “Artificial Intelligence in Software
accuracy. Notably, LLMs, such as Llama3, demonstrated                  Testing : Impact, Problems, Challenges and
remarkable consistency, indicating the robustness of our               Prospect.”     arXiv,     January       14,    2022.
framework. Even with a limited dataset, our system                     https://doi.org/10.48550/arXiv.2201.05371.
achieved high accuracy, showcasing LLMs' potential in             [10] Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and
software testing.                                                      Frank Tip. “An Empirical Evaluation of Using
    Our study introduces a novel approach to DSL                       Large Language Models for Automated Unit Test
testing, with a user-friendly web application for our test             Generation.” IEEE Transactions on Software
script generation system, enhancing accessibility and                  Engineering 50, no. 1 (January 2024): 85–105.
testing efficiency. Future work includes expanding our                 https://doi.org/10.1109/TSE.2023.3334955.
dataset to improve script performance and integrating             [11] “Domain       Specific    Languages.”       Accessed
the system into CI/CD pipelines for real-time testing.                 September                   17,                2024.
Ethical considerations and model transparency will also                https://martinfowler.com/books/dsl.html.
be prioritized. In conclusion, our research establishes           [12] Levenshtein, Vladimir I. “Двоичные Коды с
LLMs as a viable solution for automating DSL test script               Исправлением         Выпадений,       Вставок      и
generation, laying the groundwork for future                           Замещений Символов [Binary Codes Capable of
advancements in AI-assisted software testing.                          Correcting Deletions, Insertions, and Reversals].”
                                                                       Soviet Physics Doklady 163, no. 4 (February 1966):
References                                                             845–48.
[1]   Awedikian, Roy, and Bernard Yannou. “Design of
      a Validation Test Process of an Automotive
      Software.” International Journal on Interactive
      Design and Manufacturing (IJIDeM) 4, no. 4
      (November           1,       2010):        259–68.
      https://doi.org/10.1007/s12008-010-0108-2.




                                                             57