<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Mohammadi E. Mapping the structure and evolution of software testing
research over the past three decades // Journal of Systems and Software. 2023. Vol. 195. Article
No. 111518. doi: 10.1016/j.jss.2022.111518.
[4] Laporte C.Y.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.jss.2022.111518</article-id>
      <title-group>
        <article-title>Integration of mutation testing into unit test generation using large language models⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Kovtko</string-name>
          <email>kovtko773@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Savkiv</string-name>
          <email>v.b.savkiv@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Halyna Kozbur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ihor Kozbur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rostyslav</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ternopil Ivan Puluj National Technical University</institution>
          ,
          <addr-line>Ruska Str. 56, Ternopil, 46001</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>195</volume>
      <issue>111518</issue>
      <fpage>4</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>The article examines testing as a key component of the software development lifecycle that ensures the quality and stability of the final product. It is demonstrated that the increasing complexity of software leads to greater demands on testing as a critical phase of development. Two main approaches - manual and automated testing - are identified, with a focus placed on unit testing as the primary subject of this study. It is established that unit testing contributes to safer code changes, early defect detection, documentation, and improved code structure. At the same time, several challenges associated with writing unit tests are identified, including high time costs, maintenance difficulties, and increased load on the continuous integration and deployment system. Typical unit testing test smells are described ineffective practices that complicate code maintenance, reduce verification accuracy, and may lead to a misleading impression of software quality. The application of artificial intelligence tools, particularly large language models (LLMs), is shown to support the automation of unit test generation, although the quality of generated tests remains inconsistent. A modified approach is proposed for integrating mutation testing into the generation of unit tests using LLMs. The concept of an automated system is presented, incorporating test generation, mutation creation, result evaluation, and iterative improvement. This integration is shown to reveal weak test coverage areas, enhance verification depth, and improve product quality. The proposed system demonstrates potential to reduce testing time, increase software stability, and offer broad applicability across various project environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automated testing</kwd>
        <kwd>unit testing</kwd>
        <kwd>mutation testing</kwd>
        <kwd>test smells</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>large language models1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Software development, much like any other engineering process, comprises a series of
interconnected stages that together constitute the product life cycle (Fig. 1). Each phase – from
planning through to maintenance – plays a pivotal role in ensuring the quality and reliability of the
final solution.</p>
      <p>
        Testing represents one of the critical stages within this life cycle. It involves verifying whether
the software meets specified requirements and produces the expected outcomes. Effective execution
of this phase is essential for delivering a high-quality and secure product [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        As software systems continue to grow in complexity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the demands placed on testing likewise
intensify, establishing it as one of the central research areas in software engineering [3].
Innovation in software testing is crucial not only for enhancing product quality but also for
optimizing development resources, given that testing activities may account for up to 50% of the
total development budget according to various estimates [4].
      </p>
      <sec id="sec-1-1">
        <title>Two complementary approaches can be distinguished in software testing:</title>
        <p>

manual testing – identifying software defects through direct interaction by a human tester
[5];
automated testing – executing tests using automation tools [6].</p>
        <p>Manual testing remains irreplaceable in scenarios requiring flexibility, evaluation of visual
components, or user interface interactions that are difficult to automate. In particular, it plays a
crucial role in assessing user experience (UI/UX), where careful attention must be paid to design,
navigation logic, and overall usability [7].</p>
        <p>Automated testing is particularly effective and appropriate in cases involving repetitive,
largescale, and formalized testing scenarios [8]. It is essential for regression testing, where existing
functionality must be re-verified repeatedly after code changes. Automation is also highly
beneficial when testing large volumes of data or when rapid test execution across multiple
environments is required – for instance, within CI/CD pipelines.</p>
        <p>It offers significant time and resource savings, especially in long-term projects with stable
requirements. Automated tests provide high verification accuracy, minimize the influence of
human error, and can operate continuously without human intervention. This completes a
powerful tool for maintaining software quality in large-scale or mission-critical systems.</p>
        <p>The field of automated testing is undergoing rapid development, particularly due to the
integration of artificial intelligence techniques, which enable new approaches to test generation,
defect detection, and adaptive test strategy management [9], and the use of advanced
highperformance computing methods for analysing results [10, 11]. Among the types of automated
testing, unit testing, integration testing, system testing, and others can be distinguished (Fig. 2).</p>
        <p>In this study, we focus specifically on automated unit testing and possible options for analysing its
effectiveness using modern computing algorithms [12]. We will review existing solutions for
integrating artificial intelligence into the testing process and propose our own conceptual approach
aimed at improving product quality and reducing testing-related costs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Unit testing</title>
      <p>Automated unit testing involves verifying the correctness of each individual unit (function). The
life cycle of unit testing is illustrated in Fig. 3. Unit tests are typically executed each time changes
are made to the code.</p>
      <p>The main advantages of unit testing include:




improved safety and reliability of changes made to existing code;
early detection of defects during the development process, which helps reduce the cost of
fixing them;
serving as documentation for individual units, facilitating the understanding of their
interfaces and expected behavior;
encouraging developers to write cleaner, more structured, and maintainable code.
Unit testing is an integral part of the software development process in most modern projects,
demonstrating a positive impact on code quality and stability [13].</p>
      <p>For instance, in the Linux Kernel project — one of the largest open-source initiatives, with
millions of lines of code and thousands of contributors — unit testing is implemented through the
kselftest framework and KernelCI system to verify the functionality of kernel components during
daily builds. This enables timely detection of regressions and helps maintain system stability.</p>
      <p>In the TensorFlow project, a widely used machine learning library, unit tests cover both core
computational components and the API. They are part of an automated CI pipeline that runs for
each pull request, ensuring quality and consistency across different environments and versions of
the library.</p>
      <p>The study presented in [14], which analyzes over 9,000 deep learning (DL)-related projects,
reveals several important patterns concerning the impact of unit testing. Specifically:



the presence of unit tests correlates positively with key development metrics in
opensource projects, such as the number of active contributors and overall developer
engagement;
codebase changes accompanied by unit tests are more likely to be accepted into the
repository more quickly;
defects in systems with adequate test coverage tend to be resolved more promptly
compared to those in projects lacking unit testing.</p>
      <p>In the study [15], which analyzed over 20,000 projects, it was found that approximately 62% of
them included at least one unit test. The average number of lines of code in projects with unit tests
was around 107,000, whereas projects without tests contained only about 5,605 lines. This
significant difference in scale indicates that large software systems are much less likely to forgo
unit testing, as the need for formalized quality control becomes critical with increasing codebase
size, whereas smaller projects, which are easier to maintain manually, more often operate without
formal testing practices.</p>
      <p>Despite the numerous advantages of unit testing, including its previously discussed positive
impact on code quality, project maintainability, and the onboarding of new developers, it is not a
universal solution for all types of tasks. During implementation, various challenges may arise,
stemming from both technical limitations and human factors. This necessitates a critical
examination not only of the strengths but also of the potential weaknesses of unit testing. The key
issues associated with the use of unit testing are as follows:
</p>
      <p>Developing effective unit tests is a complex and resource-intensive task that requires a deep
understanding of the system’s logic and careful design of test scenarios. Thus, a study
conducted by Microsoft Research found that writing unit tests typically consumes between




20% and 50% of the time allocated for developing core functionality, and in some cases up to
60% of the total development time [16].</p>
      <p>Test scenarios require continuous updating and maintenance, as any changes to the
functionality of the program code require corresponding adjustments to the tests.
A high level of code coverage through unit tests does not guarantee the absence of defects,
as only predefined scenarios are verified, leaving some execution paths untested. A study
[17] analyzing over 7,800 defects in open-source Java projects found only a weak to
moderate correlation between code coverage levels and the number of defects.</p>
      <p>In certain cases, preparing the environment for unit testing is a technically challenging
task, involving the creation of mocks to emulate external dependencies and provide access
to the internal logic of units.</p>
      <p>The use of a large number of unit tests can significantly increase the overall testing time,
which, particularly within CI/CD pipelines, may slow down the overall development and
deployment cycle.</p>
      <p>Given the aforementioned shortcomings, it is evident that the quality of unit testing is largely
determined by the developer’s qualifications and approach to writing tests. Poor test design and
violations of recommended practices can lead to test smells — specific patterns indicative of
suboptimal or problematic testing. Below are some typical situations that represent the primary
causes of test smells [18]:</p>
      <p>Assertion Roulette. This phenomenon occurs when a test method contains multiple
assertions without proper explanations or contextual information, making it difficult to
identify the cause of a potential failure.</p>
      <p>Missing Assert. This situation arises when a test method does not contain any assertions,
thereby stripping the test of its core verification function.</p>
      <p>Empty Test. This smell appears when a test method does not contain any executable
statements, making it a purely formal construct that performs no actual verification.
Constructor Initialization. This case takes place when a test class initializes its fields
through a constructor rather than using the standard setup mechanisms provided by the
testing framework.</p>
      <p>Eager Test. In this case, the test verifies too many functional elements simultaneously, often
invoking several production code methods within a single test, which reduces its specificity
and clarity.</p>
      <p>Exception Handling. This situation arises when a test uses try-catch blocks to validate
exceptional behavior instead of employing specialized testing constructs designed to assert
expected exceptions.</p>
      <p>Conditional Test Logic. This smell occurs when assertions are placed inside conditional
statements or exception handling blocks, complicating the interpretation of the test’s
intended behavior.</p>
      <p>The issue of test smells has been empirically investigated in [19]. In this study, eight large
opensource Java projects were analyzed. The results demonstrated a tendency for the accumulation of
test smells during the evolution of software systems: for every eliminated instance of a test smell,
typically two new ones emerged. The presence of such smells not only distorts the perception of
test quality by artificially inflating code coverage metrics but also contributes to an increase in the
number of defects within the system.</p>
      <p>Issues related to poor test design were also addressed in [20]. This study involved a survey of 19
developers as well as an empirical analysis of 152 open-source projects. The findings highlight
several important observations. First, a significant proportion of developers do not perceive poor
test design as a critical problem, leading to the neglect of test code maintainability and the
effectiveness of defect detection. Second, test smells are often introduced at the early stages of test
development. Finally, such shortcomings are rarely addressed during subsequent project evolution,
resulting in a gradual decline in the quality of the test environment.</p>
      <p>An overemphasis on numerical coverage metrics, combined with time constraints and
variability in developers' professional skills – typical in real-world projects [21] – further escalates
the emergence of numerous issues in the quality of unit testing.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Unit test generation</title>
      <p>Recent changes in software testing practices are increasingly associated with the emergence of
technologies that not only accelerate code verification but also redefine its very nature. Artificial
intelligence and machine learning are no longer viewed merely as auxiliary tools – today, they
represent an independent axis of advancement in software quality assurance [22]. These
technologies now assume routine tasks, detect defects earlier than developers, and gradually
reshape the perception of automation boundaries.</p>
      <p>Among these innovations, generative models – particularly large language models (LLMs) – are
demonstrating notable momentum (Fig. 4). According to Google Trends, since early 2023 there has
been a significant rise in interest in LLMs and generative AI. These models are being integrated
into developers' daily toolchains and are influencing how code is written, reviewed, and tested [23].
Products like GitHub Copilot not only generate code snippets — they also begin to model behavior,
anticipate needs, and propose actions that would otherwise be difficult to implement manually.</p>
      <p>In parallel with the growing adoption of LLM-based solutions in practical software testing, there
has also been an intensification of scientific research efforts in this direction. For example, the
study [24] proposed the tool TestPilot, which employs an adaptive approach to unit test generation
involving large language models. The central idea is to construct a query for the model that
includes an extended context of the target function: its name, list of parameters, comments, and
examples of usage from documentation or code snippets. In cases where the generated test fails, the
system formulates a refined query, incorporating the failed test itself and the corresponding error
message. This allows the model to respond more precisely to the previous failure and produce an
improved version of the test.</p>
      <p>Experimental results demonstrate that this strategy leads to the generation of more meaningful,
non-trivial tests that cover realistic usage scenarios. The use of three models – gpt-3.5-turbo,
codecushman-002, and StarCoder – demonstrated the superiority of the LLM-based approach over
traditional methods, particularly outperforming Nessie [25] – the first unit test generation system
utilizing a feedback-driven mechanism – across several key metrics, including verification
completeness and test relevance.</p>
      <p>Another example of the effective use of large language models for automated unit test
generation is presented in [26], which describes a tool called ChatUniTest – a solution for
automated unit test generation using ChatGPT, implemented according to the
GenerationValidation-Repair approach. Its structure involves three sequential stages of test formation.</p>
      <p>At the preprocessing stage, the system collects the most comprehensive context regarding the
target code. The query includes not only the function signature and its parameters but also
associated comments, usage examples, and other available relevant fragments. The user is able to
manually supplement or refine the query. During the second stage – generation – the constructed
query is sent to an LLM, which produces a test or a test class based on the specified task. The final
stage – post-processing – involves checking the syntactic correctness of the generated code and
executing it to confirm its validity. This approach combines the flexibility of user customization
with the adaptive nature of LLMs and ensures quality control at each stage of test generation.</p>
      <p>The analysis of the described examples of automated test generation systems highlights two key
challenges inherent to such approaches. First, crafting an effective prompt for the LLM is critically
important, as the quality of the generated test largely depends on the amount and relevance of the
provided context. Both discussed frameworks demonstrate the authors' emphasis on supplying as
much detailed information about the target function as possible. Second, an essential requirement
for system effectiveness is the verification of generated tests – both in terms of syntactic
correctness and actual execution validity. Without this stage, test generation loses its practical
value, as it does not guarantee reliable automatic coverage.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Automated unit test generation system</title>
      <p>Despite their considerable advantages, automated test generation tools have several limitations that
warrant close attention. In particular, the quality of generated tests tends to be inconsistent and
depends on multiple factors — such as the type of unit being tested, the project's existing test
coverage, and the overall structure and cleanliness of the codebase. In practice, it is not uncommon
for tests generated by AI to fail to properly verify the expected behavior of functions or classes,
thereby undermining the reliability of automated testing.</p>
      <p>One of the solutions for evaluating the quality of generated tests is mutation testing [27]. The
core idea of mutation testing is to deliberately introduce changes into the source code – so-called
mutations – that simulate typical software defects. The purpose of unit tests in this context is to
detect these artificially introduced errors. If the mutated code does not cause any test failures, this
indicates a low capability of the test suite to identify defects.</p>
      <p>Thus, mutation testing serves as an effective technique for identifying weak or superficial tests,
improving the accuracy and depth of logic verification within software units. Applying this
approach at the early stages of the development lifecycle not only enables the timely detection of
defects but also significantly enhances overall test coverage.</p>
      <p>In this context, a concept is proposed for an automated system for generating and improving
unit tests, which combines the capabilities of artificial intelligence and mutation analysis to achieve
high-quality test suites. The architecture of the system consists of the following sequential
components:
1. Test Generation. At this stage, large language models (LLMs), pre-trained on source code
and examples of test scenarios, are employed to automatically generate unit tests. LLMs are
capable of covering a wide range of potential scenarios, including those often overlooked in
manual testing.
2. Mutation Injection. The codebase is modified by introducing controlled changes –
mutations – using specialized frameworks. This enables the assessment of the generated
tests’ ability to detect intentionally introduced defects.
3. Analysis and Refinement. Based on the evaluation of mutation testing outcomes, the AI
model updates or augments the existing tests to achieve higher effectiveness.
4. Iterative Learning. A feedback loop is integrated into the system, allowing the AI to
incrementally enhance the quality of the tests by learning from the outcomes of previous
iterations.
5. Termination Criterion. The optimization process concludes when the differences in
mutation metrics become negligible and further iterations yield no significant
improvements. At this point, quantitative thresholds are defined to formalize completion.
The proposed integration of generative AI tools with mutation testing methods anticipates several
key outcomes:



</p>
      <p>Enhanced Testing Effectiveness. Combining the capabilities of artificial intelligence with
the thoroughness of mutation analysis enables the formation of a test suite that more
adequately covers edge cases and hidden defects. This approach facilitates deeper
verification of software module behavior, even under complex or unpredictable conditions.
Optimization of Time Expenditure. Automating the test generation process combined with
an iterative refinement mechanism significantly reduces the amount of manual effort
required for quality validation. This allows developers to focus on core development tasks
while minimizing the time spent on analyzing and revising tests.</p>
      <p>Improvement of Final Product Quality. Applying the automated system at early stages of
the software development lifecycle fosters the early detection of defects, reducing the cost
associated with fixing issues identified at later stages and increasing the overall stability
and reliability of the product.</p>
      <p>Versatility in Application. The architecture of the automated test generation system is
designed to support multiple programming languages and testing frameworks, making it
adaptable for deployment across a wide range of projects – from small libraries to
largescale distributed systems.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This study analyzed unit testing as a foundation for ensuring software quality, highlighting its key
advantages and associated challenges. It was demonstrated that, despite its high effectiveness
during the early stages of development, unit testing demands significant resources and is
vulnerable to test smells.</p>
      <p>It was established that modern generative AI models can automate the creation of tests;
however, the quality of such tests remains variable. To address this issue, a modified approach
combining AI with mutation testing was proposed. An architecture for an automated unit test
generation system was introduced, incorporating stages of test generation, mutation analysis, and
iterative refinement.</p>
      <p>It is posited that the integration enables the identification of weaknesses in test coverage,
enhances verification depth, and reduces testing costs, making it a promising solution for
largescale project implementation. Future empirical studies will be necessary to validate this hypothesis
following the practical implementation of the described architecture.</p>
      <p>Further research is planned to explore ways of leveraging mutation testing to provide additional
context for test generation. Moreover, it is intended to assess existing LLMs to determine which
model is best suited for test generation tasks, with the aim of developing a specialized model
tailored to the needs of the proposed automated unit test generation system.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[19] Kim D.J. An empirical study on the evolution of test smell // Proceedings of the 42nd
International Conference on Software Engineering (ICSE 2020). 2020. P. 3. URL:
https://djaekim.github.io/djae.io/img/EvolutionOfTestSmell.pdf.
[20] Tufano M., Palomba F., Bavota G., Penta M.D., Oliveto R., Poshyvanyk D. An empirical
investigation into the nature of test smells // Proceedings of the 31st IEEE/ACM International
Conference on Automated Software Engineering (ASE 2016), Singapore, 2016. P. 4–15. URL:
https://www.cs.wm.edu/~mtufano/publications/C4.pdf. doi: 10.1145/2970276.2970340.
[21] Tufano M., Bavota G., Poshyvanyk D., Di Penta M., Oliveto R., De Lucia A. An empirical study
on developer related factors characterizing fix-inducing commits // Journal of Software:
Evolution and Process. 2017. Vol. 29, No. 1. URL: doi: 10.1002/smr.1797.
[22] Coutinho M., Marques L., Santos A., Dahia M., França C., de Souza Santos R. The role of
generative AI in software development productivity: a pilot case study // Proceedings of the
1st ACM International Conference on AI-Powered Software (AIware 2024). Porto de Galinhas,
Brazil. 2024. ACM, 2024. P. 1–8. URL: https://arxiv.org/pdf/2406.00560v1. doi:
10.48550/arXiv.2406.00560.
[23] Alenezi M., Akour M. AI-driven innovations in software engineering: a review of current
practices and future directions // Applied Sciences. 2025. Vol. 15, No. 3. Article 1344. URL:
https://www.mdpi.com/2076-3417/15/3/1344/pdf?version=1738038423
doi:10.3390/app15031344.
[24] Schäfer M., Nadi S., Eghbali A., Tip F. An empirical evaluation of using large language models
for automated unit test generation // IEEE Transactions on Software Engineering. 2024. Vol.
50, No. 1. P. 85–105. URL: https://arxiv.org/pdf/2302.06527. doi: 10.48550/arXiv.2302.06527
[25] Arteca E., Harner S., Pradel M., Tip F. Nessie: automatically testing JavaScript APIs with
asynchronous callbacks // Proceedings of the 44th IEEE/ACM International Conference on
Software Engineering (ICSE 2022). Pittsburgh, PA, USA, 2022. P. 1494–1505. URL:
https://dl.acm.org/doi/pdf/10.1145/3510003.3510106. doi: 10.1145/3510003.3510106
[26] Chen Y., Hu Z., Zhi C., Han J., Deng S., Yin J. ChatUniTest: a framework for LLM-based test
generation // Companion Proceedings of the 32nd ACM International Conference on the
Foundations of Software Engineering (FSE Companion ’24). 2024. P. 572–576. URL:
https://arxiv.org/pdf/2305.04764. doi: 10.48550/arXiv.2305.04764.
[27] Jia Y., Harman M. An analysis and survey of the development of mutation testing // IEEE
Transactions on Software Engineering. 2011. Vol. 37, No. 5. P. 649–678.
doi:10.1109/TSE.2010.62.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Haiderzai</surname>
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khattab</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>How software testing impact the quality of software systems</article-title>
          ? // International Journal of Engineering Science.
          <year>2019</year>
          . Vol.
          <volume>1</volume>
          , No. 1. P. 1-
          <fpage>9</fpage>
          . doi:
          <volume>10</volume>
          .33545/26633582.
          <year>2019</year>
          .
          <year>v1</year>
          .
          <year>i1a</year>
          .
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Nguyen-Duc</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>The Impact of Software Complexity on Cost and Quality: A Comparative Analysis Between Open Source</article-title>
          and Proprietary Software //
          <source>International Journal on Software Engineering and Applications</source>
          .
          <year>2017</year>
          . Vol.
          <volume>8</volume>
          , No. 2. P.
          <volume>17</volume>
          -
          <fpage>31</fpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.1712.00675.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>