Modest-Pharo: Unit Test Generation for Pharo Based on Traces and Metamodels

Modest-Pharo: Unit Test Generation for Pharo Based on Traces and Metamodels GabrielDarbord gabriel.darbord@inria.fr UMR 9189 CRIStAL Univ. Lille Inria CNRS Centrale Lille

F-59000 Lille France

FabioVandewaeter fabio.vandewaeter.etu@univ-lille.fr UMR 9189 CRIStAL Univ. Lille Inria CNRS Centrale Lille

F-59000 Lille France

AnneEtien anne.etien@inria.fr UMR 9189 CRIStAL Univ. Lille CNRS Inria Centrale Lille

F-59000 Lille France

NicolasAnquetil nicolas.anquetil@inria.fr UMR 9189 CRIStAL Univ. Lille CNRS Inria Centrale Lille

F-59000 Lille France

BenoitVerhaeghe benoit.verhaeghe@berger-levrault.com Berger-Levrault

France

AlCeur WorkshopProceedings Modest-Pharo: Unit Test Generation for Pharo Based on Traces and Metamodels 1613-0073 7FFF6785E23D4AAA9A133E158D61420D GROBID - A machine learning software for extracting information from scholarly documents test generation unit tests regression testing trace-based metamodels Pharo

Unit testing is essential in software development to ensure code functionality and prevent the introduction of bugs. However, challenges such as time constraints and insufficient resource allocation often impede comprehensive testing efforts, leaving software systems vulnerable to regression. To address this issue, we introduce Modest, a language-agnostic approach to unit test generation that uses metamodels and execution traces. This method ensures non-regression by replaying scenarios captured from real-world executions. We demonstrate the application of Modest to Pharo codebases by generating unit tests for two projects. A total of 188 tests were generated and compared to existing tests based on mutation coverage, and we found that combining existing and generated tests increased coverage.

Introduction

Unit testing is an essential part of software development, serving as a critical mechanism for verifying code functionality and mitigating the risk of introducing bugs. Despite its importance, time constraints and inadequate resource allocation often prevent the widespread adoption of unit testing practices. This can result in codebases that lack proper testing, leaving software systems vulnerable to bugs, issues, and regressions [1,2].

While existing approaches [3,4,5] showed promising results in test generation, they have some limitations, such as being specific to a particular programming language or testing framework. To address this issue, we propose Modest, a language-agnostic approach to test generation. This approach involves the use of metamodels to facilitate the representation and generation of unit tests. The use of metamodels provides a solution that is independent of the programming language and testing framework. It enables automated transformation and code generation. Specifically, we use three metamodels: the unit test metamodel, which represents unit test elements; the code metamodel, which represents the codebase; and the value metamodel, which specifies the values used to test the codebase. Our approach is not intended to replace test-driven development or classic development practices where tests are written during the development phase. Our approach aims to generate unit tests on legacy software systems where tests are partially or completely missing. The generated tests help manage regression and identify new software bugs in existing areas of a system after changes have been made.

Furthermore, we aim to generate maintainable test code that is easy for humans to understand. Human-readable and maintainable tests make it easier for developers to understand how the code works and to make changes to the codebase with confidence [6]. In addition, human-readable tests can be helpful when onboarding new developers to a project, or when maintaining code written by others.

Ultimately, maintainable tests can reduce the amount of time spent debugging and fixing issues in the codebase.

To generate realistic tests, we use application traces consisting of method arguments and return values to leverage values from real business scenarios. Traces refer to the sequential recording of actions or operations in a system during its execution. This information is critical because it provides an accurate representation of how the software behaves at runtime.

In our previous work [7], we introduced two metamodels: the Value metamodel for representing runtime values, and the Unit Test metamodel. Our approach is based on Moose, a platform for software and data analysis 1 . This infrastructure allows us to extract knowledge from software systems and to apply our approach across programming languages.

In this paper, we present our five-step approach in Section 2. In Section 3 we explain the implementation of some steps in the case of test generation in Pharo. Section 4 presents some results on concrete Pharo applications. We discuss related works in Section 5 In Section 6, conclusions are drawn and perspectives are proposed.

Modest: a Unit test Generation Approach

Modest uses method execution traces to generate unit tests. This approach assumes that the current version of the software system for which tests are being generated is correct, allowing execution traces to be used as an oracle. The process relies on five steps to generate unit tests, as shown in Figure 1.

Codebase TracesValue

Prerequisites

There are two independent requirements that must be met before the test generation process can begin.

Step 1: Obtain a model of the application. Using the capabilities of the Moose platform, we create a comprehensive model of the application for which tests are to be generated. This model captures the structural aspects of the application, such as its classes and methods, and their relationships.

Step 2: Produce traces of the application. Data about the execution of the current version of the software system is recorded as a trace. Each trace corresponds to a specific method execution and must contain the following information: method identity, arguments, return value, and the receiver object. The method identity is a way to know exactly which method was executed. This is critical because multiple methods in the system can have the same signature due to polymorphism. This identity consists of the fully qualified class name and the method signature, including parameter types in the case of statically typed languages.

For a given project, any method that has no side effects and returns a value is a candidate for instrumentation. Side effects include use of the file system, graphical interfaces, network, global states, and randomness. Each execution of an instrumented method can result in a generated test. Thus, for a given executable comment or existing test, multiple tests can be generated that differ in the value of the arguments and the return value.

Test Generation Process

Once the prerequisites are met, the following steps are performed iteratively for each test generation cycle.

Step 3: Import and parse trace data. Traces are imported into Modest and reified to conform to a specific format. This ensures that the imported traces are represented consistently, regardless of the original storage format. The serialized data contained in each trace is parsed to extract relevant information. This parsed data is then reified using our Value metamodel, transforming it into a standard format for further processing. The Value metamodel bridges static code elements, such as method parameters, with dynamic runtime values, such as method arguments.

Step 4: Build a unit test model. The Unit Test metamodel presented in [7] is agnostic to the language and the testing framework used. It is built around the Arrange Act Assert (AAA) pattern, a widely used approach to structuring unit tests. We use the trace of a particular method execution to determine the test class and method, as well as the arrange, act, and assert phases of a unit test. The executed method determines the test method, while its class determines the test class. The method arguments determine the arrange and act phases of the test, where they are set up, used, and torn down. Finally, the result obtained from the trace determines the assert phase. We use the result as a test oracle, and the actual return value obtained in the act phase is compared to the expected value from the trace.

Step 5: Export the unit test model into concrete tests. The unit test model is translated into executable test code specific to the target language and the specific testing framework. This translation involves converting the model into Abstract Syntax Tree (AST) nodes. Finally, the AST nodes are used to generate the actual unit tests.

Adapting the Modest Approach to Pharo

In this section, we outline our methodology for generating unit tests for Pharo software systems. The Moose platform is used for application modeling, and a Pharo implementation of OpenTelemetry is used to produce execution traces. Consequently, details are given for language-dependent steps, i.e. steps 2, 3, and 5.

Step 2: Produce traces of the application. We use a Pharo implementation2 of OpenTelemetry to generate execution traces of the application. OpenTelemetry 3 is an open-source observability framework and standard designed to generate, collect, and export telemetry data such as traces. It provides tools and APIs for instrumenting applications to monitor and analyze their behavior. Our implementation uses MetaLinks [8,9], which allows the execution of instrumentation code before and after the method on which it is installed. We use this mechanism to record the method identity, arguments, return value, and the receiver. The instrumentation does not propagate to outgoing calls, only the targeted methods are traced. This preliminary step is only concerned with generating trace data. These traces will be fed into Modest in the following step, which will take place at a later date and possibly in a different Pharo image. Thus, the recorded data must be serialized for storage. In addition, we require that the serialized objects contain enough information to be correctly represented by the value metamodel, such as their runtime type.

The STON library 4 encodes the runtime type data we need, but it is not able to serialize all types of objects. In addition, STON allows developers to define a custom serialization format for their class. While this customization is useful, it makes the object encoding opaque to external tools such as Modest. Consequently, we developed a custom library inspired by Jackson5 , called PharoJackson6 , with the goal of being able to serialize any object to JSON in a consistent way. Similar to STON and Jackson, our library includes metadata to express the object type and handles circular references.

Step 3: Import and parse trace data. When the execution traces are imported into Modest, the data they contain is parsed to extract the relevant information. First, the method identity is used to determine the origin of the trace, corresponding to the method to be tested. For Pharo, this consists of the method selector and the name of the defining class.

Then, the serialized data containing the method arguments, return value, and receiver is deserialized from JSON to basic data structures: dictionaries, arrays, strings; and primitive data types: numbers, booleans, and null values. Except for dictionaries, all these types represent instances of their corresponding class in Pharo. For example, a JSON array corresponds to an instance of a Pharo Array.

Listing 1: User and Session objects serialized with PharoJackson 1 { 2 "@type": "User", 3 "@id": 1,

"name": "John Doe", 5 "session": { 6 "@type": "Session", 7 "@id": 2, Dictionaries are a special case because they are used to represent objects. Their key-value pairs correspond to attribute names and values (e.g. in Listing 1, the name attribute on line 4). In addition, metadata is added by PharoJackson: the @type value indicates the class of the object (e.g. line 2 indicates it is an instance of the User class), and the @id value is an identifier for handling circular dependencies (e.g. line 3). If the same object is referenced more than once, it is subsequently represented by a dictionary with a @ref value indicating the identifier of the corresponding object (e.g. line 9).

Thus, deserializing the trace data returns a graph of basic data structures. The importer of the Value metamodel is designed to interpret this specific format. It traverses the graph and instantiates the corresponding Value entities into a model.

Step 5: Export the unit test model into concrete tests. The unit test model is translated into executable test code specific to the Pharo language and the SUnit testing framework. Each element of the model is systematically visited.

Test classes are created using Pharo's built-in class creation API. For clarity and separation from existing tests, newly created test classes are named by appending ModestTest to the name of the tested class, e.g. in Listing 2. As part of the class creation process, each test class is then assigned to an appropriate package. Following Pharo's naming conventions, the test package is named after the package of the tested class, with the suffix -Tests added. If the specified test package does not exist, it will be created automatically.

Listing 2: Definition of the generated test class for the DataFrame class, from the package of the same name.

TestCase << #DataFrameModestTest slots: {}; package: 'DataFrame-Tests'

After visiting a test class within the unit test model, the process moves on to exporting its test methods along with their associated arrange, act, and assert entities. These three entities are linked to value entities, which are visited by a specialized visitor responsible for generating the AST to recreate the values as code, e.g. in Listing 3. Both visitors work together to generate the AST of the test method, which is then materialized and installed in the test class.

Modest in Action on Pharo Projects

In this section we evaluate our approach on real Pharo projects. First, relevant projects were selected. Then they were instrumented to generate traces. Finally, we present the generated test cases and the benefits of our approach.

Selection of Projects

As explained earlier, our approach is based on execution traces. There are several ways to get them in Pharo, such as manually executing the software to be tested. However, this can be difficult if we are not a user or developer of the software system; it requires expert knowledge. Therefore, alternative ways to generate execution traces are needed. As it happens, there are other ways to run valid execution scenarios: tests and examples, such as executable comments or class-side examples using the <example> pragma. Such examples are very common in kernel packages and graphical projects. However, since our approach uses metalinks to generate the trace, it is not possible to select projects from the kernel that are used by the instrumentation itself, such as the Boolean or Collection packages, as this would break the image. Also, as explained in Section 2, our approach does not currently deal with graphical applications, as it requires that the tested method returns a value. Side effects and randomness are also not handled yet, which limits the choice of projects.

Two projects were selected: DataFrame7 and LabelContractor 8 . DataFrame is a tabular data structure for data analysis in Pharo. It organizes and represents data in a tabular format, similar to a spreadsheet or database table. It also provides several algorithms for data manipulation. For our evaluation we only considered the DataFrame class. The LabelContractor project is used to reduce the size of labels for graphical interfaces using different strategies. It currently provides 13 different contraction strategies and two ways to combine them. For our evaluation, we considered the project's main class, a tokenizer class, a helper class, and seven strategies. We report information about the selected classes in Table 1.

In the case of Dataframe, traces result from running existing tests. In the case of LabelContractor, traces result from running existing tests and executable comments. In both cases, our approach generates tests from these traces. Since tests already exist for these projects, it is possible to compare them with our generated tests in terms of mutation coverage. To obtain these measurements, we used the MuTalk9 library.

Results

We generated tests for the previously introduced projects and classes. To reduce the number of generated tests, we recorded only the first execution of each instrumented method. An example of an existing test is shown in Listing 4, and the test that was generated from its execution is shown in Listing 5.

We now evaluate how the mutation coverage of the existing tests compare to our generated tests. We also look at how the coverage evolves when both existing and generated tests are considered. Our results are reported in Table 2. The reason for failed tests is that there are still some objects that are not serializable by our library, such as closures. The DataFrame project has a higher number of failed tests compared to LabelContractor. This difference can be attributed to the greater complexity of the DataFrame project, which uses more objects that are currently not serializable.

For DataFrame, the mutation coverage achieved by the generated tests is lower than that of the existing tests (43% compared to 59%). However, when both existing and generated tests are combined, the mutation coverage improves to 64%. For LabelContractor, the mutation coverage achieved by the generated tests is 43%, lower than the existing test coverage of 56%. When combined, the mutation coverage also improves to 59%.

We can see that more mutants are killed by existing tests than by generated tests. This can be explained by the fact that existing tests often use auxiliary methods to initialize test values during the setup phase. In contrast, the generated tests rely on a structural reconstruction approach based solely on constructors (new in Pharo) and accessors, or on reflection. During the experiment, mutations were generated for entire classes rather than for specific methods, so existing tests were more likely to encounter and kill a mutation because they execute methods more often and with different arguments.

These results indicate that the combination of generated and existing tests leads to higher mutation coverage for both projects. The increase can be attributed to the use of structural equality between actual and expected results in the generated tests. This exhaustive recursive comparison helps to identify and kill more mutants than a standard equality check.

A threat to the validity of the generated tests is their reliance on execution traces. These traces are derived from specific scenarios, and the coverage and effectiveness of the generated tests are inherently tied to the completeness of those scenarios. If the execution traces do not cover relevant code paths or edge cases, the generated tests will also lack coverage in these areas.

Related Works

EvoSuite [3] is characterized by its ability to generate JUnit test cases using evolutionary algorithms, with a specific focus on Java. One of its strengths is achieving high levels of code coverage, including branch and line coverage. However, its generated unit tests often have a distinct style that differs from human-written tests, which can affect their readability [10]. SmallEvoTest [11] generates unit tests for dynamically typed programming languages, specifically Pharo and GToolkit, by using a type-profiling mechanism and a genetic algorithm to evolve the unit tests. In contrast to these languagespecific, evolutionary algorithm-driven approaches, our approach aims to be language-agnostic and uses execution traces to generate tests. We also focus on generating code that is more comprehensive for humans.

Several research studies have explored the use of execution traces for software testing, recognizing the valuable insight they provide into the behavior of a program at runtime. One web testing approach generates test cases from user execution traces [12]. To improve the test suite, mutation operators were applied to these test cases, simulating potential real-world failures. Tests that yielded different results were kept because they revealed additional behavior in the web application being tested. Techniques such as Daikon's invariant inference, which identifies likely invariants from execution traces, demonstrate the effectiveness of trace-based testing [13]. In the future, we could use similar methods to identify interesting test scenarios from traces.

In recent years, test generation tools using deep learning have attracted considerable interest. Among these tools, AthenaTest [4] stands out for its ability to generate unit test cases for Java programs by learning from actual methods and developer-written tests. Developer surveys indicate that AthenaTest outperforms other tools such as EvoSuite in both test coverage and readability. Building on AthenaTest, A3Test [5] introduces improvements by integrating assertion knowledge and ensuring consistency in naming and test signatures, resulting in improved correctness and method coverage. CodeT [14] presents a method that uses pre-trained language models to automatically generate test cases to evaluate the quality and correctness of code solutions. Despite these advances, deep learning-based tools still face notable challenges because they require extensive training data and significant computational resources.

Conclusion

In this paper, we introduced Modest, a language-agnostic approach to test generation that uses metamodels to generate unit tests. This approach ensures non-regression by replaying scenarios captured by execution traces. Finally, we showed how Modest can be applied to Pharo by generating unit tests for two projects.

Looking ahead, several avenues for further development of Modest are possible. These include experimenting with trace selection and mutation [12], mining for invariants [13], optimizing the generated test suite through coverage modeling, and pruning recreated objects to focus on relevant data. In addition, we plan to evaluate our approach on a larger scale to better understand its effectiveness and applicability. A key aspect of future work will be the criteria for selecting relevant scenarios or traces, which are currently determined by the user. By addressing these areas, we aim to further refine Modest and increase its utility in managing regression in software systems.

Figure 1 :1Figure 1: The 5 steps of the Modest approach. Entities representing the code to be tested are shown in green (left column), entities representing runtime information are shown in orange (middle column), and entities representing the generated tests are shown in blue (right column).

8 "8

Listing 3 :3Generated code recreating the object from Listing 1.1 (user := User new)

Listing 4 :testTokenize 2 3 |43Existing test from the tokenizer class of the LabelContractor project. LbCTokenizer new tokenize: 'CK123J') 5 equals: #( 'C' 'K123' 'J' ) asOrderedCollection Listing 5: Test generated from the execution trace of Listing 4. 1 expected aString lbCTokenizer actual | 4 expected := OrderedCollection withAll: { 'C'. 'K123'. 'J' }.

7 actual:7= lbCTokenizer tokenize: aString.

8self assert: actual equals: expected

Table 11Selected Pharo projects and the number of evaluated classes. The table shows the number of methods, existing tests, and executable comments for the selected classes. It also shows the number of methods covered by tests and comments, representing the methods for which tests were generated.1-8

Table 22Generated tests for selected Pharo projects and their results. The table shows the number of generated tests, the number of tests that passed and failed, and the mutation coverage achieved by these tests. The combined mutation coverage indicates the coverage when both existing and generated tests are evaluated together.https://moosetechnology.org/https://github.com/Gabriel-Darbord/opentelemetry-pharohttps://opentelemetry.io/https://github.com/svenvc/stonhttps://github.com/FasterXML/jacksonhttps://github.com/Modest-Project/PharoJacksonhttps://github.com/PolyMathOrg/DataFramehttps://github.com/moosetechnology/LabelContractorhttps://github.com/pharo-contributions/mutalk

A survey of unit testing practices PRuneson 10.1109/MS.2006.91 IEEE Software 23 2006 Are there any unit tests? an empirical study on unit testing in open source python projects FTrautsch JGrabowski 10.1109/ICST.2017.26 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST) 2017 Evosuite: Automatic test suite generation for object-oriented software GFraser AArcuri 10.1145/2025113.2025179 doi:10.1145/ 2025113.2025179 Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11 the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11

New York, NY, USA

Association for Computing Machinery 2011 Unit test case generation with transformers and focal context MTufano DDrain ASvyatkovskiy SKDeng NSundaresan CoRR abs/2009.05617 2020 SAlagarsamy CTantithamthavorn AAleti arXiv:2302.10352 A3test: Assertion-augmented automated test case generation 2023 Modeling readability to improve unit tests EDaka JCampos GFraser JDorn WWeimer 10.1145/2786805.2786838 Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015 the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015

New York, NY, USA

Association for Computing Machinery 2015 A unit test metamodel for test generation GDarbord AEtien NAnquetil BVerhaeghe MDerras Proceedings of the 2023 International Workshop on Smalltalk Technologies, CEUR Workshop Proceedings the 2023 International Workshop on Smalltalk Technologies, CEUR Workshop Proceedings 2023 Sub-method Structural and Behavioral Reflection MDenker 2008 University of Bern PhD thesis Sub-method, partial behavioral reflection with reflectivity: Looking back on 10 years of use, The Art SCostiou VAranega MDenker 10.22152/programming-journal.org/2020/4/5 Science, and Engineering of Programming 4 2020 An empirical investigation on the readability of manual and generated test cases GGrano SScalabrino HCGall ROliveto 10.1145/3196321.3196363 doi:10.1145/3196321.3196363 Proceedings of the 26th Conference on Program Comprehension, ICPC '18 the 26th Conference on Program Comprehension, ICPC '18

New York, NY, USA

Association for Computing Machinery 2018 SmallEvoTest: Genetically created unit tests for smalltalk ABergel GGalindo-Gutiérrez AFernandez-Blanco J.-PSandoval-Alcocer Proceedings of the International Workshop on Smalltalk Technologies, CEUR Workshop Proceedings the International Workshop on Smalltalk Technologies, CEUR Workshop Proceedings 2023 Test case generation based on mutations over user execution traces AC RPaiva ARestivo SAlmeida Software quality journal 28 2020 The Daikon system for dynamic detection of likely invariants MDErnst JHPerkins PJGuo SMccamant CPacheco MSTschantz CXiao 10.1016/j.scico.2007.01.015 Science of Computer Programming 69 2007 special issue on Experimental Software and Toolkits CodeT: Code generation with generated tests BChen FZhang ANguyen DZan ZLin J.-GLou WChen 10.48550/ARXIV.2207.10397 2022