ReLESS: A Framework for Assessing Safety in Deep Learning Systems

ReLESS: A Framework for Assessing Safety in Deep Learning Systems NanJia njia@gradcenter.cuny.edu the Graduate Center CUNY

365 Fifth Avenue 10016 New York NY USA

AnitaRaja anita.raja@hunter.cuny.edu the Graduate Center CUNY

365 Fifth Avenue 10016 New York NY USA

CUNY Hunter College

695 Park Avenue 10065 New York NY USA

RaffiKhatchadourian the Graduate Center CUNY

365 Fifth Avenue 10016 New York NY USA

CUNY Hunter College

695 Park Avenue 10065 New York NY USA

ReLESS: A Framework for Assessing Safety in Deep Learning Systems 1613-0073 C500760DFD7C252B2B8D57A1305B0489 GROBID - A machine learning software for extracting information from scholarly documents learning-enabled software systems machine learning systems refactoring trusted AI software architectures AI safety

Traditionally, software refactoring helps to improve a system's internal structure and enhance its non-functional features, such as reliability and run-time performance, while preserving external behavior including original program semantics. However, in the context of learning-enabled software systems (LESS), e.g., Machine Learning (ML) systems, it is unclear which portions of a software's semantics require preservation at the development phase. This is mainly because (a) the behavior of the LESS is not defined until run-time; and (b) the inherently iterative and non-deterministic nature of ML algorithms. Consequently, there is a knowledge gap in what refactoring truly means in the context of LESS as such systems have no guarantee of a predetermined correct answer. We thus conjecture that to construct robust and safe LESS, it is imperative to understand the flexibility of refactoring LESS compared to traditional software and to measure it. In this paper, we introduce a novel conceptual framework named ReLESS for evaluating refactorings for supervised learning by (i) exploring the transformation methodologies taken by state-of-the-art LESS refactorings that focus on singular metrics, (ii) reviewing informal notions of semantics preservation and the level at which they occur (source code vs. trained model), and (iii) empirically comparing and contrasting existing LESS refactorings in the context of image classification problems. This framework will set the foundation to not only formalize a standard definition of semantics preservation in LESS but also combine four metrics: accuracy, run-time performance, robustness, and interpretability as a multi-objective optimization function, instead of a single-objective function used in existing works, to assess LESS refactorings. In the future, our work could seek reliable LESS refactorings that generalize over diverse systems.

Introduction

Developers of Learning-Enabled Software Systems (LESS) face the challenge of constructing highly reliable large-scale systems, as evidenced in previous research [1,2]. With the pervasive integration of dynamic Machine Learning (ML) models in these operational software systems, safety, efficiency, and adaptability with respect to evolving user requirements become paramount. Moreover, software systems inherently evolve throughout their life-cycle [3], which traditionally incurs substantial costs and risks, particularly in the context of large, complex systems [4]. Although LESS shares these traits with conventional software, its data-driven nature accentuates the propensity for evolution [5]. This divergence from traditional software poses unique challenges for testing and verification due to its data-driven and uncertain requirements. Notably, the efficacy of resulting ML models, including Large Language Models (LLMs), improves with more extensive data inputs, necessitating a delicate balance between user privacy protection and model refinement in large-scale systems. Consequently, there arises a pressing need for validation and testing methodologies tailored to the distinctive characteristics of AI-driven systems.

This evolving research agenda underscores a critical reassessment of priorities in AI system development. Furthermore, as AI technologies permeate various sectors of society, scalable systems must effectively consider and adapt to legal, policy, and employment implications. These technical attributes not only underpin the functional aspects of AI applications but also facilitate their alignment with essential ethical standards and societal expectations [6]. This imperative is further underscored by a recent U.S. government-issued Executive Order [7] and the EU AI Act [8], emphasizing the necessity for Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Moreover, to ensure the positive societal impact of AI systems, accuracy, runtime performance, robustness, and interpretability are crucial technical attributes that directly support broader ethical objectives.

Recent works [1,9] have highlighted a variety of metrics for assessing the impacts of LESS transformation. These metrics include aspects such as ensuring safety and fairness, protecting privacy, fostering collaboration, considering legal and policy ramifications, and evaluating impacts on employment. Recent studies [10,11,12,13,14] have investigated whether original and transformed systems should behave consistently before and after transformation. These studies illustrate the potential trade-offs between accuracy and each respective metric. Although various metrics like fairness and privacy are considered, in this work, we focus on accuracy, run-time performance, robustness, and interpretability as a starting point with the intent to cover the majority of AI safety concerns in LESS [14,15,16]. We argue that comprehending and harnessing the flexibility of refactoring in LESS represents a pivotal stride toward enhancing the safety of AI systems.

A detailed exposition of these metrics, as discussed in the state-of-the-art literature, is provided in Section 2.

Traditionally, the criterion for refactoring [17,18], is that the same input must produce the same output; any deviation is considered a behavior change of the program and a threat leading to system crash [19]. However, refactoring is underexplored in the context of LESS, including deep learning frameworks [1]. LESS, unlike traditional software systems, benefit from randomness but yet lack a guarantee of a predefined exact outcome due to their reliance on the quantity and quality of data, complicating predictions about the effects of refactoring.

This paper aims to bridge the knowledge gap between refactoring practices in traditional software [4,20,21,22,23,24,25,26,27], and LESS [12,28,29,30] by introducing Re-LESS (Refactoring of Learning-Enabled Software Systems), an evaluation framework for standardizing and formalizing refactoring methodologies. In this work, we describe this framework in the context of supervised learning tasks, specifically image classification problems. Our hypothesis posits that the criteria for successful refactoringnamely source-to-source transformation and semantic preservation-assume unique, yet complementary implications in the context of LESS as opposed to traditional software systems.

Specifically, ReLESS will allow for the possibility that transformations might produce outputs that are slightly different from the original output as long as they lead to improvements in other performance metrics of the system. Determining how "different" the output can be from the original is a research question we seek to address. Moreover, our approach aims to discover and preserve safetycritical metrics during ReLESS while further mitigating the uncertainties introduced by their non-deterministic nature. While current approaches emphasize knowledge distillation (transferring knowledge from a large neural network to a smaller, resource-efficient one) and regularization (a technique for solving over-fitting), our vision for the future of ReLESS includes approaches that combine connectionist models (e.g., neural networks) and symbolic (e.g., decision tree) approaches as well as Bayesian and analogizer (e.g., K-nearest neighbor, support vector machine) approaches.

This paper is structured as follows: in Section 2 we first provide a comprehensive analysis of state-of-the-art refactoring methodologies in LESS and discuss how these works trade-off accuracy with respect to specific metrics such as run-time performance, robustness [11,13,14,16], or interpretability. In Section 3, we contrast existing practices and scrutinize informal notions of semantics preservation across different levels (source code vs. trained model). We then motivate a novel thread of inquiry for the ReLESS evaluation framework and its multi-objective optimization function that combines the aforementioned multiple metrics to guarantee the AI system's safety. Section 4 presents preliminary experiments utilizing ReLESS to gauge LESS safety and associated parameters. Finally, in Section 5 we discuss the main insights gleaned from this work and our future work.

Related Work

In recent years, various research has been conducted on LESS refactoring, with significant observations in balancing single metric against accuracy of models. Several studies have focused on image classification or object detection, addressing this tension and presenting innovative verification. However, these approaches often face limitations in lack of generalization and narrow scope of metrics, which we aim to address in our work.

Refactoring Types in Software Development

Refactoring [17], a well-known technique for the evolution and maintenance of traditional software, alters a system's internal structure without changing its behavior [18] to improve non-functional characteristics such as run-time performance, security, and modularity, and to pay down technical debt [31,32,33,34,35]. It can be considered as a series of typically automatic procedures for modifying code, such as variable name changes to enhance comprehension [19], without an explicit focus on automated refactoring, as these modifications frequently occur automatically within a system-based environment. Formally, a refactoring is a program transformation potentially spanning multiple, non-adjacent program statements or expressions that is: (i) source-to-source and (ii) semantics-preserving, i.e., the behavior of the program is the same before and after the refactoring.

Even though refactoring is a well-established practice in traditional software development, it is not as well clear in LESS. Existing refactoring attempts in LESS are implicitly performed via controlling randomness [11], decomposing trained models [12], or defining new requirements [13]. The lack of refactoring tools and techniques, and an evaluation framework for LESS is a significant challenge for developers and researchers [2]. Our research aims to develop a multiobjective evaluation framework for LESS. We study it in the context of a specific class of supervised learning problems, namely image classification tasks.

Image Classification Problems and Evaluation

While the continuous evaluation of ML models [11,12,13,36] has highlighted modularity, reliability, robustness, and interpretability, these assessments done independently fall short of ensuring the safety of AI systems as a whole. Consider for instance the role of accuracy, which is the widely accepted metric [37] for gauging the success of models in the image classification task. Benchmark models for this problem class originating from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have continually improved on accuracy. To date, the record for the highest accuracy on the ImageNet benchmark is an impressive 92.4%, set by OmniVec(ViT) 1 . However, while high accuracy is indispensable, it is not the sole criterion for the adequacy of a model, especially within contexts of safety-critical applications. In such applications, other non-functional metrics demand equal consideration to ensure the comprehensive robustness and reliability of LESS.

Ensuring AI systems maintain safety and fairness [38] across various conditions and inputs is crucial for applications like autonomous driving and medical diagnosis [13,16,39]. Reliability is equally important, as dependable systems yield consistent results, fostering trust among stakeholders and accountability among developers. While state-of-the-art models often match or exceed human performance in image classification tasks, understanding errors and their solutions remains challenging [40]. Evaluating model performance is vital, especially in safety-critical scenarios, yet the opaque nature of the learning component hinders transparency and interpretability.

Our proposed framework ReLESS combines accuracy, runtime performance, robustness, and interpretability, using a multi-objective optimization function. By experimenting with metrics drawn from existing literature and through preliminary evaluations of them, we validate target systems' performance both before and after refactoring. Our findings illuminate the trade-offs researchers make between accuracy and other performance metrics. Importantly, our evaluation process considers not just a single metric versus accuracy but integrates multiple metrics to understand various system maintenance challenges. This approach helps mitigate the "black-box" nature of AI learning components, providing clearer insights into system behavior and performance.

Baselines for Comparison

Chen et al. [11] analyzed refactoring for image classification tasks at the algorithmic level with various models using dynamic analysis, record-and-replay, and profile-and-patch. The focus of their approach is to control randomness and hardware non-determinism to guarantee that the 𝑂𝑢𝑡𝑝𝑢𝑡 and performance metrics are the same as the original system in seven models (Lenet1/4/5, ResNet-38/56, WRN-28-10, and ModelX). Models are then reproduced efficiently and accurately across different hardware.

Pan and Rajan [12] hypothesized that decomposing learning models into reusable components can affect refactoring outputs and statistical performance in the MNIST [41] dataset. They run four DNN models across sixteen experiments with varying hidden layers and datasets, demonstrating that removing irrelevant edges in the network can lead to similar accuracy and preserve the most semantics. They found that 9 out of 16 cases were functionally equivalent to the original models, based on the Jaccard Index, with intradataset performance from decomposed models slightly outperforming models built from scratch (e.g., MNIST(+0.30%)).

Adopting the methodology from Hu et al. [13], we succeeded in obtaining the original and filtered images from ImageNet [42]. Image filters such as brightness, contrast, defocus/blur, frost, gaussian noise, and jpeg compression, are crucial for testing the robustness of the refactored systems [43] because they involve pictures that human can recognize correctly and easily before and after filtering, thus setting a baseline for model performance in similar conditions.

Methodology

Given the context of refactoring in ML systems as discussed in the previous section, we present ReLESS, a conceptual framework created especially to tackle two important research goals. First, to investigate the definition and operation of semantic preservation during system transformation procedures inside the LESS. Second, to explore approaches for evaluating the safety of LESS during the system's transition, building on our formalization of semantic preservation from the previous goal. Consider Fig. 1 which depicts the situation representing the refactoring of traditional software systems. Here, 𝑃𝑟𝑜𝑔 1 represents an (automated) refactoring that takes as input a program 𝑃𝑟𝑜𝑔 2 to be refactored and produces a refactored program 𝑃𝑟𝑜𝑔 ′ 2 . Note that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are source code, i.e., textual descriptions. We assume that 𝑃𝑟𝑜𝑔 1 is a nontrivial refactoring, i.e., that 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . As refactoring typically deals with real-world languages with non-trivial semantics, the semantic equivalence of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 is normally assessed empirically by executing 𝑃𝑟𝑜𝑔 2 's test suites and comparing the results. Thus, to evaluate the refactoring 𝑃𝑟𝑜𝑔 1 , 𝐼 𝑛𝑝𝑢𝑡 is fed to both 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 for all test suite inputs. The 𝑂𝑢𝑡𝑝𝑢𝑡 is then compared-ideally, all tests have the same results before and after the refactoring. If so, then 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , and 𝑃𝑟𝑜𝑔 1 is considered validated. Otherwise, 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning there is a bug [18,19] in the system. Since traditional software is typically deterministic and its logic is not driven by dynamic data models, the process works in a relatively straightforward fashion. In fact, the larger the test suite, the greater the confidence that the refactoring works. 2 On the other hand, given the non-deterministic intricacies inherent in LESS, the traditional refactoring as described in Fig. 1 is insufficient. Consequently, we construct an auxiliary diagram Fig. 2 that facilitates a more direct and nuanced evaluation of the transformations. Now consider Fig. 2 representing ReLESS with citations of related work in the supervised learning context. 3 Here, 𝑃𝑟𝑜𝑔 1 represents an (automated) refactoring that takes as input an ML algorithm 𝑃𝑟𝑜𝑔 2 to be refactored and produces a refactored ML algorithm 𝑃𝑟𝑜𝑔 ′ 2 . Note that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ again assume that 𝑃𝑟𝑜𝑔 1 is a non-trivial refactoring, i.e., that 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . To evaluate the refactoring 𝑃𝑟𝑜𝑔 1 , two steps are taken: (a) a training dataset is fed to both 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , which then produces the compiled, aka trained ML models 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 , respectively; (b) an evaluation (testing) dataset is fed to both 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 . One or more such datasets (both training and evaluation) may be used. The 𝑂𝑢𝑡𝑝𝑢𝑡-in this case, predictions or classifications-is then compared. If 𝑃𝑟𝑜𝑔 1 results in no accuracy loss, then 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ . Otherwise, 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning 𝑃𝑟𝑜𝑔 1 causes some accuracy loss when refactoring 𝑃𝑟𝑜𝑔 2 . Note that unlike in the traditional refactoring evaluation case, whether there is a bug in 𝑃𝑟𝑜𝑔 1 in this situation is not straightforward to determine and is not a topic of focus in this paper. Because LESS can be non-deterministic and has logic that is driven by dynamic data models, whether 𝑃𝑟𝑜𝑔 1 is considered valid may depend on multiple factors. For instance, if the accuracy loss is within a certain threshold, then 𝑃𝑟𝑜𝑔 1 may be considered valid. If the accuracy loss is above the threshold, then 𝑃𝑟𝑜𝑔 1 may be considered invalid.

A supplementary contribution of our proposed framework is that it has an additional layer where both the transformation and output comparison could occur. For instance, there is a dashed line in Fig. 2 from 𝑃𝑟𝑜𝑔 3 to 𝑃𝑟𝑜𝑔 1 and 𝑃𝑟𝑜𝑔 1 to 𝑃𝑟𝑜𝑔 ′ 3 , indicating that the program transformation can also take place on the trained ML models. In the traditional setting (Fig. 1), because the transformation is not source-to-source, it would not be considered a refactoring in the traditional sense but instead viewed as compiler optimization. However, in the LESS context, ML algorithms are typically written in interpreted languages (e.g., Python), where a compiler is not involved. It is because the model training (compilation) process can potentially be lengthy (days or even weeks) depending on the dataset size, transforming the ML algorithm to produce a new ML model as part of the refactoring process can be time-consuming [44]. Instead, it may be advantageous in this context to perform the refactoring at the testing level to avoid retraining. Such a "refactoring" is done on LESS by Pan and Rajan [12,45]. Although the transformation is on the trained ML model, their goal of enhanced modularity is a classical refactoring outcome.

Determination of Semantic Equivalence

Our objective is to ultimately build a tool, where users provide original code (old system), that determines which refactorings (new systems) would satisfy semantic equivalence. We identify different levels at which this could occur: semantic equivalence at: (a) the ML algorithm level (case 1), and (b) the ML model level (case 2). We will demonstrate how existing works perform semantic equivalence from a single-lens point of view. Drawing on these effects, however, our approach will create a multi-objective evaluation (instead of a single-objective function used by the current state-of-the-art).

Case 1: Semantic Equivalence at the ML Algorithm Level

𝑃𝑟𝑜𝑔 2 = 𝑃𝑟𝑜𝑔 ′ 2 : This equivalence implies that 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 are also semantically equivalent as shown in Fig. 2. In this case, 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are semantically equivalent since 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ . But, the average training time (in hours) of the model for this refactoring in Chen et al. [11] increases from 0.017 to 0.023 for Lenet1 and from 7.08 to 14.979 in the case of ModelX. Their approach has higher storage overhead for 𝑃𝑟𝑜𝑔 ′ 3 (due to random seed recording). Such an approach does not facilitate model generalization to unseen data by making the training process explore various possibilities. This will constrain the robustness of an ML model. Deterministic methods are also more susceptible to overfitting, as models can memorize the training data too closely, limiting their performance on new data. Lastly, ensuring complete determinism can be computationally expensive and challenging, especially in complex, multi-threaded, or distributed computing environments. This work highlights the tension between semantic preservation and model optimization.

Case 2: Semantic Equivalence at the ML Model Level

𝑃𝑟𝑜𝑔 3 = 𝑃𝑟𝑜𝑔 ′ 3 : This means that the trained ML models are the same. It follows again that 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , and 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are semantically equivalent in the traditional sense. As we are considering non-trivial refactorings as discussed earlier, we assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 , meaning that the refactoring 𝑃𝑟𝑜𝑔 1 has made some non-trivial transformation. An example of such a transformation would be to enhance the run-time performance of the training; the trained model would be the same but the training process would be faster. For instance, Castro Vélez et al. [46] show that the 𝑃𝑟𝑜𝑔 ′ 3 run-time is ∼9.22 seconds faster than 𝑃𝑟𝑜𝑔 3 by applying a hybrid training technique in imperative Deep Learning (DL) programs. In TensorFlow 2 [47], for example, the tf.function decorator can be applied to certain (model) Python functions found in imperative code to speed up the training process. Developers and scientists, then, can write natural, debuggable DL code in an imperative style while retaining the run-time performance typically found in legacy DL frameworks that support deferred-execution style programming models. Applying tf.function to (otherwise eagerly-executed) imperative DL code can be-if done correctly-a semantics-preserving refactoring [46].

𝑃𝑟𝑜𝑔 3 ≠ 𝑃𝑟𝑜𝑔 ′ 3 : This means that the trained models are not the same. It follows that it is possible that 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 may not be semantically equivalent in the traditional sense. There are several situations that may occur here, e.g., (i) different hyperparameters are used. (ii) hybridization is misused, resulting in semantically-in-equivalent code [46], (iii) 𝑃𝑟𝑜𝑔 ′ 3 may be an optimized DL model, e.g., having fewer edges, being more modular, and avoiding over-fit. In Fig. 2, 𝑃𝑟𝑜𝑔 ′ 3 represents a modular and refactored system from 𝑃𝑟𝑜𝑔 3 via 𝑃𝑟𝑜𝑔 1 , where semantics is preserved through separation of concerns, such as using supervised classification labels for maintenance and reduced model training time [12]. This indicated that 𝑃𝑟𝑜𝑔 ′ 3 does better than 𝑃𝑟𝑜𝑔 3 with respect to ReLESS optimization while preserving the potential to explore generalizability and scalability.

Our analysis not only sheds light on the current stateof-the-art but also establishes a linkage between program transformation techniques and their operational viability in scenarios where safety is of paramount concern. We then use these observations to formally define semantic preservation using a multi-objective optimization function rather than a single-objective one in LESS.

Semantic Preservation: Formal Definition and Verification Metrics

We first define the semantic preservation of LESS based on varying ranges of the output. The Venn diagram Fig. 3 shows the outputs from the original code and proposed Re-LESS. The upper circle in blue is the output from the original code, e.g., the probability of correct labels for a classification or prediction task. The lower circle in yellow is the output from ReLESS. This diagram examines where the two outputs are equivalent (overlapping area) and where they are different. Suppose 𝛿 is the acceptable range of overlap, i.e., how much developers/engineers/scientists are willing to trade accuracy with other factors viz. robustness, run-time performance, interpretability etc. Ideally, the overlapped area should be as large as possible, but this is not always the case and is application-dependent. For instance, if the system is time-critical, then the response time is emphasized in the optimization even though there are marginal accuracy losses.

If the system is safety-critical, then the accuracy should be preserved as much as possible. That said, we posit that to achieve semantic preservation in ReLESS, it is inadequate to consider accuracy as the sole optimization metric.

Prior works [13,38] has formalized balancing between accuracy and reliability/robustness and fairness. OBrien et al. [48] define non-functional LESS metrics as run-time performance (speed), security, privacy, and memory (storage). Building upon these foundational studies, we extend the evaluation framework for semantic preservation to explicitly encompass safety as an overarching theme. Run-time performance, as highlighted by OBrien et al. [48], serves not only as a measure of efficiency but also influences system safety by ensuring timely responses in critical scenarios. Robustness, as documented by Hu et al. [13], is directly linked to safety, reflecting the system's capacity to withstand errors and adversities. Finally, interpretability, introduced by [36], enhances safety by providing clarity on decision-making processes, thereby allowing for greater accountability and easier identification of potential safety breaches. These three metrics collectively forge a more resilient and safetyconscious framework for assessing semantic preservation in LESS.

This tailored approach allows for a more integrated and holistic assessment of LESS, aligning closely with contemporary LESS development and deployment needs. All three transformations do not change LESS's external behavior (semantics) [49]. The formal notation is built to combine accuracy with those non-functional metrics with customized importance factors to guide which degree of flexibility the engineers, scientists, and researchers want the model they work on to emphasize. We propose a multi-objective optimization function, akin to Nguyen et al. [38]'s approach, to determine the difference (loss function) between a LESS and its corresponding ReLESS. We argue that if the loss is below a certain threshold with constraints (as discussed in Fig. 3), then semantic preservation is maintained.

As one of the state-of-the-art formal methods, optimization via loss functions is central to the training of ML/DL models [38]. It is recognized for its adaptability to a wide range of applications. Different trade-offs exist when refactoring in ML/DL systems, so a multi-objective optimization function is constructed. Besides, optimization can standardize each metric term that needs to be balanced with accuracy in loss function to make the whole system understandable to the target audience. The range of optimization applied is from classical ML models (random forest, gradient boost) to DNN models with supported libraries, e.g., auto-sklearn [50] and AutoKeras [51].

Accuracy, Run-time Performance, Robustness, and Interpretability

To comprehensively evaluate the performance of ReLESS, we will consider accuracy with three key loss functions: run-time performance, robustness, and interpretability. a. ACCuracy is the number of correct outputs over the total number of instances.

𝐴𝐶𝐶 = 𝑇 𝑃 + 𝑇 𝑁 𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁

where (3) where  is the input dataset, 𝑥 is the training dataset, and 𝑦 is the set of corresponding labels for a supervised learning task, such as image classification. Similar to RTPI, the observed difference is captured in loss function in old and new models. ROBI is observed by the difference in the loss function between the old and new models. Our definition for robustness is based on [10,52], where a robustness system after refactoring is verified by its loss function after adversarial training, and for classical ML after refactoring can be also verified by its loss function after data augmentation, feature engineering, and ensemble learning.

Molnar [53] define interpretability in machine learning using interpretable models and a simplified loss function. The loss function serves as a quantitative measure to compare the interpretability of different models while maintaining accuracy. This approach blends the conceptual understanding of model behavior (through interpretable models) with a practical, measurable way (using the loss function) to assess and compare the clarity and comprehensibility of different models. We compute this metric using the definition of interpretability where a subset as input of . We are able to compare the difference between new and old models and the corresponding interpretability score [36]. The implication is that the simplified loss function on the explainable refactored system correlates with higher interpretability, which is plausible, but the exact method of determining explainability is essential here.

To sum up, we define a multi-objective optimization loss function that facilitates balancing the importance of various objectives depending on the application domain. Each metric is formulated as a ratio or a normalized value, which is typical in performance evaluation to provide a standardized measure of improvement or degradation. In the equation below, each metrics term is the loss measurements calculated from Eqs. ( 2) to (4) respectively for ROBI, RTPI, and INTI metrics respectively and accuracy is ACC;  𝜃 is the model with its parameters,  is the dataset. Each term's weight coefficient 𝜔 𝑖 is assumed to be user-defined and indicates the importance of each of the metrics during model evaluation.

𝑚𝑖𝑛𝕃( 𝜃 , ) = 𝜔 1 × ACC + 𝜔 2 × RTPI + 𝜔 3 × ROBI + 𝜔 4 × INTI (5)

where 𝜔 1 , 𝜔 2 , 𝜔 3 , and 𝜔 4 are weights that reflect the importance of each term in the loss function.

The multi-objective optimization function in this formalism enables the determination of whether a ReLESS is a semantically preserving transformation to its LESS. Moreover, when fusing these measurements, it is also essential to include the measure of accuracy because regardless of the importance of the speed of operation, robustness, and interpretability, producing correct outputs is the cornerstone of model evaluation. In other words, accuracy is always a first-class objective. Only by considering the critical role of accuracy can we ensure that a model is trustworthy [54].

Expanding on the conceptual structure presented in the preceding part, we describe a preliminary experimental configuration intended to closely assess the accuracy, run-time performance, robustness, and interpretability of LESS.

Experimental Setup

This section describes the experimental setup employed for a preliminary evaluation of the proposed ReLESS framework for a simple case study. We describe the datasets used for experiments, followed by an explanation of the experimental design and the metrics adopted to assess the efficacy of the refactorings.

Datasets and Models

As indicated in Section 2, we study ReLESS in the context of two image classification datasets: the ImageNet dataset and the MNIST dataset. The ImageNet dataset [42], comprised of 1.2 million images across 1000 categories, is utilized for the evaluation to assess reliability and robustness, with a specific subset of 50,000 images filtered from [43]. The MNIST dataset [41], containing 60,000 training images and 10,000 test images of handwritten digits, serves as the basis for initial evaluations. Those datasets enable preliminary assessments of the refactorings' effectiveness before proceeding to more complex scenarios. Our experimental models include fully connected neural networks with 1 to 4 layers for the MNIST dataset, and pre-trained complex architectures such as AlexNet [55], ResNet50 [56], VGG16 [57], and GoogleNet [58] for the ImageNet dataset.

Experiment and Results

In our experimental setup, we applied the methodologies outlined by Pan and Rajan [12] and Hu et al. [13], along with techniques detailed in Section 3, across both datasets to scrutinize the refactored systems with respect to accuracy, run-time performance, robustness, and intepretebility. The results of experiments are summarized in Table 1.

From Table 1, we observe that the refactored models exhibit a marginal decrease in accuracy on the MNIST dataset, with a difference of 0.001. This decrease is attributed to the expanded modular complexity, which results in a run-time increase of 414.7 seconds. The modularity of the refactored model is significantly higher than the original model, with a difference of 8. The robustness of the refactored model is also higher, with a difference of 2.0476. The interpretability of the refactored model is higher, with an accuracy difference of 0.0769. Increases in both metrics indicate that the refactored system exhibits improved robustness and interoperability after decomposing. However, although the robustness has improved, the accuracy for refactored systems using the ImageNet dataset has decreased, falling below that of a coin flip. Therefore, modularity appears not only to be harmless but also beneficial to system safety, as it maintains accuracy and improves robustness. But, for the optimization of the aforementioned complex systems, more efforts are required to prevent accuracy loss, particularly in safety-critical tasks. More details can be found in https://github.com/NanJ90/ReLess-testing-tool

To summarize, we present an initial assessment for Re-LESS evaluation framework and describe the datasets used, the experimental design, and the metrics for evaluating refactorings. The comparative analysis of original and refactored models reveals that different datasets and models can exhibit significant variations across different performance metrics. For instance, while the performance of the Ima-geNet model remained relatively consistent after speedup, the modularized MNIST model took 168 times longer than the original. This underscores the critical importance of evaluating effects across multiple datasets and models to gain comprehensive insights into performance implications w.r.t accuracy.

Conclusions and Future Work

Our contribution in this work includes a review of literature focused on refactoring in LESS, particularly with an emphasis on safety considerations. This review critically analyzes the spectrum of assessments presented across various studies, each contributing to a facet of the AI safety standard. We further explore and elucidate the interrela- tionships between these safety metrics and the accuracy of AI systems, highlighting the implications for model development and deployment. Our preliminary results set a potential foundation to help drive the long-term evolution, and robustness of LESS that are traditionally enjoyed by conventional systems during development and deployment, and then improve the safety of LESS. The scientists and engineers who develop AI systems will be able to rely on the refactored systems and trust them to make decisions that are safe, secure, and trustworthy. This work includes understanding how the thresholds in Fig. 3 will be determined for various applications and how the user can determine the weights for the various metrics. We have described an initial validation of our framework; however, further experimentation that includes more metrics such as fairness and privacy, extending the validation to a variety of problem domains and case studies is essential to comprehensively assess its effectiveness and generalizability. This would also enable practitioners to prioritize specific components when evaluating LESS and could even lead to design-to-criteria LESS.

Figure 1 :1Figure 1: Traditional refactoring. 𝑃𝑟𝑜𝑔 1 is the refactoring. 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are programs to be refactored and the refactored program, respectively. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are the outputs of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , respectively, given 𝐼 𝑛𝑝𝑢𝑡. Assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 .

Figure 2 :2Figure 2: Refactoring LESS. 𝑃𝑟𝑜𝑔 1 is the refactoring. 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 is the ML algorithm to be refactored and the refactored ML algorithm, respectively. Assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 are the outputs (trained models) of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , respectively, given a training set. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are the outputs (predictions/classifications) of 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 , respectively, given a testing set.

Figure 3 :3Figure 3: |𝑂𝑢𝑡𝑝𝑢𝑡 ′ | \|𝑂𝑢𝑡𝑝𝑢𝑡|. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are supervised classification tasks' labels from the original and refactored models, respectively. |𝛿| = |𝑂𝑢𝑡𝑝𝑢𝑡 ′ \𝑂𝑢𝑡𝑝𝑢𝑡|.

𝑇 𝑃 and 𝑇 𝑁 are the number of positive instances and negative instances correctly classified, and 𝐹 𝑃 and 𝐹 𝑁 are the number of instances incorrectly classified. b. Run-Time Performance Improvement (RTPI ) is determined by comparing the observed run-time of the original (old) code and new (transformed) code. 𝑅𝑇 𝑃𝐼 = 𝑅𝑇 𝑃 𝑜𝑙𝑑 − 𝑅𝑇 𝑃 𝑛𝑒𝑤 𝑅𝑇 𝑃 𝑜𝑙𝑑 (2) c. ROBustness Improvement is indicated as ROBI. 𝑅𝑂𝐵𝐼 = 1  * Σ 𝑥,𝑦∈ (1 − 𝐿𝑜𝑠𝑠 𝑛𝑒𝑤 (𝑦, 𝑓 ) − 𝐿𝑜𝑠𝑠 𝑜𝑙𝑑 (𝑦, 𝑓 ) 𝐿𝑜𝑠𝑠 𝑜𝑙𝑑 (𝑦, 𝑓 ) )

d. INTerpretability Improvement is indicated as INTI. 𝐼 𝑁 𝑇 𝐼 = 1 | 𝑠𝑢𝑏𝑠𝑒𝑡 | * Σ 𝑥,𝑦∈ 𝑠𝑢𝑏𝑠𝑒𝑡 𝐿𝑜𝑠𝑠(𝑦, 𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑎𝑏𝑙𝑒(𝑥) )

Table 11Performance comparison of the original and refactored models on the MNIST and ImageNet datasets.MetricOriginalMNIST RefactoredDifferenceOriginalImageNet RefactoredDifferenceAccuracy0.94910.94900.00010.7948<0.5>0.2948Run-time (seconds)4.2419.3414.7163.717410.3000Robustness0.17442.22202.04760.945310.07549.1301Interpretability (accuracy)0.78820.86510.0769<0.5<0.50

https://paperswithcode.com/paper/omnivec-learning-robustrepresentations-with are ML algorithm source code, i.e., textual descriptions. We 2 Traditional software may be concurrent, potentially experiencing race conditions, or may rely on its (changing) environment. In such cases, "flaky" tests may arise, which would challenge refactoring validation. In this case, the test suites can be executed several times to identify stable tests. While our current investigation focuses on supervised learning, we plan to extend the framework to other types of learning (unsupervised, reinforcement) as part of of our future work.

Acknowledgments

We thank Ayan Kohli for the initial investigation into semantic similarity and the anonymous reviewers for their helpful comments. This work is supported by the National Science Foundation (NSF) under Agreement No.CCF-2200343.

Software Engineering for AI-Based Systems: A Survey SMartínez-Fernández JBogner XFranch MOriol JSiebert ATrendowicz AMVollmer SWagner 10.1145/3487043 ACM Transactions on Software Engineering and Methodology 31 2022 The ML test score: A rubric for ML production readiness and technical debt reduction EBreck SCai ENielsen MSalib DSculley 10.1109/BigData.2017.8258038 IEEE International Conference on Big Data (Big Data) 2017. 2017 ISO/IEC 14764 Software Engineering -Software Life Cycle Processes -Maintenance, International Organizations for Standardization

Geneva, Switzerland

2006 Refactoring sequential java code for concurrency via concurrent libraries DDig JMarrero MDErnst 10.1109/icse.2009.5070539 International Conference on Software Engineering IEEE 2009 Hidden technical debt in Machine Learning systems DSculley GHolt DGolovin EDavydov TPhillips DEbner VChaudhary MYoung J.-FCrespo DDennison Neural Information Processing Systems MIT Press 2015 2 NIPS '15 Ariadne: Analysis for machine learning programs JDolby AShinnar AAllain JReinen 10.1145/3211346.3211349 International Workshop on Machine Learning and Programming Languages, MAPL 2018

New York, NY, USA

ACM 2018 Executive order on the safe, secure, and trustworthy development and use of artificial intelligence 2023 TMadiega Artificial intelligence act, European Parliament: European Parliamentary Research Service 2021 SRussell PNorvig 10.1007/978-3-030-82681-9. doi: Artificial Intelligence: A Modern Approach Pearson 2020 4 ed Robustness may be at odds with accuracy DTsipras SSanturkar LEngstrom ATurner AMadry arXiv:1805.12152 2019 Towards training reproducible deep learning models BChen MWen YShi DLin GKRajbahadur ZM JJiang 10.1145/3510003.3510163 International Conference on Software Engineering, ICSE '22

New York, NY, USA

Association for Computing Machinery 2022 On decomposing a deep neural network into modules RPan HRajan 10.1145/3368089.3409668 doi:10.1145/3368089.3409668 Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering ACM 2020 If a Human Can See It, So Should Your System: Reliability Requirements for Machine Vision Components BCHu LMarsso KCzarnecki RSalay HShen MChechik 10.1145/3510003.3510109 arXiv:2202.03930 Proceedings of the 44th International Conference on Software Engineering the 44th International Conference on Software Engineering 2022 Speed/accuracy trade-offs for modern convolutional object detectors JHuang VRathod CSun MZhu AKBalan AFathi ISFischer ZWojna YSong SGuadarrama KPMurphy IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. 2016 Formal Specification for Deep Neural Networks SASeshia ADesai TDreossi DFremont SGhosh EKim SShivakumar MVazquez-Chanlatte XYue UCB/EECS-2018-25 2018 EECS Department, University of California, Berkeley Technical Report Security versus accuracy: Trade-off data modeling to safe fault classification systems YZhuo ZSong ZGe 10.1109/TNNLS.2023.3251999 IEEE Transactions on Neural Networks and Learning Systems 2023 WFOpdyke Refactoring object-oriented frameworks

Champaign, IL, USA

1992 University of Illinois at Urbana-Champaign Ph.D. thesis Refactoring: Improving the Design of Existing Code MFowler 1999 Addison-Wesley Boston, MA, USA The birth of refactoring: A retrospective on the nature of high-impact software engineering WGGriswold WFOpdyke 10.1109/MS.2015.107 research 32 2015 conference Name: IEEE Software A field study of refactoring challenges and benefits MKim TZimmermann NNagappan 10.1145/2393596.2393655 Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Cary, North Carolina

ACM 2012 1 On preserving the behavior in software refactoring: A systematic mapping study EAAlomar MWMkaouer CNewman AOuni 10.1016/j.infsof.2021.106675 Information and Software Technology 140 106675 2021 Safe automated refactoring for intelligent parallelization of Java 8 streams RKhatchadourian YTang MBagherzadeh SAhmed 10.1109/icse.2019.00072 International Conference on Software Engineering, ICSE '19, ACM/IEEE

Piscataway, NJ, USA

IEEE 2019 Refactoring using type constraints FTip RMFuhrer AKieżun MDErnst IBalaban BDeSutter 10.1145/1961204.1961205 ACM Transactions on Programming Languages and Systems 33 9 47 2011 Automated refactoring of legacy Java software to enumerated types RKhatchadourian JSawin ARountev 10.1109/ICSM.2007.4362635 International Conference on Software Maintenance, ICSM '07

Paris, France

IEEE 2007 Automated refactoring of legacy Java software to default methods RKhatchadourian HMasuhara 10.1109/ICSE.2017.16 International Conference on Software Engineering, ICSE '17

Piscataway, NJ, USA

IEEE Press 2017 Proactive empirical assessment of new language feature adoption via automated refactoring: The case of Java 8 default methods RKhatchadourian HMasuhara 10.22152/programming-journal.org/2018/2/6 International Conference on the Art, Science, and Engineering of Programming, volume 2 of Programming '18

AOSA, Nice, France

2018 30 A survey of software refactoring TMens TTourwe 10.1109/TSE.2004.1265817 IEEE Transactions on Software Engineering 30 2004 On neural network equivalence checking using SMT solvers CEleftheriadis NKekatos PKatsaros STripakis 10.1007/978-3-031-15839-1_14 Formal Modeling and Analysis of Timed Systems Lecture Notes in Computer Science SBogomolov DParker

Cham

Springer International Publishing 2022 Understanding software-2.0: A study of machine learning library usage and evolution MDilhara AKetkar DDig 10.1145/3453478 ACM Transactions on Software Engineering and Methodology 30 2021 Discovering repetitive code changes in python ml systems MDilhara AKetkar NSannidhi DDig 10.1145/3510003.3510225 International Conference on Software Engineering, ICSE '22

New York, NY, USA

Association for Computing Machinery 2022 A large-scale empirical study on self-admitted technical debt GBavota BRusso 10.1145/2901739.2901742 International Conference on Mining Software Repositories, MSR '16

New York, NY, USA

ACM 2016 Managing technical debt in software-reliant systems NBrown YCai YGuo RKazman MKim PKruchten ELim AMaccormack RNord IOzkaya RSangwan CSeaman KSullivan NZazworka 10.1145/1882362.1882373 FSE/SDP Workshop on Future of Software Engineering Research, FoSER '10

New York, NY, USA

ACM 2010 BChristians Self-admitted technical debt-an investigation from farm to table to refactoring 2020 An exploration of technical debt ETom AAurum RVidgen 10.1016/j.jss.2012.12.052 Journal of Systems and Software 86 2013 An empirical study of refactorings and technical debt in Machine Learning systems YTang RKhatchadourian MBagherzadeh RSingh AStewart ARaja 10.1109/ICSE43902.2021.00033 International Conference on Software Engineering, ICSE '21, IEEE/ACM

Madrid, Spain

IEEE 2021 Semantic-preserving adversarial code comprehension YLi HWu HZhao 10.48550/arXiv.2209.05130 arXiv:2209.05130 2023 A distancebased weighting framework for boosting the performance of dynamic ensemble selection Z.-LZhang Y.-YChen JLi X.-GLuo 10.1016/j.ipm.2019.03.009 2019 56 GNguyen SBiswas HRajan arXiv:2306.09297 Fix fairness, don't ruin accuracy: Performance aware fairness repair using automl 2023 arXiv preprint Towards stable and efficient training of verifiably robust neural networks HZhang HChen CXiao SGowal RStanforth BLi DBoning C.-JHsieh arXiv:1906.06316 2019 Omnivec: Learning robust representations with cross modal sharing SSrivastava GSharma Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision the IEEE/CVF Winter Conference on Applications of Computer Vision 2024 YLecun CCortes MNIST handwritten digit database 2010 ImageNet Large Scale Visual Recognition Challenge ORussakovsky JDeng HSu JKrause SSatheesh SMa ZHuang AKarpathy AKhosla MBernstein ACBerg LFei-Fei 10.1007/s11263-015-0816-y International Journal of Computer Vision 115 2015 IJCV) DHendrycks TDietterich arXiv:1903.12261 Benchmarking neural network robustness to common corruptions and perturbations 2019 XHan ZZhang NDing YGu XLiu YHuo JQiu LZhang WHan MHuang QJin YLan YLiu ZLiu ZLu XQiu RSong JTang JWen JYuan WXZhao JZhu ArXiv abs/2106.07139 Pre-trained models: Past, present and future 2021 Decomposing convolutional neural networks into reusable and replaceable modules RPan HRajan 10.1145/3510003.3510051 arXiv:2110.07720 arXiv:2110.07720 International Conference on Software Engineering

New York, NY, USA

2022 ICSE '22 Challenges in migrating imperative deep learning programs to graph execution: An empirical study TCastroVélez RKhatchadourian MBagherzadeh ARaja 10.1145/3524842.3528455 International Conference on Mining Software Repositories, MSR '22

New York, NY, USA

Association for Computing Machinery 2022 Better performance with tf.function 2021 Google LLC 23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software DObrien SBiswas SMImtiaz RAbdalkareem EShihab HRajan 2022 13 Refactoring: Current research and future trends TMens SDemeyer BBois HStenten PVan Gorp 10.1016/S1571-0661(05)82624-6 Electronic Notes in Theoretical Computer Science 82 2003. -6, lDTA'2003 -Language descriptions, Tools and Applications Efficient and robust automated machine learning MFeurer AKlein KEggensperger JSpringenberg MBlum FHutter Advances in Neural Information Processing Systems 28 2015. 2015 Autokeras: An automl library for deep learning HJin FChollet QSong XHu Journal of Machine Learning Research 24 2023 Robustness and accuracy could be reconcilable by (proper) definition TPang MLin XYang JZhu SYan International Conference on Machine Learning 2022 Interpretable machine learning CMolnar 2020 Lulu. com SAo SRueger ASiddharthan arXiv:2308.03179 Empirical optimal risk to quantify model trustworthiness for failure detection 2023 arXiv preprint One weird trick for parallelizing convolutional neural networks AKrizhevsky arXiv:1404.5997 2014 Deep residual learning for image recognition KHe XZhang SRen JSun Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 Very deep convolutional networks for large-scale image recognition KSimonyan AZisserman arXiv:1409.1556 2015 Going deeper with convolutions CSzegedy WLiu YJia PSermanet SReed DAnguelov DErhan VVanhoucke ARabinovich Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2015