1. Introduction

ReLESS: A Framework for Assessing Safety in Deep Learning Systems

Nan Jia

Anita Raja

0 1

Rafi Khatchadourian

0 CUNY, Hunter College , 695 Park Avenue, New York, NY 10065 , USA 1 CUNY, the Graduate Center , 365 Fifth Avenue, New York, NY 10016 , USA

Traditionally, software refactoring helps to improve a system's internal structure and enhance its non-functional features, such as reliability and run-time performance, while preserving external behavior including original program semantics. However, in the context of learning-enabled software systems (LESS), e.g., Machine Learning (ML) systems, it is unclear which portions of a software's semantics require preservation at the development phase. This is mainly because (a) the behavior of the LESS is not defined until run-time; and (b) the inherently iterative and non-deterministic nature of ML algorithms. Consequently, there is a knowledge gap in what refactoring truly means in the context of LESS as such systems have no guarantee of a predetermined correct answer. We thus conjecture that to construct robust and safe LESS, it is imperative to understand the flexibility of refactoring LESS compared to traditional software and to measure it. In this paper, we introduce a novel conceptual framework named ReLESS for evaluating refactorings for supervised learning by (i) exploring the transformation methodologies taken by state-of-the-art LESS refactorings that focus on singular metrics, (ii) reviewing informal notions of semantics preservation and the level at which they occur (source code vs. trained model), and (iii) empirically comparing and contrasting existing LESS refactorings in the context of image classification problems. This framework will set the foundation to not only formalize a standard definition of semantics preservation in LESS but also combine four metrics: accuracy, run-time performance, robustness, and interpretability as a multi-objective optimization function, instead of a single-objective function used in existing works, to assess LESS refactorings. In the future, our work could seek reliable LESS refactorings that generalize over diverse systems.

eol>learning-enabled software systems machine learning systems refactoring trusted AI software architectures AI safety

1. Introduction

Developers of Learning-Enabled Software Systems (LESS) face the challenge of constructing highly reliable large-scale systems, as evidenced in previous research [ 1, 2 ]. With the pervasive integration of dynamic Machine Learning (ML) models in these operational software systems, safety, eficiency, and adaptability with respect to evolving user requirements become paramount. Moreover, software systems inherently evolve throughout their life-cycle [ 3 ], which traditionally incurs substantial costs and risks, particularly in the context of large, complex systems [ 4 ]. Although LESS shares these traits with conventional software, its data-driven nature accentuates the propensity for evolution [ 5 ]. This divergence from traditional software poses unique challenges for testing and verification due to its data-driven and uncertain requirements. Notably, the efifcacy of resulting ML models, including Large Language Models (LLMs), improves with more extensive data inputs, necessitating a delicate balance between user privacy protection and model refinement in large-scale systems. Consequently, there arises a pressing need for validation and testing methodologies tailored to the distinctive characteristics of AI-driven systems.

This evolving research agenda underscores a critical reassessment of priorities in AI system development. Furthermore, as AI technologies permeate various sectors of society, scalable systems must efectively consider and adapt to legal, policy, and employment implications. These technical attributes not only underpin the functional aspects of AI applications but also facilitate their alignment with essential ethical standards and societal expectations [ 6 ]. This imperative is further underscored by a recent U.S. government-issued

Executive Order [ 7 ] and the EU AI Act [ 8 ], emphasizing the necessity for Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Moreover, to ensure the positive societal impact of AI systems, accuracy, runtime performance, robustness, and interpretability are crucial technical attributes that directly support broader ethical objectives.

Recent works [ 1, 9 ] have highlighted a variety of metrics for assessing the impacts of LESS transformation. These metrics include aspects such as ensuring safety and fairness, protecting privacy, fostering collaboration, considering legal and policy ramifications, and evaluating impacts on employment. Recent studies [ 10, 11, 12, 13, 14 ] have investigated whether original and transformed systems should behave consistently before and after transformation. These studies illustrate the potential trade-ofs between accuracy and each respective metric. Although various metrics like fairness and privacy are considered, in this work, we focus on accuracy, run-time performance, robustness, and interpretability as a starting point with the intent to cover the majority of AI safety concerns in LESS [ 14, 15, 16 ]. We argue that comprehending and harnessing the flexibility of refactoring in LESS represents a pivotal stride toward enhancing the safety of AI systems.

A detailed exposition of these metrics, as discussed in the state-of-the-art literature, is provided in Section 2.

Traditionally, the criterion for refactoring [ 17, 18 ], is that the same input must produce the same output; any deviation is considered a behavior change of the program and a threat leading to system crash [ 19 ]. However, refactoring is underexplored in the context of LESS, including deep learning frameworks [ 1 ]. LESS, unlike traditional software systems, benefit from randomness but yet lack a guarantee of a predefined exact outcome due to their reliance on the quantity and quality of data, complicating predictions about the efects of refactoring.

This paper aims to bridge the knowledge gap between refactoring practices in traditional software [ 4, 20, 21, 22, 23, 24, 25, 26, 27 ], and LESS [ 12, 28, 29, 30 ] by introducing ReLESS (Refactoring of Learning-Enabled Software Systems), an evaluation framework for standardizing and formalizing refactoring methodologies. In this work, we describe this framework in the context of supervised learning tasks, specifically image classification problems. Our hypothesis posits that the criteria for successful refactoring— namely source-to-source transformation and semantic preservation—assume unique, yet complementary implications in the context of LESS as opposed to traditional software systems.

Specifically, ReLESS will allow for the possibility that transformations might produce outputs that are slightly diferent from the original output as long as they lead to improvements in other performance metrics of the system. Determining how "diferent" the output can be from the original is a research question we seek to address. Moreover, our approach aims to discover and preserve safetycritical metrics during ReLESS while further mitigating the uncertainties introduced by their non-deterministic nature. While current approaches emphasize knowledge distillation (transferring knowledge from a large neural network to a smaller, resource-eficient one) and regularization (a technique for solving over-fitting), our vision for the future of ReLESS includes approaches that combine connectionist models (e.g., neural networks) and symbolic (e.g., decision tree) approaches as well as Bayesian and analogizer (e.g., K-nearest neighbor, support vector machine) approaches.

This paper is structured as follows: in Section 2 we first provide a comprehensive analysis of state-of-the-art refactoring methodologies in LESS and discuss how these works trade-of accuracy with respect to specific metrics such as run-time performance, robustness [ 11, 13, 14, 16 ], or interpretability. In Section 3, we contrast existing practices and scrutinize informal notions of semantics preservation across diferent levels (source code vs. trained model). We then motivate a novel thread of inquiry for the ReLESS evaluation framework and its multi-objective optimization function that combines the aforementioned multiple metrics to guarantee the AI system’s safety. Section 4 presents preliminary experiments utilizing ReLESS to gauge LESS safety and associated parameters. Finally, in Section 5 we discuss the main insights gleaned from this work and our future work.

2. Related Work

In recent years, various research has been conducted on LESS refactoring, with significant observations in balancing single metric against accuracy of models. Several studies have focused on image classification or object detection, addressing this tension and presenting innovative verification. However, these approaches often face limitations in lack of generalization and narrow scope of metrics, which we aim to address in our work.

2.1. Refactoring Types in Software Development

Refactoring [ 17 ], a well-known technique for the evolution and maintenance of traditional software, alters a system’s internal structure without changing its behavior [ 18 ] to improve non-functional characteristics such as run-time performance, security, and modularity, and to pay down technical debt [ 31, 32, 33, 34, 35 ]. It can be considered as a series of typically automatic procedures for modifying code, such as variable name changes to enhance comprehension [ 19 ], without an explicit focus on automated refactoring, as these modifications frequently occur automatically within a system-based environment. Formally, a refactoring is a program transformation potentially spanning multiple, non-adjacent program statements or expressions that is: (i) source-to-source and (ii) semantics-preserving, i.e., the behavior of the program is the same before and after the refactoring.

Even though refactoring is a well-established practice in traditional software development, it is not as well clear in LESS. Existing refactoring attempts in LESS are implicitly performed via controlling randomness [ 11 ], decomposing trained models [ 12 ], or defining new requirements [ 13 ]. The lack of refactoring tools and techniques, and an evaluation framework for LESS is a significant challenge for developers and researchers [ 2 ]. Our research aims to develop a multiobjective evaluation framework for LESS. We study it in the context of a specific class of supervised learning problems, namely image classification tasks.

2.2. Image Classification Problems and Evaluation

While the continuous evaluation of ML models [ 11, 12, 13, 36 ] has highlighted modularity, reliability, robustness, and interpretability, these assessments done independently fall short of ensuring the safety of AI systems as a whole. Consider for instance the role of accuracy, which is the widely accepted metric [ 37 ] for gauging the success of models in the image classification task. Benchmark models for this problem class originating from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have continually improved on accuracy. To date, the record for the highest accuracy on the ImageNet benchmark is an impressive 92.4%, set by OmniVec(ViT)1. However, while high accuracy is indispensable, it is not the sole criterion for the adequacy of a model, especially within contexts of safety-critical applications. In such applications, other non-functional metrics demand equal consideration to ensure the comprehensive robustness and reliability of LESS.

Ensuring AI systems maintain safety and fairness [ 38 ] across various conditions and inputs is crucial for applications like autonomous driving and medical diagnosis [ 13, 16, 39 ]. Reliability is equally important, as dependable systems yield consistent results, fostering trust among stakeholders and accountability among developers. While state-of-the-art models often match or exceed human performance in image classification tasks, understanding errors and their solutions remains challenging [ 40 ]. Evaluating model performance is vital, especially in safety-critical scenarios, yet the opaque nature of the learning component hinders transparency and interpretability.

Our proposed framework ReLESS combines accuracy, runtime performance, robustness, and interpretability, using a multi-objective optimization function. By experimenting with metrics drawn from existing literature and through preliminary evaluations of them, we validate target systems’ performance both before and after refactoring. Our findings illuminate the trade-ofs researchers make between accuracy and other performance metrics. Importantly, our evaluation 1https://paperswithcode.com/paper/omnivec-learning-robustrepresentations-with process considers not just a single metric versus accuracy but integrates multiple metrics to understand various system maintenance challenges. This approach helps mitigate the "black-box" nature of AI learning components, providing clearer insights into system behavior and performance.

2.3. Baselines for Comparison

Chen et al. [ 11 ] analyzed refactoring for image classification tasks at the algorithmic level with various models using dynamic analysis, record-and-replay, and profile-and-patch. The focus of their approach is to control randomness and hardware non-determinism to guarantee that the and performance metrics are the same as the original system in seven models (Lenet1/4/5, ResNet-38/56, WRN-28-10, and ModelX). Models are then reproduced eficiently and accurately across diferent hardware.

Pan and Rajan [ 12 ] hypothesized that decomposing learning models into reusable components can afect refactoring outputs and statistical performance in the MNIST [ 41 ] dataset. They run four DNN models across sixteen experiments with varying hidden layers and datasets, demonstrating that removing irrelevant edges in the network can lead to similar accuracy and preserve the most semantics. They found that 9 out of 16 cases were functionally equivalent to the original models, based on the Jaccard Index, with intradataset performance from decomposed models slightly outperforming models built from scratch (e.g., MNIST(+0.30%)).

Adopting the methodology from Hu et al. [ 13 ], we succeeded in obtaining the original and filtered images from ImageNet [ 42 ]. Image filters such as brightness, contrast, defocus/blur, frost, gaussian noise, and jpeg compression, are crucial for testing the robustness of the refactored systems [ 43 ] because they involve pictures that human can recognize correctly and easily before and after filtering, thus setting a baseline for model performance in similar conditions.

3. Methodology

Given the context of refactoring in ML systems as discussed in the previous section, we present ReLESS, a conceptual framework created especially to tackle two important research goals. First, to investigate the definition and operation of semantic preservation during system transformation procedures inside the LESS. Second, to explore approaches for evaluating the safety of LESS during the system’s transition, building on our formalization of semantic preservation from the previous goal.

Consider Fig. 1 which depicts the situation representing the refactoring of traditional software systems. Here, 1 represents an (automated) refactoring that takes as input a program 2 to be refactored and produces a refactored ip.ero.,gtreaxmtu al de2′s.cNrioptteiotnhsa.t We a2ssaunmd e tha2′t a re so1uirscea cnoodne-, trivial refactoring, i.e., that 2 ≠ 2′. As refactoring typically deals with real-world languages with non-trivial semantics, the semantic equivalence of 2 and 2′ is normally assessed empirically by executing 2’s test suites and comparing the results. Thus, to evaluate the refactoring 1, is fed to both 2 and 2′ for all test suite inputs. The is then compared—ideally, all tests have the same results before and after the refactoring. If so, then = ′, and 1 is considered validated. Otherwise, ≠ ′, meaning there is a bug [ 18, 19 ] in the system. Since traditional software is typically deterministic and its logic is not driven by dynamic data models, the process works in a relatively straightforward fashion. In fact, the larger the test suite, the greater the confidence that the refactoring works. 2 On the other hand, given the non-deterministic intricacies inherent in LESS, the traditional refactoring as described in Fig. 1 is insuficient. Consequently, we construct an auxiliary diagram Fig. 2 that facilitates a more direct and nuanced evaluation of the transformations.

Now consider Fig. 2 representing ReLESS with citations of related work in the supervised learning context. 3 Here, 1 represents an (automated) refactoring that takes as input an ML algorithm 2 to be refactored and produces a refactored ML algorithm 2′. Note that 2 and 2′ are ML algorithm source code, i.e., textual descriptions. We 2Traditional software may be concurrent, potentially experiencing race conditions, or may rely on its (changing) environment. In such cases, “flaky” tests may arise, which would challenge refactoring validation. In this case, the test suites can be executed several times to identify stable tests. 3While our current investigation focuses on supervised learning, we plan to extend the framework to other types of learning (unsupervised, reinforcement) as part of of our future work. again assume that 1 is a non-trivial refactoring, i.e., that 2 ≠ 2′. To evaluate the refactoring 1, two steps are taken: (a) a training dataset is fed to both 2 (aMtneLdstminogd) ed2′la,st washeitchi3satfhendedntopbroodt3′h,urceesspteh3cetainvcdeol my;p(ible)3′d.a,nOanekveaaotlrruamaitniooernde such datasets (both training and evaluation) may be used. The —in this case, predictions or classifications—is then compared. If 1 results in no accuracy loss, then = ′. Otherwise, ≠ ′, meaning 1 causes some accuracy loss when refactoring 2. Note that unlike in the traditional refactoring evaluation case, whether there is a bug in 1 in this situation is not straightforward to determine and is not a topic of focus in this paper. Because LESS can be non-deterministic and has logic that is driven by dynamic data models, whether 1 is considered valid may depend on multiple factors. For instance, if the accuracy loss is within a certain threshold, then 1 may be considered valid. If the accuracy loss is above the threshold, then 1 may be considered invalid.

A supplementary contribution of our proposed framework is that it has an additional layer where both the transformation and output comparison could occur. For instance, there is a dashed line in Fig. 2 from 3 to 1 and can a1lstoo take3′p,liancdeicoantinthgethtraat itnheedpMroLgrmamodterlasn. sIfnortmheattioranditional setting (Fig. 1), because the transformation is not source-to-source, it would not be considered a refactoring in the traditional sense but instead viewed as compiler optimization. However, in the LESS context, ML algorithms are typically written in interpreted languages (e.g., Python), where a compiler is not involved. It is because the model training (compilation) process can potentially be lengthy (days or even weeks) depending on the dataset size, transforming the ML algorithm to produce a new ML model as part of the refactoring process can be time-consuming [ 44 ]. Instead, it may be advantageous in this context to perform the refactoring at the testing level to avoid retraining. Such a "refactoring" is done on LESS by Pan and Rajan [ 12, 45 ]. Although the transformation is on the trained ML model, their goal of enhanced modularity is a classical refactoring outcome.

3.1. Determination of Semantic Equivalence

Our objective is to ultimately build a tool, where users provide original code (old system), that determines which refactorings (new systems) would satisfy semantic equivalence. We identify diferent levels at which this could occur: semantic equivalence at: (a) the ML algorithm level (case 1), and (b) the ML model level (case 2). We will demonstrate how existing works perform semantic equivalence from a single-lens point of view. Drawing on these efects, however, our approach will create a multi-objective evaluation (instead of a single-objective function used by the current state-of-the-art).

Case 1: Semantic Equivalence at the ML Algorithm Level

2 = 2′: This equivalence implies that 3 and t his c3′aasre=e, a lso 2seamn′.danBtuict,atl2hl′yeaeraeqvuseeirvmaagaleennttrtiacaaisnllisynhgeoqwtiumniveian(leiFnnigth.so2iun.rcIsne) of the model for this refactoring in Chen et al. [ 11 ] increases from 0.017 to 0.023 for Lenet1 and from 7.08 to 14.979 in the case of ModelX. Their approach has higher storage overahpepardofaocrh does3′n(odtufeactiolitraatnedmoomdeslegeednreeracolirzdatiniogn).toSuunchseaenn data by making the training process explore various possibilities. This will constrain the robustness of an ML model. Deterministic methods are also more susceptible to overiftting, as models can memorize the training data too closely, limiting their performance on new data. Lastly, ensuring complete determinism can be computationally expensive and challenging, especially in complex, multi-threaded, or distributed computing environments. This work highlights the tension between semantic preservation and model optimization.

Case 2: Semantic Equivalence at the ML Model Level are th3e=sa me. I3′t: fTolhloiswms eaagnasinththaattthe train=ed ML m′o,daenlds t iona2l saennds e . As2′waerearseecmoannsitdicearlilnygenqouni-vtarlievniatl irneftahcetotrriandgisas discussed earlier, we assume 2 ≠ 2′, meaning that the refactoring 1 has made some non-trivial transformation. An example of such a transformation would be to enhance the run-time performance of the training; the trained model would be the same but the training process would be faster. For instance, Castro Vélez et al. [ 46 ] show tbhyaatptphley ing a3′hryubnr-itdi mtreaiinsin∼g9.t2e2chsneciqounedisnfaimstpeerrtahtaivne Deep3 Learning (DL) programs. In TensorFlow 2 [ 47 ], for example, the tf.function decorator can be applied to certain (model) Python functions found in imperative code to speed up the training process. Developers and scientists, then, can write natural, debuggable DL code in an imperative style while retaining the run-time performance typically found in legacy DL frameworks that support deferred-execution style programming models. Applying tf.function to (otherwise eagerly-executed) imperative DL code can be—if done correctly—a semantics-preserving refactoring [ 46 ]. th e sam3e.≠It follo3′w:Tshthisatmiteiasnpsotshsaitbtlehethtraatined mod≠els are no′t, meaning that 2 and 2′ may not be semantically equivalent in the traditional sense. There are several situations that may occur here, e.g., (i) diferent hyperparameters are used. (ii) hybridization is misused, resulting in semantically–in-equivalent code [ 46 ], (iii) 3′ may be an optimized DL model, e.g., having fewer edges, being more modular, and avoiding over-fit. In Fig. 2, 3′ represents a modular and refactored system from 3 via 1, where semantics is preserved through separation of concerns, such as using supervised classification labels for maintenance and reduced model training time [ 12 ]. This indicated that 3′ does better than 3 with respect to ReLESS optimization while preserving the potential to explore generalizability and scalability.

Our analysis not only sheds light on the current stateof-the-art but also establishes a linkage between program transformation techniques and their operational viability in scenarios where safety is of paramount concern. We then use these observations to formally define semantic preservation using a multi-objective optimization function rather than a single-objective one in LESS.

3.2. Semantic Preservation: Formal Definition and Verification Metrics

We first define the semantic preservation of LESS based on varying ranges of the output. The Venn diagram Fig. 3 shows the outputs from the original code and proposed ReLESS. The upper circle in blue is the output from the original code, e.g., the probability of correct labels for a classification or prediction task. The lower circle in yellow is the output from ReLESS. This diagram examines where the two outputs are equivalent (overlapping area) and where they are diferent. Suppose is the acceptable range of overlap, i.e., how much developers/engineers/scientists are willing to trade accuracy with other factors viz. robustness, run-time performance, interpretability etc. Ideally, the overlapped area should be as large as possible, but this is not always the case and is application-dependent. For instance, if the system is time-critical, then the response time is emphasized in the optimization even though there are marginal accuracy losses. If the system is safety-critical, then the accuracy should be preserved as much as possible. That said, we posit that to achieve semantic preservation in ReLESS, it is inadequate to consider accuracy as the sole optimization metric.

Prior works [ 13, 38 ] has formalized balancing between accuracy and reliability/robustness and fairness. OBrien et al. [ 48 ] define non-functional LESS metrics as run-time performance (speed), security, privacy, and memory (storage). Building upon these foundational studies, we extend the evaluation framework for semantic preservation to explicitly encompass safety as an overarching theme. Run-time performance, as highlighted by OBrien et al. [ 48 ], serves not only as a measure of eficiency but also influences system safety by ensuring timely responses in critical scenarios. Robustness, as documented by Hu et al. [ 13 ], is directly linked to safety, reflecting the system’s capacity to withstand errors and adversities. Finally, interpretability, introduced by [ 36 ], enhances safety by providing clarity on decision-making processes, thereby allowing for greater accountability and easier identification of potential safety breaches. These three metrics collectively forge a more resilient and safetyconscious framework for assessing semantic preservation in LESS.

This tailored approach allows for a more integrated and holistic assessment of LESS, aligning closely with contemporary LESS development and deployment needs. All three curacy with those non-functional metrics with customized importance factors to guide which degree of flexibility the engineers, scientists, and researchers want the model they work on to emphasize. We propose a multi-objective optimization function, akin to Nguyen et al. [ 38 ]’s approach, to determine the diference (loss function) between a LESS and its corresponding ReLESS. We argue that if the loss is below a certain threshold with constraints (as discussed in Fig. 3), then semantic preservation is maintained.

As one of the state-of-the-art formal methods, optimization via loss functions is central to the training of ML/DL models [ 38 ]. It is recognized for its adaptability to a wide range of applications. Diferent trade-ofs exist when refactoring in ML/DL systems, so a multi-objective optimization function is constructed. Besides, optimization can standardize each metric term that needs to be balanced with accuracy in loss function to make the whole system understandable to the target audience. The range of optimization applied is from classical ML models (random forest, gradient boost) to DNN models with supported libraries, e.g., auto-sklearn [ 50 ] and AutoKeras [ 51 ].

3.2.1. Accuracy, Run-time Performance, Robustness, and Interpretability

To comprehensively evaluate the performance of ReLESS, we will consider accuracy with three key loss functions: run-time performance, robustness, and interpretability.

a. ACCuracy is the number of correct outputs over the total number of instances.

+ + + where

and negative instances correctly classified, and the number of instances incorrectly classified. and b. Run-Time Performance Improvement (RTPI ) is determined by comparing the observed run-time of the original (old) code and new (transformed) code.

are the number of positive instances and + (1) are (2) (3) = −

c. ROBustness Improvement is indicated as ROBI.

1 =  ∗Σ, ∈(1 − (, ) −

(, ) where  is the input dataset, is the training dataset, and is the set of corresponding labels for a supervised learning task, such as image classification. Similar to RTPI, the observed diference is captured in loss function in old and new models. ROBI is observed by the diference in the loss function between the old and new models. Our definition for robustness is based on [ 10, 52 ], where a robustness system after refactoring is verified by its loss function after adversarial training, and for classical ML after refactoring can be also verified by its loss function after data augmentation, feature engineering, and ensemble learning.

d. INTerpretability Improvement is indicated as INTI. =  | 1 | ∗Σ,  ∈ (, ( )) (4) Molnar [ 53 ] define interpretability in machine learning using interpretable models and a simplified loss function. The loss function serves as a quantitative measure to compare the interpretability of diferent models while maintaining accuracy. This approach blends the conceptual understanding of model behavior (through interpretable models) with a practical, measurable way (using the loss function) to assess and compare the clarity and comprehensibility of diferent models. We compute this metric using the definition of interpretability where a subset as input of . We are able to compare the diference between new and old models and the corresponding interpretability score [ 36 ]. The implication is that the simplified loss function on the explainable refactored system correlates with higher interpretability, which is plausible, but the exact method of determining explainability is essential here.

To sum up, we define a multi-objective optimization loss function that facilitates balancing the importance of various objectives depending on the application domain. Each metric is formulated as a ratio or a normalized value, which is typical in performance evaluation to provide a standardized measure of improvement or degradation. In the equation below, each metrics term is the loss measurements calculated from Eqs. (2) to (4) respectively for ROBI, RTPI, and INTI metrics respectively and accuracy is ACC;  model with its parameters,  is the dataset. Each term’s is the weight coeficient evaluation. dicates the importance of each of the metrics during model is assumed to be user-defined and in ( , ) = 1 × ACC  + 2 × RTPI + 3 × ROBI + 4 × INTI (5) tance of each term in the loss function. where 1, 2, 3, and 4 are weights that reflect the impor

The multi-objective optimization function in this formalism enables the determination of whether a ReLESS is a semantically preserving transformation to its LESS. Moreover, when fusing these measurements, it is also essential to include the measure of accuracy because regardless of the importance of the speed of operation, robustness, and interpretability, producing correct outputs is the cornerstone of model evaluation. In other words, accuracy is always a ifrst-class objective. Only by considering the critical role of accuracy can we ensure that a model is trustworthy [ 54 ].

Expanding on the conceptual structure presented in the preceding part, we describe a preliminary experimental conifguration intended to closely assess the accuracy, run-time performance, robustness, and interpretability of LESS.

4. Experimental Setup

This section describes the experimental setup employed for a preliminary evaluation of the proposed ReLESS framework for a simple case study. We describe the datasets used for experiments, followed by an explanation of the experimental design and the metrics adopted to assess the eficacy of the refactorings.

4.1. Datasets and Models

As indicated in Section 2, we study ReLESS in the context of two image classification datasets: the ImageNet dataset and the MNIST dataset. The ImageNet dataset [ 42 ], comprised of 1.2 million images across 1000 categories, is utilized for the evaluation to assess reliability and robustness, with a specific subset of 50,000 images filtered from [ dataset [ 41 ], containing 60,000 training images and 10,000 test images of handwritten digits, serves as the basis for initial evaluations. Those datasets enable preliminary assessments of the refactorings’ efectiveness before proceeding to more complex scenarios. Our experimental models include fully connected neural networks with 1 to 4 layers for the MNIST dataset, and pre-trained complex architectures such as AlexNet [ 55 ], ResNet50 [ 56 ], VGG16 [ 57 ], and GoogleNet [ 58 ] for the ImageNet dataset.

4.2. Experiment and Results

In our experimental setup, we applied the methodologies outlined by Pan and Rajan [ 12 ] and Hu et al. [ 13 ], along with techniques detailed in Section 3, across both datasets to scrutinize the refactored systems with respect to accuracy, run-time performance, robustness, and intepretebility. The results of experiments are summarized in Table 1.

From Table 1, we observe that the refactored models exhibit a marginal decrease in accuracy on the MNIST dataset, with a diference of

0.001. This decrease is attributed to the expanded modular complexity, which results in a run-time increase of 414.7 seconds. The modularity of the refactored model is significantly higher than the original model, with a diference of 8. The robustness of the refactored model is 2.0476. The interpretabilalso higher, with a diference of ity of the refactored model is higher, with an accuracy difference of 0.0769. Increases in both metrics indicate that the refactored system exhibits improved robustness and interoperability after decomposing. However, although the robustness has improved, the accuracy for refactored systems using the ImageNet dataset has decreased, falling below that of a coin flip. Therefore, modularity appears not only to be harmless but also beneficial to system safety, as it maintains accuracy and improves robustness. But, for the optimization of the aforementioned complex systems, more eforts are required to prevent accuracy loss, particularly in safety-critical tasks. More details can be found in https://github.com/NanJ90/ReLess-testing-tool

To summarize, we present an initial assessment for ReLESS evaluation framework and describe the datasets used, the experimental design, and the metrics for evaluating refactorings. The comparative analysis of original and refactored models reveals that diferent datasets and models can exhibit significant variations across diferent performance metrics. For instance, while the performance of the ImageNet model remained relatively consistent after speedup, the modularized MNIST model took 168 times longer than the original. This underscores the critical importance of evaluating efects across multiple datasets and models to gain comprehensive insights into performance implications w.r.t accuracy.

5. Conclusions and Future Work

Our contribution in this work includes a review of literature focused on refactoring in LESS, particularly with an emphasis on safety considerations. This review critically analyzes the spectrum of assessments presented across various studies, each contributing to a facet of the AI safety standard. We further explore and elucidate the interrelationships between these safety metrics and the accuracy of AI systems, highlighting the implications for model development and deployment. Our preliminary results set a potential foundation to help drive the long-term evolution, and robustness of LESS that are traditionally enjoyed by conventional systems during development and deployment, and then improve the safety of LESS. The scientists and engineers who develop AI systems will be able to rely on the refactored systems and trust them to make decisions that are safe, secure, and trustworthy. This work includes understanding how the thresholds in Fig. 3 will be determined for various applications and how the user can determine the weights for the various metrics. We have described an initial validation of our framework; however, further experimentation that includes more metrics such as fairness and privacy, extending the validation to a variety of problem domains and case studies is essential to comprehensively assess its efectiveness and generalizability. This would also enable practitioners to prioritize specific components when evaluating LESS and could even lead to design-to-criteria LESS.

Acknowledgments

We thank Ayan Kohli for the initial investigation into semantic similarity and the anonymous reviewers for their helpful comments. This work is supported by the National Science Foundation (NSF) under Agreement No.CCF-2200343.

ImageNet Refactored <0.5 174 10.0754 <0.5

[1]

Martínez-Fernández ,

Bogner ,

Franch ,

Oriol ,

Siebert ,

Trendowicz ,

A. M.

Vollmer ,

Wagner , Software Engineering for AI-Based Systems: A Survey , ACM Transactions on Software Engineering and Methodology 31 ( 2022 ) 1 - 59 . URL: http://ar xiv. org/abs/2105 . 01984 . doi: 10 .1145/3487043. arXiv: 2105 . 01984 .

[2]

Breck ,

Cai , E. Nielsen,

Salib ,

Sculley , The ML test score: A rubric for ML production readiness and technical debt reduction , in: 2017 IEEE International Conference on Big Data (Big Data) , 2017 , pp. 1123 - 1132 . doi: 10 .1109/BigData. 2017 . 8258038 .

[3] ISO/IEC 14764, Software

Engineering - Software

Life Cycle Processes - Maintenance , International Organizations for Standardization, Geneva, Switzerland, 2006 .

[4]

Dig ,

Marrero ,

M. D.

Ernst , Refactoring sequential java code for concurrency via concurrent libraries , in: International Conference on Software Engineering, IEEE, 2009 , pp. 397 - 407 . doi: 10 .1109/icse. 2009 . 5070539 .

[5]

Sculley ,

Holt ,

Golovin , E. Davydov,

Phillips ,

Ebner ,

Chaudhary ,

Young ,

J.-F.

Crespo ,

Dennison , Hidden technical debt in Machine Learning systems , in: Neural Information Processing Systems , volume 2 of NIPS '15 , MIT Press, 2015 , pp. 2503 - 2511 .

[6]

Dolby ,

Shinnar ,

Allain ,

Reinen , Ariadne: Analysis for machine learning programs , in: International Workshop on Machine Learning and Programming Languages, MAPL 2018 , ACM

SIGPLAN

, ACM, New York, NY, USA, 2018 , pp. 1 - 10 . doi: 10 .1145/32 11346.3211349.

[7] Executive order on the safe, secure, and trustworthy development and use of artificial intelligence , 2023 . URL: https://www.whitehouse.gov/brief ing-room/pr esidential-actions/ 2023 /10/30/executive -order-on-t he-saf e-secure-and-trustworthy-development-and -use-of-artif icial-intelligence/.

[8]

Madiega , Artificial intelligence act , European Parliament: European Parliamentary Research Service ( 2021 ). URL: https://www.europarl.europa.eu/RegDa ta/etudes/BRIE/ 2021 /698792/EPRS_BRI( 2021 ) 698792_ EN .pdf .

[9]

Russell ,

Norvig , Artificial Intelligence: A Modern Approach ., 4 ed., Pearson , 2020 . doi:https://doi. org/10.1007/978-3- 030 -82681-9.

[10]

Tsipras ,

Santurkar ,

Engstrom ,

Turner ,

Madry , Robustness may be at odds with accuracy, 2019 . arXiv: 1805 .12152.

[11]

Chen ,

Wen ,

Shi ,

Lin ,

G. K.

Rajbahadur ,

Z. M. J.

Jiang , Towards training reproducible deep learning models , in: International Conference on Software Engineering, ICSE '22, Association for Computing Machinery, New York, NY, USA, 2022 , pp. 2202 - 2214 . doi: 10 .1145/3510003.3510163.

[12]

Pan ,

Rajan , On decomposing a deep neural network into modules , in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ACM, 2020 , pp. 889 - 900 . URL: https://dl.acm.org/doi/10.1145/3368089.3409668. doi: 10 .1145/3368089.3409668.

[13]

B. C.

Hu ,

Marsso ,

Czarnecki ,

Salay ,

Shen ,

Chechik , If a Human Can See It, So Should Your System: Reliability Requirements for Machine Vision Components , in : Proceedings of the 44th International Conference on Software Engineering , 2022 , pp. 1145 - 1156 . URL: http://arxiv.org/abs/2202.03930. doi:10.1 145/3510003.3510109. arXiv: 2202 . 03930 .

[14]

Huang ,

Rathod ,

Sun ,

Zhu ,

A. K.

Balan ,

Fathi ,

I. S.

Fischer ,

Wojna ,

Song ,

Guadarrama ,

K. P.

Murphy , Speed/accuracy trade-ofs for modern convolutional object detectors , 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ( 2016 ) 3296 - 3297 . URL: https://api.semanticscholar.or g/CorpusID:206595627.

[15]

S. A.

Seshia ,

Desai ,

Dreossi ,

Fremont ,

Ghosh , E. Kim,

Shivakumar ,

Vazquez-Chanlatte ,

Yue , Formal Specification for Deep Neural Networks , Technical Report UCB/EECS-2018-25 , EECS Department, University of California, Berkeley, 2018 . URL: http: //www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EE CS-2018-25.html.

[16]

Zhuo ,

Song ,

Ge , Security versus accuracy: Trade-of data modeling to safe fault classification systems , IEEE Transactions on Neural Networks and Learning Systems ( 2023 ) 1 - 12 . doi: 10 .1109/TNNLS. 2023 . 3251999 .

[17]

W. F.

Opdyke , Refactoring object-oriented frameworks , Ph.D. thesis , University of Illinois at UrbanaChampaign, Champaign, IL, USA, 1992 .

[18]

Fowler , Refactoring: Improving the Design of Existing Code, Addison- Wesley , Boston, MA, USA, 1999 .

[19]

W. G.

Griswold ,

W. F.

Opdyke , The birth of refactoring: A retrospective on the nature of high-impact software engineering research 32 ( 2015 ) 30 - 38 . doi: 10 .1109/ MS. 2015 . 107 , conference Name: IEEE Software.

[20]

Kim ,

Zimmermann ,

Nagappan , A field study of refactoring challenges and benefits , in: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ACM, Cary, North Carolina, 2012 , p. 1 . doi: 10 .1145/2393596.2393655.

[21]

E. A.

AlOmar ,

M. W.

Mkaouer ,

Newman ,

Ouni , On preserving the behavior in software refactoring: A systematic mapping study , Information and Software Technology 140 ( 2021 ) 106675 . doi: 10 .1016/j.infs of. 2021 . 106675 .

[22]

Khatchadourian ,

Tang ,

Bagherzadeh ,

Ahmed , Safe automated refactoring for intelligent parallelization of Java 8 streams , in: International Conference on Software Engineering, ICSE '19, ACM/IEEE, IEEE, Piscataway, NJ, USA, 2019 , pp. 619 - 630 . doi: 10 .1109/icse. 2019 . 00072 .

[23]

Tip ,

R. M.

Fuhrer ,

Kieżun ,

M. D.

Ernst ,

Balaban , B. De Sutter, Refactoring using type constraints , ACM Transactions on Programming Languages and Systems 33 ( 2011 ) 9: 1 - 9 : 47 . doi: 10 .1145/1961204.1961205.

[24]

Khatchadourian ,

Sawin ,

Rountev , Automated refactoring of legacy Java software to enumerated types , in: International Conference on Software Maintenance, ICSM '07 , IEEE, Paris, France, 2007 , pp. 224 - 233 . doi: 10 .1109/ICSM. 2007 . 4362635 .

[25]

Khatchadourian ,

Masuhara , Automated refactoring of legacy Java software to default methods , in: International Conference on Software Engineering, ICSE '17, ACM/IEEE, IEEE Press, Piscataway, NJ, USA, 2017 , pp. 82 - 93 . doi: 10 .1109/ICSE. 2017 . 16 .

[26]

Khatchadourian ,

Masuhara , Proactive empirical assessment of new language feature adoption via automated refactoring: The case of Java 8 default methods , in: International Conference on the Art, Science, and Engineering of Programming , volume 2 of Programming '18, AOSA , Nice, France, 2018 , pp. 6 : 1 - 6 : 30 . doi: 10 .22152/programming-journal. org/201 8/2/6.

[27]

Mens ,

Tourwe , A survey of software refactoring , IEEE Transactions on Software Engineering 30 ( 2004 ) 126 - 139 . doi: 10 .1109/TSE. 2004 . 1265817 .

[28]

Eleftheriadis ,

Kekatos ,

Katsaros ,

Tripakis , On neural network equivalence checking using SMT solvers , in: S. Bogomolov, D. Parker (Eds.), Formal Modeling and Analysis of Timed Systems, Lecture Notes in Computer Science , Springer International Publishing, Cham, 2022 , pp. 237 - 257 . doi: 10 .1007/ 978-3- 031 -15839-1_ 14 .

[29]

Dilhara ,

Ketkar ,

Dig , Understanding software2 . 0: A study of machine learning library usage and evolution , ACM Transactions on Software Engineering and Methodology 30 ( 2021 ). doi: 10 .1145/3453478.

[30]

Dilhara ,

Ketkar ,

Sannidhi ,

Dig , Discovering repetitive code changes in python ml systems , in: International Conference on Software Engineering, ICSE '22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 736 - 748 . doi: 10 .1145/3510 003.3510225.

[31]

Bavota ,

Russo , A large-scale empirical study on self-admitted technical debt , in: International Conference on Mining Software Repositories, MSR '16 , ACM , New York, NY, USA, 2016 , pp. 315 - 326 . doi: 10 .1145/2901739.2901742.

[32]

Brown , Y. Cai,

Guo ,

Kazman ,

Kim ,

Kruchten ,

Lim , A. MacCormack ,

Nord , I. Ozkaya ,

Sangwan ,

Seaman ,

Sullivan ,

Zazworka , Managing technical debt in software-reliant systems , in: FSE/SDP Workshop on Future of Software Engineering Research, FoSER '10, ACM , New York, NY, USA, 2010 , pp. 47 - 52 . doi: 10 .1145/1882362. 1882 373.

[33]

Christians , Self-admitted technical debt-an investigation from farm to table to refactoring , 2020 .

[34]

Tom ,

Aurum ,

Vidgen , An exploration of technical debt , Journal of Systems and Software 86 ( 2013 ) 1498 - 1516 . doi: 10 .1016/j.jss. 2012 . 12 .052.

[35]

Tang ,

Khatchadourian ,

Bagherzadeh ,

Singh ,

Stewart ,

Raja , An empirical study of refactorings and technical debt in Machine Learning systems , in: International Conference on Software Engineering, ICSE '21, IEEE/ACM, IEEE, Madrid, Spain, 2021 , pp. 238 - 250 . doi: 10 .1109/ICSE43902. 2021 . 00033 .

[36]

Li ,

Wu ,

Zhao , Semantic-preserving adversarial code comprehension , 2023 . URL: http://arxiv.org/ab s/2209.05130. doi: 10 .48550/arXiv.2209.05130. arXiv: 2209 .05130 [cs].

[37]

Z.-L.

Zhang , Y.-

Chen ,

Li ,

X.-G.

Luo , A distancebased weighting framework for boosting the performance of dynamic ensemble selection 56 ( 2019 ) 1300 - 1316 . URL: https://www.sciencedirect.com/science/ar ticle/pii/S030645731830712X. doi: 10 .1016/j.ipm. 2019 . 03 .009.

[38]

Nguyen ,

Biswas ,

Rajan , Fix fairness, don't ruin accuracy: Performance aware fairness repair using automl , arXiv preprint arXiv:2306.09297 ( 2023 ).

[39]

Zhang ,

Chen ,

Xiao ,

Gowal ,

Stanforth ,

Li ,

Boning , C.-J. Hsieh , Towards stable and efifcient training of verifiably robust neural networks , 2019 . arXiv: 1906 .06316.

[40]

Srivastava , G. Sharma, Omnivec: Learning robust representations with cross modal sharing , in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2024 , pp. 1236 - 1248 .

[41]

LeCun , C. Cortes, MNIST handwritten digit database ( 2010 ). URL: http://yann.lecun.com/exd b/mnist/.

[42]

Russakovsky ,

Deng ,

Su ,

Krause ,

Satheesh ,

Ma ,

Huang ,

Karpathy ,

Khosla ,

Bernstein ,

A. C.

Berg , L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge , International Journal of Computer Vision (IJCV) 115 ( 2015 ) 211 - 252 . doi: 10 .1007/s11263-015-0816-y.

[43]

Hendrycks ,

Dietterich , Benchmarking neural network robustness to common corruptions and perturbations, 2019 . arXiv: 1903 .12261.

[44]

Han , Z . Zhang,

Ding ,

Gu ,

Liu ,

Huo ,

Qiu ,

Zhang , W. Han,

Huang ,

Jin ,

Lan ,

Liu ,

Lu ,

Qiu ,

Song ,

Tang , J. rong Wen,

Yuan ,

W. X.

Zhao ,

Zhu , Pre-trained models: Past, present and future , ArXiv abs/2106 .07139 ( 2021 ). URL: https://api.semanticscholar.org/CorpusID:235421816.

[45]

Pan ,

Rajan , Decomposing convolutional neural networks into reusable and replaceable modules , in: International Conference on Software Engineering, number arXiv: 2110 .07720 in ICSE ' 22 , arXiv, New York, NY, USA, 2022 , pp. 524 - 535 . doi: 10 .1145/3510003. 3510051. arXiv: 2110 . 07720 .

[46]

Castro Vélez ,

Khatchadourian ,

Bagherzadeh ,

Raja , Challenges in migrating imperative deep learning programs to graph execution: An empirical study , in: International Conference on Mining Software Repositories, MSR '22 , Association for Computing Machinery, New York, NY, USA, 2022 , pp. 469 - 481 . doi: 10 .1145/3524842.3528455.

[47] Google

LLC

, Better performance with tf . function , 2021 . URL: https://tensorflow.org/guide/function.

[48] D. OBrien , S. Biswas, S. M.

Imtiaz , R.

Abdalkareem , E. Shihab, H. Rajan, 23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software ( 2022 ) 13 .

[49]

Mens ,

Demeyer ,

B. Du

Bois ,

Stenten , P. Van Gorp , Refactoring: Current research and future trends, Electronic Notes in Theoretical Computer Science 82 ( 2003 ) 483 - 499 . URL: https://www.sciencedirect.co m/science/article/pii/S1571066105826246. doi:ht tps://doi.org/10.1016/S1571- 0661 ( 05 )826 24 - 6 , lDTA' 2003 - Language descriptions, Tools and Applications.

[50]

Feurer ,

Klein ,

Eggensperger ,

Springenberg ,

Blum ,

Hutter , Eficient and robust automated machine learning , in: Advances in Neural Information Processing Systems 28 ( 2015 ), 2015 , pp. 2962 - 2970 .

[51]

Jin ,

Chollet ,

Song ,

Hu , Autokeras: An automl library for deep learning , Journal of Machine Learning Research 24 ( 2023 ) 1 - 6 . URL: http://jmlr.org /papers/v24/ 20 - 1355 .html.

[52]

Pang ,

Lin ,

Yang ,

Zhu ,

Yan , Robustness and accuracy could be reconcilable by (proper) definition , in: International Conference on Machine Learning , 2022 . URL: https://api.semanticscholar.org/CorpusID: 247011694.

[53] C. Molnar, Interpretable machine learning , Lulu. com , 2020 .

[54]

Ao ,

Rueger ,

Siddharthan , Empirical optimal risk to quantify model trustworthiness for failure detection , arXiv preprint arXiv:2308.03179 ( 2023 ).

[55]

Krizhevsky , One weird trick for parallelizing convolutional neural networks , 2014 . arXiv: 1404 . 5997 .

[56]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 770 - 778 .

[57]

Simonyan ,

Zisserman , Very deep convolutional networks for large-scale image recognition , 2015 . arXiv: 1409 . 1556 .

[58]

Szegedy , W. Liu,

Jia ,

Sermanet ,

Reed ,

Anguelov ,

Erhan ,

Vanhoucke ,

Rabinovich , Going deeper with convolutions , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 1 - 9 .