<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ReLESS: A Framework for Assessing Safety in Deep Learning Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nan Jia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anita Raja</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafi Khatchadourian</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CUNY, Hunter College</institution>
          ,
          <addr-line>695 Park Avenue, New York, NY 10065</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CUNY, the Graduate Center</institution>
          ,
          <addr-line>365 Fifth Avenue, New York, NY 10016</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Traditionally, software refactoring helps to improve a system's internal structure and enhance its non-functional features, such as reliability and run-time performance, while preserving external behavior including original program semantics. However, in the context of learning-enabled software systems (LESS), e.g., Machine Learning (ML) systems, it is unclear which portions of a software's semantics require preservation at the development phase. This is mainly because (a) the behavior of the LESS is not defined until run-time; and (b) the inherently iterative and non-deterministic nature of ML algorithms. Consequently, there is a knowledge gap in what refactoring truly means in the context of LESS as such systems have no guarantee of a predetermined correct answer. We thus conjecture that to construct robust and safe LESS, it is imperative to understand the flexibility of refactoring LESS compared to traditional software and to measure it. In this paper, we introduce a novel conceptual framework named ReLESS for evaluating refactorings for supervised learning by (i) exploring the transformation methodologies taken by state-of-the-art LESS refactorings that focus on singular metrics, (ii) reviewing informal notions of semantics preservation and the level at which they occur (source code vs. trained model), and (iii) empirically comparing and contrasting existing LESS refactorings in the context of image classification problems. This framework will set the foundation to not only formalize a standard definition of semantics preservation in LESS but also combine four metrics: accuracy, run-time performance, robustness, and interpretability as a multi-objective optimization function, instead of a single-objective function used in existing works, to assess LESS refactorings. In the future, our work could seek reliable LESS refactorings that generalize over diverse systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;learning-enabled software systems</kwd>
        <kwd>machine learning systems</kwd>
        <kwd>refactoring</kwd>
        <kwd>trusted AI software architectures</kwd>
        <kwd>AI safety</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Developers of Learning-Enabled Software Systems (LESS)
face the challenge of constructing highly reliable large-scale
systems, as evidenced in previous research [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. With
the pervasive integration of dynamic Machine Learning
(ML) models in these operational software systems, safety,
eficiency, and adaptability with respect to evolving user
requirements become paramount. Moreover, software
systems inherently evolve throughout their life-cycle [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which
traditionally incurs substantial costs and risks, particularly
in the context of large, complex systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Although
LESS shares these traits with conventional software, its
data-driven nature accentuates the propensity for
evolution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This divergence from traditional software poses
unique challenges for testing and verification due to its
data-driven and uncertain requirements. Notably, the
efifcacy of resulting ML models, including Large Language
Models (LLMs), improves with more extensive data inputs,
necessitating a delicate balance between user privacy
protection and model refinement in large-scale systems.
Consequently, there arises a pressing need for validation and
testing methodologies tailored to the distinctive
characteristics of AI-driven systems.
      </p>
      <p>
        This evolving research agenda underscores a critical
reassessment of priorities in AI system development.
Furthermore, as AI technologies permeate various sectors of society,
scalable systems must efectively consider and adapt to legal,
policy, and employment implications. These technical
attributes not only underpin the functional aspects of AI
applications but also facilitate their alignment with essential
ethical standards and societal expectations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This imperative
is further underscored by a recent U.S. government-issued
      </p>
      <p>
        Executive Order [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the EU AI Act [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], emphasizing
the necessity for Safe, Secure, and Trustworthy
Development and Use of Artificial Intelligence. Moreover, to ensure
the positive societal impact of AI systems, accuracy,
runtime performance, robustness, and interpretability are crucial
technical attributes that directly support broader ethical
objectives.
      </p>
      <p>
        Recent works [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ] have highlighted a variety of metrics
for assessing the impacts of LESS transformation. These
metrics include aspects such as ensuring safety and fairness,
protecting privacy, fostering collaboration, considering legal
and policy ramifications, and evaluating impacts on
employment. Recent studies [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14">10, 11, 12, 13, 14</xref>
        ] have investigated
whether original and transformed systems should behave
consistently before and after transformation. These studies
illustrate the potential trade-ofs between accuracy and each
respective metric. Although various metrics like fairness
and privacy are considered, in this work, we focus on
accuracy, run-time performance, robustness, and interpretability
as a starting point with the intent to cover the majority
of AI safety concerns in LESS [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
        ]. We argue that
comprehending and harnessing the flexibility of refactoring
in LESS represents a pivotal stride toward enhancing the
safety of AI systems.
      </p>
      <p>A detailed exposition of these metrics, as discussed in the
state-of-the-art literature, is provided in Section 2.</p>
      <p>
        Traditionally, the criterion for refactoring [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ], is that
the same input must produce the same output; any
deviation is considered a behavior change of the program and
a threat leading to system crash [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However,
refactoring is underexplored in the context of LESS, including deep
learning frameworks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LESS, unlike traditional software
systems, benefit from randomness but yet lack a guarantee
of a predefined exact outcome due to their reliance on the
quantity and quality of data, complicating predictions about
the efects of refactoring.
      </p>
      <p>
        This paper aims to bridge the knowledge gap between
refactoring practices in traditional software [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref4">4, 20, 21, 22, 23,
24, 25, 26, 27</xref>
        ], and LESS [
        <xref ref-type="bibr" rid="ref12 ref28 ref29 ref30">12, 28, 29, 30</xref>
        ] by introducing
ReLESS (Refactoring of Learning-Enabled Software Systems),
an evaluation framework for standardizing and
formalizing refactoring methodologies. In this work, we describe
this framework in the context of supervised learning tasks,
specifically image classification problems. Our
hypothesis posits that the criteria for successful refactoring—
namely source-to-source transformation and
semantic preservation—assume unique, yet complementary
implications in the context of LESS as opposed to
traditional software systems.
      </p>
      <p>Specifically, ReLESS will allow for the possibility that
transformations might produce outputs that are slightly
diferent from the original output as long as they lead to
improvements in other performance metrics of the system.
Determining how "diferent" the output can be from the
original is a research question we seek to address.
Moreover, our approach aims to discover and preserve
safetycritical metrics during ReLESS while further mitigating the
uncertainties introduced by their non-deterministic nature.
While current approaches emphasize knowledge
distillation (transferring knowledge from a large neural network
to a smaller, resource-eficient one) and regularization (a
technique for solving over-fitting), our vision for the future
of ReLESS includes approaches that combine connectionist
models (e.g., neural networks) and symbolic (e.g., decision
tree) approaches as well as Bayesian and analogizer (e.g.,
K-nearest neighbor, support vector machine) approaches.</p>
      <p>
        This paper is structured as follows: in Section 2 we first
provide a comprehensive analysis of state-of-the-art
refactoring methodologies in LESS and discuss how these works
trade-of accuracy with respect to specific metrics such as
run-time performance, robustness [
        <xref ref-type="bibr" rid="ref11 ref13 ref14 ref16">11, 13, 14, 16</xref>
        ], or
interpretability. In Section 3, we contrast existing practices and
scrutinize informal notions of semantics preservation across
diferent levels (source code vs. trained model). We then
motivate a novel thread of inquiry for the ReLESS evaluation
framework and its multi-objective optimization function
that combines the aforementioned multiple metrics to
guarantee the AI system’s safety. Section 4 presents preliminary
experiments utilizing ReLESS to gauge LESS safety and
associated parameters. Finally, in Section 5 we discuss the main
insights gleaned from this work and our future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In recent years, various research has been conducted on
LESS refactoring, with significant observations in balancing
single metric against accuracy of models. Several studies
have focused on image classification or object detection,
addressing this tension and presenting innovative verification.
However, these approaches often face limitations in lack of
generalization and narrow scope of metrics, which we aim
to address in our work.</p>
      <sec id="sec-2-1">
        <title>2.1. Refactoring Types in Software</title>
      </sec>
      <sec id="sec-2-2">
        <title>Development</title>
        <p>
          Refactoring [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], a well-known technique for the evolution
and maintenance of traditional software, alters a system’s
internal structure without changing its behavior [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] to
improve non-functional characteristics such as run-time
performance, security, and modularity, and to pay down
technical debt [
          <xref ref-type="bibr" rid="ref31 ref32 ref33 ref34 ref35">31, 32, 33, 34, 35</xref>
          ]. It can be considered as
a series of typically automatic procedures for modifying
code, such as variable name changes to enhance
comprehension [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], without an explicit focus on automated
refactoring, as these modifications frequently occur automatically
within a system-based environment. Formally, a refactoring
is a program transformation potentially spanning multiple,
non-adjacent program statements or expressions that is:
(i) source-to-source and (ii) semantics-preserving, i.e., the
behavior of the program is the same before and after the
refactoring.
        </p>
        <p>
          Even though refactoring is a well-established practice in
traditional software development, it is not as well clear in
LESS. Existing refactoring attempts in LESS are implicitly
performed via controlling randomness [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], decomposing
trained models [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], or defining new requirements [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The
lack of refactoring tools and techniques, and an evaluation
framework for LESS is a significant challenge for developers
and researchers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Our research aims to develop a
multiobjective evaluation framework for LESS. We study it in the
context of a specific class of supervised learning problems,
namely image classification tasks.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.2. Image Classification Problems and</title>
      </sec>
      <sec id="sec-2-4">
        <title>Evaluation</title>
        <p>
          While the continuous evaluation of ML models [
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref36">11, 12, 13,
36</xref>
          ] has highlighted modularity, reliability, robustness, and
interpretability, these assessments done independently fall
short of ensuring the safety of AI systems as a whole.
Consider for instance the role of accuracy, which is the widely
accepted metric [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] for gauging the success of models in
the image classification task. Benchmark models for this
problem class originating from the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) have continually
improved on accuracy. To date, the record for the highest
accuracy on the ImageNet benchmark is an impressive 92.4%,
set by OmniVec(ViT)1. However, while high accuracy is
indispensable, it is not the sole criterion for the adequacy of
a model, especially within contexts of safety-critical
applications. In such applications, other non-functional metrics
demand equal consideration to ensure the comprehensive
robustness and reliability of LESS.
        </p>
        <p>
          Ensuring AI systems maintain safety and fairness [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]
across various conditions and inputs is crucial for
applications like autonomous driving and medical
diagnosis [
          <xref ref-type="bibr" rid="ref13 ref16 ref39">13, 16, 39</xref>
          ]. Reliability is equally important, as
dependable systems yield consistent results, fostering trust among
stakeholders and accountability among developers. While
state-of-the-art models often match or exceed human
performance in image classification tasks, understanding errors
and their solutions remains challenging [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. Evaluating
model performance is vital, especially in safety-critical
scenarios, yet the opaque nature of the learning component
hinders transparency and interpretability.
        </p>
        <p>Our proposed framework ReLESS combines accuracy,
runtime performance, robustness, and interpretability, using a
multi-objective optimization function. By experimenting
with metrics drawn from existing literature and through
preliminary evaluations of them, we validate target systems’
performance both before and after refactoring. Our findings
illuminate the trade-ofs researchers make between accuracy
and other performance metrics. Importantly, our evaluation
1https://paperswithcode.com/paper/omnivec-learning-robustrepresentations-with
process considers not just a single metric versus accuracy
but integrates multiple metrics to understand various system
maintenance challenges. This approach helps mitigate the
"black-box" nature of AI learning components, providing
clearer insights into system behavior and performance.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.3. Baselines for Comparison</title>
        <p>
          Chen et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] analyzed refactoring for image classification
tasks at the algorithmic level with various models using
dynamic analysis, record-and-replay, and profile-and-patch.
The focus of their approach is to control randomness and
hardware non-determinism to guarantee that the 
and performance metrics are the same as the original
system in seven models (Lenet1/4/5, ResNet-38/56, WRN-28-10,
and ModelX). Models are then reproduced eficiently and
accurately across diferent hardware.
        </p>
        <p>
          Pan and Rajan [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] hypothesized that decomposing
learning models into reusable components can afect
refactoring outputs and statistical performance in the MNIST [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]
dataset. They run four DNN models across sixteen
experiments with varying hidden layers and datasets,
demonstrating that removing irrelevant edges in the network can lead
to similar accuracy and preserve the most semantics. They
found that 9 out of 16 cases were functionally equivalent to
the original models, based on the Jaccard Index, with
intradataset performance from decomposed models slightly
outperforming models built from scratch (e.g., MNIST(+0.30%)).
        </p>
        <p>
          Adopting the methodology from Hu et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we
succeeded in obtaining the original and filtered images from
ImageNet [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ]. Image filters such as brightness, contrast,
defocus/blur, frost, gaussian noise, and jpeg compression,
are crucial for testing the robustness of the refactored
systems [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] because they involve pictures that human can
recognize correctly and easily before and after filtering,
thus setting a baseline for model performance in similar
conditions.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Given the context of refactoring in ML systems as discussed
in the previous section, we present ReLESS, a conceptual
framework created especially to tackle two important
research goals. First, to investigate the definition and
operation of semantic preservation during system transformation
procedures inside the LESS. Second, to explore approaches
for evaluating the safety of LESS during the system’s
transition, building on our formalization of semantic preservation
from the previous goal.</p>
      <p>
        Consider Fig. 1 which depicts the situation representing
the refactoring of traditional software systems. Here,   1
represents an (automated) refactoring that takes as input a
program   2 to be refactored and produces a refactored
ip.ero.,gtreaxmtu al de2′s.cNrioptteiotnhsa.t We  a2ssaunmd e tha2′t a re so1uirscea cnoodne-,
trivial refactoring, i.e., that   2 ≠   2′. As refactoring
typically deals with real-world languages with non-trivial
semantics, the semantic equivalence of   2 and   2′
is normally assessed empirically by executing   2’s test
suites and comparing the results. Thus, to evaluate the
refactoring   1,   is fed to both   2 and   2′ for
all test suite inputs. The  is then compared—ideally,
all tests have the same results before and after the
refactoring. If so, then  =  ′, and   1 is considered
validated. Otherwise,  ≠  ′, meaning there is
a bug [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ] in the system. Since traditional software is
typically deterministic and its logic is not driven by dynamic
data models, the process works in a relatively
straightforward fashion. In fact, the larger the test suite, the greater
the confidence that the refactoring works. 2 On the other
hand, given the non-deterministic intricacies inherent in
LESS, the traditional refactoring as described in Fig. 1 is
insuficient. Consequently, we construct an auxiliary diagram
Fig. 2 that facilitates a more direct and nuanced evaluation
of the transformations.
      </p>
      <p>Now consider Fig. 2 representing ReLESS with citations
of related work in the supervised learning context. 3 Here,
  1 represents an (automated) refactoring that takes as
input an ML algorithm   2 to be refactored and produces
a refactored ML algorithm   2′. Note that   2 and   2′
are ML algorithm source code, i.e., textual descriptions. We
2Traditional software may be concurrent, potentially experiencing race
conditions, or may rely on its (changing) environment. In such cases,
“flaky” tests may arise, which would challenge refactoring validation.
In this case, the test suites can be executed several times to identify
stable tests.
3While our current investigation focuses on supervised learning, we
plan to extend the framework to other types of learning (unsupervised,
reinforcement) as part of of our future work.
again assume that   1 is a non-trivial refactoring, i.e.,
that   2 ≠   2′. To evaluate the refactoring   1, two
steps are taken: (a) a training dataset is fed to both   2
(aMtneLdstminogd) ed2′la,st washeitchi3satfhendedntopbroodt3′h,urceesspteh3cetainvcdeol my;p(ible)3′d.a,nOanekveaaotlrruamaitniooernde
such datasets (both training and evaluation) may be used.
The  —in this case, predictions or classifications—is
then compared. If   1 results in no accuracy loss, then
 =  ′. Otherwise,  ≠  ′, meaning
  1 causes some accuracy loss when refactoring   2.
Note that unlike in the traditional refactoring evaluation
case, whether there is a bug in   1 in this situation is not
straightforward to determine and is not a topic of focus in
this paper. Because LESS can be non-deterministic and has
logic that is driven by dynamic data models, whether   1
is considered valid may depend on multiple factors. For
instance, if the accuracy loss is within a certain threshold,
then   1 may be considered valid. If the accuracy loss is
above the threshold, then   1 may be considered invalid.</p>
      <p>
        A supplementary contribution of our proposed
framework is that it has an additional layer where both the
transformation and output comparison could occur. For instance,
there is a dashed line in Fig. 2 from   3 to   1 and
 can a1lstoo  take3′p,liancdeicoantinthgethtraat itnheedpMroLgrmamodterlasn.
sIfnortmheattioranditional setting (Fig. 1), because the transformation is not
source-to-source, it would not be considered a refactoring
in the traditional sense but instead viewed as compiler
optimization. However, in the LESS context, ML algorithms
are typically written in interpreted languages (e.g., Python),
where a compiler is not involved. It is because the model
training (compilation) process can potentially be lengthy
(days or even weeks) depending on the dataset size,
transforming the ML algorithm to produce a new ML model as
part of the refactoring process can be time-consuming [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ].
Instead, it may be advantageous in this context to perform
the refactoring at the testing level to avoid retraining. Such
a "refactoring" is done on LESS by Pan and Rajan [
        <xref ref-type="bibr" rid="ref12 ref45">12, 45</xref>
        ].
Although the transformation is on the trained ML model,
their goal of enhanced modularity is a classical refactoring
outcome.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Determination of Semantic Equivalence</title>
        <p>Our objective is to ultimately build a tool, where users
provide original code (old system), that determines which
refactorings (new systems) would satisfy semantic equivalence.
We identify diferent levels at which this could occur:
semantic equivalence at: (a) the ML algorithm level (case 1),
and (b) the ML model level (case 2). We will demonstrate
how existing works perform semantic equivalence from a
single-lens point of view. Drawing on these efects,
however, our approach will create a multi-objective evaluation
(instead of a single-objective function used by the current
state-of-the-art).</p>
        <sec id="sec-3-1-1">
          <title>Case 1: Semantic Equivalence at the ML Algorithm Level</title>
          <p>
            2 =   2′: This equivalence implies that   3 and
t his c3′aasre=e,   a lso 2seamn′.danBtuict,atl2hl′yeaeraeqvuseeirvmaagaleennttrtiacaaisnllisynhgeoqwtiumniveian(leiFnnigth.so2iun.rcIsne)
of the model for this refactoring in Chen et al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] increases
from 0.017 to 0.023 for Lenet1 and from 7.08 to 14.979 in the
case of ModelX. Their approach has higher storage
overahpepardofaocrh  does3′n(odtufeactiolitraatnedmoomdeslegeednreeracolirzdatiniogn).toSuunchseaenn
data by making the training process explore various
possibilities. This will constrain the robustness of an ML model.
Deterministic methods are also more susceptible to
overiftting, as models can memorize the training data too closely,
limiting their performance on new data. Lastly, ensuring
complete determinism can be computationally expensive
and challenging, especially in complex, multi-threaded, or
distributed computing environments. This work highlights
the tension between semantic preservation and model
optimization.
          </p>
          <p>
            Case 2: Semantic Equivalence at the ML Model Level
 are th3e=sa me. I3′t: fTolhloiswms eaagnasinththaattthe train=ed ML m′o,daenlds
 t iona2l saennds e . As2′waerearseecmoannsitdicearlilnygenqouni-vtarlievniatl
irneftahcetotrriandgisas discussed earlier, we assume   2 ≠   2′, meaning
that the refactoring   1 has made some non-trivial
transformation. An example of such a transformation would be
to enhance the run-time performance of the training; the
trained model would be the same but the training process
would be faster. For instance, Castro Vélez et al. [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ] show
tbhyaatptphley ing a3′hryubnr-itdi mtreaiinsin∼g9.t2e2chsneciqounedisnfaimstpeerrtahtaivne Deep3
Learning (DL) programs. In TensorFlow 2 [
            <xref ref-type="bibr" rid="ref47">47</xref>
            ], for
example, the tf.function decorator can be applied to certain
(model) Python functions found in imperative code to speed
up the training process. Developers and scientists, then, can
write natural, debuggable DL code in an imperative style
while retaining the run-time performance typically found in
legacy DL frameworks that support deferred-execution style
programming models. Applying tf.function to
(otherwise eagerly-executed) imperative DL code can be—if done
correctly—a semantics-preserving refactoring [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ].
th e sam3e.≠It  follo3′w:Tshthisatmiteiasnpsotshsaitbtlehethtraatined mod≠els are no′t,
meaning that   2 and   2′ may not be semantically
equivalent in the traditional sense. There are several
situations that may occur here, e.g., (i) diferent
hyperparameters are used. (ii) hybridization is misused, resulting in
semantically–in-equivalent code [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ], (iii)   3′ may be an
optimized DL model, e.g., having fewer edges, being more
modular, and avoiding over-fit. In Fig. 2,   3′ represents a
modular and refactored system from   3 via   1, where
semantics is preserved through separation of concerns, such
as using supervised classification labels for maintenance and
reduced model training time [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. This indicated that   3′
does better than   3 with respect to ReLESS optimization
while preserving the potential to explore generalizability
and scalability.
          </p>
          <p>Our analysis not only sheds light on the current
stateof-the-art but also establishes a linkage between program
transformation techniques and their operational viability
in scenarios where safety is of paramount concern. We
then use these observations to formally define semantic
preservation using a multi-objective optimization function
rather than a single-objective one in LESS.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Semantic Preservation: Formal</title>
      </sec>
      <sec id="sec-3-3">
        <title>Definition and Verification Metrics</title>
        <p>We first define the semantic preservation of LESS based
on varying ranges of the output. The Venn diagram Fig. 3
shows the outputs from the original code and proposed
ReLESS. The upper circle in blue is the output from the original
code, e.g., the probability of correct labels for a classification
or prediction task. The lower circle in yellow is the output
from ReLESS. This diagram examines where the two outputs
are equivalent (overlapping area) and where they are
diferent. Suppose  is the acceptable range of overlap, i.e., how
much developers/engineers/scientists are willing to trade
accuracy with other factors viz. robustness, run-time
performance, interpretability etc. Ideally, the overlapped area
should be as large as possible, but this is not always the case
and is application-dependent. For instance, if the system is
time-critical, then the response time is emphasized in the
optimization even though there are marginal accuracy losses.
If the system is safety-critical, then the accuracy should be
preserved as much as possible. That said, we posit that to
achieve semantic preservation in ReLESS, it is inadequate to
consider accuracy as the sole optimization metric.</p>
        <p>
          Prior works [
          <xref ref-type="bibr" rid="ref13 ref38">13, 38</xref>
          ] has formalized balancing between
accuracy and reliability/robustness and fairness. OBrien et al.
[
          <xref ref-type="bibr" rid="ref48">48</xref>
          ] define non-functional LESS metrics as run-time
performance (speed), security, privacy, and memory (storage).
Building upon these foundational studies, we extend the
evaluation framework for semantic preservation to
explicitly encompass safety as an overarching theme. Run-time
performance, as highlighted by OBrien et al. [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ], serves not
only as a measure of eficiency but also influences system
safety by ensuring timely responses in critical scenarios.
Robustness, as documented by Hu et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], is directly linked
to safety, reflecting the system’s capacity to withstand errors
and adversities. Finally, interpretability, introduced by [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ],
enhances safety by providing clarity on decision-making
processes, thereby allowing for greater accountability and
easier identification of potential safety breaches. These
three metrics collectively forge a more resilient and
safetyconscious framework for assessing semantic preservation
in LESS.
        </p>
        <p>
          This tailored approach allows for a more integrated and
holistic assessment of LESS, aligning closely with
contemporary LESS development and deployment needs. All three
curacy with those non-functional metrics with customized
importance factors to guide which degree of flexibility the
engineers, scientists, and researchers want the model they
work on to emphasize. We propose a multi-objective
optimization function, akin to Nguyen et al. [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]’s approach, to
determine the diference (loss function) between a LESS and
its corresponding ReLESS. We argue that if the loss is below
a certain threshold with constraints (as discussed in Fig. 3),
then semantic preservation is maintained.
        </p>
        <p>
          As one of the state-of-the-art formal methods,
optimization via loss functions is central to the training of ML/DL
models [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]. It is recognized for its adaptability to a wide
range of applications. Diferent trade-ofs exist when
refactoring in ML/DL systems, so a multi-objective optimization
function is constructed. Besides, optimization can
standardize each metric term that needs to be balanced with accuracy
in loss function to make the whole system understandable
to the target audience. The range of optimization applied is
from classical ML models (random forest, gradient boost) to
DNN models with supported libraries, e.g., auto-sklearn [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ]
and AutoKeras [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ].
        </p>
        <sec id="sec-3-3-1">
          <title>3.2.1. Accuracy, Run-time Performance, Robustness, and Interpretability</title>
          <p>To comprehensively evaluate the performance of ReLESS,
we will consider accuracy with three key loss functions:
run-time performance, robustness, and interpretability.</p>
          <p>a. ACCuracy is the number of correct outputs over the
total number of instances.</p>
          <p>=</p>
          <p>+  
+  
+  
where</p>
          <p>and  
negative instances correctly classified, and  
the number of instances incorrectly classified.
and  
b. Run-Time Performance Improvement (RTPI ) is
determined by comparing the observed run-time of the original
(old) code and new (transformed) code.</p>
          <p>are the number of positive instances and
+  

(1)
are
(2)
(3)
  
=

 
−</p>
          <p>c. ROBustness Improvement is indicated as ROBI.</p>
          <p>1

=  ∗Σ, ∈(1 −
 (,  ) −</p>
          <p>
            (,  )
where  is the input dataset,  is the training dataset, and 
is the set of corresponding labels for a supervised learning
task, such as image classification. Similar to
RTPI, the
observed diference is captured in loss function in old and new
models. ROBI is observed by the diference in the loss
function between the old and new models. Our definition for
robustness is based on [
            <xref ref-type="bibr" rid="ref10 ref52">10, 52</xref>
            ], where a robustness system
after refactoring is verified by its loss function after
adversarial training, and for classical ML after refactoring can be
also verified by its loss function after data augmentation,
feature engineering, and ensemble learning.
          </p>
          <p>
            d. INTerpretability Improvement is indicated as INTI.
   
= 
| 
1
|
∗Σ,

∈ 
(, 

( ))
(4)
Molnar [
            <xref ref-type="bibr" rid="ref53">53</xref>
            ] define interpretability in machine learning
using interpretable models and a simplified loss function. The
loss function serves as a quantitative measure to compare
the interpretability of diferent models while maintaining
accuracy. This approach blends the conceptual
understanding of model behavior (through interpretable models) with a
practical, measurable way (using the loss function) to assess
and compare the clarity and comprehensibility of diferent
models. We compute this metric using the definition of
interpretability where a subset as input of . We are able to
compare the diference between new and old models and
the corresponding interpretability score [
            <xref ref-type="bibr" rid="ref36">36</xref>
            ]. The
implication is that the simplified loss function on the explainable
refactored system correlates with higher interpretability,
which is plausible, but the exact method of determining
explainability is essential here.
          </p>
          <p>To sum up, we define a multi-objective optimization loss
function that facilitates balancing the importance of various
objectives depending on the application domain. Each
metric is formulated as a ratio or a normalized value, which is
typical in performance evaluation to provide a standardized
measure of improvement or degradation. In the equation
below, each metrics term is the loss measurements
calculated from Eqs. (2) to (4) respectively for ROBI, RTPI, and
INTI metrics respectively and accuracy is ACC; 
model with its parameters,  is the dataset. Each term’s
 is the
weight coeficient
evaluation.
dicates the importance of each of the metrics during model
  is assumed to be user-defined and
in
 ( , ) =  1 × ACC

+  2 × RTPI
+  3 × ROBI
+  4 × INTI
(5)
tance of each term in the loss function.
where  1,  2,  3, and  4 are weights that reflect the
impor</p>
          <p>
            The multi-objective optimization function in this
formalism enables the determination of whether a ReLESS is a
semantically preserving transformation to its LESS.
Moreover, when fusing these measurements, it is also essential to
include the measure of accuracy because regardless of the
importance of the speed of operation, robustness, and
interpretability, producing correct outputs is the cornerstone
of model evaluation. In other words, accuracy is always a
ifrst-class objective. Only by considering the critical role of
accuracy can we ensure that a model is trustworthy [
            <xref ref-type="bibr" rid="ref54">54</xref>
            ].
          </p>
          <p>Expanding on the conceptual structure presented in the
preceding part, we describe a preliminary experimental
conifguration intended to closely assess the
accuracy, run-time
performance, robustness, and interpretability of LESS.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>This section describes the experimental setup employed for
a preliminary evaluation of the proposed ReLESS framework
for a simple case study. We describe the datasets used for
experiments, followed by an explanation of the experimental
design and the metrics adopted to assess the eficacy of the
refactorings.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets and Models</title>
        <p>
          As indicated in Section 2, we study ReLESS in the context of
two image classification datasets: the ImageNet dataset and
the MNIST dataset. The ImageNet dataset [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ], comprised
of 1.2 million images across 1000 categories, is utilized for
the evaluation to assess reliability and robustness, with a
specific subset of 50,000 images filtered from [
dataset [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], containing 60,000 training images and 10,000
test images of handwritten digits, serves as the basis for
initial evaluations. Those datasets enable preliminary
assessments of the refactorings’ efectiveness before
proceeding to more complex scenarios. Our experimental models
include fully connected neural networks with 1 to 4 layers
for the MNIST dataset, and pre-trained complex
architectures such as AlexNet [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ], ResNet50 [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ], VGG16 [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ], and
GoogleNet [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ] for the ImageNet dataset.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment and Results</title>
        <p>
          In our experimental setup, we applied the methodologies
outlined by Pan and Rajan [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and Hu et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], along
with techniques detailed in Section 3, across both datasets
to scrutinize the refactored systems with respect to accuracy,
run-time performance, robustness, and intepretebility. The
results of experiments are summarized in Table 1.
        </p>
        <p>From Table 1, we observe that the refactored models
exhibit a marginal decrease in accuracy on the MNIST dataset,
with a diference of</p>
        <p>0.001. This decrease is attributed to the
expanded modular complexity, which results in a run-time
increase of 414.7 seconds. The modularity of the refactored
model is significantly higher than the original model, with
a diference of
8. The robustness of the refactored model is
2.0476. The
interpretabilalso higher, with a diference of
ity of the refactored model is higher, with an accuracy
difference of 0.0769. Increases in both metrics indicate that
the refactored system exhibits improved robustness and
interoperability after decomposing. However, although the
robustness has improved, the accuracy for refactored
systems using the ImageNet dataset has decreased, falling
below that of a coin flip. Therefore, modularity appears not
only to be harmless but also beneficial to system safety, as
it maintains accuracy and improves robustness. But, for
the optimization of the aforementioned complex systems,
more eforts are required to prevent
accuracy loss,
particularly in safety-critical tasks. More details can be found in
https://github.com/NanJ90/ReLess-testing-tool</p>
        <p>To summarize, we present an initial assessment for
ReLESS evaluation framework and describe the datasets used,
the experimental design, and the metrics for evaluating
refactorings. The comparative analysis of original and
refactored models reveals that diferent datasets and models can
exhibit significant variations across diferent performance
metrics. For instance, while the performance of the
ImageNet model remained relatively consistent after speedup,
the modularized MNIST model took 168 times longer than
the original. This underscores the critical importance of
evaluating efects across multiple datasets and models to
gain comprehensive insights into performance implications
w.r.t accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future</title>
    </sec>
    <sec id="sec-6">
      <title>Work</title>
      <p>Our contribution in this work includes a review of
literature focused on refactoring in LESS, particularly with an
emphasis on safety considerations. This review critically
analyzes the spectrum of assessments presented across
various studies, each contributing to a facet of the AI safety
standard. We further explore and elucidate the
interrelationships between these safety metrics and the accuracy
of AI systems, highlighting the implications for model
development and deployment. Our preliminary results set a
potential foundation to help drive the long-term evolution,
and robustness of LESS that are traditionally enjoyed by
conventional systems during development and deployment,
and then improve the safety of LESS. The scientists and
engineers who develop AI systems will be able to rely on the
refactored systems and trust them to make decisions that
are safe, secure, and trustworthy. This work includes
understanding how the thresholds in Fig. 3 will be determined
for various applications and how the user can determine
the weights for the various metrics. We have described an
initial validation of our framework; however, further
experimentation that includes more metrics such as fairness and
privacy, extending the validation to a variety of problem
domains and case studies is essential to comprehensively
assess its efectiveness and generalizability. This would also
enable practitioners to prioritize specific components when
evaluating LESS and could even lead to design-to-criteria
LESS.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank Ayan Kohli for the initial investigation into
semantic similarity and the anonymous reviewers for their helpful
comments. This work is supported by the National Science
Foundation (NSF) under Agreement No.CCF-2200343.</p>
      <p>ImageNet
Refactored
&lt;0.5
174
10.0754
&lt;0.5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Martínez-Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bogner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Franch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oriol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Siebert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trendowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Vollmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <source>Software Engineering for AI-Based Systems: A Survey</source>
          ,
          <source>ACM Transactions on Software Engineering and Methodology</source>
          <volume>31</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>59</lpage>
          . URL: http://ar xiv.
          <source>org/abs/2105</source>
          .
          <year>01984</year>
          . doi:
          <volume>10</volume>
          .1145/3487043. arXiv:
          <fpage>2105</fpage>
          .
          <year>01984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Breck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cai</surname>
          </string-name>
          , E. Nielsen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <article-title>The ML test score: A rubric for ML production readiness and technical debt reduction</article-title>
          ,
          <source>in: 2017 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1123</fpage>
          -
          <lpage>1132</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData.
          <year>2017</year>
          .
          <volume>8258038</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] ISO/IEC 14764,
          <string-name>
            <surname>Software</surname>
            <given-names>Engineering - Software</given-names>
          </string-name>
          <string-name>
            <surname>Life Cycle Processes - Maintenance</surname>
          </string-name>
          , International Organizations for Standardization, Geneva, Switzerland,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <article-title>Refactoring sequential java code for concurrency via concurrent libraries</article-title>
          , in: International Conference on Software Engineering, IEEE,
          <year>2009</year>
          , pp.
          <fpage>397</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          .1109/icse.
          <year>2009</year>
          .
          <volume>5070539</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Golovin</surname>
          </string-name>
          , E. Davydov,
          <string-name>
            <given-names>T.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ebner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Crespo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dennison</surname>
          </string-name>
          ,
          <article-title>Hidden technical debt in Machine Learning systems</article-title>
          ,
          <source>in: Neural Information Processing Systems</source>
          , volume
          <volume>2</volume>
          <source>of NIPS '15</source>
          , MIT Press,
          <year>2015</year>
          , pp.
          <fpage>2503</fpage>
          -
          <lpage>2511</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dolby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shinnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Allain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reinen</surname>
          </string-name>
          ,
          <article-title>Ariadne: Analysis for machine learning programs</article-title>
          ,
          <source>in: International Workshop on Machine Learning and Programming Languages, MAPL</source>
          <year>2018</year>
          ,
          <string-name>
            <surname>ACM</surname>
            <given-names>SIGPLAN</given-names>
          </string-name>
          , ACM, New York, NY, USA,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . doi:
          <volume>10</volume>
          .1145/32 11346.3211349.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] Executive order on the safe, secure, and trustworthy development</article-title>
          and
          <source>use of artificial intelligence</source>
          ,
          <year>2023</year>
          . URL: https://www.whitehouse.gov/brief ing-room/pr esidential-actions/
          <year>2023</year>
          /10/30/executive
          <article-title>-order-on-t he-saf e-secure-and-trustworthy-development-and -use-of-artif icial-intelligence/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Madiega</surname>
          </string-name>
          ,
          <source>Artificial intelligence act</source>
          ,
          <source>European Parliament: European Parliamentary Research Service</source>
          (
          <year>2021</year>
          ). URL: https://www.europarl.europa.eu/RegDa ta/etudes/BRIE/
          <year>2021</year>
          /698792/EPRS_BRI(
          <year>2021</year>
          )
          <article-title>698792_ EN</article-title>
          .pdf .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          , Artificial Intelligence:
          <article-title>A Modern Approach</article-title>
          ., 4 ed.,
          <source>Pearson</source>
          ,
          <year>2020</year>
          . doi:https://doi. org/10.1007/978-3-
          <fpage>030</fpage>
          -82681-9.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Engstrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madry</surname>
          </string-name>
          , Robustness may be at odds with accuracy,
          <year>2019</year>
          . arXiv:
          <year>1805</year>
          .12152.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Rajbahadur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. M. J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Towards training reproducible deep learning models</article-title>
          , in: International Conference on Software Engineering, ICSE '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , pp.
          <fpage>2202</fpage>
          -
          <lpage>2214</lpage>
          . doi:
          <volume>10</volume>
          .1145/3510003.3510163.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>On decomposing a deep neural network into modules</article-title>
          ,
          <source>in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>889</fpage>
          -
          <lpage>900</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3368089.3409668. doi:
          <volume>10</volume>
          .1145/3368089.3409668.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marsso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chechik</surname>
          </string-name>
          ,
          <article-title>If a Human Can See It, So Should Your System: Reliability Requirements for Machine Vision Components</article-title>
          , in
          <source>: Proceedings of the 44th International Conference on Software Engineering</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1145</fpage>
          -
          <lpage>1156</lpage>
          . URL: http://arxiv.org/abs/2202.03930. doi:10.1 145/3510003.3510109. arXiv:
          <volume>2202</volume>
          .
          <fpage>03930</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rathod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Balan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wojna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <article-title>Speed/accuracy trade-ofs for modern convolutional object detectors</article-title>
          ,
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          )
          <fpage>3296</fpage>
          -
          <lpage>3297</lpage>
          . URL: https://api.semanticscholar.or g/CorpusID:206595627.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Seshia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dreossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fremont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          , E. Kim,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shivakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vazquez-Chanlatte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <article-title>Formal Specification for Deep Neural Networks</article-title>
          ,
          <source>Technical Report UCB/EECS-2018-25</source>
          , EECS Department, University of California, Berkeley,
          <year>2018</year>
          . URL: http: //www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EE CS-2018-25.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <article-title>Security versus accuracy: Trade-of data modeling to safe fault classification systems</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2023</year>
          .
          <volume>3251999</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W. F.</given-names>
            <surname>Opdyke</surname>
          </string-name>
          ,
          <article-title>Refactoring object-oriented frameworks</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Illinois at UrbanaChampaign, Champaign, IL, USA,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fowler</surname>
          </string-name>
          ,
          <article-title>Refactoring: Improving the Design of Existing Code, Addison-</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          , Boston, MA, USA,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Griswold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. F.</given-names>
            <surname>Opdyke</surname>
          </string-name>
          ,
          <article-title>The birth of refactoring: A retrospective on the nature of high-impact software engineering research 32 (</article-title>
          <year>2015</year>
          )
          <fpage>30</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1109/ MS.
          <year>2015</year>
          .
          <volume>107</volume>
          , conference Name: IEEE Software.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nagappan</surname>
          </string-name>
          ,
          <article-title>A field study of refactoring challenges and benefits</article-title>
          ,
          <source>in: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          , ACM, Cary, North Carolina,
          <year>2012</year>
          , p.
          <fpage>1</fpage>
          . doi:
          <volume>10</volume>
          .1145/2393596.2393655.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>AlOmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Mkaouer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ouni</surname>
          </string-name>
          ,
          <article-title>On preserving the behavior in software refactoring: A systematic mapping study</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>140</volume>
          (
          <year>2021</year>
          )
          <article-title>106675</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.infs of.
          <year>2021</year>
          .
          <volume>106675</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bagherzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Safe automated refactoring for intelligent parallelization of Java 8 streams</article-title>
          , in: International Conference on Software Engineering, ICSE '19, ACM/IEEE, IEEE, Piscataway, NJ, USA,
          <year>2019</year>
          , pp.
          <fpage>619</fpage>
          -
          <lpage>630</lpage>
          . doi:
          <volume>10</volume>
          .1109/icse.
          <year>2019</year>
          .
          <volume>00072</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Fuhrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kieżun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Balaban</surname>
          </string-name>
          , B. De Sutter,
          <article-title>Refactoring using type constraints</article-title>
          ,
          <source>ACM Transactions on Programming Languages and Systems</source>
          <volume>33</volume>
          (
          <year>2011</year>
          ) 9:
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          :
          <fpage>47</fpage>
          . doi:
          <volume>10</volume>
          .1145/1961204.1961205.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sawin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rountev</surname>
          </string-name>
          ,
          <article-title>Automated refactoring of legacy Java software to enumerated types</article-title>
          ,
          <source>in: International Conference on Software Maintenance, ICSM '07</source>
          , IEEE, Paris, France,
          <year>2007</year>
          , pp.
          <fpage>224</fpage>
          -
          <lpage>233</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSM.
          <year>2007</year>
          .
          <volume>4362635</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Masuhara</surname>
          </string-name>
          ,
          <article-title>Automated refactoring of legacy Java software to default methods</article-title>
          , in: International Conference on Software Engineering, ICSE '17, ACM/IEEE, IEEE Press, Piscataway, NJ, USA,
          <year>2017</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>93</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSE.
          <year>2017</year>
          .
          <volume>16</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Masuhara</surname>
          </string-name>
          ,
          <article-title>Proactive empirical assessment of new language feature adoption via automated refactoring: The case of Java 8 default methods</article-title>
          ,
          <source>in: International Conference on the Art, Science, and Engineering of Programming</source>
          , volume
          <volume>2</volume>
          of Programming '18,
          <string-name>
            <surname>AOSA</surname>
          </string-name>
          , Nice, France,
          <year>2018</year>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>30</fpage>
          . doi:
          <volume>10</volume>
          .22152/programming-journal.
          <source>org/201 8/2/6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tourwe</surname>
          </string-name>
          ,
          <article-title>A survey of software refactoring</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>30</volume>
          (
          <year>2004</year>
          )
          <fpage>126</fpage>
          -
          <lpage>139</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2004</year>
          .
          <volume>1265817</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Eleftheriadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kekatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Katsaros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tripakis</surname>
          </string-name>
          ,
          <article-title>On neural network equivalence checking using SMT solvers</article-title>
          , in: S. Bogomolov, D. Parker (Eds.),
          <source>Formal Modeling and Analysis of Timed Systems, Lecture Notes in Computer Science</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>257</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -15839-1_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dilhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ketkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dig</surname>
          </string-name>
          ,
          <year>Understanding software2</year>
          .
          <article-title>0: A study of machine learning library usage and evolution</article-title>
          ,
          <source>ACM Transactions on Software Engineering and Methodology</source>
          <volume>30</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1145/3453478.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dilhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ketkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sannidhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dig</surname>
          </string-name>
          ,
          <article-title>Discovering repetitive code changes in python ml systems</article-title>
          , in: International Conference on Software Engineering, ICSE '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>736</fpage>
          -
          <lpage>748</lpage>
          . doi:
          <volume>10</volume>
          .1145/3510 003.3510225.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bavota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <article-title>A large-scale empirical study on self-admitted technical debt</article-title>
          ,
          <source>in: International Conference on Mining Software Repositories, MSR '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2016</year>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>326</lpage>
          . doi:
          <volume>10</volume>
          .1145/2901739.2901742.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , Y. Cai,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kazman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kruchten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. MacCormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nord</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ozkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sangwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Seaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zazworka</surname>
          </string-name>
          ,
          <article-title>Managing technical debt in software-reliant systems</article-title>
          , in: FSE/SDP Workshop on Future of Software Engineering Research, FoSER '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2010</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .1145/1882362.
          <year>1882</year>
          373.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>B.</given-names>
            <surname>Christians</surname>
          </string-name>
          ,
          <article-title>Self-admitted technical debt-an investigation from farm to table to refactoring</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aurum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          ,
          <article-title>An exploration of technical debt</article-title>
          ,
          <source>Journal of Systems and Software</source>
          <volume>86</volume>
          (
          <year>2013</year>
          )
          <fpage>1498</fpage>
          -
          <lpage>1516</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.jss.
          <year>2012</year>
          .
          <volume>12</volume>
          .052.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bagherzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <article-title>An empirical study of refactorings and technical debt in Machine Learning systems</article-title>
          , in: International Conference on Software Engineering, ICSE '21, IEEE/ACM, IEEE, Madrid, Spain,
          <year>2021</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>250</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSE43902.
          <year>2021</year>
          .
          <volume>00033</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , Semantic-preserving
          <source>adversarial code comprehension</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/ab s/2209.05130. doi:
          <volume>10</volume>
          .48550/arXiv.2209.05130. arXiv:
          <volume>2209</volume>
          .05130 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Z.-L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-G.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>A distancebased weighting framework for boosting the performance of dynamic ensemble selection 56 (</article-title>
          <year>2019</year>
          )
          <fpage>1300</fpage>
          -
          <lpage>1316</lpage>
          . URL: https://www.sciencedirect.com/science/ar ticle/pii/S030645731830712X. doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2019</year>
          .
          <volume>03</volume>
          .009.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>G.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>Fix fairness, don't ruin accuracy: Performance aware fairness repair using automl</article-title>
          ,
          <source>arXiv preprint arXiv:2306.09297</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gowal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stanforth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Boning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
          </string-name>
          ,
          <article-title>Towards stable and efifcient training of verifiably robust neural networks</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1906</year>
          .06316.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , G. Sharma,
          <article-title>Omnivec: Learning robust representations with cross modal sharing</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1236</fpage>
          -
          <lpage>1248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , C. Cortes,
          <article-title>MNIST handwritten digit database (</article-title>
          <year>2010</year>
          ). URL: http://yann.lecun.com/exd b/mnist/.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , L. Fei-Fei,
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          (IJCV)
          <volume>115</volume>
          (
          <year>2015</year>
          )
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263-015-0816-y.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          ,
          <article-title>Benchmarking neural network robustness to common corruptions</article-title>
          and perturbations,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .12261.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Zhang,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. rong Wen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Pre-trained models: Past, present and future</article-title>
          ,
          <source>ArXiv abs/2106</source>
          .07139 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID:235421816.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>Decomposing convolutional neural networks into reusable and replaceable modules</article-title>
          , in: International Conference on Software Engineering, number arXiv:
          <volume>2110</volume>
          .07720 in ICSE '
          <volume>22</volume>
          , arXiv, New York, NY, USA,
          <year>2022</year>
          , pp.
          <fpage>524</fpage>
          -
          <lpage>535</lpage>
          . doi:
          <volume>10</volume>
          .1145/3510003. 3510051. arXiv:
          <volume>2110</volume>
          .
          <fpage>07720</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>T.</given-names>
            <surname>Castro</surname>
          </string-name>
          <string-name>
            <surname>Vélez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bagherzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <article-title>Challenges in migrating imperative deep learning programs to graph execution: An empirical study</article-title>
          ,
          <source>in: International Conference on Mining Software Repositories, MSR '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>481</lpage>
          . doi:
          <volume>10</volume>
          .1145/3524842.3528455.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <surname>Google</surname>
            <given-names>LLC</given-names>
          </string-name>
          ,
          <article-title>Better performance with tf</article-title>
          .
          <source>function</source>
          ,
          <year>2021</year>
          . URL: https://tensorflow.org/guide/function.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <surname>D. OBrien</surname>
            , S. Biswas,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Imtiaz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Abdalkareem</surname>
          </string-name>
          , E. Shihab, H. Rajan, 23 Shades of Self-Admitted
          <source>Technical Debt: An Empirical Study on Machine Learning Software</source>
          (
          <year>2022</year>
          )
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Demeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Du</given-names>
            <surname>Bois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stenten</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Van Gorp</surname>
          </string-name>
          ,
          <source>Refactoring: Current research and future trends, Electronic Notes in Theoretical Computer Science</source>
          <volume>82</volume>
          (
          <year>2003</year>
          )
          <fpage>483</fpage>
          -
          <lpage>499</lpage>
          . URL: https://www.sciencedirect.co m/science/article/pii/S1571066105826246. doi:ht tps://doi.org/10.1016/S1571-
          <volume>0661</volume>
          (
          <issue>05</issue>
          )826
          <fpage>24</fpage>
          -
          <lpage>6</lpage>
          , lDTA'
          <fpage>2003</fpage>
          - Language descriptions,
          <source>Tools and Applications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>M.</given-names>
            <surname>Feurer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Eggensperger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Eficient and robust automated machine learning</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ),
          <year>2015</year>
          , pp.
          <fpage>2962</fpage>
          -
          <lpage>2970</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Autokeras:</surname>
          </string-name>
          <article-title>An automl library for deep learning</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: http://jmlr.org /papers/v24/
          <fpage>20</fpage>
          -
          <lpage>1355</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Robustness and accuracy could be reconcilable by (proper) definition</article-title>
          , in: International
          <source>Conference on Machine Learning</source>
          ,
          <year>2022</year>
          . URL: https://api.semanticscholar.org/CorpusID: 247011694.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <surname>C. Molnar,</surname>
          </string-name>
          <article-title>Interpretable machine learning</article-title>
          ,
          <source>Lulu. com</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rueger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddharthan</surname>
          </string-name>
          ,
          <article-title>Empirical optimal risk to quantify model trustworthiness for failure detection</article-title>
          ,
          <source>arXiv preprint arXiv:2308.03179</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>One weird trick for parallelizing convolutional neural networks</article-title>
          ,
          <year>2014</year>
          . arXiv:
          <volume>1404</volume>
          .
          <fpage>5997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1409</volume>
          .
          <fpage>1556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>