<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ReLESS: A Framework for Assessing Safety in Deep Learning Systems</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Nan</forename><surname>Jia</surname></persName>
							<email>njia@gradcenter.cuny.edu</email>
							<affiliation key="aff0">
								<orgName type="department">the Graduate Center</orgName>
								<orgName type="institution">CUNY</orgName>
								<address>
									<addrLine>365 Fifth Avenue</addrLine>
									<postCode>10016</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Anita</forename><surname>Raja</surname></persName>
							<email>anita.raja@hunter.cuny.edu</email>
							<affiliation key="aff0">
								<orgName type="department">the Graduate Center</orgName>
								<orgName type="institution">CUNY</orgName>
								<address>
									<addrLine>365 Fifth Avenue</addrLine>
									<postCode>10016</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">CUNY</orgName>
								<orgName type="institution" key="instit2">Hunter College</orgName>
								<address>
									<addrLine>695 Park Avenue</addrLine>
									<postCode>10065</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raffi</forename><surname>Khatchadourian</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">the Graduate Center</orgName>
								<orgName type="institution">CUNY</orgName>
								<address>
									<addrLine>365 Fifth Avenue</addrLine>
									<postCode>10016</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">CUNY</orgName>
								<orgName type="institution" key="instit2">Hunter College</orgName>
								<address>
									<addrLine>695 Park Avenue</addrLine>
									<postCode>10065</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ReLESS: A Framework for Assessing Safety in Deep Learning Systems</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C500760DFD7C252B2B8D57A1305B0489</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>learning-enabled software systems</term>
					<term>machine learning systems</term>
					<term>refactoring</term>
					<term>trusted AI software architectures</term>
					<term>AI safety</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Traditionally, software refactoring helps to improve a system's internal structure and enhance its non-functional features, such as reliability and run-time performance, while preserving external behavior including original program semantics. However, in the context of learning-enabled software systems (LESS), e.g., Machine Learning (ML) systems, it is unclear which portions of a software's semantics require preservation at the development phase. This is mainly because (a) the behavior of the LESS is not defined until run-time; and (b) the inherently iterative and non-deterministic nature of ML algorithms. Consequently, there is a knowledge gap in what refactoring truly means in the context of LESS as such systems have no guarantee of a predetermined correct answer. We thus conjecture that to construct robust and safe LESS, it is imperative to understand the flexibility of refactoring LESS compared to traditional software and to measure it. In this paper, we introduce a novel conceptual framework named ReLESS for evaluating refactorings for supervised learning by (i) exploring the transformation methodologies taken by state-of-the-art LESS refactorings that focus on singular metrics, (ii) reviewing informal notions of semantics preservation and the level at which they occur (source code vs. trained model), and (iii) empirically comparing and contrasting existing LESS refactorings in the context of image classification problems. This framework will set the foundation to not only formalize a standard definition of semantics preservation in LESS but also combine four metrics: accuracy, run-time performance, robustness, and interpretability as a multi-objective optimization function, instead of a single-objective function used in existing works, to assess LESS refactorings. In the future, our work could seek reliable LESS refactorings that generalize over diverse systems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Developers of Learning-Enabled Software Systems (LESS) face the challenge of constructing highly reliable large-scale systems, as evidenced in previous research <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. With the pervasive integration of dynamic Machine Learning (ML) models in these operational software systems, safety, efficiency, and adaptability with respect to evolving user requirements become paramount. Moreover, software systems inherently evolve throughout their life-cycle <ref type="bibr" target="#b2">[3]</ref>, which traditionally incurs substantial costs and risks, particularly in the context of large, complex systems <ref type="bibr" target="#b3">[4]</ref>. Although LESS shares these traits with conventional software, its data-driven nature accentuates the propensity for evolution <ref type="bibr" target="#b4">[5]</ref>. This divergence from traditional software poses unique challenges for testing and verification due to its data-driven and uncertain requirements. Notably, the efficacy of resulting ML models, including Large Language Models (LLMs), improves with more extensive data inputs, necessitating a delicate balance between user privacy protection and model refinement in large-scale systems. Consequently, there arises a pressing need for validation and testing methodologies tailored to the distinctive characteristics of AI-driven systems.</p><p>This evolving research agenda underscores a critical reassessment of priorities in AI system development. Furthermore, as AI technologies permeate various sectors of society, scalable systems must effectively consider and adapt to legal, policy, and employment implications. These technical attributes not only underpin the functional aspects of AI applications but also facilitate their alignment with essential ethical standards and societal expectations <ref type="bibr" target="#b5">[6]</ref>. This imperative is further underscored by a recent U.S. government-issued Executive Order <ref type="bibr" target="#b6">[7]</ref> and the EU AI Act <ref type="bibr" target="#b7">[8]</ref>, emphasizing the necessity for Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Moreover, to ensure the positive societal impact of AI systems, accuracy, runtime performance, robustness, and interpretability are crucial technical attributes that directly support broader ethical objectives.</p><p>Recent works <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b8">9]</ref> have highlighted a variety of metrics for assessing the impacts of LESS transformation. These metrics include aspects such as ensuring safety and fairness, protecting privacy, fostering collaboration, considering legal and policy ramifications, and evaluating impacts on employment. Recent studies <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14]</ref> have investigated whether original and transformed systems should behave consistently before and after transformation. These studies illustrate the potential trade-offs between accuracy and each respective metric. Although various metrics like fairness and privacy are considered, in this work, we focus on accuracy, run-time performance, robustness, and interpretability as a starting point with the intent to cover the majority of AI safety concerns in LESS <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>. We argue that comprehending and harnessing the flexibility of refactoring in LESS represents a pivotal stride toward enhancing the safety of AI systems.</p><p>A detailed exposition of these metrics, as discussed in the state-of-the-art literature, is provided in Section 2.</p><p>Traditionally, the criterion for refactoring <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>, is that the same input must produce the same output; any deviation is considered a behavior change of the program and a threat leading to system crash <ref type="bibr" target="#b18">[19]</ref>. However, refactoring is underexplored in the context of LESS, including deep learning frameworks <ref type="bibr" target="#b0">[1]</ref>. LESS, unlike traditional software systems, benefit from randomness but yet lack a guarantee of a predefined exact outcome due to their reliance on the quantity and quality of data, complicating predictions about the effects of refactoring.</p><p>This paper aims to bridge the knowledge gap between refactoring practices in traditional software <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27]</ref>, and LESS <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b27">28,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b29">30]</ref> by introducing Re-LESS (Refactoring of Learning-Enabled Software Systems), an evaluation framework for standardizing and formalizing refactoring methodologies. In this work, we describe this framework in the context of supervised learning tasks, specifically image classification problems. Our hypothesis posits that the criteria for successful refactoringnamely source-to-source transformation and semantic preservation-assume unique, yet complementary implications in the context of LESS as opposed to traditional software systems.</p><p>Specifically, ReLESS will allow for the possibility that transformations might produce outputs that are slightly different from the original output as long as they lead to improvements in other performance metrics of the system. Determining how "different" the output can be from the original is a research question we seek to address. Moreover, our approach aims to discover and preserve safetycritical metrics during ReLESS while further mitigating the uncertainties introduced by their non-deterministic nature. While current approaches emphasize knowledge distillation (transferring knowledge from a large neural network to a smaller, resource-efficient one) and regularization (a technique for solving over-fitting), our vision for the future of ReLESS includes approaches that combine connectionist models (e.g., neural networks) and symbolic (e.g., decision tree) approaches as well as Bayesian and analogizer (e.g., K-nearest neighbor, support vector machine) approaches.</p><p>This paper is structured as follows: in Section 2 we first provide a comprehensive analysis of state-of-the-art refactoring methodologies in LESS and discuss how these works trade-off accuracy with respect to specific metrics such as run-time performance, robustness <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b15">16]</ref>, or interpretability. In Section 3, we contrast existing practices and scrutinize informal notions of semantics preservation across different levels (source code vs. trained model). We then motivate a novel thread of inquiry for the ReLESS evaluation framework and its multi-objective optimization function that combines the aforementioned multiple metrics to guarantee the AI system's safety. Section 4 presents preliminary experiments utilizing ReLESS to gauge LESS safety and associated parameters. Finally, in Section 5 we discuss the main insights gleaned from this work and our future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In recent years, various research has been conducted on LESS refactoring, with significant observations in balancing single metric against accuracy of models. Several studies have focused on image classification or object detection, addressing this tension and presenting innovative verification. However, these approaches often face limitations in lack of generalization and narrow scope of metrics, which we aim to address in our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Refactoring Types in Software Development</head><p>Refactoring <ref type="bibr" target="#b16">[17]</ref>, a well-known technique for the evolution and maintenance of traditional software, alters a system's internal structure without changing its behavior <ref type="bibr" target="#b17">[18]</ref> to improve non-functional characteristics such as run-time performance, security, and modularity, and to pay down technical debt <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b32">33,</ref><ref type="bibr" target="#b33">34,</ref><ref type="bibr" target="#b34">35]</ref>. It can be considered as a series of typically automatic procedures for modifying code, such as variable name changes to enhance comprehension <ref type="bibr" target="#b18">[19]</ref>, without an explicit focus on automated refactoring, as these modifications frequently occur automatically within a system-based environment. Formally, a refactoring is a program transformation potentially spanning multiple, non-adjacent program statements or expressions that is: (i) source-to-source and (ii) semantics-preserving, i.e., the behavior of the program is the same before and after the refactoring.</p><p>Even though refactoring is a well-established practice in traditional software development, it is not as well clear in LESS. Existing refactoring attempts in LESS are implicitly performed via controlling randomness <ref type="bibr" target="#b10">[11]</ref>, decomposing trained models <ref type="bibr" target="#b11">[12]</ref>, or defining new requirements <ref type="bibr" target="#b12">[13]</ref>. The lack of refactoring tools and techniques, and an evaluation framework for LESS is a significant challenge for developers and researchers <ref type="bibr" target="#b1">[2]</ref>. Our research aims to develop a multiobjective evaluation framework for LESS. We study it in the context of a specific class of supervised learning problems, namely image classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Image Classification Problems and Evaluation</head><p>While the continuous evaluation of ML models <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b35">36]</ref> has highlighted modularity, reliability, robustness, and interpretability, these assessments done independently fall short of ensuring the safety of AI systems as a whole. Consider for instance the role of accuracy, which is the widely accepted metric <ref type="bibr" target="#b36">[37]</ref> for gauging the success of models in the image classification task. Benchmark models for this problem class originating from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have continually improved on accuracy. To date, the record for the highest accuracy on the ImageNet benchmark is an impressive 92.4%, set by OmniVec(ViT) <ref type="foot" target="#foot_0">1</ref> . However, while high accuracy is indispensable, it is not the sole criterion for the adequacy of a model, especially within contexts of safety-critical applications. In such applications, other non-functional metrics demand equal consideration to ensure the comprehensive robustness and reliability of LESS.</p><p>Ensuring AI systems maintain safety and fairness <ref type="bibr" target="#b37">[38]</ref> across various conditions and inputs is crucial for applications like autonomous driving and medical diagnosis <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b38">39]</ref>. Reliability is equally important, as dependable systems yield consistent results, fostering trust among stakeholders and accountability among developers. While state-of-the-art models often match or exceed human performance in image classification tasks, understanding errors and their solutions remains challenging <ref type="bibr" target="#b39">[40]</ref>. Evaluating model performance is vital, especially in safety-critical scenarios, yet the opaque nature of the learning component hinders transparency and interpretability.</p><p>Our proposed framework ReLESS combines accuracy, runtime performance, robustness, and interpretability, using a multi-objective optimization function. By experimenting with metrics drawn from existing literature and through preliminary evaluations of them, we validate target systems' performance both before and after refactoring. Our findings illuminate the trade-offs researchers make between accuracy and other performance metrics. Importantly, our evaluation process considers not just a single metric versus accuracy but integrates multiple metrics to understand various system maintenance challenges. This approach helps mitigate the "black-box" nature of AI learning components, providing clearer insights into system behavior and performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Baselines for Comparison</head><p>Chen et al. <ref type="bibr" target="#b10">[11]</ref> analyzed refactoring for image classification tasks at the algorithmic level with various models using dynamic analysis, record-and-replay, and profile-and-patch. The focus of their approach is to control randomness and hardware non-determinism to guarantee that the 𝑂𝑢𝑡𝑝𝑢𝑡 and performance metrics are the same as the original system in seven models (Lenet1/4/5, ResNet-38/56, WRN-28-10, and ModelX). Models are then reproduced efficiently and accurately across different hardware.</p><p>Pan and Rajan <ref type="bibr" target="#b11">[12]</ref> hypothesized that decomposing learning models into reusable components can affect refactoring outputs and statistical performance in the MNIST <ref type="bibr" target="#b40">[41]</ref> dataset. They run four DNN models across sixteen experiments with varying hidden layers and datasets, demonstrating that removing irrelevant edges in the network can lead to similar accuracy and preserve the most semantics. They found that 9 out of 16 cases were functionally equivalent to the original models, based on the Jaccard Index, with intradataset performance from decomposed models slightly outperforming models built from scratch (e.g., MNIST(+0.30%)).</p><p>Adopting the methodology from Hu et al. <ref type="bibr" target="#b12">[13]</ref>, we succeeded in obtaining the original and filtered images from ImageNet <ref type="bibr" target="#b41">[42]</ref>. Image filters such as brightness, contrast, defocus/blur, frost, gaussian noise, and jpeg compression, are crucial for testing the robustness of the refactored systems <ref type="bibr" target="#b42">[43]</ref> because they involve pictures that human can recognize correctly and easily before and after filtering, thus setting a baseline for model performance in similar conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>Given the context of refactoring in ML systems as discussed in the previous section, we present ReLESS, a conceptual framework created especially to tackle two important research goals. First, to investigate the definition and operation of semantic preservation during system transformation procedures inside the LESS. Second, to explore approaches for evaluating the safety of LESS during the system's transition, building on our formalization of semantic preservation from the previous goal.  Consider Fig. <ref type="figure" target="#fig_0">1</ref> which depicts the situation representing the refactoring of traditional software systems. Here, 𝑃𝑟𝑜𝑔 1 represents an (automated) refactoring that takes as input a program 𝑃𝑟𝑜𝑔 2 to be refactored and produces a refactored program 𝑃𝑟𝑜𝑔 ′ 2 . Note that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are source code, i.e., textual descriptions. We assume that 𝑃𝑟𝑜𝑔 1 is a nontrivial refactoring, i.e., that 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . As refactoring typically deals with real-world languages with non-trivial semantics, the semantic equivalence of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 is normally assessed empirically by executing 𝑃𝑟𝑜𝑔 2 's test suites and comparing the results. Thus, to evaluate the refactoring 𝑃𝑟𝑜𝑔 1 , 𝐼 𝑛𝑝𝑢𝑡 is fed to both 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 for all test suite inputs. The 𝑂𝑢𝑡𝑝𝑢𝑡 is then compared-ideally, all tests have the same results before and after the refactoring. If so, then 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , and 𝑃𝑟𝑜𝑔 1 is considered validated. Otherwise, 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning there is a bug <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref> in the system. Since traditional software is typically deterministic and its logic is not driven by dynamic data models, the process works in a relatively straightforward fashion. In fact, the larger the test suite, the greater the confidence that the refactoring works. <ref type="foot" target="#foot_1">2</ref> On the other hand, given the non-deterministic intricacies inherent in LESS, the traditional refactoring as described in Fig. <ref type="figure" target="#fig_0">1</ref> is insufficient. Consequently, we construct an auxiliary diagram Fig. <ref type="figure" target="#fig_1">2</ref> that facilitates a more direct and nuanced evaluation of the transformations. Now consider Fig. <ref type="figure" target="#fig_1">2</ref> representing ReLESS with citations of related work in the supervised learning context. <ref type="foot" target="#foot_2">3</ref> Here, 𝑃𝑟𝑜𝑔 1 represents an (automated) refactoring that takes as input an ML algorithm 𝑃𝑟𝑜𝑔 2 to be refactored and produces a refactored ML algorithm 𝑃𝑟𝑜𝑔 ′ 2 . Note that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ again assume that 𝑃𝑟𝑜𝑔 1 is a non-trivial refactoring, i.e., that 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . To evaluate the refactoring 𝑃𝑟𝑜𝑔 1 , two steps are taken: (a) a training dataset is fed to both 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , which then produces the compiled, aka trained ML models 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 , respectively; (b) an evaluation (testing) dataset is fed to both 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 . One or more such datasets (both training and evaluation) may be used. The 𝑂𝑢𝑡𝑝𝑢𝑡-in this case, predictions or classifications-is then compared. If 𝑃𝑟𝑜𝑔 1 results in no accuracy loss, then 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ . Otherwise, 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning 𝑃𝑟𝑜𝑔 1 causes some accuracy loss when refactoring 𝑃𝑟𝑜𝑔 2 . Note that unlike in the traditional refactoring evaluation case, whether there is a bug in 𝑃𝑟𝑜𝑔 1 in this situation is not straightforward to determine and is not a topic of focus in this paper. Because LESS can be non-deterministic and has logic that is driven by dynamic data models, whether 𝑃𝑟𝑜𝑔 1 is considered valid may depend on multiple factors. For instance, if the accuracy loss is within a certain threshold, then 𝑃𝑟𝑜𝑔 1 may be considered valid. If the accuracy loss is above the threshold, then 𝑃𝑟𝑜𝑔 1 may be considered invalid.</p><p>A supplementary contribution of our proposed framework is that it has an additional layer where both the transformation and output comparison could occur. For instance, there is a dashed line in Fig. <ref type="figure" target="#fig_1">2</ref> from 𝑃𝑟𝑜𝑔 3 to 𝑃𝑟𝑜𝑔 1 and 𝑃𝑟𝑜𝑔 1 to 𝑃𝑟𝑜𝑔 ′ 3 , indicating that the program transformation can also take place on the trained ML models. In the traditional setting (Fig. <ref type="figure" target="#fig_0">1</ref>), because the transformation is not source-to-source, it would not be considered a refactoring in the traditional sense but instead viewed as compiler optimization. However, in the LESS context, ML algorithms are typically written in interpreted languages (e.g., Python), where a compiler is not involved. It is because the model training (compilation) process can potentially be lengthy (days or even weeks) depending on the dataset size, transforming the ML algorithm to produce a new ML model as part of the refactoring process can be time-consuming <ref type="bibr" target="#b43">[44]</ref>. Instead, it may be advantageous in this context to perform the refactoring at the testing level to avoid retraining. Such a "refactoring" is done on LESS by Pan and Rajan <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b44">45]</ref>. Although the transformation is on the trained ML model, their goal of enhanced modularity is a classical refactoring outcome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Determination of Semantic Equivalence</head><p>Our objective is to ultimately build a tool, where users provide original code (old system), that determines which refactorings (new systems) would satisfy semantic equivalence. We identify different levels at which this could occur: semantic equivalence at: (a) the ML algorithm level (case 1), and (b) the ML model level (case 2). We will demonstrate how existing works perform semantic equivalence from a single-lens point of view. Drawing on these effects, however, our approach will create a multi-objective evaluation (instead of a single-objective function used by the current state-of-the-art).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Case 1: Semantic Equivalence at the ML Algorithm Level</head><p>𝑃𝑟𝑜𝑔 2 = 𝑃𝑟𝑜𝑔 ′ 2 : This equivalence implies that 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 are also semantically equivalent as shown in Fig. <ref type="figure" target="#fig_1">2</ref>. In this case, 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are semantically equivalent since 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ . But, the average training time (in hours) of the model for this refactoring in Chen et al. <ref type="bibr" target="#b10">[11]</ref> increases from 0.017 to 0.023 for Lenet1 and from 7.08 to 14.979 in the case of ModelX. Their approach has higher storage overhead for 𝑃𝑟𝑜𝑔 ′ 3 (due to random seed recording). Such an approach does not facilitate model generalization to unseen data by making the training process explore various possibilities. This will constrain the robustness of an ML model. Deterministic methods are also more susceptible to overfitting, as models can memorize the training data too closely, limiting their performance on new data. Lastly, ensuring complete determinism can be computationally expensive and challenging, especially in complex, multi-threaded, or distributed computing environments. This work highlights the tension between semantic preservation and model optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Case 2: Semantic Equivalence at the ML Model Level</head><p>𝑃𝑟𝑜𝑔 3 = 𝑃𝑟𝑜𝑔 ′ 3 : This means that the trained ML models are the same. It follows again that 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , and 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are semantically equivalent in the traditional sense. As we are considering non-trivial refactorings as discussed earlier, we assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 , meaning that the refactoring 𝑃𝑟𝑜𝑔 1 has made some non-trivial transformation. An example of such a transformation would be to enhance the run-time performance of the training; the trained model would be the same but the training process would be faster. For instance, Castro Vélez et al. <ref type="bibr" target="#b45">[46]</ref> show that the 𝑃𝑟𝑜𝑔 ′ 3 run-time is ∼9.22 seconds faster than 𝑃𝑟𝑜𝑔 3 by applying a hybrid training technique in imperative Deep Learning (DL) programs. In TensorFlow 2 <ref type="bibr" target="#b46">[47]</ref>, for example, the tf.function decorator can be applied to certain (model) Python functions found in imperative code to speed up the training process. Developers and scientists, then, can write natural, debuggable DL code in an imperative style while retaining the run-time performance typically found in legacy DL frameworks that support deferred-execution style programming models. Applying tf.function to (otherwise eagerly-executed) imperative DL code can be-if done correctly-a semantics-preserving refactoring <ref type="bibr" target="#b45">[46]</ref>.</p><p>𝑃𝑟𝑜𝑔 3 ≠ 𝑃𝑟𝑜𝑔 ′ 3 : This means that the trained models are not the same. It follows that it is possible that 𝑂𝑢𝑡𝑝𝑢𝑡 ≠ 𝑂𝑢𝑡𝑝𝑢𝑡 ′ , meaning that 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 may not be semantically equivalent in the traditional sense. There are several situations that may occur here, e.g., (i) different hyperparameters are used. (ii) hybridization is misused, resulting in semantically-in-equivalent code <ref type="bibr" target="#b45">[46]</ref>, (iii) 𝑃𝑟𝑜𝑔 ′ 3 may be an optimized DL model, e.g., having fewer edges, being more modular, and avoiding over-fit. In Fig. <ref type="figure" target="#fig_1">2, 𝑃𝑟𝑜𝑔 ′</ref> 3 represents a modular and refactored system from 𝑃𝑟𝑜𝑔 3 via 𝑃𝑟𝑜𝑔 1 , where semantics is preserved through separation of concerns, such as using supervised classification labels for maintenance and reduced model training time <ref type="bibr" target="#b11">[12]</ref>. This indicated that 𝑃𝑟𝑜𝑔 ′ 3 does better than 𝑃𝑟𝑜𝑔 3 with respect to ReLESS optimization while preserving the potential to explore generalizability and scalability.</p><p>Our analysis not only sheds light on the current stateof-the-art but also establishes a linkage between program transformation techniques and their operational viability in scenarios where safety is of paramount concern. We then use these observations to formally define semantic preservation using a multi-objective optimization function rather than a single-objective one in LESS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Semantic Preservation: Formal Definition and Verification Metrics</head><p>We first define the semantic preservation of LESS based on varying ranges of the output. The Venn diagram Fig. <ref type="figure" target="#fig_2">3</ref> shows the outputs from the original code and proposed Re-LESS. The upper circle in blue is the output from the original code, e.g., the probability of correct labels for a classification or prediction task. The lower circle in yellow is the output from ReLESS. This diagram examines where the two outputs are equivalent (overlapping area) and where they are different. Suppose 𝛿 is the acceptable range of overlap, i.e., how much developers/engineers/scientists are willing to trade accuracy with other factors viz. robustness, run-time performance, interpretability etc. Ideally, the overlapped area should be as large as possible, but this is not always the case and is application-dependent. For instance, if the system is time-critical, then the response time is emphasized in the optimization even though there are marginal accuracy losses.</p><p>If the system is safety-critical, then the accuracy should be preserved as much as possible. That said, we posit that to achieve semantic preservation in ReLESS, it is inadequate to consider accuracy as the sole optimization metric.</p><p>Prior works <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b37">38]</ref> has formalized balancing between accuracy and reliability/robustness and fairness. OBrien et al. <ref type="bibr" target="#b47">[48]</ref> define non-functional LESS metrics as run-time performance (speed), security, privacy, and memory (storage). Building upon these foundational studies, we extend the evaluation framework for semantic preservation to explicitly encompass safety as an overarching theme. Run-time performance, as highlighted by OBrien et al. <ref type="bibr" target="#b47">[48]</ref>, serves not only as a measure of efficiency but also influences system safety by ensuring timely responses in critical scenarios. Robustness, as documented by Hu et al. <ref type="bibr" target="#b12">[13]</ref>, is directly linked to safety, reflecting the system's capacity to withstand errors and adversities. Finally, interpretability, introduced by <ref type="bibr" target="#b35">[36]</ref>, enhances safety by providing clarity on decision-making processes, thereby allowing for greater accountability and easier identification of potential safety breaches. These three metrics collectively forge a more resilient and safetyconscious framework for assessing semantic preservation in LESS.</p><p>This tailored approach allows for a more integrated and holistic assessment of LESS, aligning closely with contemporary LESS development and deployment needs. All three transformations do not change LESS's external behavior (semantics) <ref type="bibr" target="#b48">[49]</ref>. The formal notation is built to combine accuracy with those non-functional metrics with customized importance factors to guide which degree of flexibility the engineers, scientists, and researchers want the model they work on to emphasize. We propose a multi-objective optimization function, akin to Nguyen et al. <ref type="bibr" target="#b37">[38]</ref>'s approach, to determine the difference (loss function) between a LESS and its corresponding ReLESS. We argue that if the loss is below a certain threshold with constraints (as discussed in Fig. <ref type="figure" target="#fig_2">3</ref>), then semantic preservation is maintained.</p><p>As one of the state-of-the-art formal methods, optimization via loss functions is central to the training of ML/DL models <ref type="bibr" target="#b37">[38]</ref>. It is recognized for its adaptability to a wide range of applications. Different trade-offs exist when refactoring in ML/DL systems, so a multi-objective optimization function is constructed. Besides, optimization can standardize each metric term that needs to be balanced with accuracy in loss function to make the whole system understandable to the target audience. The range of optimization applied is from classical ML models (random forest, gradient boost) to DNN models with supported libraries, e.g., auto-sklearn <ref type="bibr" target="#b49">[50]</ref> and AutoKeras <ref type="bibr" target="#b50">[51]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">Accuracy, Run-time Performance, Robustness, and Interpretability</head><p>To comprehensively evaluate the performance of ReLESS, we will consider accuracy with three key loss functions: run-time performance, robustness, and interpretability. a. ACCuracy is the number of correct outputs over the total number of instances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝐴𝐶𝐶 = 𝑇 𝑃 + 𝑇 𝑁 𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁</head><p>where (3) where  is the input dataset, 𝑥 is the training dataset, and 𝑦 is the set of corresponding labels for a supervised learning task, such as image classification. Similar to RTPI, the observed difference is captured in loss function in old and new models. ROBI is observed by the difference in the loss function between the old and new models. Our definition for robustness is based on <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b51">52]</ref>, where a robustness system after refactoring is verified by its loss function after adversarial training, and for classical ML after refactoring can be also verified by its loss function after data augmentation, feature engineering, and ensemble learning. </p><p>Molnar <ref type="bibr" target="#b52">[53]</ref> define interpretability in machine learning using interpretable models and a simplified loss function. The loss function serves as a quantitative measure to compare the interpretability of different models while maintaining accuracy. This approach blends the conceptual understanding of model behavior (through interpretable models) with a practical, measurable way (using the loss function) to assess and compare the clarity and comprehensibility of different models. We compute this metric using the definition of interpretability where a subset as input of . We are able to compare the difference between new and old models and the corresponding interpretability score <ref type="bibr" target="#b35">[36]</ref>. The implication is that the simplified loss function on the explainable refactored system correlates with higher interpretability, which is plausible, but the exact method of determining explainability is essential here.</p><p>To sum up, we define a multi-objective optimization loss function that facilitates balancing the importance of various objectives depending on the application domain. Each metric is formulated as a ratio or a normalized value, which is typical in performance evaluation to provide a standardized measure of improvement or degradation. In the equation below, each metrics term is the loss measurements calculated from Eqs. ( <ref type="formula">2</ref>) to (4) respectively for ROBI, RTPI, and INTI metrics respectively and accuracy is ACC;  𝜃 is the model with its parameters,  is the dataset. Each term's weight coefficient 𝜔 𝑖 is assumed to be user-defined and indicates the importance of each of the metrics during model evaluation.</p><formula xml:id="formula_2">𝑚𝑖𝑛𝕃( 𝜃 , ) = 𝜔 1 × ACC + 𝜔 2 × RTPI + 𝜔 3 × ROBI + 𝜔 4 × INTI (5)</formula><p>where 𝜔 1 , 𝜔 2 , 𝜔 3 , and 𝜔 4 are weights that reflect the importance of each term in the loss function.</p><p>The multi-objective optimization function in this formalism enables the determination of whether a ReLESS is a semantically preserving transformation to its LESS. Moreover, when fusing these measurements, it is also essential to include the measure of accuracy because regardless of the importance of the speed of operation, robustness, and interpretability, producing correct outputs is the cornerstone of model evaluation. In other words, accuracy is always a first-class objective. Only by considering the critical role of accuracy can we ensure that a model is trustworthy <ref type="bibr" target="#b53">[54]</ref>.</p><p>Expanding on the conceptual structure presented in the preceding part, we describe a preliminary experimental configuration intended to closely assess the accuracy, run-time performance, robustness, and interpretability of LESS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head><p>This section describes the experimental setup employed for a preliminary evaluation of the proposed ReLESS framework for a simple case study. We describe the datasets used for experiments, followed by an explanation of the experimental design and the metrics adopted to assess the efficacy of the refactorings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Datasets and Models</head><p>As indicated in Section 2, we study ReLESS in the context of two image classification datasets: the ImageNet dataset and the MNIST dataset. The ImageNet dataset <ref type="bibr" target="#b41">[42]</ref>, comprised of 1.2 million images across 1000 categories, is utilized for the evaluation to assess reliability and robustness, with a specific subset of 50,000 images filtered from <ref type="bibr" target="#b42">[43]</ref>. The MNIST dataset <ref type="bibr" target="#b40">[41]</ref>, containing 60,000 training images and 10,000 test images of handwritten digits, serves as the basis for initial evaluations. Those datasets enable preliminary assessments of the refactorings' effectiveness before proceeding to more complex scenarios. Our experimental models include fully connected neural networks with 1 to 4 layers for the MNIST dataset, and pre-trained complex architectures such as AlexNet <ref type="bibr" target="#b54">[55]</ref>, ResNet50 <ref type="bibr" target="#b55">[56]</ref>, VGG16 <ref type="bibr" target="#b56">[57]</ref>, and GoogleNet <ref type="bibr" target="#b57">[58]</ref> for the ImageNet dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experiment and Results</head><p>In our experimental setup, we applied the methodologies outlined by Pan and Rajan <ref type="bibr" target="#b11">[12]</ref> and Hu et al. <ref type="bibr" target="#b12">[13]</ref>, along with techniques detailed in Section 3, across both datasets to scrutinize the refactored systems with respect to accuracy, run-time performance, robustness, and intepretebility. The results of experiments are summarized in Table <ref type="table" target="#tab_0">1</ref>.</p><p>From Table <ref type="table" target="#tab_0">1</ref>, we observe that the refactored models exhibit a marginal decrease in accuracy on the MNIST dataset, with a difference of 0.001. This decrease is attributed to the expanded modular complexity, which results in a run-time increase of 414.7 seconds. The modularity of the refactored model is significantly higher than the original model, with a difference of 8. The robustness of the refactored model is also higher, with a difference of 2.0476. The interpretability of the refactored model is higher, with an accuracy difference of 0.0769. Increases in both metrics indicate that the refactored system exhibits improved robustness and interoperability after decomposing. However, although the robustness has improved, the accuracy for refactored systems using the ImageNet dataset has decreased, falling below that of a coin flip. Therefore, modularity appears not only to be harmless but also beneficial to system safety, as it maintains accuracy and improves robustness. But, for the optimization of the aforementioned complex systems, more efforts are required to prevent accuracy loss, particularly in safety-critical tasks. More details can be found in https://github.com/NanJ90/ReLess-testing-tool</p><p>To summarize, we present an initial assessment for Re-LESS evaluation framework and describe the datasets used, the experimental design, and the metrics for evaluating refactorings. The comparative analysis of original and refactored models reveals that different datasets and models can exhibit significant variations across different performance metrics. For instance, while the performance of the Ima-geNet model remained relatively consistent after speedup, the modularized MNIST model took 168 times longer than the original. This underscores the critical importance of evaluating effects across multiple datasets and models to gain comprehensive insights into performance implications w.r.t accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Work</head><p>Our contribution in this work includes a review of literature focused on refactoring in LESS, particularly with an emphasis on safety considerations. This review critically analyzes the spectrum of assessments presented across various studies, each contributing to a facet of the AI safety standard. We further explore and elucidate the interrela- tionships between these safety metrics and the accuracy of AI systems, highlighting the implications for model development and deployment. Our preliminary results set a potential foundation to help drive the long-term evolution, and robustness of LESS that are traditionally enjoyed by conventional systems during development and deployment, and then improve the safety of LESS. The scientists and engineers who develop AI systems will be able to rely on the refactored systems and trust them to make decisions that are safe, secure, and trustworthy. This work includes understanding how the thresholds in Fig. <ref type="figure" target="#fig_2">3</ref> will be determined for various applications and how the user can determine the weights for the various metrics. We have described an initial validation of our framework; however, further experimentation that includes more metrics such as fairness and privacy, extending the validation to a variety of problem domains and case studies is essential to comprehensively assess its effectiveness and generalizability. This would also enable practitioners to prioritize specific components when evaluating LESS and could even lead to design-to-criteria LESS.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Traditional refactoring. 𝑃𝑟𝑜𝑔 1 is the refactoring. 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 are programs to be refactored and the refactored program, respectively. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are the outputs of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , respectively, given 𝐼 𝑛𝑝𝑢𝑡. Assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 .</figDesc><graphic coords="3,87.57,610.34,180.48,103.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Refactoring LESS. 𝑃𝑟𝑜𝑔 1 is the refactoring. 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 is the ML algorithm to be refactored and the refactored ML algorithm, respectively. Assume 𝑃𝑟𝑜𝑔 2 ≠ 𝑃𝑟𝑜𝑔 ′ 2 . 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 are the outputs (trained models) of 𝑃𝑟𝑜𝑔 2 and 𝑃𝑟𝑜𝑔 ′ 2 , respectively, given a training set. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are the outputs (predictions/classifications) of 𝑃𝑟𝑜𝑔 3 and 𝑃𝑟𝑜𝑔 ′ 3 , respectively, given a testing set.</figDesc><graphic coords="3,309.59,65.61,225.63,175.62" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: |𝑂𝑢𝑡𝑝𝑢𝑡 ′ | \|𝑂𝑢𝑡𝑝𝑢𝑡|. 𝑂𝑢𝑡𝑝𝑢𝑡 and 𝑂𝑢𝑡𝑝𝑢𝑡 ′ are supervised classification tasks' labels from the original and refactored models, respectively. |𝛿| = |𝑂𝑢𝑡𝑝𝑢𝑡 ′ \𝑂𝑢𝑡𝑝𝑢𝑡|.</figDesc><graphic coords="5,87.57,567.21,180.48,147.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>𝑇 𝑃 and 𝑇 𝑁 are the number of positive instances and negative instances correctly classified, and 𝐹 𝑃 and 𝐹 𝑁 are the number of instances incorrectly classified. b. Run-Time Performance Improvement (RTPI ) is determined by comparing the observed run-time of the original (old) code and new (transformed) code. 𝑅𝑇 𝑃𝐼 = 𝑅𝑇 𝑃 𝑜𝑙𝑑 − 𝑅𝑇 𝑃 𝑛𝑒𝑤 𝑅𝑇 𝑃 𝑜𝑙𝑑 (2) c. ROBustness Improvement is indicated as ROBI. 𝑅𝑂𝐵𝐼 = 1  * Σ 𝑥,𝑦∈ (1 − 𝐿𝑜𝑠𝑠 𝑛𝑒𝑤 (𝑦, 𝑓 ) − 𝐿𝑜𝑠𝑠 𝑜𝑙𝑑 (𝑦, 𝑓 ) 𝐿𝑜𝑠𝑠 𝑜𝑙𝑑 (𝑦, 𝑓 ) )</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>d. INTerpretability Improvement is indicated as INTI. 𝐼 𝑁 𝑇 𝐼 = 1 | 𝑠𝑢𝑏𝑠𝑒𝑡 | * Σ 𝑥,𝑦∈ 𝑠𝑢𝑏𝑠𝑒𝑡 𝐿𝑜𝑠𝑠(𝑦, 𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑎𝑏𝑙𝑒(𝑥) )</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Performance comparison of the original and refactored models on the MNIST and ImageNet datasets.</figDesc><table><row><cell>Metric</cell><cell>Original</cell><cell>MNIST Refactored</cell><cell>Difference</cell><cell>Original</cell><cell>ImageNet Refactored</cell><cell>Difference</cell></row><row><cell>Accuracy</cell><cell>0.9491</cell><cell>0.9490</cell><cell>0.0001</cell><cell>0.7948</cell><cell>&lt;0.5</cell><cell>&gt;0.2948</cell></row><row><cell>Run-time (seconds)</cell><cell>4.2</cell><cell>419.3</cell><cell>414.7</cell><cell>163.7</cell><cell>174</cell><cell>10.3000</cell></row><row><cell>Robustness</cell><cell>0.1744</cell><cell>2.2220</cell><cell>2.0476</cell><cell>0.9453</cell><cell>10.0754</cell><cell>9.1301</cell></row><row><cell>Interpretability (accuracy)</cell><cell>0.7882</cell><cell>0.8651</cell><cell>0.0769</cell><cell>&lt;0.5</cell><cell>&lt;0.5</cell><cell>0</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://paperswithcode.com/paper/omnivec-learning-robustrepresentations-with</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">are ML algorithm source code, i.e., textual descriptions. We 2 Traditional software may be concurrent, potentially experiencing race conditions, or may rely on its (changing) environment. In such cases, "flaky" tests may arise, which would challenge refactoring validation. In this case, the test suites can be executed several times to identify stable tests.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">While our current investigation focuses on supervised learning, we plan to extend the framework to other types of learning (unsupervised, reinforcement) as part of of our future work.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We thank Ayan Kohli for the initial investigation into semantic similarity and the anonymous reviewers for their helpful comments. This work is supported by the National Science Foundation (NSF) under Agreement No.CCF-2200343.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Software Engineering for AI-Based Systems: A Survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Martínez-Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bogner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Franch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Oriol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Siebert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Trendowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Vollmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wagner</surname></persName>
		</author>
		<idno type="DOI">10.1145/3487043</idno>
		<ptr target="http://arxiv.org/abs/2105.01984.doi:10.1145/3487043.arXiv:2105.01984" />
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Software Engineering and Methodology</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="1" to="59" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The ML test score: A rubric for ML production readiness and technical debt reduction</title>
		<author>
			<persName><forename type="first">E</forename><surname>Breck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Nielsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Salib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sculley</surname></persName>
		</author>
		<idno type="DOI">10.1109/BigData.2017.8258038</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Big Data (Big Data)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="1123" to="1132" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<idno>ISO/IEC 14764</idno>
		<title level="m">Software Engineering -Software Life Cycle Processes -Maintenance, International Organizations for Standardization</title>
				<meeting><address><addrLine>Geneva, Switzerland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Refactoring sequential java code for concurrency via concurrent libraries</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Marrero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Ernst</surname></persName>
		</author>
		<idno type="DOI">10.1109/icse.2009.5070539</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="397" to="407" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Hidden technical debt in Machine Learning systems</title>
		<author>
			<persName><forename type="first">D</forename><surname>Sculley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Holt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Golovin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Davydov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Phillips</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ebner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-F</forename><surname>Crespo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dennison</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing Systems</title>
				<imprint>
			<publisher>MIT Press</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="2503" to="2511" />
		</imprint>
	</monogr>
	<note>NIPS &apos;15</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Ariadne: Analysis for machine learning programs</title>
		<author>
			<persName><forename type="first">J</forename><surname>Dolby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shinnar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Allain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reinen</surname></persName>
		</author>
		<idno type="DOI">10.1145/3211346.3211349</idno>
	</analytic>
	<monogr>
		<title level="m">International Workshop on Machine Learning and Programming Languages, MAPL 2018</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<ptr target="https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/" />
		<title level="m">Executive order on the safe, secure, and trustworthy development and use of artificial intelligence</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Madiega</surname></persName>
		</author>
		<ptr target="https://www.europarl.europa.eu/RegData/etudes/BRIE/2021/698792/EPRS_BRI(2021)698792_EN.pdf" />
		<title level="m">Artificial intelligence act, European Parliament: European Parliamentary Research Service</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Russell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Norvig</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-82681-9.</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1007/978-3-030-82681-9" />
		<title level="m">Artificial Intelligence: A Modern Approach</title>
				<imprint>
			<publisher>Pearson</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>4 ed</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Robustness may be at odds with accuracy</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Engstrom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Turner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madry</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1805.12152</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Towards training reproducible deep learning models</title>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K</forename><surname>Rajbahadur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">M J</forename><surname>Jiang</surname></persName>
		</author>
		<idno type="DOI">10.1145/3510003.3510163</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering, ICSE &apos;22</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="2202" to="2214" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">On decomposing a deep neural network into modules</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rajan</surname></persName>
		</author>
		<idno type="DOI">10.1145/3368089.3409668</idno>
		<idno>doi:10.1145/3368089.3409668</idno>
		<ptr target="https://dl.acm.org/doi/10.1145/3368089.3409668" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</title>
				<meeting>the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="889" to="900" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">If a Human Can See It, So Should Your System: Reliability Requirements for Machine Vision Components</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Marsso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Czarnecki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chechik</surname></persName>
		</author>
		<idno type="DOI">10.1145/3510003.3510109</idno>
		<idno type="arXiv">arXiv:2202.03930</idno>
		<ptr target="http://arxiv.org/abs/2202.03930.doi:10.1145/3510003.3510109" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 44th International Conference on Software Engineering</title>
				<meeting>the 44th International Conference on Software Engineering</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1145" to="1156" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Speed/accuracy trade-offs for modern convolutional object detectors</title>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rathod</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Balan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wojna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guadarrama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">P</forename><surname>Murphy</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:206595627" />
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2016">2017. 2016</date>
			<biblScope unit="page" from="3296" to="3297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Formal Specification for Deep Neural Networks</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Seshia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Desai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dreossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fremont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shivakumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vazquez-Chanlatte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yue</surname></persName>
		</author>
		<idno>UCB/EECS-2018-25</idno>
		<ptr target="http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-25.html" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
		<respStmt>
			<orgName>EECS Department, University of California, Berkeley</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Security versus accuracy: Trade-off data modeling to safe fault classification systems</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ge</surname></persName>
		</author>
		<idno type="DOI">10.1109/TNNLS.2023.3251999</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Neural Networks and Learning Systems</title>
		<imprint>
			<biblScope unit="page" from="1" to="12" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">F</forename><surname>Opdyke</surname></persName>
		</author>
		<title level="m">Refactoring object-oriented frameworks</title>
				<meeting><address><addrLine>Champaign, IL, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1992">1992</date>
		</imprint>
		<respStmt>
			<orgName>University of Illinois at Urbana-Champaign</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Refactoring: Improving the Design of Existing Code</title>
		<author>
			<persName><forename type="first">M</forename><surname>Fowler</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>Addison-Wesley</publisher>
			<pubPlace>Boston, MA, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The birth of refactoring: A retrospective on the nature of high-impact software engineering</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">G</forename><surname>Griswold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">F</forename><surname>Opdyke</surname></persName>
		</author>
		<idno type="DOI">10.1109/MS.2015.107</idno>
	</analytic>
	<monogr>
		<title level="j">research</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="30" to="38" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>conference Name: IEEE Software</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A field study of refactoring challenges and benefits</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zimmermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nagappan</surname></persName>
		</author>
		<idno type="DOI">10.1145/2393596.2393655</idno>
	</analytic>
	<monogr>
		<title level="m">Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering</title>
				<meeting><address><addrLine>Cary, North Carolina</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">On preserving the behavior in software refactoring: A systematic mapping study</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Alomar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Mkaouer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ouni</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.infsof.2021.106675</idno>
	</analytic>
	<monogr>
		<title level="j">Information and Software Technology</title>
		<imprint>
			<biblScope unit="volume">140</biblScope>
			<biblScope unit="page">106675</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Safe automated refactoring for intelligent parallelization of Java 8 streams</title>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bagherzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ahmed</surname></persName>
		</author>
		<idno type="DOI">10.1109/icse.2019.00072</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering, ICSE &apos;19, ACM/IEEE</title>
				<meeting><address><addrLine>Piscataway, NJ, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="619" to="630" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Refactoring using type constraints</title>
		<author>
			<persName><forename type="first">F</forename><surname>Tip</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Fuhrer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kieżun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Ernst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Balaban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">De</forename><surname>Sutter</surname></persName>
		</author>
		<idno type="DOI">10.1145/1961204.1961205</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Programming Languages and Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page">47</biblScope>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Automated refactoring of legacy Java software to enumerated types</title>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sawin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rountev</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICSM.2007.4362635</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Maintenance, ICSM &apos;07</title>
				<meeting><address><addrLine>Paris, France</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="224" to="233" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Automated refactoring of legacy Java software to default methods</title>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Masuhara</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICSE.2017.16</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering, ICSE &apos;17</title>
				<meeting><address><addrLine>Piscataway, NJ, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Press</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="82" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Proactive empirical assessment of new language feature adoption via automated refactoring: The case of Java 8 default methods</title>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Masuhara</surname></persName>
		</author>
		<idno type="DOI">10.22152/programming-journal.org/2018/2/6</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on the Art, Science, and Engineering of Programming, volume 2 of Programming &apos;18</title>
				<meeting><address><addrLine>AOSA, Nice, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">A survey of software refactoring</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tourwe</surname></persName>
		</author>
		<idno type="DOI">10.1109/TSE.2004.1265817</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Software Engineering</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="126" to="139" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">On neural network equivalence checking using SMT solvers</title>
		<author>
			<persName><forename type="first">C</forename><surname>Eleftheriadis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kekatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Katsaros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tripakis</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-15839-1_14</idno>
	</analytic>
	<monogr>
		<title level="m">Formal Modeling and Analysis of Timed Systems</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Bogomolov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Parker</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="237" to="257" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Understanding software-2.0: A study of machine learning library usage and evolution</title>
		<author>
			<persName><forename type="first">M</forename><surname>Dilhara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ketkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dig</surname></persName>
		</author>
		<idno type="DOI">10.1145/3453478</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Software Engineering and Methodology</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Discovering repetitive code changes in python ml systems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Dilhara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ketkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sannidhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dig</surname></persName>
		</author>
		<idno type="DOI">10.1145/3510003.3510225</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering, ICSE &apos;22</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="736" to="748" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">A large-scale empirical study on self-admitted technical debt</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bavota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Russo</surname></persName>
		</author>
		<idno type="DOI">10.1145/2901739.2901742</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Mining Software Repositories, MSR &apos;16</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="315" to="326" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Managing technical debt in software-reliant systems</title>
		<author>
			<persName><forename type="first">N</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kazman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kruchten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maccormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ozkaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sangwan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Seaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sullivan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zazworka</surname></persName>
		</author>
		<idno type="DOI">10.1145/1882362.1882373</idno>
	</analytic>
	<monogr>
		<title level="m">FSE/SDP Workshop on Future of Software Engineering Research, FoSER &apos;10</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="47" to="52" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Christians</surname></persName>
		</author>
		<title level="m">Self-admitted technical debt-an investigation from farm to table to refactoring</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">An exploration of technical debt</title>
		<author>
			<persName><forename type="first">E</forename><surname>Tom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aurum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vidgen</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jss.2012.12.052</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Systems and Software</title>
		<imprint>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="page" from="1498" to="1516" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">An empirical study of refactorings and technical debt in Machine Learning systems</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bagherzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stewart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raja</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICSE43902.2021.00033</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering, ICSE &apos;21, IEEE/ACM</title>
				<meeting><address><addrLine>Madrid, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="238" to="250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Semantic-preserving adversarial code comprehension</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhao</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2209.05130</idno>
		<idno type="arXiv">arXiv:2209.05130</idno>
		<ptr target="http://arxiv.org/abs/2209.05130.doi:10.48550/arXiv.2209.05130" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">A distancebased weighting framework for boosting the performance of dynamic ensemble selection</title>
		<author>
			<persName><forename type="first">Z.-L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X.-G</forename><surname>Luo</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ipm.2019.03.009</idno>
		<ptr target="https://www.sciencedirect.com/science/article/pii/S030645731830712X.doi:10.1016/j.ipm.2019.03.009" />
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="1300" to="1316" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biswas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rajan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.09297</idno>
		<title level="m">Fix fairness, don&apos;t ruin accuracy: Performance aware fairness repair using automl</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<title level="m" type="main">Towards stable and efficient training of verifiably robust neural networks</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gowal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stanforth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Boning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-J</forename><surname>Hsieh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.06316</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Omnivec: Learning robust representations with cross modal sharing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sharma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</title>
				<meeting>the IEEE/CVF Winter Conference on Applications of Computer Vision</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1236" to="1248" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</author>
		<ptr target="http://yann.lecun.com/exdb/mnist/" />
		<title level="m">MNIST handwritten digit database</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">ImageNet Large Scale Visual Recognition Challenge</title>
		<author>
			<persName><forename type="first">O</forename><surname>Russakovsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Satheesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karpathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khosla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11263-015-0816-y</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">115</biblScope>
			<biblScope unit="page" from="211" to="252" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>IJCV)</note>
</biblStruct>

<biblStruct xml:id="b42">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dietterich</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1903.12261</idno>
		<title level="m">Benchmarking neural network robustness to common corruptions and perturbations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<idno>ArXiv abs/2106.07139</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:235421816" />
		<title level="m">Pre-trained models: Past, present and future</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">Decomposing convolutional neural networks into reusable and replaceable modules</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rajan</surname></persName>
		</author>
		<idno type="DOI">10.1145/3510003.3510051</idno>
		<idno type="arXiv">arXiv:2110.07720</idno>
		<idno>arXiv:2110.07720</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Software Engineering</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="524" to="535" />
		</imprint>
	</monogr>
	<note>ICSE &apos;22</note>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Challenges in migrating imperative deep learning programs to graph execution: An empirical study</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">Castro</forename><surname>Vélez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Khatchadourian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bagherzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raja</surname></persName>
		</author>
		<idno type="DOI">10.1145/3524842.3528455</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Mining Software Repositories, MSR &apos;22</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="469" to="481" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<title level="m" type="main">Better performance with tf.function</title>
		<ptr target="https://tensorflow.org/guide/function" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
		<respStmt>
			<orgName>Google LLC</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<monogr>
		<title level="m" type="main">23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software</title>
		<author>
			<persName><forename type="first">D</forename><surname>Obrien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biswas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Imtiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Abdalkareem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shihab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rajan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page">13</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<analytic>
		<title level="a" type="main">Refactoring: Current research and future trends</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Demeyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bois</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Stenten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Van Gorp</surname></persName>
		</author>
		<idno type="DOI">10.1016/S1571-0661(05)82624-6</idno>
		<ptr target="https://www.sciencedirect.com/science/article/pii/S1571066105826246.doi:https://doi.org/10.1016/S1571-0661(05)82624" />
	</analytic>
	<monogr>
		<title level="j">Electronic Notes in Theoretical Computer Science</title>
		<imprint>
			<biblScope unit="volume">82</biblScope>
			<biblScope unit="page" from="483" to="499" />
			<date type="published" when="2003">2003. -6, lDTA&apos;2003</date>
		</imprint>
	</monogr>
	<note>-Language descriptions, Tools and Applications</note>
</biblStruct>

<biblStruct xml:id="b49">
	<analytic>
		<title level="a" type="main">Efficient and robust automated machine learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Feurer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Eggensperger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Springenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="2962" to="2970" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<analytic>
		<title level="a" type="main">Autokeras: An automl library for deep learning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Chollet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v24/20-1355.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page" from="1" to="6" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<analytic>
		<title level="a" type="main">Robustness and accuracy could be reconcilable by (proper) definition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:247011694" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b52">
	<monogr>
		<title level="m" type="main">Interpretable machine learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Molnar</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>Lulu. com</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b53">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Ao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Siddharthan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.03179</idno>
		<title level="m">Empirical optimal risk to quantify model trustworthiness for failure detection</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b54">
	<monogr>
		<title level="m" type="main">One weird trick for parallelizing convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1404.5997</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b55">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b56">
	<monogr>
		<title level="m" type="main">Very deep convolutional networks for large-scale image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b57">
	<analytic>
		<title level="a" type="main">Going deeper with convolutions</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sermanet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rabinovich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
