<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Hyper-parameter Tuning for Adversarially Robust Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Pedro</forename><surname>Mendes</surname></persName>
							<email>pgmendes@andrew.cmu.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Software and Societal Systems Department</orgName>
								<orgName type="institution">Carnegie Mellon University</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">INESC-ID</orgName>
								<orgName type="department" key="dep2">Instituto Superior Tecnico</orgName>
								<orgName type="institution">Universidade de Lisboa</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Romano</surname></persName>
							<email>romano@inesc-id.pt</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">INESC-ID</orgName>
								<orgName type="department" key="dep2">Instituto Superior Tecnico</orgName>
								<orgName type="institution">Universidade de Lisboa</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">David</forename><surname>Garlan</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Software and Societal Systems Department</orgName>
								<orgName type="institution">Carnegie Mellon University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Hyper-parameter Tuning for Adversarially Robust Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">06123E042E221A72CD5BC4FB12C8A135</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This work focuses on the problem of hyper-parameter tuning (HPT) for robust (i.e., adversarially trained) models, shedding light on the new challenges and opportunities arising during the HPT process for robust models. To this end, we conduct an extensive experimental study based on three popular deep models and explore exhaustively nine (discretized) hyper-parameters (HPs), two fidelity dimensions, and two attack bounds, for a total of 19208 configurations (corresponding to 50 thousand GPU hours).</p><p>Through this study, we show that the complexity of the HPT problem is further exacerbated in adversarial settings due to the need to independently tune the HPs used during standard and adversarial training: succeeding in doing so (i.e., adopting different HP settings in both phases) can lead to a reduction of up to 80% and 43% of the error for clean and adversarial inputs, respectively. We also identify new opportunities to reduce the cost of HPT for robust models. Specifically, we propose to leverage cheap adversarial training methods to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive state-of-the-art methods. We show that, by exploiting this novel idea in conjunction with a recent multi-fidelity optimizer (taKG), the efficiency of the HPT process can be enhanced by up to 2.1×.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Adversarial attacks <ref type="bibr" target="#b0">[1]</ref> aim at causing model misclassifications by introducing small perturbations in the input. Whitebox methods like Projected Gradient Descent (PGD) <ref type="bibr" target="#b1">[2]</ref> have been shown to be extremely effective in synthesizing perturbations that are small enough to be noticeable by humans, while severely hindering the model's performance. Fortunately, models can be hardened against this type of attack via a so-called "Adversarial Training" (AT) process. During AT, which typically takes place after an initial standard training (ST) phase <ref type="bibr" target="#b2">[3]</ref>, adversarial examples are synthesized and added (with their intended label) to the training set. Recently, several AT methods have been proposed <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b3">4]</ref> that explore different trade-offs between robustness and computational efficiency. Unfortunately, the most robust AT methods impose significant overhead (up to 7× in the models tested in this work) with respect to standard training.</p><p>These costs are further amplified when considering another crucial phase of model building, namely hyperparameter tuning (HPT). In fact, HPT methods require training a model multiple times using different hyper-parameter (HP) configurations. Consequently, the overheads introduced by AT lead also to an increase in the cost of HPT. Further, AT and ST share common HPs, which raises the question of whether AT should simply employ the same HP settings used during ST, or if a new HPT process should be executed to select different HPs for the AT phase. In the latter case, the dimensionality of the HP space to be optimized grows significantly, exacerbating the HPT cost.</p><p>Hence, this work focuses on the problem of HPT for adversarially trained models with the twofold goal of i) shedding light on the new challenges (i.e., additional costs) that emerge when performing HPT for robust models, and ii) proposing novel techniques to reduce these costs, by exploiting opportunities that emerge in this context.</p><p>We pursue the first goal via an extensive experimental study based on 3 popular models/datasets widely used to evaluate AT methods (ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10). In this study, we discretize and exhaustively explore the HP space composed of up to nine HPs, which we evaluated considering two "fidelity dimension" <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> for the training process and two attack strengths. Overall, we test a total of 19208 configurations and we make this dataset publicly accessible in the hope that it will aid the design of future HPT methods specialized for AT.</p><p>Leveraging this data, we investigate a key design choice for the HPT process of robust models, namely the decision of whether to adopt the same vs. different HP settings during AT and ST (for the common HPs in the 2 phases). To this end, we focus on 3 key HPs of deep models: learning rate, momentum, and batch size. Our empirical study shows that allowing the use of different HP settings during ST and AT can bring substantial benefits in terms of model quality, by reducing up to 80% and 43% the standard and adversarial error, respectively.</p><p>Further, our study demonstrates that while the cost and complexity of HPT are heightened in adversarial settings, it also reveals that, in the context of robust models, unique opportunities can be exploited to effectively mitigate these costs. Specifically, we show that it is possible to leverage cheap AT methods to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive methods (PGD <ref type="bibr" target="#b1">[2]</ref>). Besides studying the trade-offs between cost reduction and HP quality correlation with different AT methods, we extend a recent multi-fidelity optimizer (taKG <ref type="bibr" target="#b6">[7]</ref>) to incorporate the choice of the AT method as an additional dimension to reduce the HPT cost. We evaluate the proposed method using our dataset and show that incorporating the choice of the AT method as an additional fidelity dimension in taKG leads up to 2.1× speed-ups, with gains that extend up to 3.7× w.r.t. popular HPT methods, as HyperBand <ref type="bibr" target="#b7">[8]</ref>. These reductions in the optimization time not only translate to significant reductions in energy consumption during training but also result in corresponding decreases in pollutant emissions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Related Work</head><p>In this section, we first provide background information on AT techniques (Section 2.1) and then discuss related works in the area of HPT (Section 2.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Adversarial Training</head><p>Adversarial attacks aim at introducing small perturbations to input data, often small enough to be hardly perceivable by humans, with the goal to lead the model to generate an erroneous output. These attacks reveal vulnerabilities of current model training techniques, underscoring the need for developing robust models in different domains. Thus, several works <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref> developed new techniques to mitigate these vulnerabilities and defend against adversarial attacks, by tackling them via different and often orthogonal or complementary ways, such as adversarial training <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>, detection of adversarial attacks <ref type="bibr" target="#b11">[12]</ref>, or pre-processing techniques to filter adversarial perturbations <ref type="bibr" target="#b12">[13]</ref>. Next, we will review existing AT approaches, which represent the focus of this work.</p><p>AT aims at improving the robustness of machine learning (ML) models by i) first generating adversarially perturbed inputs, and ii) feeding these adversarial examples, along with the correct corresponding label, during the model training phase. More formally, this process can be described as follows. Unlike ST, which determines the models' parameters 𝜃 by minimizing the loss function between the model's prediction for the clean input 𝑓 𝜃 (𝑥) and the original class 𝑦, i.e., min</p><formula xml:id="formula_0">𝜃 { E 𝑥,𝑦∼𝐷</formula><p>[𝐿(𝑓 𝜃 (𝑥), 𝑦)]}, AT first computes a perturbation 𝛿, smaller than a maximum predefined bound 𝜖, which will mislead the current model, and then trains the model with that perturbed input. This approach leads to the formulation of the following optimization problem: Several methods have been developed to solve this optimization problem (or variants thereof as, e.g., <ref type="bibr" target="#b3">[4]</ref>) and the resulting techniques are based on different assumptions about, e.g., the availability of the model for the attacker (i.e., white-box <ref type="bibr" target="#b1">[2]</ref> or black-box <ref type="bibr" target="#b13">[14]</ref>), whether the underlying model is differentiable <ref type="bibr" target="#b0">[1]</ref> or not <ref type="bibr" target="#b13">[14]</ref>, and the existence of bounds on attacker capabilities <ref type="bibr" target="#b10">[11]</ref>. Among these techniques, two of the most popular ones are Fast Gradient Sign Method (FGSM) <ref type="bibr" target="#b0">[1]</ref> and Projected Gradient Descent (PGD) <ref type="bibr" target="#b1">[2]</ref>). Both techniques hypothesize that attackers can inject bounded perturbations and have access to a differentiable model. FGSM, and its later variants <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b14">15]</ref>, rely on gradient descent to compute small perturbations in an efficient way. More in detail, for a given clean input 𝑥, this method adjusts the perturbation 𝛿 by the magnitude of the bound in the direction of the gradient of the loss function, i.e., 𝛿 = 𝜖 • sign(∇ 𝛿 𝐿(𝑓 𝜃 (𝑥 + 𝛿), 𝑦)). PGD iteratively generates adversarial examples by taking small steps in the direction of the gradient of the loss function and projecting the perturbed inputs back onto the 𝜖-ball around the original input, i.e., Repeat: 𝛿 = 𝒫(𝛿+𝛼∇ 𝛿 𝐿(𝑓 𝜃 (𝑥+𝛿), 𝑦)), where 𝒫 is the projection onto the ball of radius 𝜖 and 𝛼 can be seen as analogous to the learning rate in gradient-descent-based training. Due to its iterative nature, PGD incurs a notably higher computational cost than FGSM <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b8">9]</ref>, but it is also regarded as one of the strongest methods to generate adversarial examples. In fact, prior work has shown that PGD attacks can fool robust models trained via FGSM and that PGD-based AT produces models robust to larger perturbations <ref type="bibr" target="#b1">[2]</ref> and with higher adversarial accuracy. FGSM is also known to suffer from catastrophic-overfitting <ref type="bibr" target="#b15">[16]</ref>, in which the model's adversarial accuracy collapses after some training iterations. Henceforth, we will focus on FGSM and PGD, which, as mentioned, are among the most widely used and effective methods for generating adversarial examples <ref type="bibr" target="#b14">[15]</ref>. In fact, these methods have been extensively studied and compared in the literature <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b16">17]</ref> and represent a natural starting point for investigating the trade-offs related to HPT that arise in the context of AT.</p><formula xml:id="formula_1">min 𝜃 { E 𝑥,𝑦∼𝐷</formula><p>Independently of the technique used to perform AT, a relevant finding, first investigated by Gupta et al. <ref type="bibr" target="#b2">[3]</ref>, is whether to use an initial ST phase before performing AT, or whether to use exclusively AT. This study showed that using an initial ST phase normally helps to reduce the computational cost while yielding models with comparable quality. This result motivates one of the key questions that we aim at answering in this work, namely whether the ST and AT phases should share the settings for their common HPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Hyper-parameter Tuning</head><p>HPT is a critical phase to optimize the performance of ML models. As the scale and complexity of models increase, along with the number of HP that can possibly be tuned in modern ML methods <ref type="bibr" target="#b17">[18]</ref>, HPT is a notoriously timeconsuming process, whose cost can become prohibitive due to the need to repetitively train complex models on large datasets.</p><p>To address this issue, a large spectrum of the literature on HPT relies on Bayesian Optimization (BO) <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b22">23]</ref>. BO employs modeling techniques (e.g., Gaussian Processes) to guide the optimization process and leverages the model's knowledge and uncertainty (via a, so called, acquisition function) to select which configurations to test. Although the use of BO can help to increase the convergence speed of the optimization process, the cost of testing multiple HP configurations can quickly become prohibitive, especially when considering complex models trained over large datasets.</p><p>To tackle this problem, multi-fidelity techniques <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b6">7]</ref> exploit cheap low-fidelity evaluations (e.g., training with a fraction of the available data or using a reduced number of training epochs) and extrapolate this knowledge to recommend high-fidelity configurations. This allows for reducing the cost of testing HP configurations, while still providing useful information to guide the search for the optimal high-fidelity configuration(s) <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b18">19]</ref>. HyperBand <ref type="bibr" target="#b7">[8]</ref> is a popular multi-fidelity and model-free approach that promotes good quality configurations to higher budgets and discards the poor quality ones using a simple, yet effective, successive halving approach <ref type="bibr" target="#b23">[24]</ref>. Several approaches extended HyperBand using models to identify good configurations <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26]</ref> or shortcut the number of configurations to test <ref type="bibr" target="#b26">[27]</ref>. While these works adopt a single budget type (e.g., training time or dataset size), other approaches, such as taKG <ref type="bibr" target="#b6">[7]</ref>, make joint usage of multiple budget/fidelity dimensions during the optimization process to have additional flexibility to reduce the optimization cost. taKG selects the next configuration and different budgets via model-based predictive techniques that estimate the cost incurred and information gained by sampling a given configuration for a given setting of the available fidelity dimensions.</p><p>In the area of HPT for robust models, the work that is more closely related to ours is the study by Duesterwald et al. <ref type="bibr" target="#b27">[28]</ref>. This work has investigated empirically the relations between the bounds on adversarial perturbations (𝜖) and the model's accuracy/robustness. Further, it showed that the ratio of clean/adversarial examples included in a batch (during AT) can have a positive impact on the model's quality and represents, as such, a key HP. Based on this finding, we incorporate this HP among the ones tested in our study. Differently from that work, we focus on i) quantifying the benefits of using different HPs during ST and AT, and ii) exploiting the correlation between cheaper AT methods (such as FGSM) to enhance the efficiency of multi-fidelity HPT algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">HPT for Robust Models: Challenges and Opportunities</head><p>As mentioned, this work aims at shedding light on the challenges and opportunities that arise when performing HPT for adversarially robust models. More precisely, we seek to answer the following questions: In order to answer the above questions, we have collected (and made publicly available) a dataset, which we obtained by varying some of the most impactful HPs for three popular neural models/datasets and measured the resulting model quality. We provide a detailed description of the dataset in Section 3.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Experimental Setup</head><p>We base our study on three widely-used models and datasets (ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10). All the models were trained using 1 worker, except SVHN, in which two workers were used. We used Nvidia Tesla V100 GPUs to train the ResNet50, and Nvidia GeForce RTX 2080 for the remaining models. All models and training procedures were implemented in Python3 via the Pytorch framework.</p><p>To evaluate the models, we considered up to nine different HPs, as summarized in Table <ref type="table" target="#tab_1">1</ref>. The first three HPs in this table apply to both the ST and AT phases. 𝛼 is an HP that applies exclusively to AT, whereas the last two HPs (%RAT and %AE) regulate the balance between ST and AT (see Section 2). Specifically, %RAT defines the number of computational resources allocated to the AT phase, and %AE indicates the ratio of adversarial inputs contained in the batches during the AT phase (as suggested by Duesterwald et al. <ref type="bibr" target="#b27">[28]</ref>). We further consider several settings of the bound 𝜖 on the attacker power (see Table <ref type="table" target="#tab_2">2</ref>). Note that the reported values of 𝜖 are normalized by 255. Finally, we also consider two fidelity dimensions, namely the number of training epochs and the number of PGD iterations (see  Section 3.3). The model's quality is evaluated using standard error (i.e., error using clean inputs) and adversarial error (i.e., error using adversarially perturbed inputs).</p><p>For each model, we exhaustively explored the (discretized) space defined by the HPs, bound 𝜖, and fidelities, which yields a search space encompassing a total of 19208 configurations. Building this dataset required around fifty thousand GPU hours and we have made it publicly accessible in the hope that it will aid the design of future HPT methods specialized for AT. Additional information to ensure the reproducibility of results is provided in the public repository<ref type="foot" target="#foot_0">1</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Should the HPs of ST and AT be tuned independently?</head><p>This section aims at answering the following question: given that the ST and AT phases share several HPs (e.g., batch size, learning rate, and momentum in the models considered in this study), how relevant is it to use different settings for these HPs in the two training phases? Note that, if we assume the existence of 𝐶 HPs in common between ST and AT, then enabling the use of different values for these HPs in each training stage causes a growth of the dimensionality of the HP space from 𝐶 to 2𝐶 (not accounting for any HP not in common) and, ultimately, to a significant increase in the cost/complexity of the HPT problem. Specifically, for the scenarios considered in this study, the cardinality of the HP space grows from 320 to 2560 distinct configurations. Hence, we argue that such a cost is practically justified only if it is counterbalanced by relevant gains in terms of error reduction. To answer this question, we trained the models during 16 epochs and used different settings for the common HPs for ST and AT (Table <ref type="table" target="#tab_1">1</ref>). We consider three different settings (30%, 50%, and 70%) <ref type="foot" target="#foot_1">2</ref> for the relative amount of resources (epochs) available for AT (%RAT), as well as different settings of the perturbation bound 𝜖.</p><p>We consider that the model's HPs can be optimized according to three criteria: i) clean data error (Error), ii) adversarial error (AdvError), and iii) the average of clean and adversarial error (MeanError). For each of these three optimization criteria, %RAT and bound 𝜖, we report in Figure <ref type="figure">1</ref> the percentage of reduction of the target optimization metric for the optimal HP configuration obtained by allowing for (but not imposing) different settings of the HPs in common to the ST and AT phases, with respect to the optimal HP configuration if one opts for using the same settings for the common HPs in both training phases, namely: %Error Reduction = Errorsame HPs − Error diff HPs Errorsame HPs ×100</p><p>The results show that adopting different HP settings in the two phases can lead to significant error reductions for all the three optimization criteria. The peak gains extend up to approx. 30% and are achieved for the case of ResNet18 with (relatively) large values of 𝜖 and when allocating a low percentage of epochs to AT (%RAT=30%). Overall, the geometric means of the % error reduction (across all models and settings of 𝜖 and %RAT) is 9%, 5%, and 6% for the Error, AdvError, and MeanError criterion, respectively.</p><p>Next, Figure <ref type="figure">2</ref> provides a different perspective in order to quantify the benefits achieved by separately optimizing the HP of the AT phase (vs. using for the AT phase the same HPs settings in common with the ST phase), assuming to have executed an initial ST phase using any of the possible HPs settings. Specifically, we report the Cumulative Distribution Function (CDF) of the percentage of error reduction (for each of the three target optimization metrics), when allowing the use of the same or different common HP settings for the two phases and while varying the remaining (non-common) HPs, namely %RAT, %AE, and 𝛼. These results allow us to highlight that by independently tuning the HPs of the two training stages, the model's quality is enhanced by up to approx. 80%, 43%, and 56%, when minimizing the standard, adversarial, or mean error, resp.</p><p>Overall, these results may be justified by considering that the optimization objectives and constraints of the ST and AT phases are different, hence benefiting from using different HP settings. During ST, the training procedure focuses on maximizing standard accuracy, and the model's goal is to learn representations that generalize well to new data. In contrast, AT seeks to increase robustness against adversarial attacks, and the model needs to learn to differentiate between clean and perturbed examples correctly. Further, the AT phase benefits from a pre-trained model (using clean data), and, as such, this model is expected to require relatively small weight adjustments to defend against adversarial inputs. Thus, this phase is likely to benefit from more conservative settings of HPs such as learning rate and momentum than the initial ST, whose convergence could be accelerated via the use of more aggressive settings for the same HPs. In fact, we confirmed this fact by analyzing the configurations that yield the 10 largest error reductions in Figure <ref type="figure">2</ref>: better quality models used lower learning rates and batch sizes in the AT phase.</p><p>Another factor that can justify the need for using different HP settings during ST and AT is related to the observation  that the bound on the admissible perturbation (𝜖) can have a deep impact on the model's performance, by exposing an inherent (and well-known <ref type="bibr" target="#b28">[29]</ref>) trade-off: as the bound increases, the model may become more robust to adversarial inputs but at the cost of an increase in the misclassification rate of clean inputs. To achieve an optimal trade-off between robustness and accuracy, it may be necessary to adjust the tuning of the HPs used during AT as 𝜖 varies, which in turn implies that the optimal HPs settings used during ST and AT can be different. In fact, by analyzing the results obtained on ResNet18/SVHN, for example, we see that the amplitude of the bound has an impact on the (adversarial) error reduction achievable by tuning independently the HPs of two phases of training: the 90 𝑡ℎ percentile of the percentage of clean error reduction is 50% and 65% using 𝜖=4 and 𝜖=8, respectively (see Fig. <ref type="figure">2b</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Can cheap AT methods be leveraged to accelerate HPT?</head><p>So far, we have shown that in adversarial settings the complexity of the HPT problem is exacerbated due to the need for optimizing a larger HP space. In this section, we show that, fortunately, AT provides also new opportunities to reduce the HPT cost. Specifically, we propose and evaluate a novel idea: leveraging alternative AT methods, which impose lower computational costs but provide weaker robustness guarantees to sample HP configurations in a cheap, yet informative, way. As discussed in Section 2, PGD is an iterative method, where each iteration refines the perturbation with the objective of maximizing loss. Hence, a straightforward way to reduce its cost (at the expense of robustness) is to reduce the number of executed iterations. We also note that the computational cost of FGSM is equivalent to that of a single PGD iteration. We build on these observations to propose incorporating the number of PGD iterations as an additional fidelity dimension in multi-fidelity HPT optimizers, such as taKG <ref type="bibr" target="#b6">[7]</ref>. We choose to test the proposed idea with taKG since this technique supports the use of an arbitrary number of fidelity dimensions (e.g., dataset size and number of epochs) and determines how to explore the multi-dimensional fidelity via black-box modeling techniques (see Section 2.2).</p><p>In order to assess the soundness and limitations of the proposed approach, we first analyze the correlation of the standard and adversarial error between HP configurations that use PGD with 20 iterations (which we consider as the maximum-fidelity/full budget) vs. PGD with 10 and 5 iterations and FGSM (which, as already mentioned, is computationally equivalent to 1 iteration of PGD). In Figure <ref type="figure" target="#fig_2">3</ref>, we observe that the correlation varies for different bounds on adversarial perturbations across the considered models/datasets. We omit the correlation for ResNet50/ImageNet using 𝜖=4 since the results are very similar to 𝜖=2. The scatter plots clearly show the existence of a very strong correlation (above 95%) for all the considered methods for ResNet50/ImageNet and for all the considered bounds. For ResNet18/SVHN and CNN/Cifar10, the correlation of PGD 5 and 10 iterations remains quite strong (always above 80% and typically above 90%), whereas lower correlations (as low as 53%) can be observed for FGSM, especially when considering adversarial error and larger values of 𝜖 . This is expected, as previous works <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16]</ref> had indeed observed that FGSM tends to be less robust than PGD when larger 𝜖 values are used (being subject to issues such as catastrophic-overfitting that lead to a sudden drop of adversarial accuracy). Still, even for FGSM, the correlation is always above 90% with CNN/Cifar10 and is relatively high (around 70%) also with ResNet18/SVHN for the smaller considered bound (𝜖 = 4).</p><p>We also report, in Figure <ref type="figure" target="#fig_2">3c</ref>, the CDF of the training time reduction using FGSM, PGD with 5 and 10 iterations, and ST w.r.t. PGD20 for ResNet50/ImageNet. The CDFs show that the training time reductions for a given AT method vary since the ratio of computed adversarial examples depends on the %RAT and %AE parameters. Overall, the maximum (median) training time reduction is approximately 83% (54%), 66% (42%), 47% (28%), and 86% (53%) for FGSM, PGD5, PGD10, and STD, compared to PGD20, which confirms that leveraging these "cheap" surrogate methods can significantly reduce the cost of testing HP configurations.</p><p>Supported by these findings, we evaluate our proposal by integrating the number of PGD iterations as an additional fidelity dimension in taKG <ref type="bibr" target="#b6">[7]</ref>. As discussed in Section 2, taKG is a HPT method that natively supports the use of multiple fidelity types, which we refer to as fidelity dimensions, e.g., number of epochs, input size, and dataset size. Based on the results of the previous section, we independently optimize the HPs of the ST and AT phases, which yields a search space composed of a total of nine HPs (Table <ref type="table" target="#tab_1">1</ref>). The following multi-fidelity solutions are compared:</p><p>• taKG (epochs, PGD iter): the proposed solution, which uses taKG as the underlying HPT method and employs as fidelity the number of epochs and the number of PGD iterations. We discretize these 2 dimensions (Table <ref type="table" target="#tab_3">3</ref>). • taKG (epochs): taKG using as fidelity only the number of epochs. • taKG (PGD iter): taKG using as fidelity only the number of PGD iterations. • HB (epochs): HyperBand <ref type="bibr" target="#b7">[8]</ref>, a popular (singledimensional) multi-fidelity optimizer, which uses the number of epochs as fidelity.</p><p>We further compare against BO using EI as acquisition function and Random Search (RS). These optimizers only perform high-fidelity evaluations. The evaluation of these alternative solutions is performed by exploiting the dataset already described in Section 3.1, which specifies the model quality (error and adversarial error) for all possible HPs, 𝜖 and fidelity settings reported in Tables 1, 2 and 3.</p><p>We define the optimization problem as follows:</p><formula xml:id="formula_3">min 𝑥 𝜆•Error(𝑥, 𝑠 = 1)+(1−𝜆)•Adv.Error(𝑥, 𝑠 = 1) (2)</formula><p>where 𝑥 is a vector defining the HPs, 𝑠 is a vector that encodes the ratio of budget allocated for each fidelity dimension, and 𝜆 is a weight factor that we set to 0.5 to equally balance the standard and adversarial errors. For a fair comparison, when a single fidelity dimension (e.g., epochs) is used, we set the other fidelity dimension (e.g., PGD iterations) to its maximum value. We run each optimizer using 20 independent seeds. We set the bound 𝜖 to 2, 8, and 12 to optimize ResNet50/ ImageNet, ResNet18/SVHN, and CNN/-Cifar10. Based on Figure <ref type="figure" target="#fig_2">3</ref>, the three settings correspond to scenarios with relatively high, low, and medium correlations for the budget dimension defined by PGD iterations, respectively. Figure <ref type="figure">4</ref> reports the average and standard deviation of the optimization goal (i.e., 0.5 • Error + 0.5 • Adv.Error) as a function of the optimization time for the different HPT optimizers and different models/datasets. The results show that the proposed solution, which adopts PGD iterations as extra fidelity along with the number of epochs (taKGepochs &amp; PGD iter), clearly outperforms all the alternative solutions in ResNet50/ImageNet and ResNet18/SVHN. The largest gains can be observed with ResNet18/SVHN (Fig. <ref type="figure">4b</ref>). Here, at the end of the optimization process, the proposed solution identifies a configuration of the same quality as the ones suggested by the other baselines, namely taKGepochs, HB, BO-EI, by achieving speed-ups of 2.1×, 3.7×, and 5.4×, respectively. Using the same metric (time spent to recommend a configuration of the same quality as the ones suggested by the other optimizers at the end of the optimization process) with ResNet50/ImageNet, the proposed solutions achieve slightly smaller, but still significant speedups, namely 1.28×, 1.97× and 2.45× w.r.t. taKG -epochs, HB, BO-EI. Interestingly, with ResNet50/ImageNet the proposed solution provides solid speed-ups also during the first stage of the optimization. Specifically, if we analyze the first half of the optimization process (corresponding to approx. 83 hours (see Figure <ref type="figure">4a</ref>)) the proposed solution identifies configurations of the same quality as taKG -epochs, HB, BO-EI, with speed-ups of 1.7×, 2× and 2.6×, respectively.</p><p>With CNN/Cifar10 (Figure <ref type="figure">4c</ref>), the proposed approach remains the best-performing solution, although with smaller gains when compared to taKG with epochs. Still, the proposed solution can identify configurations with the same quality as the best alternative (taKG epochs) by saving approx. 40% of the time (i.e., in 22 hours vs. 32 hours). We argue that the gains with CNN/Cifar10 are relatively lower than in the other scenarios considered in Figure <ref type="figure">4</ref> since those models are larger and more complex. As such, they benefit more from the cost reduction opportunities provided by using a reduced number of PGD iterations.</p><p>We also observe that the exclusive use of PGD iterations with taKG yields worse performance than using solely number of epochs. This is not surprising, given that number of epochs is arguably one of the most direct ways of controlling the cost of configuration sampling and is, indeed, among the most commonly adopted budgets in multi-fidelity optimizers <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b26">27]</ref>. This result confirms that PGD iterations represent a valuable mean to accelerate multi-fidelity HPT optimizers to train robust models and that it complements, but does not replace, "conventional" budget settings like number of epochs or dataset size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions and Future work</head><p>This paper focused on the problem of HPT for robust models. By means of an extensive experimental study, we first quantified the relevance of independently tuning the HPs used during standard and adversarial training. We then proposed and evaluated a novel fidelity dimension that becomes available in the context of AT. Specifically, we have shown that cheaper AT methods can be used to obtain inexpensive estimations of the quality achievable via expensive state-of-the-art AT methods and that this information can be effectively exploited to accelerate HPT. We extended taKG, a state-of-the-art HPT method, by incorporating the PGD iterations as an additional fidelity dimension (along with the number of epochs) and achieved cost reductions by up to 2.1×.</p><p>It is worth noting that the idea of employing "cheap" AT methods as proxies to estimate the quality of HP configurations with more robust/expensive methods is generic, in the sense that it can be applied, at least theoretically, to any multi-fidelity optimizer. As part of our future work, we plan to integrate this novel approach in a new HPT framework, specifically designed to cope with adversarially robust models.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>𝜃 (𝑥 + 𝛿), 𝑦)]}. The model's robustness depends on the bound 𝜖 used to produce the adversarial examples and on the strength of the method used to compute those examples.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :Figure 2 :</head><label>12</label><figDesc>Figure 1: Reduction of the mean (in black), standard (in blue), and adversarial (in red) error of the optimal configuration if the same or different hyper-parameters are used for the 2 phases of training using different scenarios and benchmarks.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Standard and adversarial error correlation between PGD20 and FGSM, PGD5, and PGD10 varying the bound 𝜖 and CDF of the training time reduction obtained using cheaper AT algorithms w.r.t PGD20 (Figure 3c).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>𝜖 12 Figure 4 :</head><label>124</label><figDesc>Figure 4: Average standard and adversarial error using different optimizers.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>1. Should the HPs that are common to the AT and ST phases be tuned independently? More in detail, we aim at quantifying to what extent the model's quality is affected if one uses the same vs different HP settings during the ST and AT phases (see Section 3.2).</figDesc><table /><note>2. Is it possible to reduce the cost of HPT by testing HP settings using cheaper (but less robust) AT methods? How correlated are the performance of alternative AT approaches and what factors (e.g., the perturbation bound or the cost of the techniques) impact such correlation? To what extent can this approach enhance the HPT's process efficiency? (see Section 3.3.)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Hyper-parameters considered</figDesc><table><row><cell>Hyper-parameter</cell><cell>Values</cell></row><row><cell>Learning Rate (ST and AT)</cell><cell>{0.1, 0.01}</cell></row><row><cell>Batch Momentum (ST and AT)</cell><cell>{0.9, 0.99} for ResNet50 {0, 0.9} otherwise</cell></row><row><cell>Batch Size (ST and AT)</cell><cell>{256, 512} for ResNet50 {128, 256} otherwise</cell></row><row><cell>𝛼 (PGD learning rate)</cell><cell>{10 −2 , 10 −3 }</cell></row><row><cell>% resources (time or epochs) for AT (%RAT)</cell><cell>{0, 30, 50, 70, 100}</cell></row><row><cell>% adversarial examples in each batch (%AE)</cell><cell>{30, 50, 70, 100}</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Bounds 𝜖 per benchmarks</figDesc><table><row><cell cols="2">Model &amp; Benchmark Bound 𝜖</cell></row><row><cell>ResNet50/ImageNet</cell><cell>{2, 4}</cell></row><row><cell>ResNet18/SVHN</cell><cell>{4, 8}</cell></row><row><cell>CNN/Cifar10</cell><cell>{8, 12}</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Fidelities considered</figDesc><table><row><cell>Fidelities</cell><cell>Values</cell></row><row><cell cols="2">PGD iterations {1 (FGSM), 5, 10, 20}</cell></row><row><cell>Epochs</cell><cell>{1,2,4,8,16}</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/pedrogbmendes/HPT_advTrain</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">We exclude the cases %RAT={0,100} in this study to focus on scenarios that contain both the ST and AT phases.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was supported by the Fundação para a Ciência e a Tecnología (Portuguese Foundation for Science and Technology) through the Carnegie Mellon Portugal Program under grant SFRH/BD/151470/2021 via projects with reference UIDB/50021/2020 and C645008882-00000055.PRR, by the NSA grant H98230-23-C-0274, and by the Advanced Cyberinfrastructure Coordination Ecosystem: Services &amp; Support (ACCESS) program, where we used the Bridges-2 GPU and Ocean resources at the Pittsburgh Supercomputing Center through allocation CIS220073, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<title level="m">Explaining and harnessing adversarial examples</title>
				<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Towards deep learning models resistant to adversarial attacks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Madry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Makelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vladu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>ICLR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Improving the affordability of robustness training for dnns</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dube</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Verma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Theoretically principled trade-off between robustness and accuracy</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">E</forename><surname>Ghaoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jordan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Hoos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Leyton-Brown</surname></persName>
		</author>
		<title level="m">Sequential model-based optimization for general algorithm configuration</title>
				<imprint>
			<publisher>LION</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Swersky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Adams</surname></persName>
		</author>
		<title level="m">Multi-task bayesian optimization</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Practical multi-fidelity bayesian optimization for hyperparameter tuning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Toscano-Palmerin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">I</forename><surname>Frazier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>UAI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Hyperband: A novel bandit-based approach to hyperparameter optimization</title>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Jamieson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Desalvo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rostamizadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Talwalkar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Kolter</surname></persName>
		</author>
		<title level="m">Fast is better than free: Revisiting adversarial training</title>
				<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Shafahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Najibi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ghiasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Dickerson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Studer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">S</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Goldstein</surname></persName>
		</author>
		<title level="m">Adversarial training for free!</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Deepfool: A simple and accurate method to fool deep neural networks</title>
		<author>
			<persName><forename type="first">S.-M</forename><surname>Moosavi-Dezfooli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fawzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Frossard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">On detecting adversarial perturbations</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Metzen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Genewein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bischoff</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>ICLR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cisse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<title level="m">Countering adversarial images using input transformations</title>
				<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Practical black-box attacks against machine learning</title>
		<author>
			<persName><forename type="first">N</forename><surname>Papernot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcdaniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">B</forename><surname>Celik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Swami</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>ASIA CCS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Andriushchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Flammarion</surname></persName>
		</author>
		<title level="m">Understanding and improving fast adversarial training</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Overfitting in adversarially robust deep learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Rice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Kolter</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Recent advances in adversarial training for adversarial robustness</title>
		<author>
			<persName><forename type="first">T</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Energy and policy considerations for deep learning in NLP</title>
		<author>
			<persName><forename type="first">E</forename><surname>Strubell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ganesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Trimtuner: Efficient optimization of machine learning jobs in the cloud via sub-sampling</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mendes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Casimiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Romano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Garlan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MASCOTS</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Fast bayesian optimization of machine learning hyperparameters on large datasets</title>
		<author>
			<persName><forename type="first">A</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Falkner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bartels</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AISTATS</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The application of bayesian methods for seeking the extremum</title>
		<author>
			<persName><forename type="first">J</forename><surname>Mockus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tiesis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zilinskas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Toward Global Optimization</title>
				<imprint>
			<date type="published" when="1978">1978</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Lynceus: Cost-efficient tuning and provisioning of data analytic jobs</title>
		<author>
			<persName><forename type="first">M</forename><surname>Casimiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Didona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Romano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDCS</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Swersky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Adams</surname></persName>
		</author>
		<idno>ArXiv:1406.3896</idno>
		<title level="m">Freeze-thaw bayesian optimization</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Non-stochastic best arm identification and hyperparameter optimization</title>
		<author>
			<persName><forename type="first">K</forename><surname>Jamieson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Talwalkar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AISTATS</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">BOHB: Robust and efficient hyperparameter optimization at scale</title>
		<author>
			<persName><forename type="first">S</forename><surname>Falkner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICML</title>
		<imprint>
			<biblScope unit="volume">80</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">DEHB: evolutionary hyberband for scalable, robust and efficient hyperparameter optimization</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">H</forename><surname>Awad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mallik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Hyperjump: Accelerating hyperband via risk modelling</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mendes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Casimiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Romano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Garlan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>AAAI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Duesterwald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Murthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Venkataraman</surname></persName>
		</author>
		<idno>ArXiv abs/1905.03837</idno>
		<title level="m">Exploring the hyperparameter landscape of adversarial robustness</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">Robustness may be at odds with accuracy</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Engstrom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Turner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madry</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>ICLR</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
