Hyper-parameter Tuning for Adversarially Robust Models Pedro Mendes1,2,* , Paolo Romano2 and David Garlan1 1 Software and Societal Systems Department, Carnegie Mellon University 2 INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa Abstract This work focuses on the problem of hyper-parameter tuning (HPT) for robust (i.e., adversarially trained) models, shedding light on the new challenges and opportunities arising during the HPT process for robust models. To this end, we conduct an extensive experimental study based on three popular deep models and explore exhaustively nine (discretized) hyper-parameters (HPs), two fidelity dimensions, and two attack bounds, for a total of 19208 configurations (corresponding to 50 thousand GPU hours). Through this study, we show that the complexity of the HPT problem is further exacerbated in adversarial settings due to the need to independently tune the HPs used during standard and adversarial training: succeeding in doing so (i.e., adopting different HP settings in both phases) can lead to a reduction of up to 80% and 43% of the error for clean and adversarial inputs, respectively. We also identify new opportunities to reduce the cost of HPT for robust models. Specifically, we propose to leverage cheap adversarial training methods to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive state-of-the-art methods. We show that, by exploiting this novel idea in conjunction with a recent multi-fidelity optimizer (taKG), the efficiency of the HPT process can be enhanced by up to 2.1×. 1. Introduction and CNN/Cifar10). In this study, we discretize and ex- haustively explore the HP space composed of up to nine Adversarial attacks [1] aim at causing model misclassifica- HPs, which we evaluated considering two “fidelity dimen- tions by introducing small perturbations in the input. White- sion” [5, 6] for the training process and two attack strengths. box methods like Projected Gradient Descent (PGD) [2] Overall, we test a total of 19208 configurations and we make have been shown to be extremely effective in synthesiz- this dataset publicly accessible in the hope that it will aid ing perturbations that are small enough to be noticeable the design of future HPT methods specialized for AT. by humans, while severely hindering the model’s perfor- Leveraging this data, we investigate a key design choice mance. Fortunately, models can be hardened against this for the HPT process of robust models, namely the decision of type of attack via a so-called “Adversarial Training” (AT) whether to adopt the same vs. different HP settings during process. During AT, which typically takes place after an AT and ST (for the common HPs in the 2 phases). To this initial standard training (ST) phase [3], adversarial exam- end, we focus on 3 key HPs of deep models: learning rate, ples are synthesized and added (with their intended label) momentum, and batch size. Our empirical study shows that to the training set. Recently, several AT methods have been allowing the use of different HP settings during ST and AT proposed [1, 2, 4] that explore different trade-offs between can bring substantial benefits in terms of model quality, by robustness and computational efficiency. Unfortunately, the reducing up to 80% and 43% the standard and adversarial most robust AT methods impose significant overhead (up error, respectively. to 7× in the models tested in this work) with respect to Further, our study demonstrates that while the cost and standard training. complexity of HPT are heightened in adversarial settings, These costs are further amplified when considering it also reveals that, in the context of robust models, unique another crucial phase of model building, namely hyper- opportunities can be exploited to effectively mitigate these parameter tuning (HPT). In fact, HPT methods require train- costs. Specifically, we show that it is possible to leverage ing a model multiple times using different hyper-parameter cheap AT methods to obtain inexpensive, yet highly cor- (HP) configurations. Consequently, the overheads intro- related, estimations of the quality achievable using more duced by AT lead also to an increase in the cost of HPT. robust/expensive methods (PGD [2]). Besides studying the Further, AT and ST share common HPs, which raises the trade-offs between cost reduction and HP quality correlation question of whether AT should simply employ the same HP with different AT methods, we extend a recent multi-fidelity settings used during ST, or if a new HPT process should optimizer (taKG [7]) to incorporate the choice of the AT be executed to select different HPs for the AT phase. In method as an additional dimension to reduce the HPT cost. the latter case, the dimensionality of the HP space to be We evaluate the proposed method using our dataset and optimized grows significantly, exacerbating the HPT cost. show that incorporating the choice of the AT method as Hence, this work focuses on the problem of HPT for ad- an additional fidelity dimension in taKG leads up to 2.1× versarially trained models with the twofold goal of i) shed- speed-ups, with gains that extend up to 3.7× w.r.t. popular ding light on the new challenges (i.e., additional costs) that HPT methods, as HyperBand [8]. These reductions in the emerge when performing HPT for robust models, and ii) optimization time not only translate to significant reduc- proposing novel techniques to reduce these costs, by ex- tions in energy consumption during training but also result ploiting opportunities that emerge in this context. in corresponding decreases in pollutant emissions. We pursue the first goal via an extensive experimental study based on 3 popular models/datasets widely used to evaluate AT methods (ResNet50/ImageNet, ResNet18/SVHN, 2. Background and Related Work In this section, we first provide background information on The IJCAI-2024 AISafety Workshop AT techniques (Section 2.1) and then discuss related works * Corresponding author. $ pgmendes@andrew.cmu.edu (P. Mendes); romano@inesc-id.pt in the area of HPT (Section 2.2). (P. Romano); dg4d@andrew.cmu.edu (D. Garlan) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2.1. Adversarial Training the model’s adversarial accuracy collapses after some train- ing iterations. Henceforth, we will focus on FGSM and PGD, Adversarial attacks aim at introducing small perturbations which, as mentioned, are among the most widely used and to input data, often small enough to be hardly perceivable effective methods for generating adversarial examples [15]. by humans, with the goal to lead the model to generate an In fact, these methods have been extensively studied and erroneous output. These attacks reveal vulnerabilities of compared in the literature [15, 16, 10, 9, 17] and represent a current model training techniques, underscoring the need natural starting point for investigating the trade-offs related for developing robust models in different domains. Thus, to HPT that arise in the context of AT. several works [1, 2, 9, 10, 11] developed new techniques to Independently of the technique used to perform AT, a mitigate these vulnerabilities and defend against adversarial relevant finding, first investigated by Gupta et al. [3], is attacks, by tackling them via different and often orthogonal whether to use an initial ST phase before performing AT, or or complementary ways, such as adversarial training [1, 2, 9, whether to use exclusively AT. This study showed that using 10], detection of adversarial attacks [12], or pre-processing an initial ST phase normally helps to reduce the computa- techniques to filter adversarial perturbations [13]. Next, we tional cost while yielding models with comparable quality. will review existing AT approaches, which represent the This result motivates one of the key questions that we aim focus of this work. at answering in this work, namely whether the ST and AT AT aims at improving the robustness of machine learning phases should share the settings for their common HPs. (ML) models by i) first generating adversarially perturbed in- puts, and ii) feeding these adversarial examples, along with the correct corresponding label, during the model train- 2.2. Hyper-parameter Tuning ing phase. More formally, this process can be described as HPT is a critical phase to optimize the performance of ML follows. Unlike ST, which determines the models’ parame- models. As the scale and complexity of models increase, ters 𝜃 by minimizing the loss function between the model’s along with the number of HP that can possibly be tuned prediction for the clean input 𝑓𝜃 (𝑥) and the original class in modern ML methods [18], HPT is a notoriously time- 𝑦, i.e., min{ E [𝐿(𝑓𝜃 (𝑥), 𝑦)]}, AT first computes a per- consuming process, whose cost can become prohibitive due 𝜃 𝑥,𝑦∼𝐷 turbation 𝛿, smaller than a maximum predefined bound to the need to repetitively train complex models on large 𝜖, which will mislead the current model, and then trains datasets. the model with that perturbed input. This approach leads To address this issue, a large spectrum of the literature on to the formulation of the following optimization problem: HPT relies on Bayesian Optimization (BO) [19, 20, 21, 6, 7, min{ E [ max 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)]}. The model’s robust- 22, 5, 23]. BO employs modeling techniques (e.g., Gaussian 𝜃 𝑥,𝑦∼𝐷 ‖𝛿‖<𝜖 Processes) to guide the optimization process and leverages ness depends on the bound 𝜖 used to produce the adver- the model’s knowledge and uncertainty (via a, so called, sarial examples and on the strength of the method used to acquisition function) to select which configurations to test. compute those examples. Although the use of BO can help to increase the conver- Several methods have been developed to solve this op- gence speed of the optimization process, the cost of testing timization problem (or variants thereof as, e.g., [4]) and multiple HP configurations can quickly become prohibitive, the resulting techniques are based on different assumptions especially when considering complex models trained over about, e.g., the availability of the model for the attacker (i.e., large datasets. white-box [2] or black-box [14]), whether the underlying To tackle this problem, multi-fidelity techniques [5, 6, 20, model is differentiable [1] or not [14], and the existence 19, 23, 7] exploit cheap low-fidelity evaluations (e.g., train- of bounds on attacker capabilities [11]. Among these tech- ing with a fraction of the available data or using a reduced niques, two of the most popular ones are Fast Gradient number of training epochs) and extrapolate this knowledge Sign Method (FGSM) [1] and Projected Gradient Descent to recommend high-fidelity configurations. This allows for (PGD) [2]). Both techniques hypothesize that attackers can reducing the cost of testing HP configurations, while still inject bounded perturbations and have access to a differen- providing useful information to guide the search for the op- tiable model. FGSM, and its later variants [9, 10, 15], rely timal high-fidelity configuration(s) [5, 19]. HyperBand [8] on gradient descent to compute small perturbations in an is a popular multi-fidelity and model-free approach that efficient way. More in detail, for a given clean input 𝑥, this promotes good quality configurations to higher budgets method adjusts the perturbation 𝛿 by the magnitude of the and discards the poor quality ones using a simple, yet effec- bound in the direction of the gradient of the loss function, tive, successive halving approach [24]. Several approaches i.e., 𝛿 = 𝜖 · sign(∇𝛿 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)). PGD iteratively gen- extended HyperBand using models to identify good config- erates adversarial examples by taking small steps in the urations [25, 26] or shortcut the number of configurations direction of the gradient of the loss function and projecting to test [27]. While these works adopt a single budget type the perturbed inputs back onto the 𝜖-ball around the original (e.g., training time or dataset size), other approaches, such input, i.e., Repeat: 𝛿 = 𝒫(𝛿+𝛼∇𝛿 𝐿(𝑓𝜃 (𝑥+𝛿), 𝑦)), where as taKG [7], make joint usage of multiple budget/fidelity di- 𝒫 is the projection onto the ball of radius 𝜖 and 𝛼 can be seen mensions during the optimization process to have additional as analogous to the learning rate in gradient-descent-based flexibility to reduce the optimization cost. taKG selects the training. Due to its iterative nature, PGD incurs a notably next configuration and different budgets via model-based higher computational cost than FGSM [10, 9], but it is also predictive techniques that estimate the cost incurred and regarded as one of the strongest methods to generate ad- information gained by sampling a given configuration for a versarial examples. In fact, prior work has shown that PGD given setting of the available fidelity dimensions. attacks can fool robust models trained via FGSM and that In the area of HPT for robust models, the work that is PGD-based AT produces models robust to larger perturba- more closely related to ours is the study by Duesterwald et tions [2] and with higher adversarial accuracy. FGSM is also al. [28]. This work has investigated empirically the relations known to suffer from catastrophic-overfitting [16], in which between the bounds on adversarial perturbations (𝜖) and the model’s accuracy/robustness. Further, it showed that Table 1 the ratio of clean/adversarial examples included in a batch Hyper-parameters considered (during AT) can have a positive impact on the model’s qual- Hyper-parameter Values ity and represents, as such, a key HP. Based on this finding, we incorporate this HP among the ones tested in our study. Learning Rate (ST and AT) {0.1, 0.01} Differently from that work, we focus on i) quantifying the Batch Momentum (ST and AT) {0.9, 0.99} for ResNet50 benefits of using different HPs during ST and AT, and ii) ex- {0, 0.9} otherwise ploiting the correlation between cheaper AT methods (such {256, 512} for ResNet50 Batch Size (ST and AT) {128, 256} otherwise as FGSM) to enhance the efficiency of multi-fidelity HPT 𝛼 (PGD learning rate) {10−2 , 10−3 } algorithms. % resources (time or epochs) {0, 30, 50, 70, 100} for AT (%RAT) 3. HPT for Robust Models: % adversarial examples {30, 50, 70, 100} in each batch (%AE) Challenges and Opportunities Table 2 As mentioned, this work aims at shedding light on the chal- Bounds 𝜖 per benchmarks lenges and opportunities that arise when performing HPT for adversarially robust models. More precisely, we seek to Model & Benchmark Bound 𝜖 answer the following questions: ResNet50/ImageNet {2, 4} 1. Should the HPs that are common to the AT and ResNet18/SVHN {4, 8} ST phases be tuned independently? More in detail, CNN/Cifar10 {8, 12} we aim at quantifying to what extent the model’s quality is affected if one uses the same vs different Table 3 HP settings during the ST and AT phases (see Fidelities considered Section 3.2). Fidelities Values PGD iterations {1 (FGSM), 5, 10, 20} 2. Is it possible to reduce the cost of HPT by testing HP Epochs {1,2,4,8,16} settings using cheaper (but less robust) AT methods? How correlated are the performance of alternative AT approaches and what factors (e.g., the perturba- Section 3.3). The model’s quality is evaluated using standard tion bound or the cost of the techniques) impact such error (i.e., error using clean inputs) and adversarial error correlation? To what extent can this approach en- (i.e., error using adversarially perturbed inputs). hance the HPT’s process efficiency? (see Section 3.3.) For each model, we exhaustively explored the (discretized) In order to answer the above questions, we have collected space defined by the HPs, bound 𝜖, and fidelities, which (and made publicly available) a dataset, which we obtained yields a search space encompassing a total of 19208 config- by varying some of the most impactful HPs for three popular urations. Building this dataset required around fifty thou- neural models/datasets and measured the resulting model sand GPU hours and we have made it publicly accessible in quality. We provide a detailed description of the dataset in the hope that it will aid the design of future HPT methods Section 3.1. specialized for AT. Additional information to ensure the re- producibility of results is provided in the public repository1 . 3.1. Experimental Setup 3.2. Should the HPs of ST and AT be tuned We base our study on three widely-used models and datasets (ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10). independently? All the models were trained using 1 worker, except SVHN, This section aims at answering the following question: given in which two workers were used. We used Nvidia Tesla that the ST and AT phases share several HPs (e.g., batch V100 GPUs to train the ResNet50, and Nvidia GeForce RTX size, learning rate, and momentum in the models considered 2080 for the remaining models. All models and training in this study), how relevant is it to use different settings procedures were implemented in Python3 via the Pytorch for these HPs in the two training phases? Note that, if we framework. assume the existence of 𝐶 HPs in common between ST and To evaluate the models, we considered up to nine differ- AT, then enabling the use of different values for these HPs ent HPs, as summarized in Table 1. The first three HPs in in each training stage causes a growth of the dimensionality this table apply to both the ST and AT phases. 𝛼 is an HP of the HP space from 𝐶 to 2𝐶 (not accounting for any HP that applies exclusively to AT, whereas the last two HPs not in common) and, ultimately, to a significant increase in (%RAT and %AE) regulate the balance between ST and AT the cost/complexity of the HPT problem. Specifically, for (see Section 2). Specifically, %RAT defines the number of the scenarios considered in this study, the cardinality of the computational resources allocated to the AT phase, and HP space grows from 320 to 2560 distinct configurations. %AE indicates the ratio of adversarial inputs contained in Hence, we argue that such a cost is practically justified the batches during the AT phase (as suggested by Duester- only if it is counterbalanced by relevant gains in terms of wald et al. [28]). We further consider several settings of error reduction. To answer this question, we trained the the bound 𝜖 on the attacker power (see Table 2). Note that models during 16 epochs and used different settings for the reported values of 𝜖 are normalized by 255. Finally, we the common HPs for ST and AT (Table 1). We consider also consider two fidelity dimensions, namely the number of training epochs and the number of PGD iterations (see 1 https://github.com/pedrogbmendes/HPT_advTrain 15 30 10 % (Adv) Error Reduction % (Adv) Error Reduction % (Adv) Error Reduction 20 7.5 10 5 5 10 2.5 0 0 0 %RAT =12 %RAT =12 %RAT =12 %RGADT, =8 %RAT =8 %RAT =8 %RGADT, =2 %RAT =2 %RGADT, =2 %RAT =4 %RAT =4 %RAT =4 %RGADT, =4 %RAT =4 %RAT =4 %RAT =8 %RGADT, =8 %RGADT, =8 =30 =50 =70 =30 =50 =70 =30 =50 =70 =30 =50 =70 =30 =50 =70 =30 =50 =70 PGD, PGD, PGD, PGD, PGD, PGD, PGD, PGD, PGD, PGD, PGD, PGD, P P P P P P (a) ResNet50/ImageNet (b) ResNet18/SVHN (c) CNN/Cifar10 Figure 1: Reduction of the mean (in black), standard (in blue), and adversarial (in red) error of the optimal configuration if the same or different hyper-parameters are used for the 2 phases of training using different scenarios and benchmarks. 1.0 1.00 1.00 0.75 0.75 0.5 CDF CDF CDF 0.50 0.50 =2 =4 =8 =4 0.25 =8 0.25 =12 0 10 20 0 25 50 75 0 20 40 % (Adv.) Error Reduction % (Adv.) Error Reduction % (Adv.) Error Reduction (a) ResNet50/ImageNet (b) ResNet18/SVHN (c) CNN/Cifar10 Figure 2: Cumulative distribution functions (CDFs) of the mean (in black), standard (in blue), and adversarial (in red) error reduction when using different HP settings for ST and AT w.r.t. the case in which common HPs are used in both phases. We use dashed and continuous lines to refer to different values of 𝜖. three different settings (30%, 50%, and 70%)2 for the relative executed an initial ST phase using any of the possible HPs amount of resources (epochs) available for AT (%RAT), as settings. Specifically, we report the Cumulative Distribution well as different settings of the perturbation bound 𝜖. Function (CDF) of the percentage of error reduction (for each We consider that the model’s HPs can be optimized ac- of the three target optimization metrics), when allowing the cording to three criteria: i) clean data error (Error), ii) ad- use of the same or different common HP settings for the versarial error (AdvError), and iii) the average of clean and two phases and while varying the remaining (non-common) adversarial error (MeanError). For each of these three opti- HPs, namely %RAT, %AE, and 𝛼. These results allow us to mization criteria, %RAT and bound 𝜖, we report in Figure 1 highlight that by independently tuning the HPs of the two the percentage of reduction of the target optimization metric training stages, the model’s quality is enhanced by up to for the optimal HP configuration obtained by allowing for approx. 80%, 43%, and 56%, when minimizing the standard, (but not imposing) different settings of the HPs in common adversarial, or mean error, resp. to the ST and AT phases, with respect to the optimal HP Overall, these results may be justified by considering configuration if one opts for using the same settings for the that the optimization objectives and constraints of the ST common HPs in both training phases, namely: and AT phases are different, hence benefiting from using different HP settings. During ST, the training procedure Errorsame HPs − Errordiff HPs focuses on maximizing standard accuracy, and the model’s %Error Reduction = ×100 (1) goal is to learn representations that generalize well to new Errorsame HPs data. In contrast, AT seeks to increase robustness against The results show that adopting different HP settings in adversarial attacks, and the model needs to learn to differ- the two phases can lead to significant error reductions for entiate between clean and perturbed examples correctly. all the three optimization criteria. The peak gains extend Further, the AT phase benefits from a pre-trained model up to approx. 30% and are achieved for the case of ResNet18 (using clean data), and, as such, this model is expected to re- with (relatively) large values of 𝜖 and when allocating a quire relatively small weight adjustments to defend against low percentage of epochs to AT (%RAT=30%). Overall, the adversarial inputs. Thus, this phase is likely to benefit from geometric means of the % error reduction (across all models more conservative settings of HPs such as learning rate and and settings of 𝜖 and %RAT) is 9%, 5%, and 6% for the Error, momentum than the initial ST, whose convergence could AdvError, and MeanError criterion, respectively. be accelerated via the use of more aggressive settings for Next, Figure 2 provides a different perspective in order to the same HPs. In fact, we confirmed this fact by analyzing quantify the benefits achieved by separately optimizing the the configurations that yield the 10 largest error reductions HP of the AT phase (vs. using for the AT phase the same HPs in Figure 2: better quality models used lower learning rates settings in common with the ST phase), assuming to have and batch sizes in the AT phase. 2 Another factor that can justify the need for using different We exclude the cases %RAT={0,100} in this study to focus on scenarios that contain both the ST and AT phases. HP settings during ST and AT is related to the observation 1.0 1.0 1.00 0.9 0.75 0.9 Adv Error 0.8 0.50 Error CDF 0.7 0.8 FGSM FGSM, =0.95 FGSM, =0.96 0.25 PGD5 0.6 PGD5, =0.97 PGD5, =0.98 PGD10 PGD10, =0.97 0.7 PGD10, =0.97 ST 0.5 0.00 0.6 0.8 1.0 0.7 0.8 0.9 1.0 25 50 75 Error (PGD20) Adv Error (PGD20) Training Time Reduction [%] (a) Error Corr. (b) Adv. Error Corr. (c) CDF time reduction ResNet50/ImageNet 𝜖-2 ResNet50/ImageNet 𝜖-2 ResNet50/ImageNet 0.8 0.8 0.8 0.9 0.6 0.6 Adv Error Adv Error 0.8 Error Error 0.6 0.4 0.4 FGSM, =0.78 FGSM, =0.66 FGSM, =0.68 0.7 FGSM, =0.53 0.2 PGD5, =0.81 0.4 PGD5, =0.83 0.2 PGD5, =0.91 PGD5, =0.91 PGD10, =0.85 PGD10, =0.87 PGD10, =0.95 0.6 PGD10, =0.95 0.2 0.4 0.6 0.8 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9 Error (PGD20) Adv Error (PGD20) Error (PGD20) Adv Error (PGD20) (d) Error Corr. (e) Adv. Error Corr. (f) Error Corr. (g) Adv. Error Corr. ResNet18/SVHN 𝜖-4 ResNet18/SVHN 𝜖-4 ResNet18/SVHN 𝜖-8 ResNet18/SVHN 𝜖-8 0.6 0.7 0.6 0.7 0.5 0.6 0.5 0.6 Adv Error Adv Error 0.4 0.4 Error Error 0.5 0.5 0.3 FGSM, =0.98 FGSM, =0.95 0.3 FGSM, =0.96 FGSM, =0.93 0.2 PGD5, =0.99 0.4 PGD5, =0.99 0.2 PGD5, =0.98 0.4 PGD5, =0.98 PGD10, =0.99 PGD10, =0.99 PGD10, =0.99 PGD10, =0.99 0.1 0.2 0.4 0.6 0.30.3 0.4 0.5 0.6 0.7 0.1 0.2 0.4 0.6 0.30.3 0.4 0.5 0.6 0.7 Error (PGD20) Adv Error (PGD20) Error (PGD20) Adv Error (PGD20) (h) Error Corr. (i) Adv. Error Corr. (j) Error Corr. (k) Adv. Error Corr. CNN/Cifar10 𝜖-8 CNN/Cifar10 𝜖-8 CNN/Cifar10 𝜖-12 CNN/Cifar10 𝜖-12 Figure 3: Standard and adversarial error correlation between PGD20 and FGSM, PGD5, and PGD10 varying the bound 𝜖 and CDF of the training time reduction obtained using cheaper AT algorithms w.r.t PGD20 (Figure 3c). that the bound on the admissible perturbation (𝜖) can have yet informative, way. As discussed in Section 2, PGD is a deep impact on the model’s performance, by exposing an iterative method, where each iteration refines the per- an inherent (and well-known [29]) trade-off: as the bound turbation with the objective of maximizing loss. Hence, a increases, the model may become more robust to adversarial straightforward way to reduce its cost (at the expense of ro- inputs but at the cost of an increase in the misclassification bustness) is to reduce the number of executed iterations. We rate of clean inputs. To achieve an optimal trade-off between also note that the computational cost of FGSM is equivalent robustness and accuracy, it may be necessary to adjust the to that of a single PGD iteration. tuning of the HPs used during AT as 𝜖 varies, which in turn We build on these observations to propose incorporating implies that the optimal HPs settings used during ST and AT the number of PGD iterations as an additional fidelity di- can be different. In fact, by analyzing the results obtained on mension in multi-fidelity HPT optimizers, such as taKG [7]. ResNet18/SVHN, for example, we see that the amplitude of We choose to test the proposed idea with taKG since this the bound has an impact on the (adversarial) error reduction technique supports the use of an arbitrary number of fidelity achievable by tuning independently the HPs of two phases of dimensions (e.g., dataset size and number of epochs) and training: the 90𝑡ℎ percentile of the percentage of clean error determines how to explore the multi-dimensional fidelity reduction is 50% and 65% using 𝜖=4 and 𝜖=8, respectively via black-box modeling techniques (see Section 2.2). (see Fig. 2b). In order to assess the soundness and limitations of the proposed approach, we first analyze the correlation of the 3.3. Can cheap AT methods be leveraged to standard and adversarial error between HP configurations that use PGD with 20 iterations (which we consider as the accelerate HPT? maximum-fidelity/full budget) vs. PGD with 10 and 5 it- So far, we have shown that in adversarial settings the com- erations and FGSM (which, as already mentioned, is com- plexity of the HPT problem is exacerbated due to the need putationally equivalent to 1 iteration of PGD). In Figure 3, for optimizing a larger HP space. In this section, we show we observe that the correlation varies for different bounds that, fortunately, AT provides also new opportunities to on adversarial perturbations across the considered models/- reduce the HPT cost. Specifically, we propose and evalu- datasets. We omit the correlation for ResNet50/ImageNet ate a novel idea: leveraging alternative AT methods, which using 𝜖=4 since the results are very similar to 𝜖=2. The impose lower computational costs but provide weaker ro- scatter plots clearly show the existence of a very strong bustness guarantees to sample HP configurations in a cheap, correlation (above 95%) for all the considered methods for 0.70 0.50 0.40 taKG (epochs) taKG (epochs) taKG (epochs) 0.5 Error + 0.5 Adv.Error taKG (PGD iter) taKG (PGD iter) taKG (PGD iter) 0.5 Error + 0.5 Adv.Error 0.5 Error + 0.5 Adv.Error 0.68 taKG (epochs & PGD iter) taKG (epochs & PGD iter) taKG (epochs & PGD iter) 0.66 HB (epochs) HB (epochs) HB (epochs) BO-EI BO-EI BO-EI Random Search Random Search Random Search 0.64 0.40 0.30 0.62 0.600 25 50 75 100 125 150 0 5 10 15 20 25 30 0 10 20 30 40 Time [hours] Time [hours] Time [hours] (a) ResNet50/ImageNet 𝜖-2 (b) ResNet18/SVHN 𝜖-8 (c) CNN/Cifar10 𝜖-12 Figure 4: Average standard and adversarial error using different optimizers. ResNet50/ImageNet and for all the considered bounds. For perform high-fidelity evaluations. The evaluation of these ResNet18/SVHN and CNN/Cifar10, the correlation of PGD alternative solutions is performed by exploiting the dataset 5 and 10 iterations remains quite strong (always above 80% already described in Section 3.1, which specifies the model and typically above 90%), whereas lower correlations (as low quality (error and adversarial error) for all possible HPs, 𝜖 as 53%) can be observed for FGSM, especially when consider- and fidelity settings reported in Tables 1, 2 and 3. ing adversarial error and larger values of 𝜖 . This is expected, We define the optimization problem as follows: as previous works [15, 16] had indeed observed that FGSM tends to be less robust than PGD when larger 𝜖 values are min 𝜆·Error(𝑥, 𝑠 = 1)+(1−𝜆)·Adv.Error(𝑥, 𝑠 = 1) (2) 𝑥 used (being subject to issues such as catastrophic-overfitting that lead to a sudden drop of adversarial accuracy). Still, where 𝑥 is a vector defining the HPs, 𝑠 is a vector that even for FGSM, the correlation is always above 90% with encodes the ratio of budget allocated for each fidelity di- CNN/Cifar10 and is relatively high (around 70%) also with mension, and 𝜆 is a weight factor that we set to 0.5 to equally ResNet18/SVHN for the smaller considered bound (𝜖 = 4). balance the standard and adversarial errors. For a fair com- We also report, in Figure 3c, the CDF of the training time parison, when a single fidelity dimension (e.g., epochs) is reduction using FGSM, PGD with 5 and 10 iterations, and used, we set the other fidelity dimension (e.g., PGD itera- ST w.r.t. PGD20 for ResNet50/ImageNet. The CDFs show tions) to its maximum value. We run each optimizer using that the training time reductions for a given AT method 20 independent seeds. We set the bound 𝜖 to 2, 8, and 12 to vary since the ratio of computed adversarial examples de- optimize ResNet50/ ImageNet, ResNet18/SVHN, and CNN/- pends on the %RAT and %AE parameters. Overall, the max- Cifar10. Based on Figure 3, the three settings correspond imum (median) training time reduction is approximately to scenarios with relatively high, low, and medium correla- 83% (54%), 66% (42%), 47% (28%), and 86% (53%) for FGSM, tions for the budget dimension defined by PGD iterations, PGD5, PGD10, and STD, compared to PGD20, which con- respectively. firms that leveraging these “cheap” surrogate methods can Figure 4 reports the average and standard deviation of significantly reduce the cost of testing HP configurations. the optimization goal (i.e., 0.5 · Error + 0.5 · Adv.Error) as Supported by these findings, we evaluate our proposal by a function of the optimization time for the different HPT integrating the number of PGD iterations as an additional optimizers and different models/datasets. The results show fidelity dimension in taKG [7]. As discussed in Section 2, that the proposed solution, which adopts PGD iterations taKG is a HPT method that natively supports the use of mul- as extra fidelity along with the number of epochs (taKG - tiple fidelity types, which we refer to as fidelity dimensions, epochs & PGD iter), clearly outperforms all the alternative e.g., number of epochs, input size, and dataset size. Based solutions in ResNet50/ImageNet and ResNet18/SVHN. The on the results of the previous section, we independently largest gains can be observed with ResNet18/SVHN (Fig. 4b). optimize the HPs of the ST and AT phases, which yields a Here, at the end of the optimization process, the proposed search space composed of a total of nine HPs (Table 1). The solution identifies a configuration of the same quality as following multi-fidelity solutions are compared: the ones suggested by the other baselines, namely taKG - epochs, HB, BO-EI, by achieving speed-ups of 2.1×, 3.7×, • taKG (epochs, PGD iter): the proposed solution, and 5.4×, respectively. Using the same metric (time spent which uses taKG as the underlying HPT method to recommend a configuration of the same quality as the and employs as fidelity the number of epochs and ones suggested by the other optimizers at the end of the op- the number of PGD iterations. We discretize these 2 timization process) with ResNet50/ImageNet, the proposed dimensions (Table 3). solutions achieve slightly smaller, but still significant speed- • taKG (epochs): taKG using as fidelity only the num- ups, namely 1.28×, 1.97× and 2.45× w.r.t. taKG - epochs, ber of epochs. HB, BO-EI. Interestingly, with ResNet50/ImageNet the pro- • taKG (PGD iter): taKG using as fidelity only the num- posed solution provides solid speed-ups also during the first ber of PGD iterations. stage of the optimization. Specifically, if we analyze the first • HB (epochs): HyperBand [8], a popular (single- half of the optimization process (corresponding to approx. dimensional) multi-fidelity optimizer, which uses 83 hours (see Figure 4a)) the proposed solution identifies the number of epochs as fidelity. configurations of the same quality as taKG - epochs, HB, BO-EI, with speed-ups of 1.7×, 2× and 2.6×, respectively. We further compare against BO using EI as acquisition With CNN/Cifar10 (Figure 4c), the proposed approach re- function and Random Search (RS). These optimizers only mains the best-performing solution, although with smaller gains when compared to taKG with epochs. Still, the pro- References posed solution can identify configurations with the same quality as the best alternative (taKG epochs) by saving ap- [1] I. Goodfellow, J. Shlens, C. Szegedy, Explaining and prox. 40% of the time (i.e., in 22 hours vs. 32 hours). We harnessing adversarial examples, in: ICLR, 2015. argue that the gains with CNN/Cifar10 are relatively lower [2] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, than in the other scenarios considered in Figure 4 since Towards deep learning models resistant to adversarial those models are larger and more complex. As such, they attacks, in: ICLR, 2018. benefit more from the cost reduction opportunities provided [3] S. Gupta, P. Dube, A. Verma, Improving the affordabil- by using a reduced number of PGD iterations. ity of robustness training for dnns, in: CVPR, 2020. We also observe that the exclusive use of PGD iterations [4] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. E. Ghaoui, M. Jordan, with taKG yields worse performance than using solely num- Theoretically principled trade-off between robustness ber of epochs. This is not surprising, given that number of and accuracy, in: ICML, 2019. epochs is arguably one of the most direct ways of controlling [5] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential the cost of configuration sampling and is, indeed, among model-based optimization for general algorithm con- the most commonly adopted budgets in multi-fidelity opti- figuration, in: LION, 2011. mizers [20, 8, 27]. This result confirms that PGD iterations [6] K. Swersky, J. Snoek, R. Adams, Multi-task bayesian represent a valuable mean to accelerate multi-fidelity HPT optimization, in: NeurIPS, 2013. optimizers to train robust models and that it complements, [7] J. Wu, S. Toscano-Palmerin, P. I. Frazier, A. G. Wil- but does not replace, "conventional" budget settings like son, Practical multi-fidelity bayesian optimization for number of epochs or dataset size. hyperparameter tuning, in: UAI, 2019. [8] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Tal- walkar, Hyperband: A novel bandit-based approach 4. Conclusions and Future work to hyperparameter optimization, Journal of Machine Learning Research (2018). This paper focused on the problem of HPT for robust mod- [9] E. Wong, L. Rice, Z. Kolter, Fast is better than free: els. By means of an extensive experimental study, we first Revisiting adversarial training, in: ICLR, 2020. quantified the relevance of independently tuning the HPs [10] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson, used during standard and adversarial training. We then C. Studer, L. S. Davis, G. Taylor, T. Goldstein, Adver- proposed and evaluated a novel fidelity dimension that be- sarial training for free!, in: NeurIPS, 2019. comes available in the context of AT. Specifically, we have [11] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deep- shown that cheaper AT methods can be used to obtain inex- fool: A simple and accurate method to fool deep neural pensive estimations of the quality achievable via expensive networks, in: CVPR, 2016. state-of-the-art AT methods and that this information can be [12] J. H. Metzen, T. Genewein, V. Fischer, B. Bischoff, On effectively exploited to accelerate HPT. We extended taKG, detecting adversarial perturbations, in: ICLR, 2017. a state-of-the-art HPT method, by incorporating the PGD [13] C. Guo, M. Rana, M. Cisse, L. van der Maaten, Coun- iterations as an additional fidelity dimension (along with tering adversarial images using input transformations, the number of epochs) and achieved cost reductions by up in: ICLR, 2018. to 2.1×. [14] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. It is worth noting that the idea of employing “cheap” AT Celik, A. Swami, Practical black-box attacks against methods as proxies to estimate the quality of HP config- machine learning, in: ASIA CCS, 2017. urations with more robust/expensive methods is generic, [15] M. Andriushchenko, N. Flammarion, Understanding in the sense that it can be applied, at least theoretically, to and improving fast adversarial training, in: NeurIPS, any multi-fidelity optimizer. As part of our future work, we 2020. plan to integrate this novel approach in a new HPT frame- [16] L. Rice, E. Wong, J. Z. Kolter, Overfitting in adversari- work, specifically designed to cope with adversarially robust ally robust deep learning, in: ICML, 2020. models. [17] T. Bai, J. Luo, J. Zhao, B. Wen, Q. Wang, Recent ad- vances in adversarial training for adversarial robust- Acknowledgments ness, in: IJCAI, 2021. [18] E. Strubell, A. Ganesh, A. McCallum, Energy and This work was supported by the Fundação para a Ciência e policy considerations for deep learning in NLP, in: a Tecnología (Portuguese Foundation for Science and Tech- ACL, 2019. nology) through the Carnegie Mellon Portugal Program [19] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Trim- under grant SFRH/BD/151470/2021 via projects with refer- tuner: Efficient optimization of machine learning jobs ence UIDB/50021/2020 and C645008882-00000055.PRR, by in the cloud via sub-sampling, in: MASCOTS, 2020. the NSA grant H98230-23-C-0274, and by the Advanced [20] A. Klein, S. Falkner, S. Bartels, et al., Fast bayesian Cyberinfrastructure Coordination Ecosystem: Services & optimization of machine learning hyperparameters on Support (ACCESS) program, where we used the Bridges-2 large datasets, in: AISTATS, 2017. GPU and Ocean resources at the Pittsburgh Supercomputing [21] J. Mockus, V. Tiesis, A. Zilinskas, The application Center through allocation CIS220073, which is supported of bayesian methods for seeking the extremum, in: by National Science Foundation grants #2138259, #2138286, Toward Global Optimization, 1978. #2138307, #2137603, and #2138296. [22] M. Casimiro, D. Didona, P. Romano, et al., Lynceus: Cost-efficient tuning and provisioning of data analytic jobs, in: ICDCS, 2020. [23] K. Swersky, J. Snoek, R. Adams, Freeze-thaw bayesian optimization, ArXiv:1406.3896 (2014). [24] K. Jamieson, A. Talwalkar, Non-stochastic best arm identification and hyperparameter optimization, in: AISTATS, 2016. [25] S. Falkner, A. Klein, F. Hutter, BOHB: Robust and efficient hyperparameter optimization at scale, in: ICML, volume 80, 2018. [26] N. H. Awad, N. Mallik, F. Hutter, DEHB: evolutionary hyberband for scalable, robust and efficient hyperpa- rameter optimization, in: IJCAI, 2021. [27] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Hyper- jump: Accelerating hyperband via risk modelling, in: AAAI, 2023. [28] E. Duesterwald, A. Murthi, G. Venkataraman, et al., Ex- ploring the hyperparameter landscape of adversarial robustness, ArXiv abs/1905.03837 (2019). [29] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, A. Madry, Robustness may be at odds with accuracy, in: ICLR, 2019.