Hyper-parameter Tuning for Adversarially Robust Models
                         Pedro Mendes1,2,* , Paolo Romano2 and David Garlan1
                         1
                             Software and Societal Systems Department, Carnegie Mellon University
                         2
                             INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa


                                             Abstract
                                             This work focuses on the problem of hyper-parameter tuning (HPT) for robust (i.e., adversarially trained) models, shedding light on the
                                             new challenges and opportunities arising during the HPT process for robust models. To this end, we conduct an extensive experimental
                                             study based on three popular deep models and explore exhaustively nine (discretized) hyper-parameters (HPs), two fidelity dimensions,
                                             and two attack bounds, for a total of 19208 configurations (corresponding to 50 thousand GPU hours).
                                                 Through this study, we show that the complexity of the HPT problem is further exacerbated in adversarial settings due to the need
                                             to independently tune the HPs used during standard and adversarial training: succeeding in doing so (i.e., adopting different HP settings
                                             in both phases) can lead to a reduction of up to 80% and 43% of the error for clean and adversarial inputs, respectively. We also identify
                                             new opportunities to reduce the cost of HPT for robust models. Specifically, we propose to leverage cheap adversarial training methods
                                             to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive state-of-the-art methods.
                                             We show that, by exploiting this novel idea in conjunction with a recent multi-fidelity optimizer (taKG), the efficiency of the HPT
                                             process can be enhanced by up to 2.1×.


                         1. Introduction                                                                                                     and CNN/Cifar10). In this study, we discretize and ex-
                                                                                                                                             haustively explore the HP space composed of up to nine
                         Adversarial attacks [1] aim at causing model misclassifica-                                                         HPs, which we evaluated considering two “fidelity dimen-
                         tions by introducing small perturbations in the input. White-                                                       sion” [5, 6] for the training process and two attack strengths.
                         box methods like Projected Gradient Descent (PGD) [2]                                                               Overall, we test a total of 19208 configurations and we make
                         have been shown to be extremely effective in synthesiz-                                                             this dataset publicly accessible in the hope that it will aid
                         ing perturbations that are small enough to be noticeable                                                            the design of future HPT methods specialized for AT.
                         by humans, while severely hindering the model’s perfor-                                                                Leveraging this data, we investigate a key design choice
                         mance. Fortunately, models can be hardened against this                                                             for the HPT process of robust models, namely the decision of
                         type of attack via a so-called “Adversarial Training” (AT)                                                          whether to adopt the same vs. different HP settings during
                         process. During AT, which typically takes place after an                                                            AT and ST (for the common HPs in the 2 phases). To this
                         initial standard training (ST) phase [3], adversarial exam-                                                         end, we focus on 3 key HPs of deep models: learning rate,
                         ples are synthesized and added (with their intended label)                                                          momentum, and batch size. Our empirical study shows that
                         to the training set. Recently, several AT methods have been                                                         allowing the use of different HP settings during ST and AT
                         proposed [1, 2, 4] that explore different trade-offs between                                                        can bring substantial benefits in terms of model quality, by
                         robustness and computational efficiency. Unfortunately, the                                                         reducing up to 80% and 43% the standard and adversarial
                         most robust AT methods impose significant overhead (up                                                              error, respectively.
                         to 7× in the models tested in this work) with respect to                                                               Further, our study demonstrates that while the cost and
                         standard training.                                                                                                  complexity of HPT are heightened in adversarial settings,
                            These costs are further amplified when considering                                                               it also reveals that, in the context of robust models, unique
                         another crucial phase of model building, namely hyper-                                                              opportunities can be exploited to effectively mitigate these
                         parameter tuning (HPT). In fact, HPT methods require train-                                                         costs. Specifically, we show that it is possible to leverage
                         ing a model multiple times using different hyper-parameter                                                          cheap AT methods to obtain inexpensive, yet highly cor-
                         (HP) configurations. Consequently, the overheads intro-                                                             related, estimations of the quality achievable using more
                         duced by AT lead also to an increase in the cost of HPT.                                                            robust/expensive methods (PGD [2]). Besides studying the
                         Further, AT and ST share common HPs, which raises the                                                               trade-offs between cost reduction and HP quality correlation
                         question of whether AT should simply employ the same HP                                                             with different AT methods, we extend a recent multi-fidelity
                         settings used during ST, or if a new HPT process should                                                             optimizer (taKG [7]) to incorporate the choice of the AT
                         be executed to select different HPs for the AT phase. In                                                            method as an additional dimension to reduce the HPT cost.
                         the latter case, the dimensionality of the HP space to be                                                           We evaluate the proposed method using our dataset and
                         optimized grows significantly, exacerbating the HPT cost.                                                           show that incorporating the choice of the AT method as
                            Hence, this work focuses on the problem of HPT for ad-                                                           an additional fidelity dimension in taKG leads up to 2.1×
                         versarially trained models with the twofold goal of i) shed-                                                        speed-ups, with gains that extend up to 3.7× w.r.t. popular
                         ding light on the new challenges (i.e., additional costs) that                                                      HPT methods, as HyperBand [8]. These reductions in the
                         emerge when performing HPT for robust models, and ii)                                                               optimization time not only translate to significant reduc-
                         proposing novel techniques to reduce these costs, by ex-                                                            tions in energy consumption during training but also result
                         ploiting opportunities that emerge in this context.                                                                 in corresponding decreases in pollutant emissions.
                            We pursue the first goal via an extensive experimental
                         study based on 3 popular models/datasets widely used to
                         evaluate AT methods (ResNet50/ImageNet, ResNet18/SVHN,                                                              2. Background and Related Work
                                                                                                                                             In this section, we first provide background information on
                          The IJCAI-2024 AISafety Workshop                                                                                   AT techniques (Section 2.1) and then discuss related works
                         *
                           Corresponding author.
                          $ pgmendes@andrew.cmu.edu (P. Mendes); romano@inesc-id.pt
                                                                                                                                             in the area of HPT (Section 2.2).
                          (P. Romano); dg4d@andrew.cmu.edu (D. Garlan)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
                                     tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2.1. Adversarial Training                                            the model’s adversarial accuracy collapses after some train-
                                                                     ing iterations. Henceforth, we will focus on FGSM and PGD,
Adversarial attacks aim at introducing small perturbations
                                                                     which, as mentioned, are among the most widely used and
to input data, often small enough to be hardly perceivable
                                                                     effective methods for generating adversarial examples [15].
by humans, with the goal to lead the model to generate an
                                                                     In fact, these methods have been extensively studied and
erroneous output. These attacks reveal vulnerabilities of
                                                                     compared in the literature [15, 16, 10, 9, 17] and represent a
current model training techniques, underscoring the need
                                                                     natural starting point for investigating the trade-offs related
for developing robust models in different domains. Thus,
                                                                     to HPT that arise in the context of AT.
several works [1, 2, 9, 10, 11] developed new techniques to
                                                                        Independently of the technique used to perform AT, a
mitigate these vulnerabilities and defend against adversarial
                                                                     relevant finding, first investigated by Gupta et al. [3], is
attacks, by tackling them via different and often orthogonal
                                                                     whether to use an initial ST phase before performing AT, or
or complementary ways, such as adversarial training [1, 2, 9,
                                                                     whether to use exclusively AT. This study showed that using
10], detection of adversarial attacks [12], or pre-processing
                                                                     an initial ST phase normally helps to reduce the computa-
techniques to filter adversarial perturbations [13]. Next, we
                                                                     tional cost while yielding models with comparable quality.
will review existing AT approaches, which represent the
                                                                     This result motivates one of the key questions that we aim
focus of this work.
                                                                     at answering in this work, namely whether the ST and AT
   AT aims at improving the robustness of machine learning
                                                                     phases should share the settings for their common HPs.
(ML) models by i) first generating adversarially perturbed in-
puts, and ii) feeding these adversarial examples, along with
the correct corresponding label, during the model train-             2.2. Hyper-parameter Tuning
ing phase. More formally, this process can be described as           HPT is a critical phase to optimize the performance of ML
follows. Unlike ST, which determines the models’ parame-             models. As the scale and complexity of models increase,
ters 𝜃 by minimizing the loss function between the model’s           along with the number of HP that can possibly be tuned
prediction for the clean input 𝑓𝜃 (𝑥) and the original class         in modern ML methods [18], HPT is a notoriously time-
𝑦, i.e., min{ E [𝐿(𝑓𝜃 (𝑥), 𝑦)]}, AT first computes a per-            consuming process, whose cost can become prohibitive due
         𝜃   𝑥,𝑦∼𝐷
turbation 𝛿, smaller than a maximum predefined bound                 to the need to repetitively train complex models on large
𝜖, which will mislead the current model, and then trains             datasets.
the model with that perturbed input. This approach leads                To address this issue, a large spectrum of the literature on
to the formulation of the following optimization problem:            HPT relies on Bayesian Optimization (BO) [19, 20, 21, 6, 7,
min{ E [ max 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)]}. The model’s robust-                 22, 5, 23]. BO employs modeling techniques (e.g., Gaussian
 𝜃    𝑥,𝑦∼𝐷 ‖𝛿‖<𝜖
                                                                     Processes) to guide the optimization process and leverages
ness depends on the bound 𝜖 used to produce the adver-
                                                                     the model’s knowledge and uncertainty (via a, so called,
sarial examples and on the strength of the method used to
                                                                     acquisition function) to select which configurations to test.
compute those examples.
                                                                     Although the use of BO can help to increase the conver-
   Several methods have been developed to solve this op-
                                                                     gence speed of the optimization process, the cost of testing
timization problem (or variants thereof as, e.g., [4]) and
                                                                     multiple HP configurations can quickly become prohibitive,
the resulting techniques are based on different assumptions
                                                                     especially when considering complex models trained over
about, e.g., the availability of the model for the attacker (i.e.,
                                                                     large datasets.
white-box [2] or black-box [14]), whether the underlying
                                                                        To tackle this problem, multi-fidelity techniques [5, 6, 20,
model is differentiable [1] or not [14], and the existence
                                                                     19, 23, 7] exploit cheap low-fidelity evaluations (e.g., train-
of bounds on attacker capabilities [11]. Among these tech-
                                                                     ing with a fraction of the available data or using a reduced
niques, two of the most popular ones are Fast Gradient
                                                                     number of training epochs) and extrapolate this knowledge
Sign Method (FGSM) [1] and Projected Gradient Descent
                                                                     to recommend high-fidelity configurations. This allows for
(PGD) [2]). Both techniques hypothesize that attackers can
                                                                     reducing the cost of testing HP configurations, while still
inject bounded perturbations and have access to a differen-
                                                                     providing useful information to guide the search for the op-
tiable model. FGSM, and its later variants [9, 10, 15], rely
                                                                     timal high-fidelity configuration(s) [5, 19]. HyperBand [8]
on gradient descent to compute small perturbations in an
                                                                     is a popular multi-fidelity and model-free approach that
efficient way. More in detail, for a given clean input 𝑥, this
                                                                     promotes good quality configurations to higher budgets
method adjusts the perturbation 𝛿 by the magnitude of the
                                                                     and discards the poor quality ones using a simple, yet effec-
bound in the direction of the gradient of the loss function,
                                                                     tive, successive halving approach [24]. Several approaches
i.e., 𝛿 = 𝜖 · sign(∇𝛿 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)). PGD iteratively gen-
                                                                     extended HyperBand using models to identify good config-
erates adversarial examples by taking small steps in the
                                                                     urations [25, 26] or shortcut the number of configurations
direction of the gradient of the loss function and projecting
                                                                     to test [27]. While these works adopt a single budget type
the perturbed inputs back onto the 𝜖-ball around the original
                                                                     (e.g., training time or dataset size), other approaches, such
input, i.e., Repeat: 𝛿 = 𝒫(𝛿+𝛼∇𝛿 𝐿(𝑓𝜃 (𝑥+𝛿), 𝑦)), where
                                                                     as taKG [7], make joint usage of multiple budget/fidelity di-
𝒫 is the projection onto the ball of radius 𝜖 and 𝛼 can be seen
                                                                     mensions during the optimization process to have additional
as analogous to the learning rate in gradient-descent-based
                                                                     flexibility to reduce the optimization cost. taKG selects the
training. Due to its iterative nature, PGD incurs a notably
                                                                     next configuration and different budgets via model-based
higher computational cost than FGSM [10, 9], but it is also
                                                                     predictive techniques that estimate the cost incurred and
regarded as one of the strongest methods to generate ad-
                                                                     information gained by sampling a given configuration for a
versarial examples. In fact, prior work has shown that PGD
                                                                     given setting of the available fidelity dimensions.
attacks can fool robust models trained via FGSM and that
                                                                        In the area of HPT for robust models, the work that is
PGD-based AT produces models robust to larger perturba-
                                                                     more closely related to ours is the study by Duesterwald et
tions [2] and with higher adversarial accuracy. FGSM is also
                                                                     al. [28]. This work has investigated empirically the relations
known to suffer from catastrophic-overfitting [16], in which
                                                                     between the bounds on adversarial perturbations (𝜖) and
the model’s accuracy/robustness. Further, it showed that        Table 1
the ratio of clean/adversarial examples included in a batch     Hyper-parameters considered
(during AT) can have a positive impact on the model’s qual-
                                                                           Hyper-parameter                       Values
ity and represents, as such, a key HP. Based on this finding,
we incorporate this HP among the ones tested in our study.              Learning Rate (ST and AT)                {0.1, 0.01}
Differently from that work, we focus on i) quantifying the            Batch Momentum (ST and AT)
                                                                                                        {0.9, 0.99} for ResNet50
benefits of using different HPs during ST and AT, and ii) ex-                                               {0, 0.9} otherwise
ploiting the correlation between cheaper AT methods (such                                               {256, 512} for ResNet50
                                                                         Batch Size (ST and AT)
                                                                                                          {128, 256} otherwise
as FGSM) to enhance the efficiency of multi-fidelity HPT
                                                                          𝛼 (PGD learning rate)               {10−2 , 10−3 }
algorithms.
                                                                       % resources (time or epochs)
                                                                                                           {0, 30, 50, 70, 100}
                                                                              for AT (%RAT)
3. HPT for Robust Models:                                                 % adversarial examples
                                                                                                            {30, 50, 70, 100}
                                                                           in each batch (%AE)
   Challenges and Opportunities
                                                                Table 2
As mentioned, this work aims at shedding light on the chal-
                                                                Bounds 𝜖 per benchmarks
lenges and opportunities that arise when performing HPT
for adversarially robust models. More precisely, we seek to                     Model & Benchmark          Bound 𝜖
answer the following questions:                                                   ResNet50/ImageNet           {2, 4}
    1. Should the HPs that are common to the AT and                                ResNet18/SVHN              {4, 8}
       ST phases be tuned independently? More in detail,                             CNN/Cifar10             {8, 12}
       we aim at quantifying to what extent the model’s
       quality is affected if one uses the same vs different    Table 3
       HP settings during the ST and AT phases (see             Fidelities considered
       Section 3.2).                                                             Fidelities             Values
                                                                               PGD iterations     {1 (FGSM), 5, 10, 20}
    2. Is it possible to reduce the cost of HPT by testing HP                    Epochs                {1,2,4,8,16}
       settings using cheaper (but less robust) AT methods?
       How correlated are the performance of alternative
       AT approaches and what factors (e.g., the perturba-
                                                                Section 3.3). The model’s quality is evaluated using standard
       tion bound or the cost of the techniques) impact such
                                                                error (i.e., error using clean inputs) and adversarial error
       correlation? To what extent can this approach en-
                                                                (i.e., error using adversarially perturbed inputs).
       hance the HPT’s process efficiency? (see Section 3.3.)
                                                                    For each model, we exhaustively explored the (discretized)
   In order to answer the above questions, we have collected    space defined by the HPs, bound 𝜖, and fidelities, which
(and made publicly available) a dataset, which we obtained      yields a search space encompassing a total of 19208 config-
by varying some of the most impactful HPs for three popular     urations. Building this dataset required around fifty thou-
neural models/datasets and measured the resulting model         sand GPU hours and we have made it publicly accessible in
quality. We provide a detailed description of the dataset in    the hope that it will aid the design of future HPT methods
Section 3.1.                                                    specialized for AT. Additional information to ensure the re-
                                                                producibility of results is provided in the public repository1 .
3.1. Experimental Setup
                                                                3.2. Should the HPs of ST and AT be tuned
We base our study on three widely-used models and datasets
(ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10).                 independently?
All the models were trained using 1 worker, except SVHN,        This section aims at answering the following question: given
in which two workers were used. We used Nvidia Tesla            that the ST and AT phases share several HPs (e.g., batch
V100 GPUs to train the ResNet50, and Nvidia GeForce RTX         size, learning rate, and momentum in the models considered
2080 for the remaining models. All models and training          in this study), how relevant is it to use different settings
procedures were implemented in Python3 via the Pytorch          for these HPs in the two training phases? Note that, if we
framework.                                                      assume the existence of 𝐶 HPs in common between ST and
   To evaluate the models, we considered up to nine differ-     AT, then enabling the use of different values for these HPs
ent HPs, as summarized in Table 1. The first three HPs in       in each training stage causes a growth of the dimensionality
this table apply to both the ST and AT phases. 𝛼 is an HP       of the HP space from 𝐶 to 2𝐶 (not accounting for any HP
that applies exclusively to AT, whereas the last two HPs        not in common) and, ultimately, to a significant increase in
(%RAT and %AE) regulate the balance between ST and AT           the cost/complexity of the HPT problem. Specifically, for
(see Section 2). Specifically, %RAT defines the number of       the scenarios considered in this study, the cardinality of the
computational resources allocated to the AT phase, and          HP space grows from 320 to 2560 distinct configurations.
%AE indicates the ratio of adversarial inputs contained in      Hence, we argue that such a cost is practically justified
the batches during the AT phase (as suggested by Duester-       only if it is counterbalanced by relevant gains in terms of
wald et al. [28]). We further consider several settings of      error reduction. To answer this question, we trained the
the bound 𝜖 on the attacker power (see Table 2). Note that      models during 16 epochs and used different settings for
the reported values of 𝜖 are normalized by 255. Finally, we     the common HPs for ST and AT (Table 1). We consider
also consider two fidelity dimensions, namely the number
of training epochs and the number of PGD iterations (see        1
                                                                    https://github.com/pedrogbmendes/HPT_advTrain
                          15                                                                  30                                                                   10


                                                                                                                                         % (Adv) Error Reduction
    % (Adv) Error Reduction


                                                                        % (Adv) Error Reduction
                                                                                              20                                                               7.5
                          10
                                                                                                                                                                    5
                              5                                                               10
                                                                                                                                                               2.5
                              0                                                                   0                                                                 0


                                                                                                                                                                   %RAT =12

                                                                                                                                                                   %RAT =12

                                                                                                                                                                   %RAT =12
                                                                                                                                                                   %RGADT, =8

                                                                                                                                                                   %RAT =8

                                                                                                                                                                   %RAT =8
                              %RGADT, =2

                              %RAT =2

                              %RGADT, =2

                              %RAT =4

                              %RAT =4

                              %RAT =4


                                                                                                  %RGADT, =4

                                                                                                  %RAT =4

                                                                                                  %RAT =4

                                                                                                  %RAT =8

                                                                                                  %RGADT, =8

                                                                                                  %RGADT, =8


                                                                                                                                                                         =30

                                                                                                                                                                         =50

                                                                                                                                                                         =70
                                    =30

                                    =50

                                    =70

                                    =30

                                    =50

                                    =70


                                                                                                        =30

                                                                                                        =50

                                                                                                        =70

                                                                                                        =30

                                                                                                        =50

                                                                                                        =70


                                                                                                                                                                         =30

                                                                                                                                                                         =50

                                                                                                                                                                         =70
                                                                                                                                                                    PGD,

                                                                                                                                                                    PGD,
                               PGD,


                               PGD,

                               PGD,

                               PGD,


                                                                                                   PGD,

                                                                                                   PGD,

                                                                                                   PGD,


                                                                                                                                                                   PGD,

                                                                                                                                                                   PGD,

                                                                                                                                                                   PGD,
                                                                                                                                                                    P
                               P


                               P


                                                                                                   P


                                                                                                   P

                                                                                                   P
                                        (a) ResNet50/ImageNet                                                (b) ResNet18/SVHN                                                 (c) CNN/Cifar10
                                    Figure 1: Reduction of the mean (in black), standard (in blue), and adversarial (in red) error of the optimal configuration if the
                                    same or different hyper-parameters are used for the 2 phases of training using different scenarios and benchmarks.


                              1.0                                                                 1.00                                                             1.00
                                                                                                  0.75                                                             0.75
                              0.5
      CDF


                                                                          CDF


                                                                                                                                           CDF
                                                                                                  0.50                                                             0.50
                                                                =2                                                               =4                                                              =8
                                                                =4                                0.25                           =8                                0.25                          =12
                                    0              10           20                                       0        25        50     75                                     0            20        40
                                        % (Adv.) Error Reduction                                             % (Adv.) Error Reduction                                         % (Adv.) Error Reduction
                                        (a) ResNet50/ImageNet                                                (b) ResNet18/SVHN                                                 (c) CNN/Cifar10
                                    Figure 2: Cumulative distribution functions (CDFs) of the mean (in black), standard (in blue), and adversarial (in red) error
                                    reduction when using different HP settings for ST and AT w.r.t. the case in which common HPs are used in both phases. We
                                    use dashed and continuous lines to refer to different values of 𝜖.


three different settings (30%, 50%, and 70%)2 for the relative                                                          executed an initial ST phase using any of the possible HPs
amount of resources (epochs) available for AT (%RAT), as                                                                settings. Specifically, we report the Cumulative Distribution
well as different settings of the perturbation bound 𝜖.                                                                 Function (CDF) of the percentage of error reduction (for each
   We consider that the model’s HPs can be optimized ac-                                                                of the three target optimization metrics), when allowing the
cording to three criteria: i) clean data error (Error), ii) ad-                                                         use of the same or different common HP settings for the
versarial error (AdvError), and iii) the average of clean and                                                           two phases and while varying the remaining (non-common)
adversarial error (MeanError). For each of these three opti-                                                            HPs, namely %RAT, %AE, and 𝛼. These results allow us to
mization criteria, %RAT and bound 𝜖, we report in Figure 1                                                              highlight that by independently tuning the HPs of the two
the percentage of reduction of the target optimization metric                                                           training stages, the model’s quality is enhanced by up to
for the optimal HP configuration obtained by allowing for                                                               approx. 80%, 43%, and 56%, when minimizing the standard,
(but not imposing) different settings of the HPs in common                                                              adversarial, or mean error, resp.
to the ST and AT phases, with respect to the optimal HP                                                                    Overall, these results may be justified by considering
configuration if one opts for using the same settings for the                                                           that the optimization objectives and constraints of the ST
common HPs in both training phases, namely:                                                                             and AT phases are different, hence benefiting from using
                                                                                                                        different HP settings. During ST, the training procedure
                                                   Errorsame HPs − Errordiff HPs                                        focuses on maximizing standard accuracy, and the model’s
    %Error Reduction =                                                           ×100 (1)                               goal is to learn representations that generalize well to new
                                                          Errorsame HPs
                                                                                                                        data. In contrast, AT seeks to increase robustness against
   The results show that adopting different HP settings in                                                              adversarial attacks, and the model needs to learn to differ-
the two phases can lead to significant error reductions for                                                             entiate between clean and perturbed examples correctly.
all the three optimization criteria. The peak gains extend                                                              Further, the AT phase benefits from a pre-trained model
up to approx. 30% and are achieved for the case of ResNet18                                                             (using clean data), and, as such, this model is expected to re-
with (relatively) large values of 𝜖 and when allocating a                                                               quire relatively small weight adjustments to defend against
low percentage of epochs to AT (%RAT=30%). Overall, the                                                                 adversarial inputs. Thus, this phase is likely to benefit from
geometric means of the % error reduction (across all models                                                             more conservative settings of HPs such as learning rate and
and settings of 𝜖 and %RAT) is 9%, 5%, and 6% for the Error,                                                            momentum than the initial ST, whose convergence could
AdvError, and MeanError criterion, respectively.                                                                        be accelerated via the use of more aggressive settings for
   Next, Figure 2 provides a different perspective in order to                                                          the same HPs. In fact, we confirmed this fact by analyzing
quantify the benefits achieved by separately optimizing the                                                             the configurations that yield the 10 largest error reductions
HP of the AT phase (vs. using for the AT phase the same HPs                                                             in Figure 2: better quality models used lower learning rates
settings in common with the ST phase), assuming to have                                                                 and batch sizes in the AT phase.
2                                                                                                                          Another factor that can justify the need for using different
    We exclude the cases %RAT={0,100} in this study to focus on scenarios
    that contain both the ST and AT phases.                                                                             HP settings during ST and AT is related to the observation
                       1.0                                                       1.0                                           1.00
                       0.9                                                                                                     0.75
                                                                                 0.9


                                                                         Adv Error
                       0.8
                                                                                                                               0.50


                   Error


                                                                                                                            CDF
                       0.7                                                       0.8                                                                                   FGSM
                                              FGSM, =0.95                                             FGSM, =0.96              0.25                                    PGD5
                       0.6                    PGD5, =0.97                                             PGD5, =0.98                                                      PGD10
                                              PGD10, =0.97                       0.7                  PGD10, =0.97                                                     ST
                       0.5                                                                                                     0.00
                                0.6            0.8                 1.0                  0.7     0.8     0.9           1.0                   25      50      75
                                      Error (PGD20)                                        Adv Error (PGD20)                          Training Time Reduction [%]
                                 (a) Error Corr.                                       (b) Adv. Error Corr.                       (c) CDF time reduction
                              ResNet50/ImageNet 𝜖-2                                    ResNet50/ImageNet 𝜖-2                          ResNet50/ImageNet

    0.8                                               0.8                                              0.8
                                                                                                                                                        0.9
    0.6                                                                                                0.6
                                              Adv Error


                                                                                                                                                Adv Error
                                                                                                                                                        0.8
Error


                                                                                                   Error
                                                      0.6
    0.4                                                                                                0.4
                           FGSM, =0.78                                           FGSM, =0.66                                 FGSM, =0.68                0.7                     FGSM, =0.53
    0.2                    PGD5, =0.81                0.4                        PGD5, =0.83           0.2                   PGD5, =0.91                                        PGD5, =0.91
                           PGD10, =0.85                                          PGD10, =0.87                                PGD10, =0.95               0.6                     PGD10, =0.95
          0.2       0.4 0.6 0.8                                 0.4       0.6      0.8                          0.2      0.4 0.6 0.8                             0.6     0.7 0.8 0.9
                 Error (PGD20)                                     Adv Error (PGD20)                                  Error (PGD20)                                    Adv Error (PGD20)
           (d) Error Corr.                                     (e) Adv. Error Corr.                             (f) Error Corr.                                  (g) Adv. Error Corr.
          ResNet18/SVHN 𝜖-4                                     ResNet18/SVHN 𝜖-4                              ResNet18/SVHN 𝜖-8                                  ResNet18/SVHN 𝜖-8

    0.6                                               0.7                                              0.6                                              0.7
    0.5                                               0.6                                              0.5                                              0.6
                                              Adv Error


                                                                                                                                                Adv Error
    0.4                                                                                                0.4
Error


                                                                                                   Error


                                                      0.5                                                                                               0.5
    0.3                    FGSM, =0.98                                        FGSM, =0.95
                                                                                                       0.3                  FGSM, =0.96                                        FGSM, =0.93
    0.2                    PGD5, =0.99                0.4                     PGD5, =0.99              0.2                  PGD5, =0.98                 0.4                    PGD5, =0.98
                           PGD10, =0.99                                       PGD10, =0.99                                  PGD10, =0.99                                       PGD10, =0.99
    0.1    0.2            0.4           0.6           0.30.3      0.4 0.5 0.6                0.7       0.1      0.2            0.4        0.6           0.30.3         0.4 0.5 0.6         0.7
                 Error (PGD20)                                    Adv Error (PGD20)                                   Error (PGD20)                                    Adv Error (PGD20)
           (h) Error Corr.                                     (i) Adv. Error Corr.                            (j) Error Corr.                                   (k) Adv. Error Corr.
          CNN/Cifar10 𝜖-8                                         CNN/Cifar10 𝜖-8                             CNN/Cifar10 𝜖-12                                     CNN/Cifar10 𝜖-12
          Figure 3: Standard and adversarial error correlation between PGD20 and FGSM, PGD5, and PGD10 varying the bound 𝜖 and
          CDF of the training time reduction obtained using cheaper AT algorithms w.r.t PGD20 (Figure 3c).


that the bound on the admissible perturbation (𝜖) can have                                                 yet informative, way. As discussed in Section 2, PGD is
a deep impact on the model’s performance, by exposing                                                      an iterative method, where each iteration refines the per-
an inherent (and well-known [29]) trade-off: as the bound                                                  turbation with the objective of maximizing loss. Hence, a
increases, the model may become more robust to adversarial                                                 straightforward way to reduce its cost (at the expense of ro-
inputs but at the cost of an increase in the misclassification                                             bustness) is to reduce the number of executed iterations. We
rate of clean inputs. To achieve an optimal trade-off between                                              also note that the computational cost of FGSM is equivalent
robustness and accuracy, it may be necessary to adjust the                                                 to that of a single PGD iteration.
tuning of the HPs used during AT as 𝜖 varies, which in turn                                                   We build on these observations to propose incorporating
implies that the optimal HPs settings used during ST and AT                                                the number of PGD iterations as an additional fidelity di-
can be different. In fact, by analyzing the results obtained on                                            mension in multi-fidelity HPT optimizers, such as taKG [7].
ResNet18/SVHN, for example, we see that the amplitude of                                                   We choose to test the proposed idea with taKG since this
the bound has an impact on the (adversarial) error reduction                                               technique supports the use of an arbitrary number of fidelity
achievable by tuning independently the HPs of two phases of                                                dimensions (e.g., dataset size and number of epochs) and
training: the 90𝑡ℎ percentile of the percentage of clean error                                             determines how to explore the multi-dimensional fidelity
reduction is 50% and 65% using 𝜖=4 and 𝜖=8, respectively                                                   via black-box modeling techniques (see Section 2.2).
(see Fig. 2b).                                                                                                In order to assess the soundness and limitations of the
                                                                                                           proposed approach, we first analyze the correlation of the
3.3. Can cheap AT methods be leveraged to                                                                  standard and adversarial error between HP configurations
                                                                                                           that use PGD with 20 iterations (which we consider as the
     accelerate HPT?                                                                                       maximum-fidelity/full budget) vs. PGD with 10 and 5 it-
So far, we have shown that in adversarial settings the com-                                                erations and FGSM (which, as already mentioned, is com-
plexity of the HPT problem is exacerbated due to the need                                                  putationally equivalent to 1 iteration of PGD). In Figure 3,
for optimizing a larger HP space. In this section, we show                                                 we observe that the correlation varies for different bounds
that, fortunately, AT provides also new opportunities to                                                   on adversarial perturbations across the considered models/-
reduce the HPT cost. Specifically, we propose and evalu-                                                   datasets. We omit the correlation for ResNet50/ImageNet
ate a novel idea: leveraging alternative AT methods, which                                                 using 𝜖=4 since the results are very similar to 𝜖=2. The
impose lower computational costs but provide weaker ro-                                                    scatter plots clearly show the existence of a very strong
bustness guarantees to sample HP configurations in a cheap,                                                correlation (above 95%) for all the considered methods for
                        0.70                                                                     0.50                                                                  0.40
                                               taKG (epochs)                                                        taKG (epochs)                                                         taKG (epochs)
 0.5 Error + 0.5 Adv.Error                     taKG (PGD iter)                                                      taKG (PGD iter)                                                       taKG (PGD iter)


                                                                          0.5 Error + 0.5 Adv.Error


                                                                                                                                                0.5 Error + 0.5 Adv.Error
                        0.68
                                               taKG (epochs & PGD iter)                                             taKG (epochs & PGD iter)                                              taKG (epochs & PGD iter)
                        0.66                   HB (epochs)                                                          HB (epochs)                                                           HB (epochs)
                                               BO-EI                                                                BO-EI                                                                 BO-EI
                                               Random Search                                                        Random Search                                                         Random Search
                        0.64
                                                                                                 0.40                                                                  0.30
                        0.62

                        0.600      25   50     75 100 125 150                                         0    5    10 15 20            25     30                               0     10       20         30         40
                                             Time [hours]                                                         Time [hours]                                                         Time [hours]
                                 (a) ResNet50/ImageNet 𝜖-2                                                (b) ResNet18/SVHN 𝜖-8                                                 (c) CNN/Cifar10 𝜖-12

                                 Figure 4: Average standard and adversarial error using different optimizers.


ResNet50/ImageNet and for all the considered bounds. For                                                                perform high-fidelity evaluations. The evaluation of these
ResNet18/SVHN and CNN/Cifar10, the correlation of PGD                                                                   alternative solutions is performed by exploiting the dataset
5 and 10 iterations remains quite strong (always above 80%                                                              already described in Section 3.1, which specifies the model
and typically above 90%), whereas lower correlations (as low                                                            quality (error and adversarial error) for all possible HPs, 𝜖
as 53%) can be observed for FGSM, especially when consider-                                                             and fidelity settings reported in Tables 1, 2 and 3.
ing adversarial error and larger values of 𝜖 . This is expected,                                                           We define the optimization problem as follows:
as previous works [15, 16] had indeed observed that FGSM
tends to be less robust than PGD when larger 𝜖 values are                                                                min 𝜆·Error(𝑥, 𝑠 = 1)+(1−𝜆)·Adv.Error(𝑥, 𝑠 = 1) (2)
                                                                                                                            𝑥
used (being subject to issues such as catastrophic-overfitting
that lead to a sudden drop of adversarial accuracy). Still,                                                             where 𝑥 is a vector defining the HPs, 𝑠 is a vector that
even for FGSM, the correlation is always above 90% with                                                                 encodes the ratio of budget allocated for each fidelity di-
CNN/Cifar10 and is relatively high (around 70%) also with                                                               mension, and 𝜆 is a weight factor that we set to 0.5 to equally
ResNet18/SVHN for the smaller considered bound (𝜖 = 4).                                                                 balance the standard and adversarial errors. For a fair com-
   We also report, in Figure 3c, the CDF of the training time                                                           parison, when a single fidelity dimension (e.g., epochs) is
reduction using FGSM, PGD with 5 and 10 iterations, and                                                                 used, we set the other fidelity dimension (e.g., PGD itera-
ST w.r.t. PGD20 for ResNet50/ImageNet. The CDFs show                                                                    tions) to its maximum value. We run each optimizer using
that the training time reductions for a given AT method                                                                 20 independent seeds. We set the bound 𝜖 to 2, 8, and 12 to
vary since the ratio of computed adversarial examples de-                                                               optimize ResNet50/ ImageNet, ResNet18/SVHN, and CNN/-
pends on the %RAT and %AE parameters. Overall, the max-                                                                 Cifar10. Based on Figure 3, the three settings correspond
imum (median) training time reduction is approximately                                                                  to scenarios with relatively high, low, and medium correla-
83% (54%), 66% (42%), 47% (28%), and 86% (53%) for FGSM,                                                                tions for the budget dimension defined by PGD iterations,
PGD5, PGD10, and STD, compared to PGD20, which con-                                                                     respectively.
firms that leveraging these “cheap” surrogate methods can                                                                  Figure 4 reports the average and standard deviation of
significantly reduce the cost of testing HP configurations.                                                             the optimization goal (i.e., 0.5 · Error + 0.5 · Adv.Error) as
   Supported by these findings, we evaluate our proposal by                                                             a function of the optimization time for the different HPT
integrating the number of PGD iterations as an additional                                                               optimizers and different models/datasets. The results show
fidelity dimension in taKG [7]. As discussed in Section 2,                                                              that the proposed solution, which adopts PGD iterations
taKG is a HPT method that natively supports the use of mul-                                                             as extra fidelity along with the number of epochs (taKG -
tiple fidelity types, which we refer to as fidelity dimensions,                                                         epochs & PGD iter), clearly outperforms all the alternative
e.g., number of epochs, input size, and dataset size. Based                                                             solutions in ResNet50/ImageNet and ResNet18/SVHN. The
on the results of the previous section, we independently                                                                largest gains can be observed with ResNet18/SVHN (Fig. 4b).
optimize the HPs of the ST and AT phases, which yields a                                                                Here, at the end of the optimization process, the proposed
search space composed of a total of nine HPs (Table 1). The                                                             solution identifies a configuration of the same quality as
following multi-fidelity solutions are compared:                                                                        the ones suggested by the other baselines, namely taKG -
                                                                                                                        epochs, HB, BO-EI, by achieving speed-ups of 2.1×, 3.7×,
                             • taKG (epochs, PGD iter): the proposed solution,                                          and 5.4×, respectively. Using the same metric (time spent
                               which uses taKG as the underlying HPT method                                             to recommend a configuration of the same quality as the
                               and employs as fidelity the number of epochs and                                         ones suggested by the other optimizers at the end of the op-
                               the number of PGD iterations. We discretize these 2                                      timization process) with ResNet50/ImageNet, the proposed
                               dimensions (Table 3).                                                                    solutions achieve slightly smaller, but still significant speed-
                             • taKG (epochs): taKG using as fidelity only the num-                                      ups, namely 1.28×, 1.97× and 2.45× w.r.t. taKG - epochs,
                               ber of epochs.                                                                           HB, BO-EI. Interestingly, with ResNet50/ImageNet the pro-
                             • taKG (PGD iter): taKG using as fidelity only the num-                                    posed solution provides solid speed-ups also during the first
                               ber of PGD iterations.                                                                   stage of the optimization. Specifically, if we analyze the first
                             • HB (epochs): HyperBand [8], a popular (single-                                           half of the optimization process (corresponding to approx.
                               dimensional) multi-fidelity optimizer, which uses                                        83 hours (see Figure 4a)) the proposed solution identifies
                               the number of epochs as fidelity.                                                        configurations of the same quality as taKG - epochs, HB,
                                                                                                                        BO-EI, with speed-ups of 1.7×, 2× and 2.6×, respectively.
  We further compare against BO using EI as acquisition                                                                    With CNN/Cifar10 (Figure 4c), the proposed approach re-
function and Random Search (RS). These optimizers only                                                                  mains the best-performing solution, although with smaller
gains when compared to taKG with epochs. Still, the pro-          References
posed solution can identify configurations with the same
quality as the best alternative (taKG epochs) by saving ap-        [1] I. Goodfellow, J. Shlens, C. Szegedy, Explaining and
prox. 40% of the time (i.e., in 22 hours vs. 32 hours). We             harnessing adversarial examples, in: ICLR, 2015.
argue that the gains with CNN/Cifar10 are relatively lower         [2] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu,
than in the other scenarios considered in Figure 4 since               Towards deep learning models resistant to adversarial
those models are larger and more complex. As such, they                attacks, in: ICLR, 2018.
benefit more from the cost reduction opportunities provided        [3] S. Gupta, P. Dube, A. Verma, Improving the affordabil-
by using a reduced number of PGD iterations.                           ity of robustness training for dnns, in: CVPR, 2020.
  We also observe that the exclusive use of PGD iterations         [4] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. E. Ghaoui, M. Jordan,
with taKG yields worse performance than using solely num-              Theoretically principled trade-off between robustness
ber of epochs. This is not surprising, given that number of            and accuracy, in: ICML, 2019.
epochs is arguably one of the most direct ways of controlling      [5] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential
the cost of configuration sampling and is, indeed, among               model-based optimization for general algorithm con-
the most commonly adopted budgets in multi-fidelity opti-              figuration, in: LION, 2011.
mizers [20, 8, 27]. This result confirms that PGD iterations       [6] K. Swersky, J. Snoek, R. Adams, Multi-task bayesian
represent a valuable mean to accelerate multi-fidelity HPT             optimization, in: NeurIPS, 2013.
optimizers to train robust models and that it complements,         [7] J. Wu, S. Toscano-Palmerin, P. I. Frazier, A. G. Wil-
but does not replace, "conventional" budget settings like              son, Practical multi-fidelity bayesian optimization for
number of epochs or dataset size.                                      hyperparameter tuning, in: UAI, 2019.
                                                                   [8] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Tal-
                                                                       walkar, Hyperband: A novel bandit-based approach
4. Conclusions and Future work                                         to hyperparameter optimization, Journal of Machine
                                                                       Learning Research (2018).
This paper focused on the problem of HPT for robust mod-           [9] E. Wong, L. Rice, Z. Kolter, Fast is better than free:
els. By means of an extensive experimental study, we first             Revisiting adversarial training, in: ICLR, 2020.
quantified the relevance of independently tuning the HPs          [10] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson,
used during standard and adversarial training. We then                 C. Studer, L. S. Davis, G. Taylor, T. Goldstein, Adver-
proposed and evaluated a novel fidelity dimension that be-             sarial training for free!, in: NeurIPS, 2019.
comes available in the context of AT. Specifically, we have       [11] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deep-
shown that cheaper AT methods can be used to obtain inex-              fool: A simple and accurate method to fool deep neural
pensive estimations of the quality achievable via expensive            networks, in: CVPR, 2016.
state-of-the-art AT methods and that this information can be      [12] J. H. Metzen, T. Genewein, V. Fischer, B. Bischoff, On
effectively exploited to accelerate HPT. We extended taKG,             detecting adversarial perturbations, in: ICLR, 2017.
a state-of-the-art HPT method, by incorporating the PGD           [13] C. Guo, M. Rana, M. Cisse, L. van der Maaten, Coun-
iterations as an additional fidelity dimension (along with             tering adversarial images using input transformations,
the number of epochs) and achieved cost reductions by up               in: ICLR, 2018.
to 2.1×.                                                          [14] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B.
   It is worth noting that the idea of employing “cheap” AT            Celik, A. Swami, Practical black-box attacks against
methods as proxies to estimate the quality of HP config-               machine learning, in: ASIA CCS, 2017.
urations with more robust/expensive methods is generic,           [15] M. Andriushchenko, N. Flammarion, Understanding
in the sense that it can be applied, at least theoretically, to        and improving fast adversarial training, in: NeurIPS,
any multi-fidelity optimizer. As part of our future work, we           2020.
plan to integrate this novel approach in a new HPT frame-         [16] L. Rice, E. Wong, J. Z. Kolter, Overfitting in adversari-
work, specifically designed to cope with adversarially robust          ally robust deep learning, in: ICML, 2020.
models.                                                           [17] T. Bai, J. Luo, J. Zhao, B. Wen, Q. Wang, Recent ad-
                                                                       vances in adversarial training for adversarial robust-
Acknowledgments                                                        ness, in: IJCAI, 2021.
                                                                  [18] E. Strubell, A. Ganesh, A. McCallum, Energy and
This work was supported by the Fundação para a Ciência e               policy considerations for deep learning in NLP, in:
a Tecnología (Portuguese Foundation for Science and Tech-              ACL, 2019.
nology) through the Carnegie Mellon Portugal Program              [19] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Trim-
under grant SFRH/BD/151470/2021 via projects with refer-               tuner: Efficient optimization of machine learning jobs
ence UIDB/50021/2020 and C645008882-00000055.PRR, by                   in the cloud via sub-sampling, in: MASCOTS, 2020.
the NSA grant H98230-23-C-0274, and by the Advanced               [20] A. Klein, S. Falkner, S. Bartels, et al., Fast bayesian
Cyberinfrastructure Coordination Ecosystem: Services &                 optimization of machine learning hyperparameters on
Support (ACCESS) program, where we used the Bridges-2                  large datasets, in: AISTATS, 2017.
GPU and Ocean resources at the Pittsburgh Supercomputing          [21] J. Mockus, V. Tiesis, A. Zilinskas, The application
Center through allocation CIS220073, which is supported                of bayesian methods for seeking the extremum, in:
by National Science Foundation grants #2138259, #2138286,              Toward Global Optimization, 1978.
#2138307, #2137603, and #2138296.                                 [22] M. Casimiro, D. Didona, P. Romano, et al., Lynceus:
                                                                       Cost-efficient tuning and provisioning of data analytic
                                                                       jobs, in: ICDCS, 2020.
                                                                  [23] K. Swersky, J. Snoek, R. Adams, Freeze-thaw bayesian
                                                                       optimization, ArXiv:1406.3896 (2014).
[24] K. Jamieson, A. Talwalkar, Non-stochastic best arm
     identification and hyperparameter optimization, in:
     AISTATS, 2016.
[25] S. Falkner, A. Klein, F. Hutter, BOHB: Robust and
     efficient hyperparameter optimization at scale, in:
     ICML, volume 80, 2018.
[26] N. H. Awad, N. Mallik, F. Hutter, DEHB: evolutionary
     hyberband for scalable, robust and efficient hyperpa-
     rameter optimization, in: IJCAI, 2021.
[27] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Hyper-
     jump: Accelerating hyperband via risk modelling, in:
     AAAI, 2023.
[28] E. Duesterwald, A. Murthi, G. Venkataraman, et al., Ex-
     ploring the hyperparameter landscape of adversarial
     robustness, ArXiv abs/1905.03837 (2019).
[29] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner,
     A. Madry, Robustness may be at odds with accuracy,
     in: ICLR, 2019.