=Paper=
{{Paper
|id=Vol-3856/paper_1
|storemode=property
|title=Hyper-parameter Tuning for Adversarially Robust Models
|pdfUrl=https://ceur-ws.org/Vol-3856/paper_1.pdf
|volume=Vol-3856
|authors=Pedro Mendes,Paolo Romano,David Garlan
|dblpUrl=https://dblp.org/rec/conf/aisafety/Mendes0G24
}}
==Hyper-parameter Tuning for Adversarially Robust Models==
Hyper-parameter Tuning for Adversarially Robust Models
Pedro Mendes1,2,* , Paolo Romano2 and David Garlan1
1
Software and Societal Systems Department, Carnegie Mellon University
2
INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa
Abstract
This work focuses on the problem of hyper-parameter tuning (HPT) for robust (i.e., adversarially trained) models, shedding light on the
new challenges and opportunities arising during the HPT process for robust models. To this end, we conduct an extensive experimental
study based on three popular deep models and explore exhaustively nine (discretized) hyper-parameters (HPs), two fidelity dimensions,
and two attack bounds, for a total of 19208 configurations (corresponding to 50 thousand GPU hours).
Through this study, we show that the complexity of the HPT problem is further exacerbated in adversarial settings due to the need
to independently tune the HPs used during standard and adversarial training: succeeding in doing so (i.e., adopting different HP settings
in both phases) can lead to a reduction of up to 80% and 43% of the error for clean and adversarial inputs, respectively. We also identify
new opportunities to reduce the cost of HPT for robust models. Specifically, we propose to leverage cheap adversarial training methods
to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive state-of-the-art methods.
We show that, by exploiting this novel idea in conjunction with a recent multi-fidelity optimizer (taKG), the efficiency of the HPT
process can be enhanced by up to 2.1×.
1. Introduction and CNN/Cifar10). In this study, we discretize and ex-
haustively explore the HP space composed of up to nine
Adversarial attacks [1] aim at causing model misclassifica- HPs, which we evaluated considering two “fidelity dimen-
tions by introducing small perturbations in the input. White- sion” [5, 6] for the training process and two attack strengths.
box methods like Projected Gradient Descent (PGD) [2] Overall, we test a total of 19208 configurations and we make
have been shown to be extremely effective in synthesiz- this dataset publicly accessible in the hope that it will aid
ing perturbations that are small enough to be noticeable the design of future HPT methods specialized for AT.
by humans, while severely hindering the model’s perfor- Leveraging this data, we investigate a key design choice
mance. Fortunately, models can be hardened against this for the HPT process of robust models, namely the decision of
type of attack via a so-called “Adversarial Training” (AT) whether to adopt the same vs. different HP settings during
process. During AT, which typically takes place after an AT and ST (for the common HPs in the 2 phases). To this
initial standard training (ST) phase [3], adversarial exam- end, we focus on 3 key HPs of deep models: learning rate,
ples are synthesized and added (with their intended label) momentum, and batch size. Our empirical study shows that
to the training set. Recently, several AT methods have been allowing the use of different HP settings during ST and AT
proposed [1, 2, 4] that explore different trade-offs between can bring substantial benefits in terms of model quality, by
robustness and computational efficiency. Unfortunately, the reducing up to 80% and 43% the standard and adversarial
most robust AT methods impose significant overhead (up error, respectively.
to 7× in the models tested in this work) with respect to Further, our study demonstrates that while the cost and
standard training. complexity of HPT are heightened in adversarial settings,
These costs are further amplified when considering it also reveals that, in the context of robust models, unique
another crucial phase of model building, namely hyper- opportunities can be exploited to effectively mitigate these
parameter tuning (HPT). In fact, HPT methods require train- costs. Specifically, we show that it is possible to leverage
ing a model multiple times using different hyper-parameter cheap AT methods to obtain inexpensive, yet highly cor-
(HP) configurations. Consequently, the overheads intro- related, estimations of the quality achievable using more
duced by AT lead also to an increase in the cost of HPT. robust/expensive methods (PGD [2]). Besides studying the
Further, AT and ST share common HPs, which raises the trade-offs between cost reduction and HP quality correlation
question of whether AT should simply employ the same HP with different AT methods, we extend a recent multi-fidelity
settings used during ST, or if a new HPT process should optimizer (taKG [7]) to incorporate the choice of the AT
be executed to select different HPs for the AT phase. In method as an additional dimension to reduce the HPT cost.
the latter case, the dimensionality of the HP space to be We evaluate the proposed method using our dataset and
optimized grows significantly, exacerbating the HPT cost. show that incorporating the choice of the AT method as
Hence, this work focuses on the problem of HPT for ad- an additional fidelity dimension in taKG leads up to 2.1×
versarially trained models with the twofold goal of i) shed- speed-ups, with gains that extend up to 3.7× w.r.t. popular
ding light on the new challenges (i.e., additional costs) that HPT methods, as HyperBand [8]. These reductions in the
emerge when performing HPT for robust models, and ii) optimization time not only translate to significant reduc-
proposing novel techniques to reduce these costs, by ex- tions in energy consumption during training but also result
ploiting opportunities that emerge in this context. in corresponding decreases in pollutant emissions.
We pursue the first goal via an extensive experimental
study based on 3 popular models/datasets widely used to
evaluate AT methods (ResNet50/ImageNet, ResNet18/SVHN, 2. Background and Related Work
In this section, we first provide background information on
The IJCAI-2024 AISafety Workshop AT techniques (Section 2.1) and then discuss related works
*
Corresponding author.
$ pgmendes@andrew.cmu.edu (P. Mendes); romano@inesc-id.pt
in the area of HPT (Section 2.2).
(P. Romano); dg4d@andrew.cmu.edu (D. Garlan)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
tion 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2.1. Adversarial Training the model’s adversarial accuracy collapses after some train-
ing iterations. Henceforth, we will focus on FGSM and PGD,
Adversarial attacks aim at introducing small perturbations
which, as mentioned, are among the most widely used and
to input data, often small enough to be hardly perceivable
effective methods for generating adversarial examples [15].
by humans, with the goal to lead the model to generate an
In fact, these methods have been extensively studied and
erroneous output. These attacks reveal vulnerabilities of
compared in the literature [15, 16, 10, 9, 17] and represent a
current model training techniques, underscoring the need
natural starting point for investigating the trade-offs related
for developing robust models in different domains. Thus,
to HPT that arise in the context of AT.
several works [1, 2, 9, 10, 11] developed new techniques to
Independently of the technique used to perform AT, a
mitigate these vulnerabilities and defend against adversarial
relevant finding, first investigated by Gupta et al. [3], is
attacks, by tackling them via different and often orthogonal
whether to use an initial ST phase before performing AT, or
or complementary ways, such as adversarial training [1, 2, 9,
whether to use exclusively AT. This study showed that using
10], detection of adversarial attacks [12], or pre-processing
an initial ST phase normally helps to reduce the computa-
techniques to filter adversarial perturbations [13]. Next, we
tional cost while yielding models with comparable quality.
will review existing AT approaches, which represent the
This result motivates one of the key questions that we aim
focus of this work.
at answering in this work, namely whether the ST and AT
AT aims at improving the robustness of machine learning
phases should share the settings for their common HPs.
(ML) models by i) first generating adversarially perturbed in-
puts, and ii) feeding these adversarial examples, along with
the correct corresponding label, during the model train- 2.2. Hyper-parameter Tuning
ing phase. More formally, this process can be described as HPT is a critical phase to optimize the performance of ML
follows. Unlike ST, which determines the models’ parame- models. As the scale and complexity of models increase,
ters 𝜃 by minimizing the loss function between the model’s along with the number of HP that can possibly be tuned
prediction for the clean input 𝑓𝜃 (𝑥) and the original class in modern ML methods [18], HPT is a notoriously time-
𝑦, i.e., min{ E [𝐿(𝑓𝜃 (𝑥), 𝑦)]}, AT first computes a per- consuming process, whose cost can become prohibitive due
𝜃 𝑥,𝑦∼𝐷
turbation 𝛿, smaller than a maximum predefined bound to the need to repetitively train complex models on large
𝜖, which will mislead the current model, and then trains datasets.
the model with that perturbed input. This approach leads To address this issue, a large spectrum of the literature on
to the formulation of the following optimization problem: HPT relies on Bayesian Optimization (BO) [19, 20, 21, 6, 7,
min{ E [ max 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)]}. The model’s robust- 22, 5, 23]. BO employs modeling techniques (e.g., Gaussian
𝜃 𝑥,𝑦∼𝐷 ‖𝛿‖<𝜖
Processes) to guide the optimization process and leverages
ness depends on the bound 𝜖 used to produce the adver-
the model’s knowledge and uncertainty (via a, so called,
sarial examples and on the strength of the method used to
acquisition function) to select which configurations to test.
compute those examples.
Although the use of BO can help to increase the conver-
Several methods have been developed to solve this op-
gence speed of the optimization process, the cost of testing
timization problem (or variants thereof as, e.g., [4]) and
multiple HP configurations can quickly become prohibitive,
the resulting techniques are based on different assumptions
especially when considering complex models trained over
about, e.g., the availability of the model for the attacker (i.e.,
large datasets.
white-box [2] or black-box [14]), whether the underlying
To tackle this problem, multi-fidelity techniques [5, 6, 20,
model is differentiable [1] or not [14], and the existence
19, 23, 7] exploit cheap low-fidelity evaluations (e.g., train-
of bounds on attacker capabilities [11]. Among these tech-
ing with a fraction of the available data or using a reduced
niques, two of the most popular ones are Fast Gradient
number of training epochs) and extrapolate this knowledge
Sign Method (FGSM) [1] and Projected Gradient Descent
to recommend high-fidelity configurations. This allows for
(PGD) [2]). Both techniques hypothesize that attackers can
reducing the cost of testing HP configurations, while still
inject bounded perturbations and have access to a differen-
providing useful information to guide the search for the op-
tiable model. FGSM, and its later variants [9, 10, 15], rely
timal high-fidelity configuration(s) [5, 19]. HyperBand [8]
on gradient descent to compute small perturbations in an
is a popular multi-fidelity and model-free approach that
efficient way. More in detail, for a given clean input 𝑥, this
promotes good quality configurations to higher budgets
method adjusts the perturbation 𝛿 by the magnitude of the
and discards the poor quality ones using a simple, yet effec-
bound in the direction of the gradient of the loss function,
tive, successive halving approach [24]. Several approaches
i.e., 𝛿 = 𝜖 · sign(∇𝛿 𝐿(𝑓𝜃 (𝑥 + 𝛿), 𝑦)). PGD iteratively gen-
extended HyperBand using models to identify good config-
erates adversarial examples by taking small steps in the
urations [25, 26] or shortcut the number of configurations
direction of the gradient of the loss function and projecting
to test [27]. While these works adopt a single budget type
the perturbed inputs back onto the 𝜖-ball around the original
(e.g., training time or dataset size), other approaches, such
input, i.e., Repeat: 𝛿 = 𝒫(𝛿+𝛼∇𝛿 𝐿(𝑓𝜃 (𝑥+𝛿), 𝑦)), where
as taKG [7], make joint usage of multiple budget/fidelity di-
𝒫 is the projection onto the ball of radius 𝜖 and 𝛼 can be seen
mensions during the optimization process to have additional
as analogous to the learning rate in gradient-descent-based
flexibility to reduce the optimization cost. taKG selects the
training. Due to its iterative nature, PGD incurs a notably
next configuration and different budgets via model-based
higher computational cost than FGSM [10, 9], but it is also
predictive techniques that estimate the cost incurred and
regarded as one of the strongest methods to generate ad-
information gained by sampling a given configuration for a
versarial examples. In fact, prior work has shown that PGD
given setting of the available fidelity dimensions.
attacks can fool robust models trained via FGSM and that
In the area of HPT for robust models, the work that is
PGD-based AT produces models robust to larger perturba-
more closely related to ours is the study by Duesterwald et
tions [2] and with higher adversarial accuracy. FGSM is also
al. [28]. This work has investigated empirically the relations
known to suffer from catastrophic-overfitting [16], in which
between the bounds on adversarial perturbations (𝜖) and
the model’s accuracy/robustness. Further, it showed that Table 1
the ratio of clean/adversarial examples included in a batch Hyper-parameters considered
(during AT) can have a positive impact on the model’s qual-
Hyper-parameter Values
ity and represents, as such, a key HP. Based on this finding,
we incorporate this HP among the ones tested in our study. Learning Rate (ST and AT) {0.1, 0.01}
Differently from that work, we focus on i) quantifying the Batch Momentum (ST and AT)
{0.9, 0.99} for ResNet50
benefits of using different HPs during ST and AT, and ii) ex- {0, 0.9} otherwise
ploiting the correlation between cheaper AT methods (such {256, 512} for ResNet50
Batch Size (ST and AT)
{128, 256} otherwise
as FGSM) to enhance the efficiency of multi-fidelity HPT
𝛼 (PGD learning rate) {10−2 , 10−3 }
algorithms.
% resources (time or epochs)
{0, 30, 50, 70, 100}
for AT (%RAT)
3. HPT for Robust Models: % adversarial examples
{30, 50, 70, 100}
in each batch (%AE)
Challenges and Opportunities
Table 2
As mentioned, this work aims at shedding light on the chal-
Bounds 𝜖 per benchmarks
lenges and opportunities that arise when performing HPT
for adversarially robust models. More precisely, we seek to Model & Benchmark Bound 𝜖
answer the following questions: ResNet50/ImageNet {2, 4}
1. Should the HPs that are common to the AT and ResNet18/SVHN {4, 8}
ST phases be tuned independently? More in detail, CNN/Cifar10 {8, 12}
we aim at quantifying to what extent the model’s
quality is affected if one uses the same vs different Table 3
HP settings during the ST and AT phases (see Fidelities considered
Section 3.2). Fidelities Values
PGD iterations {1 (FGSM), 5, 10, 20}
2. Is it possible to reduce the cost of HPT by testing HP Epochs {1,2,4,8,16}
settings using cheaper (but less robust) AT methods?
How correlated are the performance of alternative
AT approaches and what factors (e.g., the perturba-
Section 3.3). The model’s quality is evaluated using standard
tion bound or the cost of the techniques) impact such
error (i.e., error using clean inputs) and adversarial error
correlation? To what extent can this approach en-
(i.e., error using adversarially perturbed inputs).
hance the HPT’s process efficiency? (see Section 3.3.)
For each model, we exhaustively explored the (discretized)
In order to answer the above questions, we have collected space defined by the HPs, bound 𝜖, and fidelities, which
(and made publicly available) a dataset, which we obtained yields a search space encompassing a total of 19208 config-
by varying some of the most impactful HPs for three popular urations. Building this dataset required around fifty thou-
neural models/datasets and measured the resulting model sand GPU hours and we have made it publicly accessible in
quality. We provide a detailed description of the dataset in the hope that it will aid the design of future HPT methods
Section 3.1. specialized for AT. Additional information to ensure the re-
producibility of results is provided in the public repository1 .
3.1. Experimental Setup
3.2. Should the HPs of ST and AT be tuned
We base our study on three widely-used models and datasets
(ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10). independently?
All the models were trained using 1 worker, except SVHN, This section aims at answering the following question: given
in which two workers were used. We used Nvidia Tesla that the ST and AT phases share several HPs (e.g., batch
V100 GPUs to train the ResNet50, and Nvidia GeForce RTX size, learning rate, and momentum in the models considered
2080 for the remaining models. All models and training in this study), how relevant is it to use different settings
procedures were implemented in Python3 via the Pytorch for these HPs in the two training phases? Note that, if we
framework. assume the existence of 𝐶 HPs in common between ST and
To evaluate the models, we considered up to nine differ- AT, then enabling the use of different values for these HPs
ent HPs, as summarized in Table 1. The first three HPs in in each training stage causes a growth of the dimensionality
this table apply to both the ST and AT phases. 𝛼 is an HP of the HP space from 𝐶 to 2𝐶 (not accounting for any HP
that applies exclusively to AT, whereas the last two HPs not in common) and, ultimately, to a significant increase in
(%RAT and %AE) regulate the balance between ST and AT the cost/complexity of the HPT problem. Specifically, for
(see Section 2). Specifically, %RAT defines the number of the scenarios considered in this study, the cardinality of the
computational resources allocated to the AT phase, and HP space grows from 320 to 2560 distinct configurations.
%AE indicates the ratio of adversarial inputs contained in Hence, we argue that such a cost is practically justified
the batches during the AT phase (as suggested by Duester- only if it is counterbalanced by relevant gains in terms of
wald et al. [28]). We further consider several settings of error reduction. To answer this question, we trained the
the bound 𝜖 on the attacker power (see Table 2). Note that models during 16 epochs and used different settings for
the reported values of 𝜖 are normalized by 255. Finally, we the common HPs for ST and AT (Table 1). We consider
also consider two fidelity dimensions, namely the number
of training epochs and the number of PGD iterations (see 1
https://github.com/pedrogbmendes/HPT_advTrain
15 30 10
% (Adv) Error Reduction
% (Adv) Error Reduction
% (Adv) Error Reduction
20 7.5
10
5
5 10
2.5
0 0 0
%RAT =12
%RAT =12
%RAT =12
%RGADT, =8
%RAT =8
%RAT =8
%RGADT, =2
%RAT =2
%RGADT, =2
%RAT =4
%RAT =4
%RAT =4
%RGADT, =4
%RAT =4
%RAT =4
%RAT =8
%RGADT, =8
%RGADT, =8
=30
=50
=70
=30
=50
=70
=30
=50
=70
=30
=50
=70
=30
=50
=70
=30
=50
=70
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
PGD,
P
P
P
P
P
P
(a) ResNet50/ImageNet (b) ResNet18/SVHN (c) CNN/Cifar10
Figure 1: Reduction of the mean (in black), standard (in blue), and adversarial (in red) error of the optimal configuration if the
same or different hyper-parameters are used for the 2 phases of training using different scenarios and benchmarks.
1.0 1.00 1.00
0.75 0.75
0.5
CDF
CDF
CDF
0.50 0.50
=2 =4 =8
=4 0.25 =8 0.25 =12
0 10 20 0 25 50 75 0 20 40
% (Adv.) Error Reduction % (Adv.) Error Reduction % (Adv.) Error Reduction
(a) ResNet50/ImageNet (b) ResNet18/SVHN (c) CNN/Cifar10
Figure 2: Cumulative distribution functions (CDFs) of the mean (in black), standard (in blue), and adversarial (in red) error
reduction when using different HP settings for ST and AT w.r.t. the case in which common HPs are used in both phases. We
use dashed and continuous lines to refer to different values of 𝜖.
three different settings (30%, 50%, and 70%)2 for the relative executed an initial ST phase using any of the possible HPs
amount of resources (epochs) available for AT (%RAT), as settings. Specifically, we report the Cumulative Distribution
well as different settings of the perturbation bound 𝜖. Function (CDF) of the percentage of error reduction (for each
We consider that the model’s HPs can be optimized ac- of the three target optimization metrics), when allowing the
cording to three criteria: i) clean data error (Error), ii) ad- use of the same or different common HP settings for the
versarial error (AdvError), and iii) the average of clean and two phases and while varying the remaining (non-common)
adversarial error (MeanError). For each of these three opti- HPs, namely %RAT, %AE, and 𝛼. These results allow us to
mization criteria, %RAT and bound 𝜖, we report in Figure 1 highlight that by independently tuning the HPs of the two
the percentage of reduction of the target optimization metric training stages, the model’s quality is enhanced by up to
for the optimal HP configuration obtained by allowing for approx. 80%, 43%, and 56%, when minimizing the standard,
(but not imposing) different settings of the HPs in common adversarial, or mean error, resp.
to the ST and AT phases, with respect to the optimal HP Overall, these results may be justified by considering
configuration if one opts for using the same settings for the that the optimization objectives and constraints of the ST
common HPs in both training phases, namely: and AT phases are different, hence benefiting from using
different HP settings. During ST, the training procedure
Errorsame HPs − Errordiff HPs focuses on maximizing standard accuracy, and the model’s
%Error Reduction = ×100 (1) goal is to learn representations that generalize well to new
Errorsame HPs
data. In contrast, AT seeks to increase robustness against
The results show that adopting different HP settings in adversarial attacks, and the model needs to learn to differ-
the two phases can lead to significant error reductions for entiate between clean and perturbed examples correctly.
all the three optimization criteria. The peak gains extend Further, the AT phase benefits from a pre-trained model
up to approx. 30% and are achieved for the case of ResNet18 (using clean data), and, as such, this model is expected to re-
with (relatively) large values of 𝜖 and when allocating a quire relatively small weight adjustments to defend against
low percentage of epochs to AT (%RAT=30%). Overall, the adversarial inputs. Thus, this phase is likely to benefit from
geometric means of the % error reduction (across all models more conservative settings of HPs such as learning rate and
and settings of 𝜖 and %RAT) is 9%, 5%, and 6% for the Error, momentum than the initial ST, whose convergence could
AdvError, and MeanError criterion, respectively. be accelerated via the use of more aggressive settings for
Next, Figure 2 provides a different perspective in order to the same HPs. In fact, we confirmed this fact by analyzing
quantify the benefits achieved by separately optimizing the the configurations that yield the 10 largest error reductions
HP of the AT phase (vs. using for the AT phase the same HPs in Figure 2: better quality models used lower learning rates
settings in common with the ST phase), assuming to have and batch sizes in the AT phase.
2 Another factor that can justify the need for using different
We exclude the cases %RAT={0,100} in this study to focus on scenarios
that contain both the ST and AT phases. HP settings during ST and AT is related to the observation
1.0 1.0 1.00
0.9 0.75
0.9
Adv Error
0.8
0.50
Error
CDF
0.7 0.8 FGSM
FGSM, =0.95 FGSM, =0.96 0.25 PGD5
0.6 PGD5, =0.97 PGD5, =0.98 PGD10
PGD10, =0.97 0.7 PGD10, =0.97 ST
0.5 0.00
0.6 0.8 1.0 0.7 0.8 0.9 1.0 25 50 75
Error (PGD20) Adv Error (PGD20) Training Time Reduction [%]
(a) Error Corr. (b) Adv. Error Corr. (c) CDF time reduction
ResNet50/ImageNet 𝜖-2 ResNet50/ImageNet 𝜖-2 ResNet50/ImageNet
0.8 0.8 0.8
0.9
0.6 0.6
Adv Error
Adv Error
0.8
Error
Error
0.6
0.4 0.4
FGSM, =0.78 FGSM, =0.66 FGSM, =0.68 0.7 FGSM, =0.53
0.2 PGD5, =0.81 0.4 PGD5, =0.83 0.2 PGD5, =0.91 PGD5, =0.91
PGD10, =0.85 PGD10, =0.87 PGD10, =0.95 0.6 PGD10, =0.95
0.2 0.4 0.6 0.8 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9
Error (PGD20) Adv Error (PGD20) Error (PGD20) Adv Error (PGD20)
(d) Error Corr. (e) Adv. Error Corr. (f) Error Corr. (g) Adv. Error Corr.
ResNet18/SVHN 𝜖-4 ResNet18/SVHN 𝜖-4 ResNet18/SVHN 𝜖-8 ResNet18/SVHN 𝜖-8
0.6 0.7 0.6 0.7
0.5 0.6 0.5 0.6
Adv Error
Adv Error
0.4 0.4
Error
Error
0.5 0.5
0.3 FGSM, =0.98 FGSM, =0.95
0.3 FGSM, =0.96 FGSM, =0.93
0.2 PGD5, =0.99 0.4 PGD5, =0.99 0.2 PGD5, =0.98 0.4 PGD5, =0.98
PGD10, =0.99 PGD10, =0.99 PGD10, =0.99 PGD10, =0.99
0.1 0.2 0.4 0.6 0.30.3 0.4 0.5 0.6 0.7 0.1 0.2 0.4 0.6 0.30.3 0.4 0.5 0.6 0.7
Error (PGD20) Adv Error (PGD20) Error (PGD20) Adv Error (PGD20)
(h) Error Corr. (i) Adv. Error Corr. (j) Error Corr. (k) Adv. Error Corr.
CNN/Cifar10 𝜖-8 CNN/Cifar10 𝜖-8 CNN/Cifar10 𝜖-12 CNN/Cifar10 𝜖-12
Figure 3: Standard and adversarial error correlation between PGD20 and FGSM, PGD5, and PGD10 varying the bound 𝜖 and
CDF of the training time reduction obtained using cheaper AT algorithms w.r.t PGD20 (Figure 3c).
that the bound on the admissible perturbation (𝜖) can have yet informative, way. As discussed in Section 2, PGD is
a deep impact on the model’s performance, by exposing an iterative method, where each iteration refines the per-
an inherent (and well-known [29]) trade-off: as the bound turbation with the objective of maximizing loss. Hence, a
increases, the model may become more robust to adversarial straightforward way to reduce its cost (at the expense of ro-
inputs but at the cost of an increase in the misclassification bustness) is to reduce the number of executed iterations. We
rate of clean inputs. To achieve an optimal trade-off between also note that the computational cost of FGSM is equivalent
robustness and accuracy, it may be necessary to adjust the to that of a single PGD iteration.
tuning of the HPs used during AT as 𝜖 varies, which in turn We build on these observations to propose incorporating
implies that the optimal HPs settings used during ST and AT the number of PGD iterations as an additional fidelity di-
can be different. In fact, by analyzing the results obtained on mension in multi-fidelity HPT optimizers, such as taKG [7].
ResNet18/SVHN, for example, we see that the amplitude of We choose to test the proposed idea with taKG since this
the bound has an impact on the (adversarial) error reduction technique supports the use of an arbitrary number of fidelity
achievable by tuning independently the HPs of two phases of dimensions (e.g., dataset size and number of epochs) and
training: the 90𝑡ℎ percentile of the percentage of clean error determines how to explore the multi-dimensional fidelity
reduction is 50% and 65% using 𝜖=4 and 𝜖=8, respectively via black-box modeling techniques (see Section 2.2).
(see Fig. 2b). In order to assess the soundness and limitations of the
proposed approach, we first analyze the correlation of the
3.3. Can cheap AT methods be leveraged to standard and adversarial error between HP configurations
that use PGD with 20 iterations (which we consider as the
accelerate HPT? maximum-fidelity/full budget) vs. PGD with 10 and 5 it-
So far, we have shown that in adversarial settings the com- erations and FGSM (which, as already mentioned, is com-
plexity of the HPT problem is exacerbated due to the need putationally equivalent to 1 iteration of PGD). In Figure 3,
for optimizing a larger HP space. In this section, we show we observe that the correlation varies for different bounds
that, fortunately, AT provides also new opportunities to on adversarial perturbations across the considered models/-
reduce the HPT cost. Specifically, we propose and evalu- datasets. We omit the correlation for ResNet50/ImageNet
ate a novel idea: leveraging alternative AT methods, which using 𝜖=4 since the results are very similar to 𝜖=2. The
impose lower computational costs but provide weaker ro- scatter plots clearly show the existence of a very strong
bustness guarantees to sample HP configurations in a cheap, correlation (above 95%) for all the considered methods for
0.70 0.50 0.40
taKG (epochs) taKG (epochs) taKG (epochs)
0.5 Error + 0.5 Adv.Error taKG (PGD iter) taKG (PGD iter) taKG (PGD iter)
0.5 Error + 0.5 Adv.Error
0.5 Error + 0.5 Adv.Error
0.68
taKG (epochs & PGD iter) taKG (epochs & PGD iter) taKG (epochs & PGD iter)
0.66 HB (epochs) HB (epochs) HB (epochs)
BO-EI BO-EI BO-EI
Random Search Random Search Random Search
0.64
0.40 0.30
0.62
0.600 25 50 75 100 125 150 0 5 10 15 20 25 30 0 10 20 30 40
Time [hours] Time [hours] Time [hours]
(a) ResNet50/ImageNet 𝜖-2 (b) ResNet18/SVHN 𝜖-8 (c) CNN/Cifar10 𝜖-12
Figure 4: Average standard and adversarial error using different optimizers.
ResNet50/ImageNet and for all the considered bounds. For perform high-fidelity evaluations. The evaluation of these
ResNet18/SVHN and CNN/Cifar10, the correlation of PGD alternative solutions is performed by exploiting the dataset
5 and 10 iterations remains quite strong (always above 80% already described in Section 3.1, which specifies the model
and typically above 90%), whereas lower correlations (as low quality (error and adversarial error) for all possible HPs, 𝜖
as 53%) can be observed for FGSM, especially when consider- and fidelity settings reported in Tables 1, 2 and 3.
ing adversarial error and larger values of 𝜖 . This is expected, We define the optimization problem as follows:
as previous works [15, 16] had indeed observed that FGSM
tends to be less robust than PGD when larger 𝜖 values are min 𝜆·Error(𝑥, 𝑠 = 1)+(1−𝜆)·Adv.Error(𝑥, 𝑠 = 1) (2)
𝑥
used (being subject to issues such as catastrophic-overfitting
that lead to a sudden drop of adversarial accuracy). Still, where 𝑥 is a vector defining the HPs, 𝑠 is a vector that
even for FGSM, the correlation is always above 90% with encodes the ratio of budget allocated for each fidelity di-
CNN/Cifar10 and is relatively high (around 70%) also with mension, and 𝜆 is a weight factor that we set to 0.5 to equally
ResNet18/SVHN for the smaller considered bound (𝜖 = 4). balance the standard and adversarial errors. For a fair com-
We also report, in Figure 3c, the CDF of the training time parison, when a single fidelity dimension (e.g., epochs) is
reduction using FGSM, PGD with 5 and 10 iterations, and used, we set the other fidelity dimension (e.g., PGD itera-
ST w.r.t. PGD20 for ResNet50/ImageNet. The CDFs show tions) to its maximum value. We run each optimizer using
that the training time reductions for a given AT method 20 independent seeds. We set the bound 𝜖 to 2, 8, and 12 to
vary since the ratio of computed adversarial examples de- optimize ResNet50/ ImageNet, ResNet18/SVHN, and CNN/-
pends on the %RAT and %AE parameters. Overall, the max- Cifar10. Based on Figure 3, the three settings correspond
imum (median) training time reduction is approximately to scenarios with relatively high, low, and medium correla-
83% (54%), 66% (42%), 47% (28%), and 86% (53%) for FGSM, tions for the budget dimension defined by PGD iterations,
PGD5, PGD10, and STD, compared to PGD20, which con- respectively.
firms that leveraging these “cheap” surrogate methods can Figure 4 reports the average and standard deviation of
significantly reduce the cost of testing HP configurations. the optimization goal (i.e., 0.5 · Error + 0.5 · Adv.Error) as
Supported by these findings, we evaluate our proposal by a function of the optimization time for the different HPT
integrating the number of PGD iterations as an additional optimizers and different models/datasets. The results show
fidelity dimension in taKG [7]. As discussed in Section 2, that the proposed solution, which adopts PGD iterations
taKG is a HPT method that natively supports the use of mul- as extra fidelity along with the number of epochs (taKG -
tiple fidelity types, which we refer to as fidelity dimensions, epochs & PGD iter), clearly outperforms all the alternative
e.g., number of epochs, input size, and dataset size. Based solutions in ResNet50/ImageNet and ResNet18/SVHN. The
on the results of the previous section, we independently largest gains can be observed with ResNet18/SVHN (Fig. 4b).
optimize the HPs of the ST and AT phases, which yields a Here, at the end of the optimization process, the proposed
search space composed of a total of nine HPs (Table 1). The solution identifies a configuration of the same quality as
following multi-fidelity solutions are compared: the ones suggested by the other baselines, namely taKG -
epochs, HB, BO-EI, by achieving speed-ups of 2.1×, 3.7×,
• taKG (epochs, PGD iter): the proposed solution, and 5.4×, respectively. Using the same metric (time spent
which uses taKG as the underlying HPT method to recommend a configuration of the same quality as the
and employs as fidelity the number of epochs and ones suggested by the other optimizers at the end of the op-
the number of PGD iterations. We discretize these 2 timization process) with ResNet50/ImageNet, the proposed
dimensions (Table 3). solutions achieve slightly smaller, but still significant speed-
• taKG (epochs): taKG using as fidelity only the num- ups, namely 1.28×, 1.97× and 2.45× w.r.t. taKG - epochs,
ber of epochs. HB, BO-EI. Interestingly, with ResNet50/ImageNet the pro-
• taKG (PGD iter): taKG using as fidelity only the num- posed solution provides solid speed-ups also during the first
ber of PGD iterations. stage of the optimization. Specifically, if we analyze the first
• HB (epochs): HyperBand [8], a popular (single- half of the optimization process (corresponding to approx.
dimensional) multi-fidelity optimizer, which uses 83 hours (see Figure 4a)) the proposed solution identifies
the number of epochs as fidelity. configurations of the same quality as taKG - epochs, HB,
BO-EI, with speed-ups of 1.7×, 2× and 2.6×, respectively.
We further compare against BO using EI as acquisition With CNN/Cifar10 (Figure 4c), the proposed approach re-
function and Random Search (RS). These optimizers only mains the best-performing solution, although with smaller
gains when compared to taKG with epochs. Still, the pro- References
posed solution can identify configurations with the same
quality as the best alternative (taKG epochs) by saving ap- [1] I. Goodfellow, J. Shlens, C. Szegedy, Explaining and
prox. 40% of the time (i.e., in 22 hours vs. 32 hours). We harnessing adversarial examples, in: ICLR, 2015.
argue that the gains with CNN/Cifar10 are relatively lower [2] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu,
than in the other scenarios considered in Figure 4 since Towards deep learning models resistant to adversarial
those models are larger and more complex. As such, they attacks, in: ICLR, 2018.
benefit more from the cost reduction opportunities provided [3] S. Gupta, P. Dube, A. Verma, Improving the affordabil-
by using a reduced number of PGD iterations. ity of robustness training for dnns, in: CVPR, 2020.
We also observe that the exclusive use of PGD iterations [4] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. E. Ghaoui, M. Jordan,
with taKG yields worse performance than using solely num- Theoretically principled trade-off between robustness
ber of epochs. This is not surprising, given that number of and accuracy, in: ICML, 2019.
epochs is arguably one of the most direct ways of controlling [5] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential
the cost of configuration sampling and is, indeed, among model-based optimization for general algorithm con-
the most commonly adopted budgets in multi-fidelity opti- figuration, in: LION, 2011.
mizers [20, 8, 27]. This result confirms that PGD iterations [6] K. Swersky, J. Snoek, R. Adams, Multi-task bayesian
represent a valuable mean to accelerate multi-fidelity HPT optimization, in: NeurIPS, 2013.
optimizers to train robust models and that it complements, [7] J. Wu, S. Toscano-Palmerin, P. I. Frazier, A. G. Wil-
but does not replace, "conventional" budget settings like son, Practical multi-fidelity bayesian optimization for
number of epochs or dataset size. hyperparameter tuning, in: UAI, 2019.
[8] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Tal-
walkar, Hyperband: A novel bandit-based approach
4. Conclusions and Future work to hyperparameter optimization, Journal of Machine
Learning Research (2018).
This paper focused on the problem of HPT for robust mod- [9] E. Wong, L. Rice, Z. Kolter, Fast is better than free:
els. By means of an extensive experimental study, we first Revisiting adversarial training, in: ICLR, 2020.
quantified the relevance of independently tuning the HPs [10] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson,
used during standard and adversarial training. We then C. Studer, L. S. Davis, G. Taylor, T. Goldstein, Adver-
proposed and evaluated a novel fidelity dimension that be- sarial training for free!, in: NeurIPS, 2019.
comes available in the context of AT. Specifically, we have [11] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deep-
shown that cheaper AT methods can be used to obtain inex- fool: A simple and accurate method to fool deep neural
pensive estimations of the quality achievable via expensive networks, in: CVPR, 2016.
state-of-the-art AT methods and that this information can be [12] J. H. Metzen, T. Genewein, V. Fischer, B. Bischoff, On
effectively exploited to accelerate HPT. We extended taKG, detecting adversarial perturbations, in: ICLR, 2017.
a state-of-the-art HPT method, by incorporating the PGD [13] C. Guo, M. Rana, M. Cisse, L. van der Maaten, Coun-
iterations as an additional fidelity dimension (along with tering adversarial images using input transformations,
the number of epochs) and achieved cost reductions by up in: ICLR, 2018.
to 2.1×. [14] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B.
It is worth noting that the idea of employing “cheap” AT Celik, A. Swami, Practical black-box attacks against
methods as proxies to estimate the quality of HP config- machine learning, in: ASIA CCS, 2017.
urations with more robust/expensive methods is generic, [15] M. Andriushchenko, N. Flammarion, Understanding
in the sense that it can be applied, at least theoretically, to and improving fast adversarial training, in: NeurIPS,
any multi-fidelity optimizer. As part of our future work, we 2020.
plan to integrate this novel approach in a new HPT frame- [16] L. Rice, E. Wong, J. Z. Kolter, Overfitting in adversari-
work, specifically designed to cope with adversarially robust ally robust deep learning, in: ICML, 2020.
models. [17] T. Bai, J. Luo, J. Zhao, B. Wen, Q. Wang, Recent ad-
vances in adversarial training for adversarial robust-
Acknowledgments ness, in: IJCAI, 2021.
[18] E. Strubell, A. Ganesh, A. McCallum, Energy and
This work was supported by the Fundação para a Ciência e policy considerations for deep learning in NLP, in:
a Tecnología (Portuguese Foundation for Science and Tech- ACL, 2019.
nology) through the Carnegie Mellon Portugal Program [19] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Trim-
under grant SFRH/BD/151470/2021 via projects with refer- tuner: Efficient optimization of machine learning jobs
ence UIDB/50021/2020 and C645008882-00000055.PRR, by in the cloud via sub-sampling, in: MASCOTS, 2020.
the NSA grant H98230-23-C-0274, and by the Advanced [20] A. Klein, S. Falkner, S. Bartels, et al., Fast bayesian
Cyberinfrastructure Coordination Ecosystem: Services & optimization of machine learning hyperparameters on
Support (ACCESS) program, where we used the Bridges-2 large datasets, in: AISTATS, 2017.
GPU and Ocean resources at the Pittsburgh Supercomputing [21] J. Mockus, V. Tiesis, A. Zilinskas, The application
Center through allocation CIS220073, which is supported of bayesian methods for seeking the extremum, in:
by National Science Foundation grants #2138259, #2138286, Toward Global Optimization, 1978.
#2138307, #2137603, and #2138296. [22] M. Casimiro, D. Didona, P. Romano, et al., Lynceus:
Cost-efficient tuning and provisioning of data analytic
jobs, in: ICDCS, 2020.
[23] K. Swersky, J. Snoek, R. Adams, Freeze-thaw bayesian
optimization, ArXiv:1406.3896 (2014).
[24] K. Jamieson, A. Talwalkar, Non-stochastic best arm
identification and hyperparameter optimization, in:
AISTATS, 2016.
[25] S. Falkner, A. Klein, F. Hutter, BOHB: Robust and
efficient hyperparameter optimization at scale, in:
ICML, volume 80, 2018.
[26] N. H. Awad, N. Mallik, F. Hutter, DEHB: evolutionary
hyberband for scalable, robust and efficient hyperpa-
rameter optimization, in: IJCAI, 2021.
[27] P. Mendes, M. Casimiro, P. Romano, D. Garlan, Hyper-
jump: Accelerating hyperband via risk modelling, in:
AAAI, 2023.
[28] E. Duesterwald, A. Murthi, G. Venkataraman, et al., Ex-
ploring the hyperparameter landscape of adversarial
robustness, ArXiv abs/1905.03837 (2019).
[29] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner,
A. Madry, Robustness may be at odds with accuracy,
in: ICLR, 2019.