<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hyper-parameter Tuning for Adversarially Robust Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro Mendes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Romano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Garlan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Software and Societal Systems Department, Carnegie Mellon University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work focuses on the problem of hyper-parameter tuning (HPT) for robust (i.e., adversarially trained) models, shedding light on the new challenges and opportunities arising during the HPT process for robust models. To this end, we conduct an extensive experimental study based on three popular deep models and explore exhaustively nine (discretized) hyper-parameters (HPs), two fidelity dimensions, and two attack bounds, for a total of 19208 configurations (corresponding to 50 thousand GPU hours). Through this study, we show that the complexity of the HPT problem is further exacerbated in adversarial settings due to the need to independently tune the HPs used during standard and adversarial training: succeeding in doing so (i.e., adopting diferent HP settings in both phases) can lead to a reduction of up to 80% and 43% of the error for clean and adversarial inputs, respectively. We also identify new opportunities to reduce the cost of HPT for robust models. Specifically, we propose to leverage cheap adversarial training methods to obtain inexpensive, yet highly correlated, estimations of the quality achievable using more robust/expensive state-of-the-art methods. We show that, by exploiting this novel idea in conjunction with a recent multi-fidelity optimizer (taKG), the eficiency of the HPT process can be enhanced by up to 2.1× .</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Adversarial attacks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aim at causing model
misclassifications by introducing small perturbations in the input.
Whitebox methods like Projected Gradient Descent (PGD) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
have been shown to be extremely efective in
synthesizing perturbations that are small enough to be noticeable
by humans, while severely hindering the model’s
performance. Fortunately, models can be hardened against this
type of attack via a so-called “Adversarial Training” (AT)
process. During AT, which typically takes place after an
initial standard training (ST) phase [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], adversarial
examples are synthesized and added (with their intended label)
to the training set. Recently, several AT methods have been
proposed [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4</xref>
        ] that explore diferent trade-ofs between
robustness and computational eficiency. Unfortunately, the
most robust AT methods impose significant overhead (up
to 7× in the models tested in this work) with respect to
standard training.
      </p>
      <p>These costs are further amplified when considering
another crucial phase of model building, namely
hyperparameter tuning (HPT). In fact, HPT methods require
training a model multiple times using diferent hyper-parameter
(HP) configurations. Consequently, the overheads
introduced by AT lead also to an increase in the cost of HPT.
Further, AT and ST share common HPs, which raises the
question of whether AT should simply employ the same HP
settings used during ST, or if a new HPT process should
be executed to select diferent HPs for the AT phase. In
the latter case, the dimensionality of the HP space to be
optimized grows significantly, exacerbating the HPT cost.</p>
      <p>Hence, this work focuses on the problem of HPT for
adversarially trained models with the twofold goal of i)
shedding light on the new challenges (i.e., additional costs) that
emerge when performing HPT for robust models, and ii)
proposing novel techniques to reduce these costs, by
exploiting opportunities that emerge in this context.</p>
      <p>
        We pursue the first goal via an extensive experimental
study based on 3 popular models/datasets widely used to
evaluate AT methods (ResNet50/ImageNet, ResNet18/SVHN,
and CNN/Cifar10). In this study, we discretize and
exhaustively explore the HP space composed of up to nine
HPs, which we evaluated considering two “fidelity
dimension” [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] for the training process and two attack strengths.
Overall, we test a total of 19208 configurations and we make
this dataset publicly accessible in the hope that it will aid
the design of future HPT methods specialized for AT.
      </p>
      <p>Leveraging this data, we investigate a key design choice
for the HPT process of robust models, namely the decision of
whether to adopt the same vs. diferent HP settings during
AT and ST (for the common HPs in the 2 phases). To this
end, we focus on 3 key HPs of deep models: learning rate,
momentum, and batch size. Our empirical study shows that
allowing the use of diferent HP settings during ST and AT
can bring substantial benefits in terms of model quality, by
reducing up to 80% and 43% the standard and adversarial
error, respectively.</p>
      <p>
        Further, our study demonstrates that while the cost and
complexity of HPT are heightened in adversarial settings,
it also reveals that, in the context of robust models, unique
opportunities can be exploited to efectively mitigate these
costs. Specifically, we show that it is possible to leverage
cheap AT methods to obtain inexpensive, yet highly
correlated, estimations of the quality achievable using more
robust/expensive methods (PGD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Besides studying the
trade-ofs between cost reduction and HP quality correlation
with diferent AT methods, we extend a recent multi-fidelity
optimizer (taKG [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) to incorporate the choice of the AT
method as an additional dimension to reduce the HPT cost.
We evaluate the proposed method using our dataset and
show that incorporating the choice of the AT method as
an additional fidelity dimension in taKG leads up to 2.1 ×
speed-ups, with gains that extend up to 3.7× w.r.t. popular
HPT methods, as HyperBand [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These reductions in the
optimization time not only translate to significant
reductions in energy consumption during training but also result
in corresponding decreases in pollutant emissions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>In this section, we first provide background information on
AT techniques (Section 2.1) and then discuss related works
in the area of HPT (Section 2.2).</p>
      <sec id="sec-2-1">
        <title>2.1. Adversarial Training</title>
        <p>
          Adversarial attacks aim at introducing small perturbations
to input data, often small enough to be hardly perceivable
by humans, with the goal to lead the model to generate an
erroneous output. These attacks reveal vulnerabilities of
current model training techniques, underscoring the need
for developing robust models in diferent domains. Thus,
several works [
          <xref ref-type="bibr" rid="ref1 ref10 ref11 ref2 ref9">1, 2, 9, 10, 11</xref>
          ] developed new techniques to
mitigate these vulnerabilities and defend against adversarial
attacks, by tackling them via diferent and often orthogonal
or complementary ways, such as adversarial training [
          <xref ref-type="bibr" rid="ref1 ref10 ref2 ref9">1, 2, 9,
10</xref>
          ], detection of adversarial attacks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], or pre-processing
techniques to filter adversarial perturbations [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Next, we
will review existing AT approaches, which represent the
focus of this work.
        </p>
        <p>AT aims at improving the robustness of machine learning
(ML) models by i) first generating adversarially perturbed
inputs, and ii) feeding these adversarial examples, along with
the correct corresponding label, during the model
training phase. More formally, this process can be described as
follows. Unlike ST, which determines the models’
parameters  by minimizing the loss function between the model’s
prediction for the clean input  () and the original class
, i.e., min{,E∼</p>
        <p>[( (), )]}, AT first computes a
perturbation  , smaller than a maximum predefined bound
 , which will mislead the current model, and then trains
the model with that perturbed input. This approach leads
to the formulation of the following optimization problem:
min{,∼  ‖ ‖&lt;</p>
        <p>E</p>
        <p>[ max ( ( +  ), )]}. The model’s
robustness depends on the bound  used to produce the
adversarial examples and on the strength of the method used to
compute those examples.</p>
        <p>
          Several methods have been developed to solve this
optimization problem (or variants thereof as, e.g., [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) and
the resulting techniques are based on diferent assumptions
about, e.g., the availability of the model for the attacker (i.e.,
white-box [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] or black-box [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]), whether the underlying
model is diferentiable [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or not [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and the existence
of bounds on attacker capabilities [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Among these
techniques, two of the most popular ones are Fast Gradient
Sign Method (FGSM) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Projected Gradient Descent
(PGD) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]). Both techniques hypothesize that attackers can
inject bounded perturbations and have access to a
diferentiable model. FGSM, and its later variants [
          <xref ref-type="bibr" rid="ref10 ref15 ref9">9, 10, 15</xref>
          ], rely
on gradient descent to compute small perturbations in an
eficient way. More in detail, for a given clean input
, this
method adjusts the perturbation  by the magnitude of the
bound in the direction of the gradient of the loss function,
i.e.,  =  · sign(∇ ( ( +  ), )). PGD iteratively
generates adversarial examples by taking small steps in the
direction of the gradient of the loss function and projecting
the perturbed inputs back onto the  -ball around the original
input, i.e., Repeat:  =  ( + ∇ ( (+ ), )), where
 is the projection onto the ball of radius  and  can be seen
as analogous to the learning rate in gradient-descent-based
training. Due to its iterative nature, PGD incurs a notably
higher computational cost than FGSM [
          <xref ref-type="bibr" rid="ref10 ref9">10, 9</xref>
          ], but it is also
regarded as one of the strongest methods to generate
adversarial examples. In fact, prior work has shown that PGD
attacks can fool robust models trained via FGSM and that
PGD-based AT produces models robust to larger
perturbations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and with higher adversarial accuracy. FGSM is also
known to sufer from catastrophic-overfitting [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], in which
the model’s adversarial accuracy collapses after some
training iterations. Henceforth, we will focus on FGSM and PGD,
which, as mentioned, are among the most widely used and
efective methods for generating adversarial examples [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
In fact, these methods have been extensively studied and
compared in the literature [
          <xref ref-type="bibr" rid="ref10 ref15 ref16 ref17 ref9">15, 16, 10, 9, 17</xref>
          ] and represent a
natural starting point for investigating the trade-ofs related
to HPT that arise in the context of AT.
        </p>
        <p>Independently of the technique used to perform AT, a
relevant finding, first investigated by Gupta et al. [
whether to use an initial ST phase before performing AT, or
whether to use exclusively AT. This study showed that using
an initial ST phase normally helps to reduce the
computational cost while yielding models with comparable quality.
This result motivates one of the key questions that we aim
at answering in this work, namely whether the ST and AT
phases should share the settings for their common HPs.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Hyper-parameter Tuning</title>
        <p>
          HPT is a critical phase to optimize the performance of ML
models. As the scale and complexity of models increase,
along with the number of HP that can possibly be tuned
in modern ML methods [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], HPT is a notoriously
timeconsuming process, whose cost can become prohibitive due
to the need to repetitively train complex models on large
datasets.
        </p>
        <p>
          To address this issue, a large spectrum of the literature on
HPT relies on Bayesian Optimization (BO) [
          <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23 ref5 ref6 ref7">19, 20, 21, 6, 7,
22, 5, 23</xref>
          ]. BO employs modeling techniques (e.g., Gaussian
Processes) to guide the optimization process and leverages
the model’s knowledge and uncertainty (via a, so called,
acquisition function) to select which configurations to test.
Although the use of BO can help to increase the
convergence speed of the optimization process, the cost of testing
multiple HP configurations can quickly become prohibitive,
especially when considering complex models trained over
large datasets.
        </p>
        <p>
          To tackle this problem, multi-fidelity techniques [
          <xref ref-type="bibr" rid="ref19 ref20 ref23 ref5 ref6 ref7">5, 6, 20,
19, 23, 7</xref>
          ] exploit cheap low-fidelity evaluations (e.g.,
training with a fraction of the available data or using a reduced
number of training epochs) and extrapolate this knowledge
to recommend high-fidelity configurations. This allows for
reducing the cost of testing HP configurations, while still
providing useful information to guide the search for the
optimal high-fidelity configuration(s) [
          <xref ref-type="bibr" rid="ref19 ref5">5, 19</xref>
          ]. HyperBand [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
is a popular multi-fidelity and model-free approach that
promotes good quality configurations to higher budgets
and discards the poor quality ones using a simple, yet
efective, successive halving approach [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Several approaches
extended HyperBand using models to identify good
configurations [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ] or shortcut the number of configurations
to test [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. While these works adopt a single budget type
(e.g., training time or dataset size), other approaches, such
as taKG [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], make joint usage of multiple budget/fidelity
dimensions during the optimization process to have additional
lfexibility to reduce the optimization cost. taKG selects the
next configuration and diferent budgets via model-based
predictive techniques that estimate the cost incurred and
information gained by sampling a given configuration for a
given setting of the available fidelity dimensions.
        </p>
        <p>
          In the area of HPT for robust models, the work that is
more closely related to ours is the study by Duesterwald et
al. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. This work has investigated empirically the relations
between the bounds on adversarial perturbations ( ) and
the model’s accuracy/robustness. Further, it showed that
the ratio of clean/adversarial examples included in a batch
(during AT) can have a positive impact on the model’s
quality and represents, as such, a key HP. Based on this finding,
we incorporate this HP among the ones tested in our study.
Diferently from that work, we focus on i) quantifying the
benefits of using diferent HPs during ST and AT, and ii)
exploiting the correlation between cheaper AT methods (such
as FGSM) to enhance the eficiency of multi-fidelity HPT
algorithms.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. HPT for Robust Models:</title>
    </sec>
    <sec id="sec-4">
      <title>Challenges and Opportunities</title>
      <p>As mentioned, this work aims at shedding light on the
challenges and opportunities that arise when performing HPT
for adversarially robust models. More precisely, we seek to
answer the following questions:
1. Should the HPs that are common to the AT and
ST phases be tuned independently? More in detail,
we aim at quantifying to what extent the model’s
quality is afected if one uses the same vs diferent
HP settings during the ST and AT phases (see
Section 3.2).
2. Is it possible to reduce the cost of HPT by testing HP
settings using cheaper (but less robust) AT methods?
How correlated are the performance of alternative
AT approaches and what factors (e.g., the
perturbation bound or the cost of the techniques) impact such
correlation? To what extent can this approach
enhance the HPT’s process eficiency? (see Section 3.3.)
In order to answer the above questions, we have collected
(and made publicly available) a dataset, which we obtained
by varying some of the most impactful HPs for three popular
neural models/datasets and measured the resulting model
quality. We provide a detailed description of the dataset in
Section 3.1.</p>
      <sec id="sec-4-1">
        <title>3.1. Experimental Setup</title>
        <p>We base our study on three widely-used models and datasets
(ResNet50/ImageNet, ResNet18/SVHN, and CNN/Cifar10).
All the models were trained using 1 worker, except SVHN,
in which two workers were used. We used Nvidia Tesla
V100 GPUs to train the ResNet50, and Nvidia GeForce RTX
2080 for the remaining models. All models and training
procedures were implemented in Python3 via the Pytorch
framework.</p>
        <p>
          To evaluate the models, we considered up to nine
diferent HPs, as summarized in Table 1. The first three HPs in
this table apply to both the ST and AT phases.  is an HP
that applies exclusively to AT, whereas the last two HPs
(%RAT and %AE) regulate the balance between ST and AT
(see Section 2). Specifically, %RAT defines the number of
computational resources allocated to the AT phase, and
%AE indicates the ratio of adversarial inputs contained in
the batches during the AT phase (as suggested by
Duesterwald et al. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]). We further consider several settings of
the bound  on the attacker power (see Table 2). Note that
the reported values of  are normalized by 255. Finally, we
also consider two fidelity dimensions, namely the number
of training epochs and the number of PGD iterations (see
Section 3.3). The model’s quality is evaluated using standard
error (i.e., error using clean inputs) and adversarial error
(i.e., error using adversarially perturbed inputs).
        </p>
        <p>For each model, we exhaustively explored the (discretized)
space defined by the HPs, bound  , and fidelities, which
yields a search space encompassing a total of 19208
configurations. Building this dataset required around fifty
thousand GPU hours and we have made it publicly accessible in
the hope that it will aid the design of future HPT methods
specialized for AT. Additional information to ensure the
reproducibility of results is provided in the public repository1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Should the HPs of ST and AT be tuned independently?</title>
        <p>This section aims at answering the following question: given
that the ST and AT phases share several HPs (e.g., batch
size, learning rate, and momentum in the models considered
in this study), how relevant is it to use diferent settings
for these HPs in the two training phases? Note that, if we
assume the existence of  HPs in common between ST and
AT, then enabling the use of diferent values for these HPs
in each training stage causes a growth of the dimensionality
of the HP space from  to 2 (not accounting for any HP
not in common) and, ultimately, to a significant increase in
the cost/complexity of the HPT problem. Specifically, for
the scenarios considered in this study, the cardinality of the
HP space grows from 320 to 2560 distinct configurations.
Hence, we argue that such a cost is practically justified
only if it is counterbalanced by relevant gains in terms of
error reduction. To answer this question, we trained the
models during 16 epochs and used diferent settings for
the common HPs for ST and AT (Table 1). We consider
1https://github.com/pedrogbmendes/HPT_advTrain
1.00
0.75
F
CD0.50
1.00
0.75
F
CD0.50
three diferent settings (30%, 50%, and 70%) 2 for the relative
amount of resources (epochs) available for AT (%RAT), as
well as diferent settings of the perturbation bound  .</p>
        <p>We consider that the model’s HPs can be optimized
according to three criteria: i) clean data error (Error), ii)
adversarial error (AdvError), and iii) the average of clean and
adversarial error (MeanError). For each of these three
optimization criteria, %RAT and bound  , we report in Figure 1
the percentage of reduction of the target optimization metric
for the optimal HP configuration obtained by allowing for
(but not imposing) diferent settings of the HPs in common
to the ST and AT phases, with respect to the optimal HP
configuration if one opts for using the same settings for the
common HPs in both training phases, namely:
%Error Reduction = Errorsame HPs − Errordif HPs</p>
        <p>Errorsame HPs
× 100 (1)</p>
        <p>The results show that adopting diferent HP settings in
the two phases can lead to significant error reductions for
all the three optimization criteria. The peak gains extend
up to approx. 30% and are achieved for the case of ResNet18
with (relatively) large values of  and when allocating a
low percentage of epochs to AT (%RAT=30%). Overall, the
geometric means of the % error reduction (across all models
and settings of  and %RAT) is 9%, 5%, and 6% for the Error,
AdvError, and MeanError criterion, respectively.</p>
        <p>Next, Figure 2 provides a diferent perspective in order to
quantify the benefits achieved by separately optimizing the
HP of the AT phase (vs. using for the AT phase the same HPs
settings in common with the ST phase), assuming to have
2We exclude the cases %RAT={0,100} in this study to focus on scenarios
that contain both the ST and AT phases.
executed an initial ST phase using any of the possible HPs
settings. Specifically, we report the Cumulative Distribution
Function (CDF) of the percentage of error reduction (for each
of the three target optimization metrics), when allowing the
use of the same or diferent common HP settings for the
two phases and while varying the remaining (non-common)
HPs, namely %RAT, %AE, and  . These results allow us to
highlight that by independently tuning the HPs of the two
training stages, the model’s quality is enhanced by up to
approx. 80%, 43%, and 56%, when minimizing the standard,
adversarial, or mean error, resp.</p>
        <p>Overall, these results may be justified by considering
that the optimization objectives and constraints of the ST
and AT phases are diferent, hence benefiting from using
diferent HP settings. During ST, the training procedure
focuses on maximizing standard accuracy, and the model’s
goal is to learn representations that generalize well to new
data. In contrast, AT seeks to increase robustness against
adversarial attacks, and the model needs to learn to
diferentiate between clean and perturbed examples correctly.
Further, the AT phase benefits from a pre-trained model
(using clean data), and, as such, this model is expected to
require relatively small weight adjustments to defend against
adversarial inputs. Thus, this phase is likely to benefit from
more conservative settings of HPs such as learning rate and
momentum than the initial ST, whose convergence could
be accelerated via the use of more aggressive settings for
the same HPs. In fact, we confirmed this fact by analyzing
the configurations that yield the 10 largest error reductions
in Figure 2: better quality models used lower learning rates
and batch sizes in the AT phase.</p>
        <p>
          Another factor that can justify the need for using diferent
HP settings during ST and AT is related to the observation
1.0
0.9
that the bound on the admissible perturbation ( ) can have
a deep impact on the model’s performance, by exposing
an inherent (and well-known [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]) trade-of: as the bound
increases, the model may become more robust to adversarial
inputs but at the cost of an increase in the misclassification
rate of clean inputs. To achieve an optimal trade-of between
robustness and accuracy, it may be necessary to adjust the
tuning of the HPs used during AT as  varies, which in turn
implies that the optimal HPs settings used during ST and AT
can be diferent. In fact, by analyzing the results obtained on
ResNet18/SVHN, for example, we see that the amplitude of
the bound has an impact on the (adversarial) error reduction
achievable by tuning independently the HPs of two phases of
training: the 90ℎ percentile of the percentage of clean error
reduction is 50% and 65% using  =4 and  =8, respectively
(see Fig. 2b).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Can cheap AT methods be leveraged to accelerate HPT?</title>
        <p>So far, we have shown that in adversarial settings the
complexity of the HPT problem is exacerbated due to the need
for optimizing a larger HP space. In this section, we show
that, fortunately, AT provides also new opportunities to
reduce the HPT cost. Specifically, we propose and
evaluate a novel idea: leveraging alternative AT methods, which
impose lower computational costs but provide weaker
robustness guarantees to sample HP configurations in a cheap,
yet informative, way. As discussed in Section 2, PGD is
an iterative method, where each iteration refines the
perturbation with the objective of maximizing loss. Hence, a
straightforward way to reduce its cost (at the expense of
robustness) is to reduce the number of executed iterations. We
also note that the computational cost of FGSM is equivalent
to that of a single PGD iteration.</p>
        <p>
          We build on these observations to propose incorporating
the number of PGD iterations as an additional fidelity
dimension in multi-fidelity HPT optimizers, such as taKG [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
We choose to test the proposed idea with taKG since this
technique supports the use of an arbitrary number of fidelity
dimensions (e.g., dataset size and number of epochs) and
determines how to explore the multi-dimensional fidelity
via black-box modeling techniques (see Section 2.2).
        </p>
        <p>
          In order to assess the soundness and limitations of the
proposed approach, we first analyze the correlation of the
standard and adversarial error between HP configurations
that use PGD with 20 iterations (which we consider as the
maximum-fidelity/full budget) vs. PGD with 10 and 5
iterations and FGSM (which, as already mentioned, is
computationally equivalent to 1 iteration of PGD). In Figure 3,
we observe that the correlation varies for diferent bounds
on adversarial perturbations across the considered
models/datasets. We omit the correlation for ResNet50/ImageNet
using  =4 since the results are very similar to  =2. The
scatter plots clearly show the existence of a very strong
correlation (above 95%) for all the considered methods for
0.70
r
rro0.68
E
.
v
dA0.66
5
.
0
r+0.64
o
r
r
.E50.62
0
0.600
taKG (epochs)
taKG (PGD iter)
taKG (epochs &amp; PGD iter)
HB (epochs)
BO-EI
Random Search
25
50
ResNet50/ImageNet and for all the considered bounds. For
ResNet18/SVHN and CNN/Cifar10, the correlation of PGD
5 and 10 iterations remains quite strong (always above 80%
and typically above 90%), whereas lower correlations (as low
as 53%) can be observed for FGSM, especially when
considering adversarial error and larger values of  . This is expected,
as previous works [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ] had indeed observed that FGSM
tends to be less robust than PGD when larger  values are
used (being subject to issues such as catastrophic-overfitting
that lead to a sudden drop of adversarial accuracy). Still,
even for FGSM, the correlation is always above 90% with
CNN/Cifar10 and is relatively high (around 70%) also with
ResNet18/SVHN for the smaller considered bound ( = 4).
        </p>
        <p>We also report, in Figure 3c, the CDF of the training time
reduction using FGSM, PGD with 5 and 10 iterations, and
ST w.r.t. PGD20 for ResNet50/ImageNet. The CDFs show
that the training time reductions for a given AT method
vary since the ratio of computed adversarial examples
depends on the %RAT and %AE parameters. Overall, the
maximum (median) training time reduction is approximately
83% (54%), 66% (42%), 47% (28%), and 86% (53%) for FGSM,
PGD5, PGD10, and STD, compared to PGD20, which
conifrms that leveraging these “cheap” surrogate methods can
significantly reduce the cost of testing HP configurations.</p>
        <p>
          Supported by these findings, we evaluate our proposal by
integrating the number of PGD iterations as an additional
ifdelity dimension in taKG [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. As discussed in Section 2,
taKG is a HPT method that natively supports the use of
multiple fidelity types, which we refer to as fidelity dimensions,
e.g., number of epochs, input size, and dataset size. Based
on the results of the previous section, we independently
optimize the HPs of the ST and AT phases, which yields a
search space composed of a total of nine HPs (Table 1). The
following multi-fidelity solutions are compared:
• taKG (epochs, PGD iter): the proposed solution,
which uses taKG as the underlying HPT method
and employs as fidelity the number of epochs and
the number of PGD iterations. We discretize these 2
dimensions (Table 3).
• taKG (epochs): taKG using as fidelity only the
number of epochs.
• taKG (PGD iter): taKG using as fidelity only the
number of PGD iterations.
• HB (epochs): HyperBand [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a popular
(singledimensional) multi-fidelity optimizer, which uses
the number of epochs as fidelity.
        </p>
        <p>We further compare against BO using EI as acquisition
function and Random Search (RS). These optimizers only
perform high-fidelity evaluations. The evaluation of these
alternative solutions is performed by exploiting the dataset
already described in Section 3.1, which specifies the model
quality (error and adversarial error) for all possible HPs, 
and fidelity settings reported in Tables 1, 2 and 3.</p>
        <p>We define the optimization problem as follows:
min  · Error(,  = 1)+(1−  )· Adv.Error(,  = 1) (2)

where  is a vector defining the HPs,  is a vector that
encodes the ratio of budget allocated for each fidelity
dimension, and  is a weight factor that we set to 0.5 to equally
balance the standard and adversarial errors. For a fair
comparison, when a single fidelity dimension (e.g., epochs) is
used, we set the other fidelity dimension (e.g., PGD
iterations) to its maximum value. We run each optimizer using
20 independent seeds. We set the bound  to 2, 8, and 12 to
optimize ResNet50/ ImageNet, ResNet18/SVHN, and
CNN/Cifar10. Based on Figure 3, the three settings correspond
to scenarios with relatively high, low, and medium
correlations for the budget dimension defined by PGD iterations,
respectively.</p>
        <p>Figure 4 reports the average and standard deviation of
the optimization goal (i.e., 0.5 · Error + 0.5 · Adv.Error) as
a function of the optimization time for the diferent HPT
optimizers and diferent models/datasets. The results show
that the proposed solution, which adopts PGD iterations
as extra fidelity along with the number of epochs (taKG
epochs &amp; PGD iter), clearly outperforms all the alternative
solutions in ResNet50/ImageNet and ResNet18/SVHN. The
largest gains can be observed with ResNet18/SVHN (Fig. 4b).
Here, at the end of the optimization process, the proposed
solution identifies a configuration of the same quality as
the ones suggested by the other baselines, namely taKG
epochs, HB, BO-EI, by achieving speed-ups of 2.1× , 3.7× ,
and 5.4× , respectively. Using the same metric (time spent
to recommend a configuration of the same quality as the
ones suggested by the other optimizers at the end of the
optimization process) with ResNet50/ImageNet, the proposed
solutions achieve slightly smaller, but still significant
speedups, namely 1.28× , 1.97× and 2.45× w.r.t. taKG - epochs,
HB, BO-EI. Interestingly, with ResNet50/ImageNet the
proposed solution provides solid speed-ups also during the first
stage of the optimization. Specifically, if we analyze the first
half of the optimization process (corresponding to approx.
83 hours (see Figure 4a)) the proposed solution identifies
configurations of the same quality as taKG - epochs, HB,
BO-EI, with speed-ups of 1.7× , 2× and 2.6× , respectively.</p>
        <p>With CNN/Cifar10 (Figure 4c), the proposed approach
remains the best-performing solution, although with smaller
gains when compared to taKG with epochs. Still, the
proposed solution can identify configurations with the same
quality as the best alternative (taKG epochs) by saving
approx. 40% of the time (i.e., in 22 hours vs. 32 hours). We
argue that the gains with CNN/Cifar10 are relatively lower
than in the other scenarios considered in Figure 4 since
those models are larger and more complex. As such, they
benefit more from the cost reduction opportunities provided
by using a reduced number of PGD iterations.</p>
        <p>
          We also observe that the exclusive use of PGD iterations
with taKG yields worse performance than using solely
number of epochs. This is not surprising, given that number of
epochs is arguably one of the most direct ways of controlling
the cost of configuration sampling and is, indeed, among
the most commonly adopted budgets in multi-fidelity
optimizers [
          <xref ref-type="bibr" rid="ref20 ref27 ref8">20, 8, 27</xref>
          ]. This result confirms that PGD iterations
represent a valuable mean to accelerate multi-fidelity HPT
optimizers to train robust models and that it complements,
but does not replace, "conventional" budget settings like
number of epochs or dataset size.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future work</title>
      <p>This paper focused on the problem of HPT for robust
models. By means of an extensive experimental study, we first
quantified the relevance of independently tuning the HPs
used during standard and adversarial training. We then
proposed and evaluated a novel fidelity dimension that
becomes available in the context of AT. Specifically, we have
shown that cheaper AT methods can be used to obtain
inexpensive estimations of the quality achievable via expensive
state-of-the-art AT methods and that this information can be
efectively exploited to accelerate HPT. We extended taKG,
a state-of-the-art HPT method, by incorporating the PGD
iterations as an additional fidelity dimension (along with
the number of epochs) and achieved cost reductions by up
to 2.1× .</p>
      <p>It is worth noting that the idea of employing “cheap” AT
methods as proxies to estimate the quality of HP
configurations with more robust/expensive methods is generic,
in the sense that it can be applied, at least theoretically, to
any multi-fidelity optimizer. As part of our future work, we
plan to integrate this novel approach in a new HPT
framework, specifically designed to cope with adversarially robust
models.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Fundação para a Ciência e
a Tecnología (Portuguese Foundation for Science and
Technology) through the Carnegie Mellon Portugal Program
under grant SFRH/BD/151470/2021 via projects with
reference UIDB/50021/2020 and C645008882-00000055.PRR, by
the NSA grant H98230-23-C-0274, and by the Advanced
Cyberinfrastructure Coordination Ecosystem: Services &amp;
Support (ACCESS) program, where we used the Bridges-2
GPU and Ocean resources at the Pittsburgh Supercomputing
Center through allocation CIS220073, which is supported
by National Science Foundation grants #2138259, #2138286,
#2138307, #2137603, and #2138296.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <article-title>Explaining and harnessing adversarial examples</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Makelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <article-title>Towards deep learning models resistant to adversarial attacks</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <article-title>Improving the afordability of robustness training for dnns</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Ghaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <article-title>Theoretically principled trade-of between robustness and accuracy</article-title>
          , in: ICML,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          ,
          <article-title>Sequential model-based optimization for general algorithm conifguration</article-title>
          ,
          <source>in: LION</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Adams</surname>
          </string-name>
          <article-title>, Multi-task bayesian optimization</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Toscano-Palmerin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. I.</given-names>
            <surname>Frazier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Wilson,</surname>
          </string-name>
          <article-title>Practical multi-fidelity bayesian optimization for hyperparameter tuning</article-title>
          ,
          <source>in: UAI</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jamieson</surname>
          </string-name>
          , G. DeSalvo,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rostamizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talwalkar</surname>
          </string-name>
          ,
          <article-title>Hyperband: A novel bandit-based approach to hyperparameter optimization</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <article-title>Fast is better than free: Revisiting adversarial training</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shafahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghiasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Dickerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Studer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          , G. Taylor, T. Goldstein,
          <article-title>Adversarial training for free!</article-title>
          , in: NeurIPS,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.-M.</given-names>
            <surname>Moosavi-Dezfooli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fawzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Frossard</surname>
          </string-name>
          ,
          <article-title>Deepfool: A simple and accurate method to fool deep neural networks</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Metzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Genewein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bischof</surname>
          </string-name>
          ,
          <article-title>On detecting adversarial perturbations</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cisse</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. van der Maaten</surname>
          </string-name>
          ,
          <article-title>Countering adversarial images using input transformations</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McDaniel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. B.</given-names>
            <surname>Celik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swami</surname>
          </string-name>
          ,
          <article-title>Practical black-box attacks against machine learning</article-title>
          ,
          <source>in: ASIA CCS</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Andriushchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Flammarion</surname>
          </string-name>
          ,
          <article-title>Understanding and improving fast adversarial training</article-title>
          , in: NeurIPS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <article-title>Overfitting in adversarially robust deep learning</article-title>
          ,
          <source>in: ICML</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Recent advances in adversarial training for adversarial robustness</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Energy and policy considerations for deep learning in NLP</article-title>
          , in: ACL,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casimiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garlan</surname>
          </string-name>
          , Trimtuner:
          <article-title>Eficient optimization of machine learning jobs in the cloud via sub-sampling</article-title>
          ,
          <source>in: MASCOTS</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Falkner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bartels</surname>
          </string-name>
          , et al.,
          <article-title>Fast bayesian optimization of machine learning hyperparameters on large datasets</article-title>
          ,
          <source>in: AISTATS</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mockus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tiesis</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Zilinskas,</surname>
          </string-name>
          <article-title>The application of bayesian methods for seeking the extremum</article-title>
          ,
          <source>in: Toward Global Optimization</source>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Casimiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Didona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Romano</surname>
          </string-name>
          , et al.,
          <article-title>Lynceus: Cost-eficient tuning and provisioning of data analytic jobs</article-title>
          ,
          <source>in: ICDCS</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <article-title>Freeze-thaw bayesian optimization</article-title>
          ,
          <source>ArXiv:1406.3896</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jamieson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talwalkar</surname>
          </string-name>
          ,
          <article-title>Non-stochastic best arm identification and hyperparameter optimization</article-title>
          ,
          <source>in: AISTATS</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Falkner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>BOHB: Robust and eficient hyperparameter optimization at scale</article-title>
          , in: ICML, volume
          <volume>80</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Awad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mallik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>DEHB: evolutionary hyberband for scalable, robust and eficient hyperparameter optimization</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casimiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garlan</surname>
          </string-name>
          , Hyperjump:
          <article-title>Accelerating hyperband via risk modelling</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Duesterwald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Murthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Venkataraman</surname>
          </string-name>
          , et al.,
          <article-title>Exploring the hyperparameter landscape of adversarial robustness</article-title>
          , ArXiv abs/
          <year>1905</year>
          .03837 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Engstrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madry</surname>
          </string-name>
          ,
          <article-title>Robustness may be at odds with accuracy</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>