<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Error-Silenced Quantization: Bridging Robustness and Compactness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhicong Tang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yinpeng Dong</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hang Su</string-name>
          <email>suhangss@mail.tsinghua.edu.cn</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>As deep neural networks (DNNs) advance rapidly, quantization has become a widely used standard for deployments on resource-limited hardware. However, DNNs are well accepted vulnerable to adversarial attacks, and quantization is found to further weaken the robustness. Adversarial training is proved a feasible defense but depends on a larger network capacity, which contradicts with quantization. Thus in this work, we propose a novel method of Error-silenced Quantization that relaxes the requirement and achieves both robustness and compactness. We first observe the Error Amplification Effect, i.e., small perturbations on adversarial samples being amplified through layers, then a pairing is designed to directly silence the error. Comprehensive experimental results on CIFAR-10 and CIFAR-100 prove that our method fixes the robustness drop against alternative threat models and even outperforms full-precision models. Finally, we study different pairing schemes and secure our method from the obfuscated gradient problem that undermines many previous defenses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Deep neural networks (DNNs) have demonstrated
extraordinary performances in a wide range of applications,
including visual understanding [Krizhevsky et al., 2012; He et al.,
2016], speech recognition [Graves et al., 2013], and
natural language processing [Devlin et al., 2019]. As its
application develops, the deployment of DNNs is becoming
omnipresent in embedded and edge devices, such as mobile
phones, IoT devices, autonomous driving systems, etc. To
facilitate such deployment, quantization [Wu et al., 2016;
Jacob et al., 2018] is proposed, which has become an
industry standard for deep learning hardware and an accelerator for
inference in real-time applications [Rastegari et al., 2016].</p>
      <p>However, it is accepted that DNNs are vulnerable to
adversarial attacks [Szegedy et al., 2014; Goodfellow et al.,
since it is inspired by the Error Amplification Effect and aims
at silencing the error in both activation and predictions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <sec id="sec-2-1">
        <title>2.1 Compress with quantization</title>
        <p>In this section, we briefly introduce two typical quantized
networks, including Binary Weight Network (BWN) [Rastegari
et al., 2016] and Ternary Weight Network (TWN) [Li and
Liu, 2016].</p>
        <p>Firstly, the weight W of a DNN can be denoted by Wl =
fW1; ; Wi; ; Wmg, where the l-th layer has m output
channels and Wi 2 Rd is the weight of the i-th filter.
Quantization converts each weight matrix Wi into Qi 2 Sd, where
S consists of at most 2n sparse values in a n-bit quantization.</p>
        <p>BWN takes a scaling factor 2 R+ and S = f ; + g.
By solving the optimization J = min kWi Bik it yields
Bij =
sign(Wij )
and
=</p>
        <p>Wij :</p>
        <p>(1)
1 Xd
d
j=1</p>
        <p>TWN introduces a 0 state over BWN in S = f ; 0; + g
to approximate the real-valued weight Wi more precisely. It
solves the optimization J = min kWi Tik as
and
=
1</p>
        <p>X
jI j i2I</p>
        <p>Wij ; (2)
Tij =</p>
        <p>;
8
&gt;
&lt;</p>
        <p>0;
&gt;:+ ;</p>
        <p>Wij &lt;
Wij</p>
        <p>Wij &gt;
where = 0d:7 Pjd=1 Wij and I = fjj Wij &gt; g.</p>
        <p>Then Bi and Ti are the 1-bit and 2-bit quantized Qi that
forms the space-efficient weight Q. Since the factor
requires little storage, BWN compresses a full-precision model
by 32 and TWN compresses by 16 .</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Adversarial attacks and defenses</title>
        <p>Given an image x, adversarial attacks is to find the noise
that the classifier’s prediction of input xadv = x + is wrong.
And defenses aim to maintain the robustness of the classifier,
i.e. the prediction accuracy on input xadv . Here we list some
attacks and defenses used in experiments.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.2.1 Attacks</title>
        <p>Fast Gradient Sign Method (FGSM) is a L1 bounded
onestep attack forwarded by [Goodfellow et al., 2015] that
calculates the adversarial samples by following the direction of
the gradient of loss function L at step size .</p>
      </sec>
      <sec id="sec-2-4">
        <title>Projected Gradient Descend (PGD) proposed by [Madry</title>
        <p>et al., 2018] repeats FGSM and starts with a random step
to escape the sharp curvature near the original input, and is
thought to be the strongest first-order attack.</p>
        <p>C&amp;W Attack [Carlini and Wagner, 2017] chooses tanh
function instead of box-constrained methods and optimizes
the difference between logits instead of the logit itself. It is
an iterative attack and among the strongest L2 attacks.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Decoupling Direction and Norm Attack (DDN) [Rony</title>
        <p>et al., 2019] is a newly proposed L2 attack that outperforms
C&amp;W. It iterates FGSM with the adjusted in each round,
leading to a finer-grained search for adversarial images.</p>
        <sec id="sec-2-5-1">
          <title>NAT-Full</title>
          <p>NAT-VQ-BWN
ADV-Full
ADV-VQ-BWN
36.19
35.38
47.22
40.84
27.96
21.07
43.65
28.34
20.76
11.79
36.63
19.00
14.53
7.59
24.60
12.74
7.79
4.99
11.16
7.74</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>2.2.2 Defenses</title>
        <p>Adversarial training [Goodfellow et al., 2015; Kurakin et
al., 2017; Madry et al., 2018] is currently the strongest and
most commonly used defense. It augments the training set
with adversarial samples by the optimization as
where pairs of example x 2 Rd and ground-truth y follow an
underlying data distribution D , 2 is the allowed
adversarial noise added to image x to deceive the classifier, is the
model weight to be optimized and L is the loss function.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Error Amplification Effect</title>
      <p>The conventionally quantized DNN is counter-intuitively
more vulnerable [Lin et al., 2019] under the threat of
adversarial attacks. One convincing explanation is the Error
Amplification Effect discovered by [Liao et al., 2018]. Specifically,
tiny perturbations can be amplified when fed through layers,
become sizable enough to deceive the network and eventually
push the classification result into an incorrect bucket.
Moreover, the quantization of a DNN worsens its robustness
comparing with the original full-precision one by enlarging the
granularity of the weights, making its response more
susceptible to the input. As shown in Table 1, quantized models
yield constantly inferior robustness under FGSM attacks of
varied perturbation strength.</p>
      <p>To in detail investigate the effect, we conducted
preexperiments on CIFAR-100 [Krizhevsky and Hinton, 2009]
and ResNet-152 [He et al., 2016]. Adversarial samples are
generated untargeted by a 10-step PGD attacker with other
parameters = 8=255 and step size 2=255 corresponding
to [Madry et al., 2018]. In Figure 1, we test four settings
with the attack, evaluate and plot the distance Dl between the
clean and perturbed activation of each layer as</p>
      <p>Dl(x; xadv) =</p>
      <p>Fl(x)</p>
      <p>Fl(xadv) 2</p>
      <p>;
kFl(x)k2
where Fl denotes the activation after the l-th ResNet module.
For convenience, we note training scheme with prefix
NATand ADV-, quantization scheme with infix -VQ- and
-EQ, weight precision with suffix -Full, -BWN, -TWN and use
acronyms in all tables.</p>
      <p>In the left zone of the illustration 1, the adversarial noise
applied to the input image is relatively small compared to the
(4)
image itself ( 8 versus 255 in this setting). However, as the
inference carries on the magnitude of initial perturbation is
amplified through the latter part of the network. Once the
perturbation is amplified large enough, the model is misled to
a wrong bucket and the accuracy is witnessed a harsh drop.</p>
      <p>With the experiment results above we have the
following observations: (i) The error of the activation eventually
accumulates large enough to push the prediction to a
misleading bucket. (ii) All models suffer from the effect while
quantization reduces robustness by a wide margin. (iii) With
vanilla quantization methods, the robustness gain of
adversarial training is drastically degraded.</p>
      <p>Therefore, the currently used vanilla quantizations are
showed practically limited and the Error Amplification Effect
may be a key to a robustness-aware quantization.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>Motivated by the Error Amplification Effect above, we
introduce a quantization scheme that simultaneously preserves
the robustness of the original full-capacity model and the
compactness of low bandwidth quantization. The concurrent
training and quantizing procedure is described in (1).</p>
      <p>We firstly follow the commonly used min-max based
robustness optimization and formulate the overall robustness
and compactness target as
(5)
where full is the original full-precision weight (W), is the
finally quantized weight (Q), size( ) is the memory size to
store the weight and c is the target compression rate.</p>
      <p>The equation (5) can be divided into two parts: (i)
Minimize the loss on adversarially perturbed inputs for
robustness. (ii) Compress the model weight to meet the target rate
for compactness. In our method, the latter one is handled by a
quantization algorithm that allows simultaneous training, and
s:t:
size( full)
size( )
= c;</p>
      <p>Algorithm 1 Error-silenced Quantization
the former one is handled by directly controlling the
amplified error, i.e., pairing activation.
Since the activation of an adversarial input deviates largely
from that of its original image, a natural solution to control
the error is training the network to diminish this deviation.</p>
      <p>Let Dl(x; x0) be a function that calculates the relative
distance between the activation of l-th layer when the model is
fed with x and x0 respectively, which can be normalized L2
or L1. With a set of layers to control S, the robustness
regularization that optimizes the former part of (5) is</p>
      <p>l2S
where l is a series of sensitivity parameters that determine
the threshold of the amplified error between clean and
adversarial samples. The model is forced to infer close activation
on l-th layer if l is large and is allowed to tolerate sizable
differences if l is small.</p>
      <p>With the pairing object, we train the model with clean
samples and then pair the activation of particular layers, rather
than directly training on adversarial samples. The equation
(6) can also be divided into two parts that separately tackle
the classification accuracy on clean and adversarial images.</p>
      <p>The first part is designed to maintain the performance of
the model because it is noticed that the development of
robustness is often at the cost of prediction accuracy [Su et al.,
2018]. With the second part, we train the model to
diminish the deviation and infer close activation. A model behaves
closely on clean and adversarial inputs is supposed to gain
close prediction accuracy on both.
(6)
(7)
AVB
AEB
AET</p>
      <p>As a special case, pairing is applied only on the final
output layer of the network, on which the following experiments
focus. Then the pairing can be simplified as the distance
between the logits on clean and adversarial samples.
In the optimization (6), the perturbations are generated to
maximize the error of selected activation. However, in this
work we generate them with untargeted white-box attacks
because it is believed the strongest attack and so far no attack
studies and magnifies the error.</p>
      <p>Previous works [Madry et al., 2018] have shown that PGD
performs as the most powerful first-order attack. We follow
the conclusion and solve adversarial perturbations by PGD
attacks with settings consistent with [Madry et al., 2018] and
modify iteration number and step size.
Our method upholds and improves the robustness of
quantized models by concurrently updating and quantizing its
weight. Accordingly, we choose the Stochastic Quantization
method introduced in [Dong et al., 2019]. In our method,
a model is fed of clean and adversarial inputs with partially
quantized weight, and the full-precision weight is updated by
the gradients estimated. For comparison, vanilla Stochastic
Quantization trains models with clean inputs only.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this section, our experiments demonstrate that the
proposed method can effectively retain and further improve the
robustness when a model is quantized into low-bandwidth.
Also, the method diminishes the aforementioned Error
Amplification Effect by a large margin compared with both
fullprecision and vanilla quantized models. Finally, we show that
the method provides more convincing performances than two
baselines: adversarial training before and while quantization.
5.1</p>
      <sec id="sec-5-1">
        <title>Settings</title>
        <p>We apply Wide ResNet 28-10 [Zagoruyko and Komodakis,
2016] on CIFAR-10 [Krizhevsky and Hinton, 2009] and
ResNet-152 on CIFAR-100. Six models in each setting are
tested with clean input, white-box and transfer attacks.</p>
        <p>During training, we augment training set with the PGD
attacker same as above and train models with an Adam
optimizer [Kingma and Ba, 2015] for 150 epochs. The
hyperparameters are left in default without fine-tuning.</p>
        <p>During quantization, we pair the activation after the final
layer (logits) by L2 norm and use a SGD optimizer with
learning rate 0.1, momentum 0.9 and weight decay 10 4 to train
for 120 epochs in consistence with [Dong et al., 2019].
However, the quantization ratio is updated by the uniform scheme,
i.e., beginning at 0.2 and updated by 0.2 for every 25 epochs.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Retaining robustness of quantized models</title>
        <p>For white-box attack tests, we use a 20-step PGD attacker
with step size 0.1, which is slightly stronger than that used
for training. We also analyze the robustness against other
adversarial attacks, using = 16=255 FGSM to study one-step</p>
        <sec id="sec-5-2-1">
          <title>Clean</title>
          <p>FGSM
PGD
DDN
C&amp;W
NF
NEB
AF
AVB
AEB</p>
          <p>AET
(b) Transfer attack accuracy (in %). Attacks are generated by row
and applied by line, for example, AF model reaches an accuracy of
60:84% on adversarial inputs generated with NEB model.
attacks, 100-step = 1 DDN and 20-step = 1 C&amp;W to
study L2 bounded attacks.</p>
          <p>For transfer attack tests, all adversarial samples are
generated by the same PGD attacker as white-box stage. We train
and quantize alternative models from scratch if the model
setting generating attacks and being attacked is the same.
As shown in Table 2a and 3a, the vanilla quantized models are
exposed with weak robustness and adversarial training before
quantization helps little. With conventional methods, the
robustness gained by adversarial training is drastically degraded
to nearly none. While with our method, the accuracy
consistently floats around or above full-precision models
throughout two datasets. Comparing to the gap of vanilla
quantization, our proposed method is proved to be feasible in
controlling the harsh drop to a reasonably small level and works for
both naturally and adversarially trained models.</p>
          <p>In the cross transfer attack scenario (Tables 2b and 3b), our
robustly quantized models achieve sound results. For
adversarial attacks generated from NF models, which is often the
situation, the proposed method assists quantized models to
steadily beat the AF model. It is also true that our method
established solid defenses confronting other attacks, for
example, in Table 3b the -EQ- models exceed the AF model
under the attacks of other quantized models.</p>
          <p>We also notice that the NEB model and the AEB model
perform almost the same, which further demonstrates the
advantages of our method that adversarial training before
quantization is not required. Lastly, the method manages to
maintain and even improve accuracy on clean data.
54.09
13.36
20.49
13.74
19.83
AEB
AET
73.20
7.77
0.03
0.01
0.34
NF
0.09
49.88
44.78
13.77
51.14
56.34
55.54
12.05
19.17
12.35
18.48
NEB
52.87
18.88
37.27
51.94
37.99
40.05
50.80
11.15
22.15
17.37
20.62</p>
          <p>AF
AVB
(b) Transfer attack accuracy (in %). Attacks are generated by row
and applied by column, for example, AF model reaches an accuracy
of 36:80% on adversarial inputs generated with NEB model.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3 Silencing the Error Amplification Effect</title>
        <p>We re-evaluate the error in latent layers to investigate whether
the method manages to silence it. The relative distance is
defined in (4) and sampled after every ResNet module. The
experiment is conducted on ResNet-152 and CIFAR-100.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.3.1 Results</title>
        <p>Though the input is perturbed by the same magnitude, the
error is amplified quite differently in Figure 2. With
conventional quantization, the error of the ADV-VQ-BWN model
increases up to 4 times of the ADV-Full model, which is a
possible explanation of the large robustness drop. While with
our method, the models managed to lower the error than its
full-precision counterpart throughout the inference.</p>
        <p>[Xu et al., 2018] conclude that image quantization, i.e.,
reduction in color bit depth is an effective defense. However,
quantization of network weight instead weakens robustness.
[Lin et al., 2019] proved that it tends to intensify the Error
Amplification Effect when &gt; 3=255, which even starts from
= 1=255 in our experiments (Table 1). Our method
obtains significant results, overcomes the threshold and further
pushes it beyond = 8=255 as in Figure 2.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.4 Beyond standalone adversarial training</title>
        <p>To prove the necessity of pairing, we append experiments
of adversarial training in vanilla quantization on ResNet-152
and CIFAR-100.</p>
        <p>For adversarial training in vanilla quantization, models are
fed with perturbed samples only and updated by the original
min-max optimization. All adversarial samples are generated
with the same PGD attacker as in the white-box section and
all models are quantized for 120 epochs.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.4.1 Results</title>
        <p>As in the upper part of Table 4, adversarial training in vanilla
quantization retains limited robustness and is not
comparable to our method. For naturally trained models, adversarial
training promotes robustness to 19% against PGD but lags
1% behind our method. For adversarially trained models,
adversarial training fails to maintain the robustness and leaves a
drop of 5.5%, which is the triple of ours.</p>
        <p>We hold that the following hypothesis may lead to the
inconsistent performances of adversarial training in the context
of ordinary training and quantization: (i) Quantization limits
the capacity of the model, while adversarial training requires
a significantly large capacity. (ii) With limited capacity, the
model faces difficulty in learning and therefore suffers from
lower accuracy on both clean and adversarial inputs. In
contrast, the model learns to predict only clean inputs and infer
close activation on adversarial inputs with our method.</p>
        <p>
          We apply additional experiments on 2-bit quantization to
demonstrate the hypothesis above. Though TWN models
learn higher accuracy on training set, which confirms our
hypothesis that adversarial training is hindered by limited
network capacity, they attain the same and even inferior results
on test set compared to BWN models. It draws conclusion
that while higher bandwidth enables adversarial training, it
itself undermines robustness
          <xref ref-type="bibr" rid="ref10 ref19 ref21 ref24 ref4 ref5">([Lin et al., 2019])</xref>
          . In contrast,
our method better balances the trade-off between adversarial
training and low bandwidth weight.
In this section, we discuss the equivalence of different pairing
scheme and assume pairing logits as a universal pairing. We
also discuss the obfuscated gradients problem which
undermines many previous defenses and further secure the
robustness of our method.
While we offer a general pairing object in (6) and (7) that
can be any layers, only the output logits is paired in
experiments. Here we reveal that though pairing the activation may
produce lower errors, pairing the logits achieves the same
accuracy and better balances training costs and performances.
We investigate ResNet-152 on CIFAR-100 and pair the
activation after the 4th, 12th and 48th ResNet module.
        </p>
        <p>In Table 5, the close accuracy of two pairing schemes
shown confirms that pairing more activation provides minor
improvements while it requires considerable additional
computations and storage of intermediate results. It brings a large
cost of memory space, especially when training with GPU.
Furthermore, pairing activation may introduce unnecessary
requirements on network capacity, as in the case of
adversarial training. The smaller gap between two pairing settings on
TWN is also an implication of it.
6.2</p>
      </sec>
      <sec id="sec-5-7">
        <title>Secure the sense of robustness</title>
        <p>A noticeable coincidence is that our simplified activation
pairing scheme, pairing logits, is considerably similar to the
Adversarial Logit Pairing forwarded in [Kannan et al., 2018].
With the method, the author claims state-of-the-art robustness
on ImageNet. However, it is found [Athalye et al., 2018] to
suffer severely from obfuscated gradients and provide a false
sense of security that can be easily circumvented with non
gradient-based attacks.</p>
        <p>In [Athalye et al., 2018], it is reported that defenses
suffering from obfuscated gradients are vulnerable to black-box
attacks that operate by estimating instead of directly
solving gradients. To thoroughly examine whether our method
is truly secure, we test it with L2 bounded Boundary
attack [Brendel et al., 2018] and N Attack [Li et al., 2019]
for decision-based and score-based black-box attacks,
respectively. We vary perturbation strength from = 0 to = 4 and
compare the accuracy of quantized models with full-precision
counterparts.</p>
        <p>As shown in Figure 3a and 3b, our quantization achieve
consistently close or better than the ADV-Full model as the
strength varies. All results confirm that our method meets no
(a) Decision-based Boundary attack test accuracy (in %).</p>
        <p>(b) Score-based N Attack test accuracy (in %).
obfuscated gradient problem and provides a secured sense of
robustness. We suppose a possible explanation that we use
untargeted attacks for training while [Kannan et al., 2018]
use targeted attacks.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper aims to tackle the issue of achieving both
robustness and compactness in DNNs. Inspired by the Error
Amplification Effect, we relax the capacity requirements of
adversarial training by pairing, and propose a quantization that
optimizes accuracy on benign and adversarial inputs
simultaneously. Extensive experiments throughout four threat
models, two datasets and two networks endorse the superior
robustness of the proposed method over vanilla approaches and
even full-precision counterparts, while still reach high
compression rates. Appended by a guarded notion of secure from
obfuscated gradients, our method managed to bridge
robustness and compactness for DNNs and further applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Athalye et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Anish</given-names>
            <surname>Athalye</surname>
          </string-name>
          , Nicholas Carlini,
          <string-name>
            <given-names>and David A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <article-title>Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples</article-title>
          .
          <source>In ICML</source>
          , volume
          <volume>80</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Brendel et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Wieland</given-names>
            <surname>Brendel</surname>
          </string-name>
          , Jonas Rauber, and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bethge</surname>
          </string-name>
          .
          <article-title>Decision-based adversarial attacks: Reliable attacks against black-box machine learning models</article-title>
          .
          <source>In ICLR. OpenReview.net</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Carlini and Wagner</source>
          , 2017]
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Carlini</surname>
          </string-name>
          and
          <string-name>
            <given-names>David A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <article-title>Towards evaluating the robustness of neural networks</article-title>
          .
          <source>In IEEE Symposium on Security and Privacy</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Devlin et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In NAACL-HLT</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Dong et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Yinpeng</given-names>
            <surname>Dong</surname>
          </string-name>
          , Renkun Ni,
          <string-name>
            <given-names>Jianguo</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yurong</given-names>
            <surname>Chen</surname>
          </string-name>
          , Hang Su, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Stochastic quantization for learning accurate low-bit deep neural networks</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Eykholt et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Eykholt</surname>
          </string-name>
          , Ivan Evtimov, Earlence Fernandes,
          <string-name>
            <given-names>Bo</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Amir</given-names>
            <surname>Rahmati</surname>
          </string-name>
          , Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and
          <string-name>
            <given-names>Dawn</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Robust physical-world attacks on deep learning visual classification</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Galloway et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Angus</given-names>
            <surname>Galloway</surname>
          </string-name>
          , Graham W. Taylor, and Medhat Moussa.
          <article-title>Attacking binarized neural networks</article-title>
          .
          <source>In ICLR. OpenReview.net</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Goodfellow et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Jonathon Shlens, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <article-title>Explaining and harnessing adversarial examples</article-title>
          .
          <source>In ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Graves et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Abdel-rahman
          <string-name>
            <surname>Mohamed</surname>
            , and
            <given-names>Geoffrey E.</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Gui et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Shupeng</given-names>
            <surname>Gui</surname>
          </string-name>
          , Haotao Wang,
          <string-name>
            <surname>Haichuan Yang</surname>
            ,
            <given-names>Chen</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Zhangyang</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , and Ji Liu.
          <article-title>Model compression with adversarial robustness: A unified optimization framework</article-title>
          .
          <source>In NeurIPS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [He et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Jacob et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Benoit</given-names>
            <surname>Jacob</surname>
          </string-name>
          , Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          .
          <article-title>Quantization and training of neural networks for efficient integer-arithmetic-only inference</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Kannan et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Harini</given-names>
            <surname>Kannan</surname>
          </string-name>
          , Alexey Kurakin, and
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          .
          <article-title>Adversarial logit pairing</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1803</year>
          .06373,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Kingma and Ba</source>
          , 2015]
          <string-name>
            <given-names>Diederik P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>In ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Krizhevsky and Hinton</source>
          ,
          <year>2009</year>
          ]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Learning multiple layers of features from tiny images</article-title>
          . University of Toronto,
          <source>Tech. Rep</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Krizhevsky et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Kurakin et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Kurakin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , and
          <string-name>
            <surname>Samy Bengio.</surname>
          </string-name>
          <article-title>Adversarial machine learning at scale</article-title>
          .
          <source>In ICLR. OpenReview.net</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Li and Liu</source>
          , 2016]
          <string-name>
            <given-names>Fengfu</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bin</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Ternary weight networks</article-title>
          .
          <source>In NIPS workshop on EMDNN</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Yandong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lijun</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Liqiang</given-names>
            <surname>Wang</surname>
          </string-name>
          , Tong Zhang, and Boqing Gong.
          <article-title>NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Liao et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Fangzhou</given-names>
            <surname>Liao</surname>
          </string-name>
          , Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Defense against adversarial attacks using high-level representation guided denoiser</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>[Lin</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Ji</given-names>
            <surname>Lin</surname>
          </string-name>
          , Chuang
          <string-name>
            <surname>Gan</surname>
          </string-name>
          , and Song Han.
          <article-title>Defensive quantization: When efficiency meets robustness</article-title>
          .
          <source>In ICLR. OpenReview.net</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Madry et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Aleksander</given-names>
            <surname>Madry</surname>
          </string-name>
          , Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Vladu</surname>
          </string-name>
          .
          <article-title>Towards deep learning models resistant to adversarial attacks</article-title>
          .
          <source>In ICLR. OpenReview.net</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Rastegari et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Rastegari</surname>
          </string-name>
          , Vicente Ordonez, Joseph Redmon, and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Farhadi</surname>
          </string-name>
          .
          <article-title>Xnor-net: Imagenet classification using binary convolutional neural networks</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Rony et al.,
          <year>2019</year>
          ] Je´roˆme Rony, Luiz G. Hafemann, Luiz S. Oliveira, Ismail Ben Ayed, Robert Sabourin, and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Granger</surname>
          </string-name>
          .
          <article-title>Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Sharif et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Mahmood</given-names>
            <surname>Sharif</surname>
          </string-name>
          , Sruti Bhagavatula, Lujo Bauer, and
          <string-name>
            <surname>Michael</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Reiter</surname>
          </string-name>
          .
          <article-title>Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition</article-title>
          .
          <source>In ACM CCS</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [Su et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Dong</given-names>
            <surname>Su</surname>
          </string-name>
          , Huan Zhang, Hongge Chen, Jinfeng Yi,
          <string-name>
            <surname>Pin-Yu Chen</surname>
            , and
            <given-names>Yupeng</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>Is robustness the cost of accuracy? - A comprehensive study on the robustness of 18 deep image classification models</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [Szegedy et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          .
          <article-title>Intriguing properties of neural networks</article-title>
          .
          <source>In ICLR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>[Wu</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Jiaxiang</given-names>
            <surname>Wu</surname>
          </string-name>
          , Cong Leng, Yuhang Wang,
          <string-name>
            <surname>Qinghao Hu</surname>
          </string-name>
          , and Jian Cheng.
          <article-title>Quantized convolutional neural networks for mobile devices</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Xu et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Weilin</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Yanjun</given-names>
            <surname>Qi</surname>
          </string-name>
          .
          <article-title>Feature squeezing: Detecting adversarial examples in deep neural networks</article-title>
          .
          <source>In NDSS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[Zagoruyko and Komodakis</source>
          , 2016]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Zagoruyko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Komodakis</surname>
          </string-name>
          .
          <article-title>Wide residual networks</article-title>
          .
          <source>In BMVC</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>