Calibrated Multi-Probabilistic Prediction as a Defense against Adversarial Attacks

Machine learning techniques have made great progress in recent years, obtaining state of the art performance in areas such as natural language processing [3] as well as image and speech recognition [2]. However, the theoretical properties of the deep neural networks responsible for this success remain poorly understood. At present, there is no theory which can satisfactorily explain the success of deep learning and many open questions remain [6]. A peculiar example of this lack of theoretical understanding is the existence of so-called adversarial perturbations [1]. These are small modifications to the inputs of a model which can drastically change its output, even though the alterations are completely insignificant.

In this work, we propose a novel defense against adversarial manipulation which aims to scale to realistic problems and provide non-trivial robustness. It is based on methods from conformal prediction and therefore enjoys frequentist guarantees of validity [4]. Empirical evaluations as well as theoretical results also support the idea that our defense can be scaled to realistic models. We evaluate our method against existing (oblivious) adversarial attacks as well as a white-box attack specifically designed to fool the MultIVAP. We find that these attacks have limited success when the norms of the perturbations are reasonably constrained.

The basic construction of the MultIVAP is as follows. Given any machine learning classifier, we use the inductive Venn-ABERS predictor algorithm by Vovk et al. [5] in a one-vs-all manner in order to obtain pairs of probabilities (p

(i) 0 , p (i) 1 )

form lower and upper bounds on the probability that the given sample belongs to class i. These probabilities are then processed into a multi-probabilistic prediction by solving a mixed integer linear program (MILP). The output of the MultIVAP is the solution to this optimization problem, which consists of a vector of bits (α 1 , . . . , α K ). Here, α i indicates that we can accept the label i for the given input at the ε significance level, where ε ∈ [0, 1] is a user-specified parameter. Table 1 shows experimental results when we evaluate the MultIVAP on four different image recognition tasks. For each task, we report several metrics: η, the ∞ norm bound on the magnitude of the perturbations; ε, the significance level at which these results were obtained; accuracy of the MultIVAP and accuracy of the underlying model; the adversarial error of the MultIVAP. Note that on three out of four tasks, the MultIVAP increases the accuracy of the classifier. Also, the adversarial error of the MultIVAP is significantly lower than that of unprotected machine learning classifiers evaluated against adaptive white-box attacks (these are almost invariably close to 100%). The computational overhead we incurred with this construction was roughly linear in the number of classes of the task.

We conclude that the MultIVAP is a computationally efficient procedure for protecting multi-class classifiers against adversarial perturbations. We make our code available at https://github.com/saeyslab/multivap.