Combining Universal Adversarial Perturbations ? Maurus Kühne1[0000−0002−4205−3552] and Beat Tödtli2[0000−0003−3674−2340] 1 Fernfachhochschule Schweiz maurus.kuehne@students.ffhs.ch 2 Institut für Informations- und Prozessmanagement, FHS St. Gallen beat.toedtli@ost.ch Abstract Universal adversarial perturbations (UAPs) are small per- turbations imposed on images that are able to fool a single convolutional neural network image classifier. They have been shown to generalise well to other neural networks. Here, we report on our reproduction effort of the results given in a work by Moosavi-Dezfooli et al. on UAPs and study two methods to construct UAPs for several neural networks. While the results are not strong enough to make general conclusions, they suggest that UAPs indeed profit from being constructed on several neural net- works. Also, we show that a linear interpolation between two UAPs does not produce a viable UAP on both networks. Keywords: Adversarial Training· Universal Adversarial Perturbation 1 Introduction The discovery of Szegedy et al. [12] that several machine learning models in- cluding deep neural networks are vulnerable to adversarial attacks was seminal for a new subfield of studying deep learning. Probably the most intriguing, but also unsettling result was that adversarial examples can be made quite imper- ceptible to the human eye while still fooling a convolutional neural network to misclassify the image [3]. Subsequent work has developed various algorithms in a variety of white-box, grey-box and black-box attack scenarios as well as de- fensive strategies such as adversarial training [11]. Moosavi-Dezfooli et al. [5,6] have demonstrated that universal perturbations exist, i.e. that a single set of pixel modifications can be found that fools a network on a large fraction of the training data set. Moreover, universal adversarial perturbations (UAPs) also fool other convolutional networks. The authors of [5,6] show good generalization res- ults for UAPs generated with their procedure DeepFool [6] across different deep learning architectures. These results suggest that neural networks and convolutional neural networks for image classification in particular partly share a common structure that can be ? Copyright © 2020 by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). exploited by universal adversarial perturbations (UAPs) while yet other aspects are different. Thus, understanding UAPs provides a window into the weaknesses of neural networks. By building UAPs that are viable on several networks, a common weakness of those networks is identified. This in turn helps in building more robust networks, for example again by adversarial training. Even more gen- erally, therefore, we view work on UAPs as one more way of (partly) approaching the bigger question of why neural networks with a large number of weights work at all (or when they don’t). In this context, we ask whether combining two neural networks in generat- ing adversarial perturbations can improve the transferability of UAPs to new convolutional neural networks. We observe that in the approach by Moosavi- Dezfooli et al. [5,6], implementing such a combination is particularly simple. We report on our effort to reproduce their results, provide our code and in- vestigate whether modifications of their UAP algorithm are able to improve the generalisation capability of UAPs. To do so, we devise modifications of the UAP generation algorithm that take into account several networks at the same time3 . Specifically, we assess whether incorporating information from a second neural network architecture improves the fooling rate of UAPs on a third neural net- work. We investigate three combination procedures and compare them with the original adversarial attack procedure. Given the practical potential and relevance of Deep Learning and the po- tential security threats of adversarial perturbations, finding a robust resolution is important and urgent. However, research into the topic is hampered by the reproducibility crisis in machine learning [10]. Research results are often difficult to reproduce due to undocumented values for hyperparameters, software library versions etc. [4]. As this was also the case for our reproduction efforts of the UAP- results of Moosavi-Dezfooli et. al., we endorse the NeurIPS-2019 code submission policy by providing our code and including their reproducibility checklist. This paper is organised as follows. In Sec. 2 we briefly review related work. In Sec. 3 we present the basic methods used to generate adversarial and universal adversarial perturbations as introduced by Moosavi-Dezfooli et al. [5]. We then present two modifications of their UAP generation algorithm to combine UAPs for several networks and investigate linear interpolations between UAPs. In Sec. 4 we first describe our reproduction effort of the original work of Moosavi-Dezfooli et al. [5] to produce UAPs on a set of networks and provide our fooling rates. We then present our results on the three methods to produce UAPs for several networks and show that the transferability to a third network is improved. In Sec. 5 we discuss these results and conclude with Sec. 6. 2 Related Work Many authors have suggested adversarial perturbation generating methods in various different settings. These attempts often generate per-instance perturb- 3 Instead of combining UAPs from different networks, UAPs of networks trained on different data domains can be combined, as done e.g. by Naseer et al. [9]. ations. Among the image-agnostic methods, i.e. those generating UAPs, there are both data-driven and data-independent techniques, white-box and black-box attacks (depending on whether the internal structure of the network to be at- tacked is accessible to the attacker) and whether the attack is targeted or not (i.e. whether a misclassification into a particular class is required, or whether any misclassification is counted as a success). For a review and references, see [1]. UAPs can be generated e.g. by gradient descent on a loss function or learned using generative models. These methods being data-driven, they require access to training data, preferably the training data of the network to be fooled. Data- independent techniques such as Fast Feature Fool [8] or GD-UAP [7] do not need this access, but usually have access to the internal state of the network to be fooled. Our approach is a data-dependent white-box attack closely tied to the UAPs of Moosavi-Dezfooli et al. [5]. Many of these methods generate perturbations that transfer well between different model architectures. As part of our contribution we seek methods that optimise the transferability of perturbations between models. Our approach tries to achieve this by using multiple models to generate perturbations. 3 Methods 3.1 General Setting Given an image classifier k̂(x) = sgn (f (x)) that is based on the sign of a clas- sification function f : Rn → R, adversarial attacks seek a perturbation v such that k̂(x + v) = sgn (f (x + v)) 6= sgn (f (x)) = k̂(x). In the following we briefly describe the adversarial perturbation generating method DeepFool and its generalisations to UAPs in the multi-class classifica- tion setting. We then describe the two approaches analysed here to build UAPs from several networks. 3.2 DeepFool In the generation procedure DeepFool [6], the perturbation v for an image x is defined to be the shortest vector4 (using the Lp -norm k.kp ) such that x + v lies on a decision boundary. If f (x) is a linear function f (x) = wx + b, then f (x) v can be shown to be v = − kwk 2 w. As f is nonlinear in general, the Taylor p T approximation of the function f around x, f (x + v) = f (x) + (∇f (x)) v is used to iteratively reach the decision boundary. In the multi-class setting considered here, an additional step is required that identifies the closest decision boundary. 4 In practice and also in our code, the vectors (elements of a vector space) such as x and v have additional structure such as (for images) height, width and depth that is used to implement the classifier k̂. In the field of Deep Learning these data structures are therefore often called tensors. 3.3 Universal Adversarial Perturbations with DeepFool Perturbations for each image in a dataset X (such as those generated using DeepFool) can be combined to form universal adversarial perturbations for a single network [5]. The procedure is given in Alg. 1 for a classifier set K with one element. Essentially, DeepFool perturbations of images that are not yet misclassified are added to obtain a universal perturbation. Whenever the norm of the perturbation becomes large, a rescaling is applied. The perturbation is scaled back to satisfy a norm bound given by kvkp ≤ ξ (that potentially undoes the successful perturbation of some images). For a small value of ξ, this ensures that the perturbation remains largely invisible. Intriguingly, although the directions to the class boundaries vary for differ- ent training images, the resulting average over all image perturbations works well according to [5], even for other convolutional neural networks whose class boundaries might be expected to look rather different. 3.4 Multi-Classifier Universal Adversarial Perturbations In the following subsections, we detail two approaches to generating UAPs for several classifiers. Alternated Generation of Perturbations Since UAPs are constructed by adding up the perturbations generated using DeepFool, there is a natural way to combine perturbations generated by the two networks: We add up the con- tributions from all networks. We note that adding up the perturbations is not a commutative operation, since projections take place once the size (norm) of the perturbation becomes too large. The precise procedure used here is given in Alg. 1. Variants of Alg. 1 exist that sample the images differently. One might for example generate perturbation by alternating the classifier for each image. We show this variant in Alg. 2. We evaluate both variants and compare their per- formance in Tab. 2. Interpolation Between UAPs on Individual Networks A simple yet in- structive alternative to the above rather involved perturbation construction is given by a simple weighted average of the perturbation vectors generated on the individual networks. If we restrict our attention to the combination of two neural networks for now, then the weighted averages of the perturbations lie on a line in the high-dimensional vector space of images. In the following, we specifically investigate whether any perturbation lying on this line improves on the fooling capability of the two endpoints with respect to a third network. More formally, given UAPs vf and vg of two neural networks f and g, we consider vf g (λ) = λ vf + (1 − λ) vg (1) for values λ ∈ [0, 1]. We seek the value of λ that maximises the average fooling rate over f , g and a third network, given the current training data set. Outlook to Other Approaches We have also investigated other, more in- volved approaches that led to unsatisfactory results. In particular, we were in- terested in constructing a multi-class DeepFool procedure that uses the gradients of two networks towards the next class boundary to compute a perturbation of a single image. One might try to find perturbations towards class boundaries that are aligned as much as possible, but where the class boundaries correspond to different classes in different networks. As of now, these attempts have not provided efficient UAPs for multiple networks. 1 Input:Data set X, set of classifiers K, desired norm k.kp of the perturbation ξ 2 Output: Universal perturbation vector v 3 Initialise v ← 0 4 while Average fooling rate is too low and max. number of iterations is not reached do 5 foreach image x ∈ X do 6 foreach k̂ ∈ K do 7 if k̂(x) = k̂(x + v) then 8 ∆v ← DeepFool(x + v, k̂) 9 v←v √+ ∆v 10 v ← ξv/ kvkp 11 end 12 end 13 end 14 Shuffle X 15 end 16 return v Algorithm 1: Computation of universal adversarial perturbations for mul- tiple neural networks. The function DeepFool computes an adversarial per- turbation as described in Sec. 3.2. For each image, perturbation updates are computed for all classifiers, added up and rescaled to have norm ξ. Note that the method is identical to the single classifier UAP method in [5] when the set K consists of a single classifier. 4 Results In this section we first discuss our attempt at the reproduction of the results in [5] regarding the transferability of UAPs across neural network architectures. We then provide the results of our approaches to construct UAPs based on two neural network architectures at the same time. 1 Input:Data set X, ordered set of classifiers K, desired norm k.kp of the perturbation ξ 2 Output: Universal perturbation vector v 3 Initialise v ← 0 4 while Average fooling rate is too low and maximum number of iterations is not reached do 5 foreach image x ∈ X do 6 k̂ ← first element of K 7 K ← cyclic rotation of K 8 if k̂(x) = k̂(x + v) then 9 ∆v ← DeepFool(x + v, k̂) 10 v←v √+ ∆v 11 v ← ξv/ kvkp 12 end 13 end 14 Shuffle X 15 end 16 return v Algorithm 2: Computation of universal adversarial perturbations for mul- tiple neural networks. For each image, perturbation updates are computed for the next classifier in the classifier sequence. Then the sequence is cyclically rotated. 4.1 Reproduction of the Original Results on Universal Adversarial Perturbations Moosavi-Dezfooli et al. tested the DeepFool and Universal Adversarial Perturba- tions algorithms on 5 different neural networks [5]. We choose the same networks as the ones used in [5] and use publicly available pretrained weights for all mod- els, since the weights used in [5] were not specified. We compare the achieved fooling rates with those given in the original paper. Separate experiments have been performed by training individual UAPs on each of the networks and sub- sequently measuring the fooling rates on all available networks. All perturbations were generated using the same random subset of 10’000 images of the ImageNet training set [2], and the fooling rates were measured on the ImageNet validation set (containing 50’000 images). Fig. 1 shows the perturbations generated. In their general structure and ap- pearance, they are similar to the ones reported in [5]. In particular, the fine line- shaped structures in green and magenta are quite recognisable and are present also in the original results in [5]. We believe that this indicates that deviations from the original results reported below stem from configuration details rather than a fundamental reproduction mistake. Therefore and despite the lower fool- ing rates reported below, we feel justified to use this setup to study our methods to generate a combined UAP for several networks. Tab. 1 shows the achieved fooling rates for the tested models (left values, in boldface) and the fooling rates reported by Moosavi-Dezfooli et al. [5] (right Inception VGG-16 VGG-19 VGG-F ResNet-152 Figure 1. Visualisation of the generated perturbations. To visualise the perturbations, we shifted them by +ξ and extended them to the entire colour space with a scalar multiplication. values). The first column indicates the network used to generate the UAP while the first row gives the network on which the fooling rate is measured. The main diagonal therefore contains the self-fooling rates, i.e. the fooling rates achieved using the same model for generating the perturbation and measuring the fooling rate. To reproduce these numbers, several insufficiently documented design choices had to be researched. For example, the original results are not stated for a given epoch but using a stopping condition on the error rate. Values reported here are for epoch 20, at which point the mean value over the last 5 epochs typically varies by less than 0.005. Parameter values are p = ∞ and ξ = 10. We used a maximum of 10 DeepFool update iterations. The overshoot parameter was η = 0.02. These parameter values were taken from the work of Moosavi-Dezfooli et al. [5,6]. Another important parameter is the number of tested class boundaries5 , called num classes. It gives the number of class boundaries in whose direction a perturbation is searched for. Computation time is highly sensitive to this para- 5 As given in the code at https://github.com/LTS4/universal/ Table 1. Fooling rates using the UAP method on the ImageNet validation set for several neural networks. Left values in bold are our reproduction results. The values to the right are the results reported in [5]. VGG-F Inception VGG-16 VGG-19 ResNet-152 VGG-F 90% 94% 56% 48% 32% 42% 32% 42% 24% 47% Inception 42% 46% 82% 79% 16% 39% 16% 40% 16% 46% VGG-16 46% 63% 60% 57% 59% 78% 52% 73% 38% 63% VGG-19 45% 64% 58% 54% 51% 74% 54% 78% 33% 58% ResNet-152 42% 46% 55% 51% 31% 47% 33% 46% 73% 84% meter, and checking all 1000 training boundaries of ImageNet was infeasible. Higher values generally improve the fooling performance, though exceptions are observed. Other minor parameters such as random initialisation values for the train-test split, etc. were taken from the provided code. We have tried to optim- ise these parameters using grid searches, but the available computing resources have limited these efforts. As part of this contribution we provide our code for the above reproduction effort and include the NeurIPS-2019 reproducibility checklist6 . 4.2 Evaluation of Approaches to Construct Universal Adversarial Perturbations for Multiple Neural Networks In this section we present our results for constructing UAPs that make use of two neural networks. We discuss results for the two approaches presented in Sec. 3.4, the alternating generation of perturbations and the linear interpolation between UAP on individual networks. The alternating generation of perturbations as given in Alg. 1 has been ap- plied to the set of classifiers K = {Inception, VGG-16}. In Tab. 2 the fooling rates are reported for these two networks and for ResNet-152. The table has two sections. In the upper section, the network listed in the first column is used to generate the perturbation. The columns give the fooling rates on the network given in the column title as well as their average as our measure for the gen- eralisability of the UAPs. We include the original network in this average. The second section reports results by applying the alternating generation of UAPs and the interpolation method given in Eq. (1) using the value λ = 0.05. Alg. 1 achieved results similar to the ones of a perturbation generated on VGG-16 only. Alg. 2 achieved slightly higher fooling rates on VGG-16 and ResNet-152 than a perturbation generated directly on VGG-16 (with an ab- solute increase of 2 and 3 percentage points, respectively). For Inception, the perturbation achieved a fooling rate of 67%. This is 15% below the measured self-fooling rate of Inception (82%) but 7% higher than the fooling rate achieved with a VGG-16 model. 6 See the file checklist.md in https://github.com/mauruskuehne/lwda-paper For a linear interpolation between the UAPs of VGG-16 and Inception V1, the best results were achieved for λ = 0.05 (see Eq. (1)). Using this configur- ation, the fooling rates remained essentially unchanged compared to the UAP generated on VGG-16 only. The small value λ = 0.05 results in a perturba- tion that is similar to the VGG-16 perturbation as the VGG-16 perturbation is weighted with 1 − λ = 0.95 while the Inception perturbation contributes only with a weight of 5%. Choosing λ ∈ [0.1, 0.15, . . . , 0.4] resulted in perturbations with lower fooling rates for both models with respect to the rates achieved by separately training UAPs on the two networks. For λ ∈ [0.4, 1.0], the fooling rates for Inception improved again but did not exceed the fooling rate of a UAP generated on Inception itself. The fooling rate on VGG-16 continued to deteri- orate, stabilising at a low fooling rate of ∼ 15% after λ ≥ 0.65. This suggests that linear interpolation does not result in improved fooling rates. Table 2. Comparison of the fooling rates by UAPs trained on individual networks (upper three rows) and of combination methods for several networks (lower three rows). Best values are shown in bold. Among the alternating generation variants (Alg. 1 and Alg. 2) and the linear interpolation procedure (Eq. (1)), Alg. 2 performs best with respect to the average fooling rate over all three networks. Results are reported on the validation set, using the l∞ -norm, ξ = 10 and 20 UAP iterations. Inception V1 VGG-16 ResNet-152 Average Inception V1 UAP 82% 16% 16% 38% VGG-16 UAP 60% 59% 38% 52% ResNet-152 UAP 55% 31% 73% 53% linear interpolation with λ = 0.05 60% 59% 38% 52% alternating generation of UAPs, Alg. 1 63% 55% 36% 51% alternating generation of UAPs, Alg. 2 67% 61% 41% 56% 5 Discussion 5.1 Reproduction of the Original Results on Universal Adversarial Perturbations For most models the fooling rates reported in the original paper could not be achieved, indicating that our reproduction results fell short of being satisfact- ory. For VGG-F, VGG-16, VGG-19 and ResNet-152 our self-fooling rates were between 4 and 24 absolute percentage points lower. For Inception we achieved a self-fooling rate 3 absolute percentage points higher than the one reported in the original paper. Further research is needed to state the precise conditions under which a reliable reproduction of the reported fooling rates is possible. As a step in this direction we have provided our code. The non-diagonal values in Tab. 1 are large but typically significantly smal- ler than the diagonal values. They show a degree of transferability of UAPs generated with DeepFool to other models. Therefore, despite the reproducibil- ity problems, these results broadly confirm that UAPs generated with DeepFool generalise to other network architectures. Nevertheless, it is clear that some as- pects of UAPs are specific to a given neural network architecture. We discuss our results on finding a way to improve the non-diagonal elements (potentially at the cost of the diagonal ones) in the next section. Interestingly, we achieved lower fooling rates than Moosavi-Dezfooli et al., except for the Inception net- work, for which we achieve 3 to 8 absolute percentage points higher fooling rates. This difference may be due to different stopping criteria, resulting in Moosavi- Dezfooli et al. running less optimisation epochs. Another possibility is that the chosen hyperparameter values for DeepFool and UAP might be particularly well suited or optimised for the Inception model. This in turn would explain the lower fooling rates achieved on other models. 5.2 Alternating Generation of Perturbations and Linear Interpolation Between UAPs on Individual Networks As the results in Sec. 4.2 clearly show a linear interpolation between two UAPs does not give good results. This suggests that using a weighted average to com- bine UAPs is not a suitable approach to produce good UAPs for several neural networks. A more sophisticated UAP-combination procedure is clearly necessary to generate perturbations that fool both networks to a high degree. The results given by the alternating generation of perturbations (Alg. 2) are much better (see Tab. 2). The fooling rate of the perturbation generated jointly on Inception and VGG-16 is better than the ones generated on any one of the two networks. A perturbation generated on Inception achieves a fooling rate of 16% on VGG-16 while a perturbation generated on VGG-16 achieves a fooling rate of 60% on Inception. Both rates are lower than the ones of a perturbation generated jointly on Inception and VGG-16, achieving 67% on Inception and 61% on VGG-16. Furthermore, the fooling rate of a jointly trained UAP on Inception and VGG-16 on a third network (Resnet-152) is better than the fooling rate of both single-network UAPs. The UAP generated jointly on both networks even worked slightly better for VGG-16 than the one trained on VGG-16 alone. The reason for this effect and its statistical significance are not yet established. In judging these results, we note that estimates of their uncertainties are still lacking due to constraints on our computational resources. Ideally, one would like to provide mean and standard deviation values of all these numbers over mul- tiple train-test splits as encouraged by the NeurIPS-2019 reproducibility check- list. Furthermore, the choice of the ImageNet training and testing data and the hyperparameter values (such as p, ξ, num classes, etc.) should be investigated. 6 Conclusions The results reported here on generalising UAPs across several networks clearly have to be interpreted cautiously given the fact that even the reproduction of previously reported results has not been satisfactory. Establishing reproducibility standards for machine learning publications remains a crucial challenge that is hampering progress. With the above caution in mind, the results reported here suggest that finding universal adversarial perturbations that generalise across different convolutional neural networks is not a hopeless endeavour. As we found, such a UAP is likely not a linear combination of UAPs of different networks but must be constructed in a more subtle way. Our best approach, Alg. 2, most certainly is not optimal. Nevertheless, it already shows some promising results: The generalisability of the fooling rates to ResNet-152 is enhanced by combining the UAPs of two networks, with respect to the UAPs generated on either one of the Inception or VGG-16 network. This suggests that combining several or even many networks might produce UAPs that are efficient on a whole class of trained convolutional neural networks. References 1. Chaubey, A., Agrawal, N., Barnwal, K., Guliani, K.K., Mehta, P.: Universal ad- versarial perturbations: A survey. ArXiv (2020), http://arxiv.org/abs/2005.08087 2. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: A large-scale hierarchical image database. In: CVPR09. pp. 248–255. IEEE Computer Society (2009). https://doi.org/10.1109/CVPR.2009.5206848 3. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learn- ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6572 4. Gundersen, O.E., Kjensmo, S.: State of the art: Reproducibil- ity in artificial intelligence. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI. pp. 1644–1651. AAAI Press (2018), http://dblp.uni- trier.de/db/conf/aaai/aaai2018.html#GundersenK18 5. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal Adversarial Perturbations. In: 2017 IEEE Conference on Computer Vision and Pattern Re- cognition (CVPR). pp. 86–94 (Juli 2017). https://doi.org/10.1109/CVPR.2017.17, iSSN: 1063-6919 6. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: DeepFool: A Simple and Accur- ate Method to Fool Deep Neural Networks. In: 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 2574–2582 (Juni 2016). ht- tps://doi.org/10.1109/CVPR.2016.282, iSSN: 1063-6919 7. Mopuri, K.R., Ganeshan, A., Babu, R.V.: Generalizable data-free ob- jective for crafting universal adversarial perturbations. CoRR (2018), http://arxiv.org/abs/1801.08092 8. Mopuri, K.R., Garg, U., Babu, R.V.: Fast feature fool: A data in- dependent approach to universal adversarial perturbations. CoRR (2017), http://arxiv.org/abs/1707.05572 9. Naseer, M.M., Khan, S.H., Khan, M.H., Shahbaz Khan, F., Porikli, F.: Cross- domain transferability of adversarial perturbations. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 12905–12915. Curran Associ- ates, Inc. (2019), http://papers.nips.cc/paper/9450-cross-domain-transferability- of-adversarial-perturbations.pdf 10. Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché- Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 5485–5495. Curran Associates, Inc. (2019), http://papers.nips.cc/paper/8787-a-step-toward-quantifying-independently- reproducible-machine-learning-research.pdf 11. Ren, K., Zheng, T., Qin, Z., Liu, X.: Adversarial attacks and defenses in deep learn- ing. Engineering 6(3), 346 – 360 (2020). https://doi.org/10.1016/j.eng.2019.12.012 12. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations. ICLR (2014), URL http://arxiv.org/abs/1312.6199