Classification of Encrypted Word Embeddings using Recurrent Neural Networks Robert Podschwadt Daniel Takabi rpodschwadt1@student.gsu.edu takabi@gsu.edu Georgia State University Georgia State University Atlanta, Georgia Atlanta, Georgia ABSTRACT fully connected variant or Elman Network [15]. In this work we Deep learning has made many exciting applications possible and work with Elman Networks and unless specified otherwise will use given the popularity of social networks and user generated content the term RNN instead of Elman Network. Recurrent architectures everyday there is no shortage of data for these applications. The are very popular in natural language processing (NLP) due to the content generated by the users is written or spoken in natural lan- sequential nature of language. There are many different sub-fields in guage which needs to be processed by computers. Recurrent Neural NLP. In this work we investigate the task of sentiment classification. Networks (RNNs) are a popular choice for language processing due Many companies have built a business around offering MLaaS. to their ability to process sequential data. On the other hand, this In MLaaS the model is hosted in the cloud. The service provider has data is some of the most privacy sensitive information. Therefore, the infrastructure and know-how to build the models. The client privacy-preserving methods for natural language processing are owns the data and sends it to the provider (also called server) for crucial. In this paper, we focus on settings where a client has private processing. data and wants to use machine learning as a service (MLaaS) to A concern for the client of MLaaS is the privacy of the data. perform classification on the data without the need to disclose the To process the data the server needs access to the data. This is data to the entity offering the service. We employ homomorphic often unwanted or unacceptable depending on the sensitivity of the encryption techniques to achieve this. Homomorphic encryption data. There are three main techniques for preserving the privacy of allows for data being processed without it being decrypted thereby the data while still allowing for ML algorithms to work: 1) Secure protecting the users privacy. Although homomorphic encryption Multiparty Computation (SMC), 2) Differential Privacy (DP) and 3) has been used for privacy-preserving machine learning, most of the Homomorphic Encryption (HE). work has been focused on image processing and convolutional neu- In previous work a variety different machine learning algorithm ral networks (CNNs), but RNNs have not been studied. In this work, have been adapted for privacy preserving processing such as linear we use homomorphic encryption to build privacy-preserving RNNs regression [29], linear classifiers [4, 17], decision trees [1, 4] or for natural language processing tasks. We show that RNNs can be neural networks [14, 29, 32]. Solutions based on SMC [29, 32] come run over encrypted data without loss in accuracy compared to a with a huge communication overhead. plaintext implementation by evaluating our system on a sentinment We propose an approach that is based on homomorphic encryp- classification task on the IMDb movie review dataset. tion and recurrent neural networks. It does not require interactive communication between client and server like SMC approaches CCS CONCEPTS but in the case of longer sequences, we use interactive communica- tion to control the noise introduced by HE. Very little prior work • Security and privacy; • Computing methodologies → Nat- deals with recurrent neural networks. Much of the work is done ural language processing; Neural networks; on CNNs in the image domain [10, 14, 22] and more. [39] perform encrypted speech recognition which is an NLP task but the model KEYWORDS used is also a CNN. Badwai et al. [2] research privacy preserving privacy preserving machine learning, recurrent neural networks, text classification which is the task that we also us in the this paper homomorphic encryption but the authors do not use an RNN. To the best of our knowledge, there is only one prior paper working with a recurrent architecture. 1 INTRODUCTION Qian and Lei propose a system [26] that is capable of implementing Artificial neural networks have been very successful and popular LSTM networks based on TFHE [7]. Their LSTM model suffers over the last few years in a variety of domains. CNNs have shown from a small drop in accuracy though when running on encrypted better than human performance in image classification tasks [13, data. Our solution is able to maintain the same accuracy as the 38] and have also been applied to language processing tasks[16]. plain text model. We present a solution that can process RNNs with RNNs, another type of neural networks, are specifically designed arbitrary length input sequences in a privacy preserving manner to work with sequences. Unlike other types of networks, RNNs and introduce a way of using word embeddings with encrypted take the output of the previous sequence step into consideration. data. To ensure the privacy of the data we rely on the CKKS [6] There are different types of RNN architectures such as Long Short crypto scheme. We evaluate our system on a text classification task. Term Memory (LSTM), Gated Recurrent Unit (GRU) and a simple The basic idea of our proposed approach is running RNNs on the encrypted data by taking advantage of HE schemes. The server Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons hosts the trained model, the client transmits the encrypted data for License Attribution 4.0 International (CC BY 4.0). Houston ’20, Febuary 07, 2020, Houston, TX Podschwadt and Takabi processing and receives an encrypted result. The training of the more noise than additions. A way of controlling the noise is to use model is done on plaintext. In this work, we make the following so-called leveled homomorphic encryption (LHE). LHE allows a for main contributions: certain number of multiplications based on the parameters chosen • We propose an approach that combines RNNs, specifically for the encryption scheme and evaluating circuits of a known depth. Elman Networks, and homomorphic encryption to perform Computation cost can be mitigated in some cases by using single inference over encrypted data in natural language processing instruction multiple data (SIMD) techniques introduced by Smart tasks. and Vercauteren [34]. • We present an innovative approach to work with word em- beddings for encrypted data. 2.2 Recurrent Neural Networks • We perform thorough benchmarking of our system both In contrast to fully connected or convolutional neural networks, with respect to run time performance and communication which are feed forward only, recurrent neural networks feed some cost. Our results demonstrate that we are able to run RNNs part of their hidden state back into themselves. over encrypted data without sacrificing accuracy and with There are many different types of recurrent neural network cells reasonable performance and communication cost. with Long short-term Memory (LSTM) [23] and Gated Recurrent Unit (GRU) [8] being the most popular ones. These cells are more 1.1 Threat Model and Problem Statement complex than simple RNNs which we are focusing on this paper. In this paper, we apply privacy preserving machine learning tech- While LSTM and GRU lead to better performance we focus on the niques based on HE to RNNs. We focus on a client server setting simpler RNN type due to lower computational complexity. such as MLaaS in which the client has full control over the data The RNN used in this paper consists of three main components and the server has full control over the model. We assume that input (x t ), hidden state (st ) and output (o) of the network at time the model has been trained on plaintext data and the server offers step t. The st for one neuron is calculated by following formula: inference as a service to the clients. The clients want to use the st = f (x t · w + st −1 · v) where f is the activation function and −x −x inference service and wish to keep their data private while the · is the vector dot product. Tanh ( ee −x −e+e −x ) is the most common server wishes to keep its model private . activation functions used in RNNs. Threat Model: We assume that all parties are honest but curios. They will not deviate from the protocol but will attempt to learn 2.3 Polynomial Approximation: Theoretical any information possible in the process. The server does not share Foundation information about the architecture of the model with the client. The client encrypts the data and sends it to the server for processing. If One of the major limitations of homomorphic encryption is the it is possible, the server will process the data and send back the final limited set of operations that can be performed. CKKS supports result in encrypted format. In some cases data will be sent back addition and multiplication. Division is supported only for plaintext to the client where it is decrypted, encrypted again to remove the divisors. Basically this allows us to evaluate only polynomials. Tanh, built-up noise and sent back to the server to continue processing. a popular activation function in RNNs, can not be expressed as a In addition to the privacy of the data we have the goal to achieve polynomial. This means we can not evaluate it over encrypted data. accurate predictions. This means the predictions made on encrypted A way to circumvent this is to find a polynomial approximation. data should be as close as possible to predictions made on plaintext Hesamifard et al. [21] use an approach that is based on Cheby- data. shev polynomials. Given the family of all continuous real valued functions X on a non-empty compact space C(X ) and let µ be ∫ a finite 2 BACKGROUND measure on X . The authors define f , д ∈ C(X ) as ⟨f , д⟩ = X f дdµ. To generate Chebyshev polynomials they use dµ = √ dx 2 as the 2.1 Homomorphic Encryption 1−x measure on [−1, 1]. For better computational performance we want Homomorphic encryption (HE) schemes are similar to other asym- to stick to low degree polynomials. metric encryption schemes as in they have a public key pk for encrypting (Enc) data and a private or secret key sk for decryp- tion (Dec). Additionally, HE schemes also have a so-called eval- 2.4 NLP with Neural Networks uation function, Eval. This evaluation function allows the eval- Recurrent neural networks are widely used for addressing chal- uation of a circuit C over encrypted data without the need for lenges in Natural Language Processing. Recurrent neural network decryption. Given a set of plaintexts {mi }0n and their encryption reached state-of-the-art performance for different tasks such as: {c i }0n = Enc(pk, {mi }0n ) the circuit C can be evaluated as: Speech Recognition, [19] and [18], Generating Image Descriptions, Dec(sk, Eval(pk, C, c 0, · · · , c n )) = C(m 0, · · · , mn ). [25] and [36], Machine Translation, [3] and [11], Language Model- Most modern HE schemes are based on the ring learning with ing, [28] and [35]. The implementation of an NLP pipeline using errors problem (RLWE). Roughly speaking to encrypt a plaintext RNNs can be broken done into four major parts: 1) Designing the some noise is added and the decryption process is the removal network, 2) Encoding the data, 3) Training the model and 4) Infer- of that noise. For more details see Brakerski et al. [5] and Cheon ence of new instances. et al. [6]. When operations are performed on the ciphertexts the In the next section we will look at the individual steps in detail noise grows and when it passes a certain threshold the ciphertext and describe the changes that are necessary for computation in a can not be decrypted correctly anymore. Multiplications add much privacy preserving setting. Classification of Encrypted Word Embeddings using Recurrent Neural Networks Houston ’20, Febuary 07, 2020, Houston, TX 3 THE PROPOSED PRIVACY-PRESERVING 3.2 Noise growth in HE CLASSIFICATION FOR RECURRENT In an RNN architecture, a sequence is processed by feeding its NEURAL NETWORKS entries into a fully connected layer which also takes the output of Looking at the components of the RNN pipeline described in Section that layer produced for the previous sequence entry. The current 2.4 we determine what changes need to be made to adhere to the output and the previous output are combined into the new output. constraints of homomorphic encryption. Due to the noise build-up in HE we need to keep track of the number Network Design. As long as we only use fully connected and of operations performed on ciphertexts. To process a sequence recurrent layers the only consideration we need to make are the of length n with an RNN layer the resulting ciphertext needs to activation functions that are being used. All other operations inside pass the layer n times. That means n dot products and activation an RNN can be performed over encrypted data using HE schemes. functions are applied. It is not always possible to process all of the However, it is not possible to implement common activation func- sequence entries due to the noise that is accumulated. Our approach tions within current HE schemes. We aim to find the best low de- is to send the encrypted data back to the client where it is decrypted gree polynomial approximation to replace the activation functions and re-encrypted thereby removing the built up noise. within the RNN. Data Encoding. In this paper, we use word embeddings as an 3.3 Implementation encoding scheme for textual data. We describe our approach to We use CKKS to protect the privacy of the client data. The server handling embeddings in more detail in Section 3.1. trains a plaintext model and shares the embedding matrix with the Model Training. In this paper, we assume that the training of client. The activation in the model needs to be compatible with the model is performed by the server on plain training data. HE. This is achieved by approximating Tanh using the method by Inference. This is the part of the pipeline in our system that Hesamifard et al. [21]. The client performs the embedding process is run on encrypted data. At no point during this process is the and encrypts the result. The encrypted embeddings are sent to the data decrypted on the server thus ensuring its privacy is protected. server where it is processed. When the noise, built up during com- During processing by the model, the encrypted data accumulates putation, reaches the limit it the data is sent back to client where it is noise. We describe a way of circumventing the problem of the noise decrypted, thereby removing all noise, rencrypted and sent back to crossing the threshold after which correct decryption is no longer the server. Once the model is completed processed the server sends possible in Section 3.2. Once the data has been processed by the the still encrypted resutl back to the client where it can be decrypted. entire network, the result of the classification is sent back to the We implement our proposed solution in C++11. We train the model client. The result of the classification is still encrypted and needs using Keras [9] and the homomorphic encryption primitives are pro- to be decrypted by the client. vided by HElib [20]. On the plaintext, we tried different activation A variety of activation functions have been proposed as replace- functions and found out that Tahn and Tanh approximations work ments for common activation functions used in NNs. Dowlin et. al best. Other activation functions such as x 2 or the linear function [14] use polynomials of degree 2 to substitute the Sigmoid function cause the model not to train properly. We find that best replacement in CNNs and Shortell and Shokoufandeh [33] use polynomial of for our purposes is: −0.00163574303018748x 3 +0.249476365628036x. degree 3 to approximate the natural logarithm function. Hesamifard et. al [21] use Chebyshev polynomials to approximate activation 4 EXPERIMENTAL RESULTS functions such as ReLU, Sigmoid and Tanh. We will be using the The experiments were performed on a Ubuntu 18.04 64bit machine approach of [21] to approximate Tanh which is the most popular with an AMD Ryzen 5 2600 @ 3.5GHz processor and 32GB of RAM. activation function in RNNs. The Softmax function can not be per- The IMDb [27] dataset contains 50,000 movie reviews labeled as formed over encrypted data but since it is typically used as the very either positive or negative of which 25,000 are used as training last function of neural network, we move it to the client side. The and 25,000 as test data. The tokenization is performed by Keras. server computes the neural network all the way to the inputs of We train a model to perform sentiment classification which is clas- the Softmax function. The the Softmax function is performed by sifying a review as either positive or negative. Out of the 25,000 the client after decryption to obtain the classification results. training instances we use 2,000 as validation data for hyperparame- ter tuning. We use a vocabulary of the top 20,000 words. We pad or 3.1 Encrypted word embeddings truncate the reviews to be 200 words long. Our model consists of Word embeddings are a way to turn words into real valued vectors. an embedding layer that turns words in the reviews into real val- The embedding layer basically is a lookup table that maps any word ued vectors of dimension 128. The embedding matrix is randomly in a dictionary to a real valued vector. The lookup of an embedding initialized and updated during the training process. The embedding for a given word cannot be performed efficiently in HE schemes. layer is followed by and RNN layer with 128 units. We use the We address this problem by moving the embedding layer out of Tanh approximation from Section 3.3 as activation function. The the RNN and to the client where it can be performed in plaintext. last layer is a fully connected layer with two units and Softmax After performing the embedding lookup, the client encrypts the activation. The training is performed on the plain data using Keras embeddings and sends the result to the server. To enhance the the and yields 86.47% accuracy on the unseen test data. We achieve the privacy of the model, the model owner can use one of the many same accuracy on the encrypted data. pretrained embeddings such as GloVe [30], Elmo [31], Bert [12] or We extract the learned weights and run experiments with differ- XLNet [37] and share those with the client . ent batch sizes. In our experiments the noise growth exceeds the Houston ’20, Febuary 07, 2020, Houston, TX Podschwadt and Takabi Table 1: Data transferred during encrypted classification Table 2: Run times of inference on IMDb test set. Batchsize Embeddings Noisy Refreshed Batch Batch Encrypted Plain ciphertext ciphertext Size (sample/batch) (sample/batch) 1 125MB 0.939MB 0.623MB 135MB 1 70.6s / 70.6s 1.83s / 1.8s 4 287MB 2.2MB 1.5MB 312MB 4 20.2s / 80.7s 0.496s / 1.99s 32 1,843MB 14MB 9MB 2,004MB 32 5.8s / 184.6s 0.072s / 2.30s 64 3,548MB 27MB 18MB 3,863MB 64 4.3s / 272.7s 0.055s / 3.52s 128 7,065MB 54MB 35MB 7,869MB 128 4.2s / 547.6s 0.046s / 5.89s 256 14,336MB 106MB 70MB 15,568MB 256 6.5s / 1658.7s 0.039s / 9.96s 5 RELATED WORK Badwai et al. [2] presented PrivFT a system for privacy preserv- ing text classification built on Facebooks fasttext [24] (Joulin et al. [24]). The main difference to our work is that we use a recurrent workable threshold after 27 timesteps. This means we need to add architecture. In PrivFT the embedding operation is also not out- communication between client and server seven times to refresh sourced to the client. The client needs to one-hot encode each word, the noise in order to classify the IMDb sequences of length 200. encrypt it and send it to server where the embedding operation is The amount of data that needs to be transmitted depends on the performed as a matrix multiplication. The message size is similar. batch size. The encrypted embeddings are larger than the plaintext The inference time for single instance on the IMDb is higher in our data by a factor of 1,280. See Table 1 for different batch sizes. The scenario but using larger batch sizes allows us to get a lower per Embeddings column is the amount of data that is initially transferred instance time. In contrast to our work, PrivFT features schemes for from the client to server. Noisy ciphertext gives the size of the data training on encrypted data and a CKKS implementation with GPU the server sends to the client to be refreshed and Refreshed ciphertext acceleration. Lou and Jiang created SHE [26] a privacy preserving is the reencrypted answer. These are the values for only one refresh neural network framework based on TFHE. It offers support for operation. The Batch column is the total amount of data transferred LSTM cells. The authors replace the computationally expensive between client and server during classification of one batch which and high noise introducing matrix operations normally required requires seven refresh rounds. by LSTMs with much cheaper shift operations. Zhang et al. [39] The amount of data that needs to be transmitted initially makes perform a different NLP task namely encrypted speech recognition up the largest portion of the transfer. To run our network seven based on a CNN. The last step of the network that matches the noise removal communications are required. At a batch size of 256 output to actual text is performed on the client side. the server sends 106MB to the client and the client responds with 70MB. One round of noise removal therefore requires 176 MB to 6 CONCLUSION be transferred. All seven rounds take 1,232MB. Which is less than 10% of the initial transfer. The increase in size of the ciphertexts In this paper, we present an approach that allows the use of recur- is nearly linear. Smaller ciphertexts sizes carry more overhead per rent neural networks on homomorphically encrypted data based instance than larger ones. on the CKKS scheme. We present a solution to perform NLP tasks Table 2 lists the execution time for different batch sizes. The over encrypted data using recurrent neural networks, in our case times are given for encrypted, plain data and for the actual time sentiment analysis on the IMDb dataset. We are able to achieve it takes to processes the batch as well as the resulting time per this with no loss in accuracy compared to the plaintext model. This instance. The noise removal is not performed by the client though. is made possible by introducing communication between client It is simulated on the server. The measurements also do not include and server to refresh the noise. We trade network traffic for the the encryption and transfer of the embeddings. We can see that ability efficiently use word embeddings. Our future work aims at increasing the batch size leads to lower per instance classification investigating other recurrent architectures such as LSTM and GRU. time. The effect is lost when increasing the batch size from 128 to 256. On the plain data we still can see improvement after that point. To get an accurate comparison the plain text measurements are per- formed on the same implementation as the encrypted experiments. It looks like the growth in execution time for the encrypted values is exponential while the plain version appears to be logarithmic. Our implementation performs best on encrypted data with a batch size of 128 and worst with a batch size of one if we look at the time per sample. The overhead is smallest though for one instance per batch. Here the encrypted version is 40 times slower than the plain version. For our optimal batch size of 128 the encrypted ver- sion is 92 times slower. This is due to the different growth rates of execution time for the encrypted and plain data. Classification of Encrypted Word Embeddings using Recurrent Neural Networks Houston ’20, Febuary 07, 2020, Houston, TX REFERENCES https://www.aclweb.org/anthology/E17-2068 [1] Louis J. M. Aslett, Pedro M. Esperança, and Chris C. Holmes. 2015. Encrypted [25] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for statistical machine learning: new privacy preserving methods. CoRR (2015). Generating Image Descriptions. In The IEEE Conference on Computer Vision and [2] Ahmad Al Badawi, Luong Hoang, Chan Fook Mun, Kim Laine, and Khin Mi Mi Pattern Recognition (CVPR). Aung. 2019. PrivFT: Private and Fast Text Classification with Homomorphic [26] Qian Lou and Lei Jiang. 2019. SHE: A Fast and Accurate Deep Neural Network Encryption. arXiv preprint arXiv:1908.06972 (2019). for Encrypted Data. In Advances in Neural Information Processing Systems. 10035– [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine 10043. Translation by Jointly Learning to Align and Translate. In 3rd International [27] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Pro- 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473 ceedings of the 49th Annual Meeting of the Association for Computational Linguis- [4] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Shafi Goldwasser. 2015. Ma- tics: Human Language Technologies. Association for Computational Linguistics, chine Learning Classification over Encrypted Data. In 22nd Annual Network and Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015 Distributed System Security Symposium, NDSS, San Diego, California, USA. [28] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khu- [5] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2012. (Leveled) Fully danpur. 2010. Recurrent neural network based language model. In INTERSPEECH. Homomorphic Encryption Without Bootstrapping. In Proceedings of the 3rd [29] P. Mohassel and Y. Zhang. 2017. SecureML: A System for Scalable Privacy- Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New Preserving Machine Learning. In IEEE Symposium on Security and Privacy (SP). York, NY, USA, 309–325. [30] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: [6] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2017. Homomor- Global Vectors for Word Representation. In Empirical Methods in Natural Lan- phic Encryption for Arithmetic of Approximate Numbers. In Advances in Cryp- guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14- tology – ASIACRYPT 2017, Tsuyoshi Takagi and Thomas Peyrin (Eds.). Springer 1162 International Publishing, Cham, 409–437. [31] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher [7] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word August 2016. TFHE: Fast Fully Homomorphic Encryption Library. representations. In Proc. of NAACL. https://tfhe.github.io/tfhe/. [32] M. Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M. Songhori, [8] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Thomas Schneider, and Farinaz Koushanfar. 2018. Chameleon: A Hybrid Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Secure Computation Framework for Machine Learning Applications. CoRR Representations using RNN Encoder–Decoder for Statistical Machine Translation. abs/1801.03239 (2018). arXiv:1801.03239 http://arxiv.org/abs/1801.03239 In Proceedings of the 2014 Conference on Empirical Methods in Natural Language [33] Thomas Shortell and Ali Shokoufandeh. 2015. Secure Signal Processing Using Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, Fully Homomorphic Encryption. In Advanced Concepts for Intelligent Vision 1724–1734. https://doi.org/10.3115/v1/D14-1179 Systems - 16th International Conference, ACIVS, Italy, Proceedings. [9] François Chollet et al. 2017. Keras. https://github.com/fchollet/keras. [34] N. P. Smart and F. Vercauteren. 2014. Fully homomorphic SIMD operations. [10] Edward Chou, Josh Beal, Daniel Levy, Serena Yeung, Albert Haque, and Li Fei-Fei. Designs, Codes and Cryptography 71, 1 (01 Apr 2014), 57–81. https://doi.org/10. 2018. Faster CryptoNets: Leveraging sparsity for real-world encrypted inference. 1007/s10623-012-9720-4 arXiv preprint arXiv:1811.09953 (2018). [35] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural [11] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Networks for Language Modeling. In INTERSPEECH. Empirical evaluation of gated recurrent neural networks on sequence modeling. [36] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan In NIPS 2014 Workshop on Deep Learning, December 2014. Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Neural Image Caption Generation with Visual Attention. In Proceedings of the Pre-training of Deep Bidirectional Transformers for Language Understanding. 32nd International Conference on Machine Learning (Proceedings of Machine Learn- CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805 ing Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, [13] Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of 2048–2057. http://proceedings.mlr.press/v37/xuc15.html Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). [37] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, arXiv:1708.04552 http://arxiv.org/abs/1708.04552 and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan- [14] Nathan Dowlin, Ran Gilad-Bachrach, Kim Laine, Kristin Lauter Michael Naehrig, guage Understanding. arXiv preprint arXiv:1906.08237 (2019). and John Wernsing. 2016. CryptoNets: Applying Neural Networks to Encrypted [38] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. CoRR Data with High Throughput and Accuracy. Technical Report MSR-TR-2016-3. abs/1605.07146 (2016). arXiv:1605.07146 http://arxiv.org/abs/1605.07146 [15] Jeffrey L. Elman. 1990. Finding structure in time. COGNITIVE SCIENCE 14, 2 [39] S. Zhang, Y. Gong, and D. Yu. 2019. Encrypted Speech Recognition Using Deep (1990), 179–211. Polynomial Networks. In ICASSP 2019 - 2019 IEEE International Conference on [16] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Acoustics, Speech and Signal Processing (ICASSP). 5691–5695. https://doi.org/10. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th 1109/ICASSP.2019.8683721 International Conference on Machine Learning-Volume 70. JMLR. org, 1243–1252. [17] Thore Graepel, Kristin Lauter, and Michael Naehrig. 2013. ML Confidential: Machine Learning on Encrypted Data. In Proceedings of the 15th International Conference on Information Security and Cryptology (ICISC’12). Springer-Verlag. [18] Alex Graves and Navdeep Jaitly. 2014. Towards End-to-end Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Con- ference on International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, II–1764–II–1772. http://dl.acm.org/citation.cfm?id=3044805.3045089 [19] A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645–6649. https://doi.org/10.1109/ICASSP.2013. 6638947 [20] Shai Halevi and Victor Shoup. 2014. Algorithms in HElib. In Advances in Cryp- tology - CRYPTO - 34th Annual Cryptology Conference, CA, USA, Proceedings. [21] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. 2016. CryptoDL: Towards Deep Learning over Encrypted Data. In Annual Computer Security Applications Conference (ACSAC). [22] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. 2019. Deep Neural Networks Classification over Encrypted Data. In Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy. ACM, 97–108. [23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9. 8.1735 arXiv:https://doi.org/10.1162/neco.1997.9.8.1735 [24] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 427–431.