Classification of Encrypted Word Embeddings using Recurrent
                           Neural Networks
                            Robert Podschwadt                                                                Daniel Takabi
                      rpodschwadt1@student.gsu.edu                                                          takabi@gsu.edu
                         Georgia State University                                                        Georgia State University
                             Atlanta, Georgia                                                               Atlanta, Georgia

ABSTRACT                                                                               fully connected variant or Elman Network [15]. In this work we
Deep learning has made many exciting applications possible and                         work with Elman Networks and unless specified otherwise will use
given the popularity of social networks and user generated content                     the term RNN instead of Elman Network. Recurrent architectures
everyday there is no shortage of data for these applications. The                      are very popular in natural language processing (NLP) due to the
content generated by the users is written or spoken in natural lan-                    sequential nature of language. There are many different sub-fields in
guage which needs to be processed by computers. Recurrent Neural                       NLP. In this work we investigate the task of sentiment classification.
Networks (RNNs) are a popular choice for language processing due                          Many companies have built a business around offering MLaaS.
to their ability to process sequential data. On the other hand, this                   In MLaaS the model is hosted in the cloud. The service provider has
data is some of the most privacy sensitive information. Therefore,                     the infrastructure and know-how to build the models. The client
privacy-preserving methods for natural language processing are                         owns the data and sends it to the provider (also called server) for
crucial. In this paper, we focus on settings where a client has private                processing.
data and wants to use machine learning as a service (MLaaS) to                            A concern for the client of MLaaS is the privacy of the data.
perform classification on the data without the need to disclose the                    To process the data the server needs access to the data. This is
data to the entity offering the service. We employ homomorphic                         often unwanted or unacceptable depending on the sensitivity of the
encryption techniques to achieve this. Homomorphic encryption                          data. There are three main techniques for preserving the privacy of
allows for data being processed without it being decrypted thereby                     the data while still allowing for ML algorithms to work: 1) Secure
protecting the users privacy. Although homomorphic encryption                          Multiparty Computation (SMC), 2) Differential Privacy (DP) and 3)
has been used for privacy-preserving machine learning, most of the                     Homomorphic Encryption (HE).
work has been focused on image processing and convolutional neu-                          In previous work a variety different machine learning algorithm
ral networks (CNNs), but RNNs have not been studied. In this work,                     have been adapted for privacy preserving processing such as linear
we use homomorphic encryption to build privacy-preserving RNNs                         regression [29], linear classifiers [4, 17], decision trees [1, 4] or
for natural language processing tasks. We show that RNNs can be                        neural networks [14, 29, 32]. Solutions based on SMC [29, 32] come
run over encrypted data without loss in accuracy compared to a                         with a huge communication overhead.
plaintext implementation by evaluating our system on a sentinment                         We propose an approach that is based on homomorphic encryp-
classification task on the IMDb movie review dataset.                                  tion and recurrent neural networks. It does not require interactive
                                                                                       communication between client and server like SMC approaches
CCS CONCEPTS                                                                           but in the case of longer sequences, we use interactive communica-
                                                                                       tion to control the noise introduced by HE. Very little prior work
• Security and privacy; • Computing methodologies → Nat-
                                                                                       deals with recurrent neural networks. Much of the work is done
ural language processing; Neural networks;
                                                                                       on CNNs in the image domain [10, 14, 22] and more. [39] perform
                                                                                       encrypted speech recognition which is an NLP task but the model
KEYWORDS                                                                               used is also a CNN. Badwai et al. [2] research privacy preserving
privacy preserving machine learning, recurrent neural networks,                        text classification which is the task that we also us in the this paper
homomorphic encryption                                                                 but the authors do not use an RNN. To the best of our knowledge,
                                                                                       there is only one prior paper working with a recurrent architecture.
1    INTRODUCTION                                                                      Qian and Lei propose a system [26] that is capable of implementing
Artificial neural networks have been very successful and popular                       LSTM networks based on TFHE [7]. Their LSTM model suffers
over the last few years in a variety of domains. CNNs have shown                       from a small drop in accuracy though when running on encrypted
better than human performance in image classification tasks [13,                       data. Our solution is able to maintain the same accuracy as the
38] and have also been applied to language processing tasks[16].                       plain text model. We present a solution that can process RNNs with
RNNs, another type of neural networks, are specifically designed                       arbitrary length input sequences in a privacy preserving manner
to work with sequences. Unlike other types of networks, RNNs                           and introduce a way of using word embeddings with encrypted
take the output of the previous sequence step into consideration.                      data. To ensure the privacy of the data we rely on the CKKS [6]
There are different types of RNN architectures such as Long Short                      crypto scheme. We evaluate our system on a text classification task.
Term Memory (LSTM), Gated Recurrent Unit (GRU) and a simple                            The basic idea of our proposed approach is running RNNs on the
                                                                                       encrypted data by taking advantage of HE schemes. The server
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   hosts the trained model, the client transmits the encrypted data for
License Attribution 4.0 International (CC BY 4.0).
Houston ’20, Febuary 07, 2020, Houston, TX                                                                                  Podschwadt and Takabi


processing and receives an encrypted result. The training of the           more noise than additions. A way of controlling the noise is to use
model is done on plaintext. In this work, we make the following            so-called leveled homomorphic encryption (LHE). LHE allows a for
main contributions:                                                        certain number of multiplications based on the parameters chosen
    • We propose an approach that combines RNNs, specifically              for the encryption scheme and evaluating circuits of a known depth.
      Elman Networks, and homomorphic encryption to perform                Computation cost can be mitigated in some cases by using single
      inference over encrypted data in natural language processing         instruction multiple data (SIMD) techniques introduced by Smart
      tasks.                                                               and Vercauteren [34].
    • We present an innovative approach to work with word em-
      beddings for encrypted data.                                         2.2    Recurrent Neural Networks
    • We perform thorough benchmarking of our system both                  In contrast to fully connected or convolutional neural networks,
      with respect to run time performance and communication               which are feed forward only, recurrent neural networks feed some
      cost. Our results demonstrate that we are able to run RNNs           part of their hidden state back into themselves.
      over encrypted data without sacrificing accuracy and with                There are many different types of recurrent neural network cells
      reasonable performance and communication cost.                       with Long short-term Memory (LSTM) [23] and Gated Recurrent
                                                                           Unit (GRU) [8] being the most popular ones. These cells are more
1.1     Threat Model and Problem Statement                                 complex than simple RNNs which we are focusing on this paper.
In this paper, we apply privacy preserving machine learning tech-          While LSTM and GRU lead to better performance we focus on the
niques based on HE to RNNs. We focus on a client server setting            simpler RNN type due to lower computational complexity.
such as MLaaS in which the client has full control over the data               The RNN used in this paper consists of three main components
and the server has full control over the model. We assume that             input (x t ), hidden state (st ) and output (o) of the network at time
the model has been trained on plaintext data and the server offers         step t. The st for one neuron is calculated by following formula:
inference as a service to the clients. The clients want to use the         st = f (x t · w + st −1 · v) where f is the activation function and
                                                                                                                    −x    −x
inference service and wish to keep their data private while the            · is the vector dot product. Tanh ( ee −x −e+e −x ) is the most common
server wishes to keep its model private .                                  activation functions used in RNNs.
    Threat Model: We assume that all parties are honest but curios.
They will not deviate from the protocol but will attempt to learn          2.3    Polynomial Approximation: Theoretical
any information possible in the process. The server does not share
                                                                                  Foundation
information about the architecture of the model with the client. The
client encrypts the data and sends it to the server for processing. If     One of the major limitations of homomorphic encryption is the
it is possible, the server will process the data and send back the final   limited set of operations that can be performed. CKKS supports
result in encrypted format. In some cases data will be sent back           addition and multiplication. Division is supported only for plaintext
to the client where it is decrypted, encrypted again to remove the         divisors. Basically this allows us to evaluate only polynomials. Tanh,
built-up noise and sent back to the server to continue processing.         a popular activation function in RNNs, can not be expressed as a
In addition to the privacy of the data we have the goal to achieve         polynomial. This means we can not evaluate it over encrypted data.
accurate predictions. This means the predictions made on encrypted         A way to circumvent this is to find a polynomial approximation.
data should be as close as possible to predictions made on plaintext          Hesamifard et al. [21] use an approach that is based on Cheby-
data.                                                                      shev polynomials. Given the family of all continuous real valued
                                                                           functions X on a non-empty compact space C(X ) and let µ be ∫ a finite
2 BACKGROUND                                                               measure on X . The authors define f , д ∈ C(X ) as ⟨f , д⟩ = X f дdµ.
                                                                           To generate Chebyshev polynomials they use dµ = √ dx 2 as the
2.1 Homomorphic Encryption                                                                                                          1−x
                                                                           measure on [−1, 1]. For better computational performance we want
Homomorphic encryption (HE) schemes are similar to other asym-
                                                                           to stick to low degree polynomials.
metric encryption schemes as in they have a public key pk for
encrypting (Enc) data and a private or secret key sk for decryp-
tion (Dec). Additionally, HE schemes also have a so-called eval-           2.4    NLP with Neural Networks
uation function, Eval. This evaluation function allows the eval-           Recurrent neural networks are widely used for addressing chal-
uation of a circuit C over encrypted data without the need for             lenges in Natural Language Processing. Recurrent neural network
decryption. Given a set of plaintexts {mi }0n and their encryption         reached state-of-the-art performance for different tasks such as:
{c i }0n = Enc(pk, {mi }0n ) the circuit C can be evaluated as:            Speech Recognition, [19] and [18], Generating Image Descriptions,
    Dec(sk, Eval(pk, C, c 0, · · · , c n )) = C(m 0, · · · , mn ).         [25] and [36], Machine Translation, [3] and [11], Language Model-
    Most modern HE schemes are based on the ring learning with             ing, [28] and [35]. The implementation of an NLP pipeline using
errors problem (RLWE). Roughly speaking to encrypt a plaintext             RNNs can be broken done into four major parts: 1) Designing the
some noise is added and the decryption process is the removal              network, 2) Encoding the data, 3) Training the model and 4) Infer-
of that noise. For more details see Brakerski et al. [5] and Cheon         ence of new instances.
et al. [6]. When operations are performed on the ciphertexts the              In the next section we will look at the individual steps in detail
noise grows and when it passes a certain threshold the ciphertext          and describe the changes that are necessary for computation in a
can not be decrypted correctly anymore. Multiplications add much           privacy preserving setting.
Classification of Encrypted Word Embeddings using Recurrent Neural Networks                                      Houston ’20, Febuary 07, 2020, Houston, TX


3     THE PROPOSED PRIVACY-PRESERVING                                         3.2    Noise growth in HE
      CLASSIFICATION FOR RECURRENT                                            In an RNN architecture, a sequence is processed by feeding its
      NEURAL NETWORKS                                                         entries into a fully connected layer which also takes the output of
Looking at the components of the RNN pipeline described in Section            that layer produced for the previous sequence entry. The current
2.4 we determine what changes need to be made to adhere to the                output and the previous output are combined into the new output.
constraints of homomorphic encryption.                                        Due to the noise build-up in HE we need to keep track of the number
   Network Design. As long as we only use fully connected and                 of operations performed on ciphertexts. To process a sequence
recurrent layers the only consideration we need to make are the               of length n with an RNN layer the resulting ciphertext needs to
activation functions that are being used. All other operations inside         pass the layer n times. That means n dot products and activation
an RNN can be performed over encrypted data using HE schemes.                 functions are applied. It is not always possible to process all of the
However, it is not possible to implement common activation func-              sequence entries due to the noise that is accumulated. Our approach
tions within current HE schemes. We aim to find the best low de-              is to send the encrypted data back to the client where it is decrypted
gree polynomial approximation to replace the activation functions             and re-encrypted thereby removing the built up noise.
within the RNN.
   Data Encoding. In this paper, we use word embeddings as an                 3.3    Implementation
encoding scheme for textual data. We describe our approach to                 We use CKKS to protect the privacy of the client data. The server
handling embeddings in more detail in Section 3.1.                            trains a plaintext model and shares the embedding matrix with the
   Model Training. In this paper, we assume that the training of              client. The activation in the model needs to be compatible with
the model is performed by the server on plain training data.                  HE. This is achieved by approximating Tanh using the method by
   Inference. This is the part of the pipeline in our system that             Hesamifard et al. [21]. The client performs the embedding process
is run on encrypted data. At no point during this process is the              and encrypts the result. The encrypted embeddings are sent to the
data decrypted on the server thus ensuring its privacy is protected.          server where it is processed. When the noise, built up during com-
During processing by the model, the encrypted data accumulates                putation, reaches the limit it the data is sent back to client where it is
noise. We describe a way of circumventing the problem of the noise            decrypted, thereby removing all noise, rencrypted and sent back to
crossing the threshold after which correct decryption is no longer            the server. Once the model is completed processed the server sends
possible in Section 3.2. Once the data has been processed by the              the still encrypted resutl back to the client where it can be decrypted.
entire network, the result of the classification is sent back to the          We implement our proposed solution in C++11. We train the model
client. The result of the classification is still encrypted and needs         using Keras [9] and the homomorphic encryption primitives are pro-
to be decrypted by the client.                                                vided by HElib [20]. On the plaintext, we tried different activation
   A variety of activation functions have been proposed as replace-           functions and found out that Tahn and Tanh approximations work
ments for common activation functions used in NNs. Dowlin et. al              best. Other activation functions such as x 2 or the linear function
[14] use polynomials of degree 2 to substitute the Sigmoid function           cause the model not to train properly. We find that best replacement
in CNNs and Shortell and Shokoufandeh [33] use polynomial of                  for our purposes is: −0.00163574303018748x 3 +0.249476365628036x.
degree 3 to approximate the natural logarithm function. Hesamifard
et. al [21] use Chebyshev polynomials to approximate activation               4     EXPERIMENTAL RESULTS
functions such as ReLU, Sigmoid and Tanh. We will be using the                The experiments were performed on a Ubuntu 18.04 64bit machine
approach of [21] to approximate Tanh which is the most popular                with an AMD Ryzen 5 2600 @ 3.5GHz processor and 32GB of RAM.
activation function in RNNs. The Softmax function can not be per-             The IMDb [27] dataset contains 50,000 movie reviews labeled as
formed over encrypted data but since it is typically used as the very         either positive or negative of which 25,000 are used as training
last function of neural network, we move it to the client side. The           and 25,000 as test data. The tokenization is performed by Keras.
server computes the neural network all the way to the inputs of               We train a model to perform sentiment classification which is clas-
the Softmax function. The the Softmax function is performed by                sifying a review as either positive or negative. Out of the 25,000
the client after decryption to obtain the classification results.             training instances we use 2,000 as validation data for hyperparame-
                                                                              ter tuning. We use a vocabulary of the top 20,000 words. We pad or
3.1     Encrypted word embeddings                                             truncate the reviews to be 200 words long. Our model consists of
Word embeddings are a way to turn words into real valued vectors.             an embedding layer that turns words in the reviews into real val-
The embedding layer basically is a lookup table that maps any word            ued vectors of dimension 128. The embedding matrix is randomly
in a dictionary to a real valued vector. The lookup of an embedding           initialized and updated during the training process. The embedding
for a given word cannot be performed efficiently in HE schemes.               layer is followed by and RNN layer with 128 units. We use the
We address this problem by moving the embedding layer out of                  Tanh approximation from Section 3.3 as activation function. The
the RNN and to the client where it can be performed in plaintext.             last layer is a fully connected layer with two units and Softmax
After performing the embedding lookup, the client encrypts the                activation. The training is performed on the plain data using Keras
embeddings and sends the result to the server. To enhance the the             and yields 86.47% accuracy on the unseen test data. We achieve the
privacy of the model, the model owner can use one of the many                 same accuracy on the encrypted data.
pretrained embeddings such as GloVe [30], Elmo [31], Bert [12] or                We extract the learned weights and run experiments with differ-
XLNet [37] and share those with the client .                                  ent batch sizes. In our experiments the noise growth exceeds the
Houston ’20, Febuary 07, 2020, Houston, TX                                                                                  Podschwadt and Takabi


 Table 1: Data transferred during encrypted classification                      Table 2: Run times of inference on IMDb test set.
 Batchsize Embeddings         Noisy Refreshed        Batch                            Batch     Encrypted        Plain
                          ciphertext ciphertext                                       Size    (sample/batch) (sample/batch)
 1                    125MB          0.939MB    0.623MB       135MB                    1          70.6s / 70.6s      1.83s / 1.8s
 4                    287MB             2.2MB      1.5MB      312MB                    4          20.2s / 80.7s     0.496s / 1.99s
 32                 1,843MB             14MB         9MB    2,004MB                    32         5.8s / 184.6s     0.072s / 2.30s
 64                 3,548MB             27MB       18MB     3,863MB                    64         4.3s / 272.7s     0.055s / 3.52s
 128                7,065MB             54MB       35MB     7,869MB                    128        4.2s / 547.6s     0.046s / 5.89s
 256               14,336MB            106MB       70MB    15,568MB                    256       6.5s / 1658.7s     0.039s / 9.96s


                                                                          5   RELATED WORK
                                                                          Badwai et al. [2] presented PrivFT a system for privacy preserv-
                                                                          ing text classification built on Facebooks fasttext [24] (Joulin et al.
                                                                          [24]). The main difference to our work is that we use a recurrent
workable threshold after 27 timesteps. This means we need to add
                                                                          architecture. In PrivFT the embedding operation is also not out-
communication between client and server seven times to refresh
                                                                          sourced to the client. The client needs to one-hot encode each word,
the noise in order to classify the IMDb sequences of length 200.
                                                                          encrypt it and send it to server where the embedding operation is
    The amount of data that needs to be transmitted depends on the
                                                                          performed as a matrix multiplication. The message size is similar.
batch size. The encrypted embeddings are larger than the plaintext
                                                                          The inference time for single instance on the IMDb is higher in our
data by a factor of 1,280. See Table 1 for different batch sizes. The
                                                                          scenario but using larger batch sizes allows us to get a lower per
Embeddings column is the amount of data that is initially transferred
                                                                          instance time. In contrast to our work, PrivFT features schemes for
from the client to server. Noisy ciphertext gives the size of the data
                                                                          training on encrypted data and a CKKS implementation with GPU
the server sends to the client to be refreshed and Refreshed ciphertext
                                                                          acceleration. Lou and Jiang created SHE [26] a privacy preserving
is the reencrypted answer. These are the values for only one refresh
                                                                          neural network framework based on TFHE. It offers support for
operation. The Batch column is the total amount of data transferred
                                                                          LSTM cells. The authors replace the computationally expensive
between client and server during classification of one batch which
                                                                          and high noise introducing matrix operations normally required
requires seven refresh rounds.
                                                                          by LSTMs with much cheaper shift operations. Zhang et al. [39]
    The amount of data that needs to be transmitted initially makes
                                                                          perform a different NLP task namely encrypted speech recognition
up the largest portion of the transfer. To run our network seven
                                                                          based on a CNN. The last step of the network that matches the
noise removal communications are required. At a batch size of 256
                                                                          output to actual text is performed on the client side.
the server sends 106MB to the client and the client responds with
70MB. One round of noise removal therefore requires 176 MB to
                                                                          6   CONCLUSION
be transferred. All seven rounds take 1,232MB. Which is less than
10% of the initial transfer. The increase in size of the ciphertexts      In this paper, we present an approach that allows the use of recur-
is nearly linear. Smaller ciphertexts sizes carry more overhead per       rent neural networks on homomorphically encrypted data based
instance than larger ones.                                                on the CKKS scheme. We present a solution to perform NLP tasks
    Table 2 lists the execution time for different batch sizes. The       over encrypted data using recurrent neural networks, in our case
times are given for encrypted, plain data and for the actual time         sentiment analysis on the IMDb dataset. We are able to achieve
it takes to processes the batch as well as the resulting time per         this with no loss in accuracy compared to the plaintext model. This
instance. The noise removal is not performed by the client though.        is made possible by introducing communication between client
It is simulated on the server. The measurements also do not include       and server to refresh the noise. We trade network traffic for the
the encryption and transfer of the embeddings. We can see that            ability efficiently use word embeddings. Our future work aims at
increasing the batch size leads to lower per instance classification      investigating other recurrent architectures such as LSTM and GRU.
time. The effect is lost when increasing the batch size from 128 to
256. On the plain data we still can see improvement after that point.
To get an accurate comparison the plain text measurements are per-
formed on the same implementation as the encrypted experiments.
It looks like the growth in execution time for the encrypted values
is exponential while the plain version appears to be logarithmic.
Our implementation performs best on encrypted data with a batch
size of 128 and worst with a batch size of one if we look at the
time per sample. The overhead is smallest though for one instance
per batch. Here the encrypted version is 40 times slower than the
plain version. For our optimal batch size of 128 the encrypted ver-
sion is 92 times slower. This is due to the different growth rates of
execution time for the encrypted and plain data.
Classification of Encrypted Word Embeddings using Recurrent Neural Networks                                                           Houston ’20, Febuary 07, 2020, Houston, TX


REFERENCES                                                                                        https://www.aclweb.org/anthology/E17-2068
 [1] Louis J. M. Aslett, Pedro M. Esperança, and Chris C. Holmes. 2015. Encrypted            [25] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for
     statistical machine learning: new privacy preserving methods. CoRR (2015).                   Generating Image Descriptions. In The IEEE Conference on Computer Vision and
 [2] Ahmad Al Badawi, Luong Hoang, Chan Fook Mun, Kim Laine, and Khin Mi Mi                       Pattern Recognition (CVPR).
     Aung. 2019. PrivFT: Private and Fast Text Classification with Homomorphic               [26] Qian Lou and Lei Jiang. 2019. SHE: A Fast and Accurate Deep Neural Network
     Encryption. arXiv preprint arXiv:1908.06972 (2019).                                          for Encrypted Data. In Advances in Neural Information Processing Systems. 10035–
 [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine                     10043.
     Translation by Jointly Learning to Align and Translate. In 3rd International            [27] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and
     Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,              Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Pro-
     2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473                           ceedings of the 49th Annual Meeting of the Association for Computational Linguis-
 [4] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Shafi Goldwasser. 2015. Ma-                   tics: Human Language Technologies. Association for Computational Linguistics,
     chine Learning Classification over Encrypted Data. In 22nd Annual Network and                Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015
     Distributed System Security Symposium, NDSS, San Diego, California, USA.                [28] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khu-
 [5] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2012. (Leveled) Fully               danpur. 2010. Recurrent neural network based language model. In INTERSPEECH.
     Homomorphic Encryption Without Bootstrapping. In Proceedings of the 3rd                 [29] P. Mohassel and Y. Zhang. 2017. SecureML: A System for Scalable Privacy-
     Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New                  Preserving Machine Learning. In IEEE Symposium on Security and Privacy (SP).
     York, NY, USA, 309–325.                                                                 [30] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
 [6] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2017. Homomor-                      Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
     phic Encryption for Arithmetic of Approximate Numbers. In Advances in Cryp-                  guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-
     tology – ASIACRYPT 2017, Tsuyoshi Takagi and Thomas Peyrin (Eds.). Springer                  1162
     International Publishing, Cham, 409–437.                                                [31] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
 [7] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène.                     Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word
     August 2016.          TFHE: Fast Fully Homomorphic Encryption Library.                       representations. In Proc. of NAACL.
     https://tfhe.github.io/tfhe/.                                                           [32] M. Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M. Songhori,
 [8] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,                      Thomas Schneider, and Farinaz Koushanfar. 2018. Chameleon: A Hybrid
     Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase                     Secure Computation Framework for Machine Learning Applications. CoRR
     Representations using RNN Encoder–Decoder for Statistical Machine Translation.               abs/1801.03239 (2018). arXiv:1801.03239 http://arxiv.org/abs/1801.03239
     In Proceedings of the 2014 Conference on Empirical Methods in Natural Language          [33] Thomas Shortell and Ali Shokoufandeh. 2015. Secure Signal Processing Using
     Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar,                  Fully Homomorphic Encryption. In Advanced Concepts for Intelligent Vision
     1724–1734. https://doi.org/10.3115/v1/D14-1179                                               Systems - 16th International Conference, ACIVS, Italy, Proceedings.
 [9] François Chollet et al. 2017. Keras. https://github.com/fchollet/keras.                 [34] N. P. Smart and F. Vercauteren. 2014. Fully homomorphic SIMD operations.
[10] Edward Chou, Josh Beal, Daniel Levy, Serena Yeung, Albert Haque, and Li Fei-Fei.             Designs, Codes and Cryptography 71, 1 (01 Apr 2014), 57–81. https://doi.org/10.
     2018. Faster CryptoNets: Leveraging sparsity for real-world encrypted inference.             1007/s10623-012-9720-4
     arXiv preprint arXiv:1811.09953 (2018).                                                 [35] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural
[11] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014.                     Networks for Language Modeling. In INTERSPEECH.
     Empirical evaluation of gated recurrent neural networks on sequence modeling.           [36] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
     In NIPS 2014 Workshop on Deep Learning, December 2014.                                       Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell:
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:                Neural Image Caption Generation with Visual Attention. In Proceedings of the
     Pre-training of Deep Bidirectional Transformers for Language Understanding.                  32nd International Conference on Machine Learning (Proceedings of Machine Learn-
     CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805                 ing Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France,
[13] Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of                      2048–2057. http://proceedings.mlr.press/v37/xuc15.html
     Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017).                  [37] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
     arXiv:1708.04552 http://arxiv.org/abs/1708.04552                                             and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan-
[14] Nathan Dowlin, Ran Gilad-Bachrach, Kim Laine, Kristin Lauter Michael Naehrig,                guage Understanding. arXiv preprint arXiv:1906.08237 (2019).
     and John Wernsing. 2016. CryptoNets: Applying Neural Networks to Encrypted              [38] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. CoRR
     Data with High Throughput and Accuracy. Technical Report MSR-TR-2016-3.                      abs/1605.07146 (2016). arXiv:1605.07146 http://arxiv.org/abs/1605.07146
[15] Jeffrey L. Elman. 1990. Finding structure in time. COGNITIVE SCIENCE 14, 2              [39] S. Zhang, Y. Gong, and D. Yu. 2019. Encrypted Speech Recognition Using Deep
     (1990), 179–211.                                                                             Polynomial Networks. In ICASSP 2019 - 2019 IEEE International Conference on
[16] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin.               Acoustics, Speech and Signal Processing (ICASSP). 5691–5695. https://doi.org/10.
     2017. Convolutional sequence to sequence learning. In Proceedings of the 34th                1109/ICASSP.2019.8683721
     International Conference on Machine Learning-Volume 70. JMLR. org, 1243–1252.
[17] Thore Graepel, Kristin Lauter, and Michael Naehrig. 2013. ML Confidential:
     Machine Learning on Encrypted Data. In Proceedings of the 15th International
     Conference on Information Security and Cryptology (ICISC’12). Springer-Verlag.
[18] Alex Graves and Navdeep Jaitly. 2014. Towards End-to-end Speech Recognition
     with Recurrent Neural Networks. In Proceedings of the 31st International Con-
     ference on International Conference on Machine Learning - Volume 32 (ICML’14).
     JMLR.org, II–1764–II–1772. http://dl.acm.org/citation.cfm?id=3044805.3045089
[19] A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep
     recurrent neural networks. In 2013 IEEE International Conference on Acoustics,
     Speech and Signal Processing. 6645–6649. https://doi.org/10.1109/ICASSP.2013.
     6638947
[20] Shai Halevi and Victor Shoup. 2014. Algorithms in HElib. In Advances in Cryp-
     tology - CRYPTO - 34th Annual Cryptology Conference, CA, USA, Proceedings.
[21] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. 2016. CryptoDL: Towards
     Deep Learning over Encrypted Data. In Annual Computer Security Applications
     Conference (ACSAC).
[22] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. 2019. Deep Neural
     Networks Classification over Encrypted Data. In Proceedings of the Ninth ACM
     Conference on Data and Application Security and Privacy. ACM, 97–108.
[23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
     Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
     8.1735 arXiv:https://doi.org/10.1162/neco.1997.9.8.1735
[24] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag
     of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of
     the European Chapter of the Association for Computational Linguistics: Volume 2,
     Short Papers. Association for Computational Linguistics, Valencia, Spain, 427–431.