Privacy-Preserving Character Language Modelling Patricia Thaine, Gerald Penn Department of Computer Science, University of Toronto {pthaine, gpenn}@cs.toronto.edu Abstract tion (Monet and Clier 2016), keyword search, and bag-of- words frequency counting (Grinman 2016). Some of the most sensitive information we generate is either written or spoken using natural language. Privacy-preserving Homomorphic Encryption methods for natural language processing are therefore cru- cial, especially considering the ever-growing number of data Homomorphic encryption schemes allow for computations breaches. However, there has been little work in this area up to be performed on encrypted data without needing to de- until now. In fact, no privacy-preserving methods have been crypt it. proposed for many of the most basic NLP tasks. For this work, we use the Brakerski-Fan-Vercauteren We propose a method for calculating character bigram and (BFV) Ring-Learning-With-Errors-based fully homomor- trigram probabilities over sensitive data using homomorphic phic encryption scheme (Brakerski 2012)(Fan and Ver- encryption. Where determining an encrypted character’s cauteren 2012), with the encryption and homomorphic mul- probability using a plaintext bigram model has a runtime of tiplication improvements presented in (Halevi, Polyakov, 1945.29 ms per character, an encrypted bigram model takes and Shoup 2018). This scheme is implemented in the PAL- us 88549.1 ms, a plaintext trigram model takes 3868.65 ms, ISADE Lattice Cryptography Library2 . However, the al- and an encrypted trigram model takes 102766 ms. gorithm we propose can be implemented using any homo- morphic encryption scheme that allows for addition and Introduction component-wise multiplication in the encrypted domain. Character-level language models are used in a variety of Notation, Scheme Overview, and Chosen tasks, such as character prediction for facilitating text mes- Parameters saging. Despite the sensitive nature of the input data, there has been very little work done on privacy-preserving charac- We will be using the same notation as (Brakerski 2012) ter language models. One example of such work is Apple’s and (Fan and Vercauteren 2012), but as we provide only a use of a version of differential privacy called randomized brief overview of the homomorphic encryption scheme, the response to improve their emoji QuickType predictions1 . specific optimizations introduced in (Fan and Vercauteren Briefly, what users truly type is sent to Apple with a cer- 2012),(Halevi, Polyakov, and Shoup 2018) will not be dis- tain probability p and fake input is sent with a probability cussed. Let R = Z[x]/(f (x)) be an integer ring, where 1 − p. While this technique can be used for efficiently train- f (x) ∈ Z[x] is a monic irreducible polynomial of degree d. ing a privacy-preserving emoji prediction system, it is easy Bold lowercase letters denote elements of RP and their coeffi- d−1 to see that on its own it would not preserve the privacy of cients will be denoted by indices (e.g., a = i=0 ai ·xi . Zq , natural language input while preserving its utility. where q > 1, q ∈ Z, denotes the set of integers (−q/2.q/2]. We propose a privacy-preserving character-level n-gram q is referred to as the ciphertext modulus. An integer n’s language model, the inputs to which are entirely accurate i-th bit is denoted n[i], from i = 0. The secret key is and entirely private. We use Homomorphic Encryption (HE) sk = (1, s), where s ← χ. The public key is called for this purpose. HE has so far been used for a small pk = ([−(a · s + e]q , a), where a ← Rq , e ← χ. number of natural language processing and information re- • Encrypt(pk,m): message m ∈ Rt , p0 = pk[0], p1 = trieval tasks. Some of these include spam filtering (Pathak, pk[1], u ← R2 , e1 , e2 ← χ: Sharifi, and Raj 2011), hidden-Markov-model-based spoken   keyword spotting (Pathak et al. 2011), speaker recognition ct = [p0 · u + e1 + ∆ · m]q , [p1 · u + e2 ]q (Pathak and Raj 2013), n-gram-based similar document de- tection (Jiang and Samanthula 2011), language identifica- • Decrypt(sk, ct): s = sk, c0 = ct[0], c1 = ct[1]. hj t · [c + c · s] mi 1 0 1 q https://machinelearning.apple.com/ docs/learning-with-privacy-at-scale/ q t 2 appledifferentialprivacysystem.pdf https://git.njit.edu/palisade/PALISADE • Add(ct1 , ct2 ): ([ct1 [0] + ct2 [0]]q , [ct1 [1] + ct2 [1]]q ) Each of these vectors are zero-vectors, except for a vector • Add(ct1 , pt2 ): ([ct1 [0] + pt2 [0]]q , [ct1 [1] + pt2 [1]]q ) of ones which is at the letter’s ‘designated index’. Here’s a simple example for a language containing k = 3 character • Mul(ct1 , ct2 ): For this paper, we use component- types. Assume we want to convert letter ‘a’; the server is wise multiplication, a simplified description of which is: sent two matrices which represent the letter ‘a’ denoted by ([ct1 [0] · ct2 [0]]q , [ct1 [1] · ct2 [1]]q ). The algorithmic Ma1 and Ma2 : details for obtaining this result can be found in (Fan and Vercauteren 2012). [1 1 1] " #! • Mul(ct1 , pt2 ): Like with the Add function, it is pos- sible to multiply a ciphertext with plaintext, resulting in: Ma1 = E([1 0 0]), Ma2 = E [0 0 0] ([ct1 [0] · pt2 [0]]q , [ct1 [1] · pt2 [1]]q ). [0 0 0] Using homomorphic encryption, we can perform linear and Say ‘a’ is followed by ‘b’, then the server receives: (depending on the encryption scheme) polynomial opera- tions on encrypted data (multiplication, addition, or both). We can neither divide a ciphertext, nor exponentiate using " [0 0 0] #! an encrypted exponent. We can keep track separately of a Mb1 = E([0 1 0]), Mb2 = E [1 1 1] numerator and a corresponding denominator. For clarity, we [0 0 0] shall refer to the encrypted version of a value ∗ as E(∗) and × to represent Mul. The user wants to know how likely is it for ‘b’ to follow ‘a’. The server is able to calculate this like so: Optimization: Single Instruction Multiple Data Single Instruction Multiple Data (SIMD) is explained in [p11 p12 p13 ] [1 1 1] " # " #! (Smart and Vercauteren 2014). Using the Chinese Remain- [p21 p22 p23 ] × E [0 0 0] der Theorem, an operation between two SIMD-encoded lists [p31 p32 p33 ] [0 0 0] of encrypted numbers can be performed by the same opera- tions between two regularly-encoded encrypted numbers. Resulting in: Encoding Variables [p11 p12 p13 ] " #! The very first step in converting an algorithm to an HE- friendly form is to make the data amenable to HE analy- I=E [0 0 0] sis. This includes transforming floating point numbers into [0 0 0] approximations, such as by computing an approximate ra- tional representation for them, clearing them by multiplying We then take the inner products of each row of I with them by a pre-specified power of 10, and then rounding to Ma1 and add them all together. The result, E(p12 ), is sent the nearest integer (Graepel, Lauter, and Naehrig 2012). back to the user, who is able to decrypt it. We follow the method suggested in (Aslett, Esperança, and Holmes 2015): choose the number of decimal places Trigram Probabilities to be retained based on a desired level of accuracy φ, then multiply the data by 10φ and round to the nearest integer. To use the trigram model, we again convert characters into Since we are dealing with probabilities, we do not lose much one-hot vectors. Along with them, however, we must send information when converting, say, 99.6% to 99. a bit more information. Say we have a trigram probability matrix whose columns are sorted as follows: Privacy-Preserving Bigram and Trigram c1 c1   Models c1 c2  We assume that a user has some sensitive data requiring c c   1 3 character-level predictions to be made and that a server has c2 c1  a character-level language model that they do not want to   c2 c2  share. We train a bigram model and a trigram model us-   c2 c3  ing plaintext from part of the Enron email dataset, which c3 c1    we pre-process to only contain k = 27 types (space and 26 c c  3 2 lowercase letters). A user’s emails are then converted into c3 c3 binary vectors of dimension k. If we want to access the row containing the probabilities Bigram Probabilities of cx cy , we can find its index with the equation xt+y, where For the bigram model, we convert characters into one-hot t is the number of character types. Let us use ‘a’ as an ex- vectors (e.g., the vector for ‘a’ has a 1 at index 0). Along ample again, with t = 3. The user must again send along with the one-hot vector, k ordered vectors of size k are sent. Ma1 = E([1 0 0]), as well as: Plaintext Model, Encrypted Input It takes us 1945.29 ms to output the probability of one encrypted character [1 1 1] [1 1 1]     given the preceding character (also encrypted) and a plain- [0 0 0] [1 1 1] text character bigram model. It takes us 88549.1 ms to out- [0 0 0] [1 1 1]  [1 !  ! put the probability of one encrypted character given its two  1 1]  [0  0 0]  preceding characters (both encrypted) and a plaintext char- [0 0 0] Ma2 = E  [0 0 0]  , Ma3 = E   acter trigram model. These results are based on the parame- [0 0 0] [0 0 0] ters listed in Section . [1 1 1] [0 0 0]     [0 0 0] [0 0 0] [0 0 0] [0 0 0] Encrypted Model, Encrypted Input It takes us 3868.65 ms to output the probability of one encrypted character If ‘b’ follows ‘a’, the server is then sent Mb1 = given the preceding character (also encrypted) and an E([0 1 0]) and: encrypted character bigram model. It takes us 102766 ms to output the probability of one encrypted character given its two preceding characters (both encrypted) and an encrypted [0 0 0] [0 0 0]     character trigram model. These results are also based on the [1 1 1] [0 0 0] parameters listed in Section . [0 0 0] [0 0 0]  !  ! [0 0 0] [1 1 1]     Additional runtime comparisons are provided in Figure 1 [1 1 1] Mb2 = E  [1 1 1]  , Mb3 = E   and Figure 2. While the runtime of encrypted models might [0 0 0] [1 1 1] limit the practicality of their deployment, plaintext models [0 0 0] [0 0 0]     [1 running on encrypted data are practical for deployment, es- 1 1] [0 0 0] pecially when considering predictions parallelizability and [0 0 0] [0 0 0] the speed ups that better RAM could lead to. The probability of ‘aba’, given a trigram probability ma- trix Ptrigram can be calculated as follows: Conclusion and Future Work We described a method for calculating character-level bi- gram and trigram probabilities given encrypted data and per- I = Ptrigram × (Ma3 × Mb3 ) form runtime experiments across various security levels to Next, the server take the inner products of each row of I test the scalability of our algorithms. Our next steps will be with Ma1 and adds them all together. The result, E(p121 ), to adapt this method to word-level language modeling, as is sent back to the user, who is able to decrypt it. well as to create a protocol for training n-gram models on encrypted data. Security The BGV scheme has semantic security (Albrecht et al. 2018), which means that “whatever an eavesdropper can compute about the cleartext given the ciphertext, he can also compute without the ciphertext” (Shafi and Micali 1984). For our first few experiments, we chose parameters that give over a 128-bit security level the values presented in (Al- brecht et al. 2018). This means that it would take over 2128 computations to crack the decryption key. We set the BFV scheme’s parameters as follows, to guar- antee 128-bit security: • Plaintext Modulus (t) = 65537, • σ = 3.2, • Root Hermite Factor = 1.0081, • m = 16384, • Ciphertext Modulus (q) = 153249540390512585124987 3756002082622499024400469688321 To run 256-bit security experiments we set m = 32768. Experiments The following experiments were run on an Intel Core i-7- 8650U CPU @ 1.90GHz and a 16GB RAM. Figure 1: Runtime comparisons of bigram and trigram predictions for character languages models with 27 to 31 characters across three security levels and between encrypted and plaintext character language models. Figure 2: Runtime comparisons of 256-bit security of bigram and trigram predictions for character languages models with 27 to 31 characters. References Albrecht, M.; Chase, M.; Chen, H.; Ding, J.; Goldwasser, S.; Gorbunov, S.; Hoffstein, J.; Lauter, K.; Lokam, S.; Miccian- cio, D.; Moody, D.; Morrison, T.; Sahai, A.; and Vaikun- tanathan, V. 2018. Homomorphic encryption security stan- dard. Technical report, HomomorphicEncryption.org, Cam- bridge MA. Aslett, L. J.; Esperança, P. M.; and Holmes, C. C. 2015. En- crypted statistical machine learning: new privacy preserving methods. arXiv preprint arXiv:1508.06845. Brakerski, Z. 2012. Fully homomorphic encryption without modulus switching from classical GapSVP. In Advances in cryptology–crypto 2012. Springer. 868–886. Fan, J., and Vercauteren, F. 2012. Somewhat Practical Fully Homomorphic Encryption. IACR Cryptology ePrint Archive 2012:144. Graepel, T.; Lauter, K.; and Naehrig, M. 2012. Ml confiden- tial: Machine learning on encrypted data. In International Conference on Information Security and Cryptology, 1–21. Springer. Grinman, A. J. 2016. Natural language processing on en- crypted patient data. Ph.D. Dissertation, Massachusetts In- stitute of Technology. Halevi, S.; Polyakov, Y.; and Shoup, V. 2018. An improved rns variant of the bfv homomorphic encryption scheme. IACR Cryptology ePrint Archive 2018:117. Jiang, W., and Samanthula, B. K. 2011. N-gram based se- cure similar document detection. In IFIP Annual Conference on Data and Applications Security and Privacy, 239–246. Springer. Monet, N., and Clier, J. 2016. Privacy-preserving text lan- guage identification using homomorphic encryption. US Patent 9,288,039. Pathak, M. A., and Raj, B. 2013. Privacy-preserving speaker verification and identification using gaussian mixture mod- els. IEEE Transactions on Audio, Speech, and Language Processing 21(2):397–406. Pathak, M. A.; Rane, S.; Sun, W.; and Raj, B. 2011. Pri- vacy preserving probabilistic inference with hidden markov models. In ICASSP, 5868–5871. Pathak, M. A.; Sharifi, M.; and Raj, B. 2011. Privacy pre- serving spam filtering. arXiv preprint arXiv:1102.4021. Shafi, G., and Micali, S. 1984. Probabilistic encryption. Journal of computer and system sciences 28(2):270–299. Smart, N. P., and Vercauteren, F. 2014. Fully homomor- phic simd operations. Designs, codes and cryptography 71(1):57–81.