Introduction

Privacy-Preserving Character Language Modelling

Patricia Thaine

Gerald Penn

gpenng@cs.toronto.edu 0 0 Department of Computer Science, University of Toronto

Some of the most sensitive information we generate is either written or spoken using natural language. Privacy-preserving methods for natural language processing are therefore crucial, especially considering the ever-growing number of data breaches. However, there has been little work in this area up until now. In fact, no privacy-preserving methods have been proposed for many of the most basic NLP tasks. We propose a method for calculating character bigram and trigram probabilities over sensitive data using homomorphic encryption. Where determining an encrypted character's probability using a plaintext bigram model has a runtime of 1945.29 ms per character, an encrypted bigram model takes us 88549.1 ms, a plaintext trigram model takes 3868.65 ms, and an encrypted trigram model takes 102766 ms.

Introduction

Character-level language models are used in a variety of tasks, such as character prediction for facilitating text messaging. Despite the sensitive nature of the input data, there has been very little work done on privacy-preserving character language models. One example of such work is Apple’s use of a version of differential privacy called randomized response to improve their emoji QuickType predictions1. Briefly, what users truly type is sent to Apple with a certain probability p and fake input is sent with a probability 1 p. While this technique can be used for efficiently training a privacy-preserving emoji prediction system, it is easy to see that on its own it would not preserve the privacy of natural language input while preserving its utility.

We propose a privacy-preserving character-level n-gram language model, the inputs to which are entirely accurate and entirely private. We use Homomorphic Encryption (HE) for this purpose. HE has so far been used for a small number of natural language processing and information retrieval tasks. Some of these include spam filtering (Pathak, Sharifi, and Raj 2011) , hidden-Markov-model-based spoken keyword spotting (Pathak et al. 2011) , speaker recognition (Pathak and Raj 2013) , n-gram-based similar document detection (Jiang and Samanthula 2011) , language identifica1https://machinelearning.apple.com/ docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf tion (Monet and Clier 2016) , keyword search, and bag-ofwords frequency counting (Grinman 2016) .

Homomorphic Encryption

Homomorphic encryption schemes allow for computations to be performed on encrypted data without needing to decrypt it.

For this work, we use the Brakerski-Fan-Vercauteren (BFV) Ring-Learning-With-Errors-based fully homomorphic encryption scheme (Brakerski 2012) (Fan and Vercauteren 2012) , with the encryption and homomorphic multiplication improvements presented in (Halevi, Polyakov, and Shoup 2018) . This scheme is implemented in the PALISADE Lattice Cryptography Library2. However, the algorithm we propose can be implemented using any homomorphic encryption scheme that allows for addition and component-wise multiplication in the encrypted domain.

Notation, Scheme Overview, and Chosen Parameters

We will be using the same notation as (Brakerski 2012) and (Fan and Vercauteren 2012) , but as we provide only a brief overview of the homomorphic encryption scheme, the specific optimizations introduced in (Fan and Vercauteren 2012) , (Halevi, Polyakov, and Shoup 2018) will not be discussed. Let R = Z[x]=(f (x)) be an integer ring, where f (x) 2 Z[x] is a monic irreducible polynomial of degree d. Bold lowercase letters denote elements of R and their coefficients will be denoted by indices (e.g., a = Pid=01 ai xi. Zq, where q > 1; q 2 Z, denotes the set of integers ( q=2:q=2]. q is referred to as the ciphertext modulus. An integer n’s i-th bit is denoted n[i], from i = 0. The secret key is sk = (1; s), where s . The public key is called pk = ([ (a s + e]q; a), where a Rq, e .

Encrypt(pk,m): message m 2 Rt, p0 = pk[0], p1 = pk[1], u R2, e1; e2 : ct = [p0 u + e1 + m]q; [p1 u + e2]q Decrypt(sk, ct): s = sk, c0 = ct[0], c1 = ct[1]. hj t [c0 + c1 s]q mi q t 2https://git.njit.edu/palisade/PALISADE Add(ct1; ct2): ([ct1[0] + ct2[0]]q; [ct1[1] + ct2[1]]q) Add(ct1; pt2): ([ct1[0] + pt2[0]]q; [ct1[1] + pt2[1]]q) Mul(ct1; ct2): For this paper, we use componentwise multiplication, a simplified description of which is: ([ct1[0] ct2[0]]q; [ct1[1] ct2[1]]q). The algorithmic details for obtaining this result can be found in (Fan and Vercauteren 2012) .

Mul(ct1; pt2): Like with the Add function, it is possible to multiply a ciphertext with plaintext, resulting in: ([ct1[0] pt2[0]]q; [ct1[1] pt2[1]]q).

Using homomorphic encryption, we can perform linear and (depending on the encryption scheme) polynomial operations on encrypted data (multiplication, addition, or both). We can neither divide a ciphertext, nor exponentiate using an encrypted exponent. We can keep track separately of a numerator and a corresponding denominator. For clarity, we shall refer to the encrypted version of a value as E( ) and to represent Mul.

Optimization: Single Instruction Multiple Data

Single Instruction Multiple Data (SIMD) is explained in (Smart and Vercauteren 2014) . Using the Chinese Remainder Theorem, an operation between two SIMD-encoded lists of encrypted numbers can be performed by the same operations between two regularly-encoded encrypted numbers.

Encoding Variables

The very first step in converting an algorithm to an HEfriendly form is to make the data amenable to HE analysis. This includes transforming floating point numbers into approximations, such as by computing an approximate rational representation for them, clearing them by multiplying them by a pre-specified power of 10, and then rounding to the nearest integer (Graepel, Lauter, and Naehrig 2012) .

We follow the method suggested in (Aslett, Esperanc¸a, and Holmes 2015) : choose the number of decimal places to be retained based on a desired level of accuracy , then multiply the data by 10 and round to the nearest integer. Since we are dealing with probabilities, we do not lose much information when converting, say, 99.6% to 99.

Privacy-Preserving Bigram and Trigram Models

We assume that a user has some sensitive data requiring character-level predictions to be made and that a server has a character-level language model that they do not want to share. We train a bigram model and a trigram model using plaintext from part of the Enron email dataset, which we pre-process to only contain k = 27 types (space and 26 lowercase letters). A user’s emails are then converted into binary vectors of dimension k.

Bigram Probabilities

For the bigram model, we convert characters into one-hot vectors (e.g., the vector for ‘a’ has a 1 at index 0). Along with the one-hot vector, k ordered vectors of size k are sent. Each of these vectors are zero-vectors, except for a vector of ones which is at the letter’s ‘designated index’. Here’s a simple example for a language containing k = 3 character types. Assume we want to convert letter ‘a’; the server is sent two matrices which represent the letter ‘a’ denoted by Ma1 and Ma2:

Ma1 = E([1 0

0]); Ma2 = E Say ‘a’ is followed by ‘b’, then the server receives:

The user wants to know how likely is it for ‘b’ to follow ‘a’. The server is able to calculate this like so:

Resulting in: To use the trigram model, we again convert characters into one-hot vectors. Along with them, however, we must send a bit more information. Say we have a trigram probability matrix whose columns are sorted as follows: Plaintext Model, Encrypted Input It takes us 1945.29 ms to output the probability of one encrypted character given the preceding character (also encrypted) and a plaintext character bigram model. It takes us 88549.1 ms to output the probability of one encrypted character given its two preceding characters (both encrypted) and a plaintext character trigram model. These results are based on the parameters listed in Section .

Encrypted Model, Encrypted Input It takes us 3868.65 ms to output the probability of one encrypted character given the preceding character (also encrypted) and an encrypted character bigram model. It takes us 102766 ms to output the probability of one encrypted character given its two preceding characters (both encrypted) and an encrypted character trigram model. These results are also based on the parameters listed in Section .

Additional runtime comparisons are provided in Figure 1 and Figure 2. While the runtime of encrypted models might limit the practicality of their deployment, plaintext models running on encrypted data are practical for deployment, especially when considering predictions parallelizability and the speed ups that better RAM could lead to.

Conclusion and Future Work

We described a method for calculating character-level bigram and trigram probabilities given encrypted data and perform runtime experiments across various security levels to test the scalability of our algorithms. Our next steps will be to adapt this method to word-level language modeling, as well as to create a protocol for training n-gram models on encrypted data.

The probability of ‘aba’, given a trigram probability matrix Ptrigram can be calculated as follows:

I = Ptrigram (Ma3

Mb3)

Next, the server take the inner products of each row of I with Ma1 and adds them all together. The result, E(p121), is sent back to the user, who is able to decrypt it.

Security

The BGV scheme has semantic security (Albrecht et al. 2018) , which means that “whatever an eavesdropper can compute about the cleartext given the ciphertext, he can also compute without the ciphertext” (Shafi and Micali 1984) . For our first few experiments, we chose parameters that give over a 128-bit security level the values presented in (Albrecht et al. 2018) . This means that it would take over 2128 computations to crack the decryption key.

We set the BFV scheme’s parameters as follows, to guarantee 128-bit security:

Plaintext Modulus (t) = 65537,

= 3:2, Root Hermite Factor = 1:0081, m = 16384, Ciphertext Modulus (q) = 153249540390512585124987 3756002082622499024400469688321 To run 256-bit security experiments we set m = 32768.

Experiments

The following experiments were run on an Intel Core i-78650U CPU @ 1.90GHz and a 16GB RAM.

Albrecht , M. ; Chase , M. ; Chen , H. ; Ding , J. ; Goldwasser, S. ; Gorbunov , S. ; Hoffstein , J. ; Lauter , K. ; Lokam , S. ; Micciancio , D.; Moody, D.; Morrison, T. ; Sahai , A. ; and Vaikuntanathan , V. 2018 . Homomorphic encryption security standard . Technical report , HomomorphicEncryption.org, Cambridge MA.

Aslett , L. J. ; Esperanc¸a, P. M.; and Holmes , C. C. 2015 . Encrypted statistical machine learning: new privacy preserving methods . arXiv preprint arXiv:1508 . 06845 .

Brakerski , Z.

2012 . Fully homomorphic encryption without modulus switching from classical GapSVP . In Advances in cryptology-crypto 2012 . Springer. 868 - 886 .

Fan , J. , and Vercauteren , F. 2012 . Somewhat Practical Fully Homomorphic Encryption . IACR Cryptology ePrint Archive 2012 : 144 .

Graepel , T. ; Lauter , K. ; and Naehrig, M. 2012 . Ml confidential: Machine learning on encrypted data . In International Conference on Information Security and Cryptology , 1 - 21 .

Grinman , A. J.

2016 . Natural language processing on encrypted patient data . Ph.D. Dissertation , Massachusetts Institute of Technology.

Halevi , S. ; Polyakov, Y. ; and Shoup , V. 2018 . An improved rns variant of the bfv homomorphic encryption scheme .

IACR Cryptology ePrint Archive 2018 : 117 .

Jiang , W. , and Samanthula , B. K. 2011 . N-gram based secure similar document detection . In IFIP Annual Conference on Data and Applications Security and Privacy , 239 - 246 .

Monet , N. , and Clier , J. 2016 . Privacy-preserving text language identification using homomorphic encryption . US Patent 9 , 288 , 039 .

Pathak , M. A. , and Raj , B. 2013 . Privacy-preserving speaker verification and identification using gaussian mixture models . IEEE Transactions on Audio, Speech, and Language Processing 21 ( 2 ): 397 - 406 .

Pathak , M. A. ; Rane , S. ; Sun , W. ; and Raj , B. 2011 . Privacy preserving probabilistic inference with hidden markov models . In ICASSP , 5868 - 5871 .

Pathak , M. A. ; Sharifi , M. ; and Raj , B. 2011 . Privacy preserving spam filtering . arXiv preprint arXiv:1102 . 4021 .

Shafi , G. , and Micali , S. 1984 . Probabilistic encryption .

Journal of computer and system sciences 28 (2) : 270 - 299 .

Smart , N. P. , and Vercauteren , F. 2014 . Fully homomorphic simd operations . Designs, codes and cryptography 71 (1) : 57 - 81 .