Privacy-Preserving Character Language Modelling

                                                Patricia Thaine, Gerald Penn
                                      Department of Computer Science, University of Toronto
                                               {pthaine, gpenn}@cs.toronto.edu


                           Abstract                                tion (Monet and Clier 2016), keyword search, and bag-of-
                                                                   words frequency counting (Grinman 2016).
  Some of the most sensitive information we generate is either
  written or spoken using natural language. Privacy-preserving     Homomorphic Encryption
  methods for natural language processing are therefore cru-
  cial, especially considering the ever-growing number of data     Homomorphic encryption schemes allow for computations
  breaches. However, there has been little work in this area up    to be performed on encrypted data without needing to de-
  until now. In fact, no privacy-preserving methods have been      crypt it.
  proposed for many of the most basic NLP tasks.                      For this work, we use the Brakerski-Fan-Vercauteren
  We propose a method for calculating character bigram and         (BFV) Ring-Learning-With-Errors-based fully homomor-
  trigram probabilities over sensitive data using homomorphic      phic encryption scheme (Brakerski 2012)(Fan and Ver-
  encryption. Where determining an encrypted character’s           cauteren 2012), with the encryption and homomorphic mul-
  probability using a plaintext bigram model has a runtime of      tiplication improvements presented in (Halevi, Polyakov,
  1945.29 ms per character, an encrypted bigram model takes        and Shoup 2018). This scheme is implemented in the PAL-
  us 88549.1 ms, a plaintext trigram model takes 3868.65 ms,       ISADE Lattice Cryptography Library2 . However, the al-
  and an encrypted trigram model takes 102766 ms.
                                                                   gorithm we propose can be implemented using any homo-
                                                                   morphic encryption scheme that allows for addition and
                       Introduction                                component-wise multiplication in the encrypted domain.
Character-level language models are used in a variety of           Notation, Scheme Overview, and Chosen
tasks, such as character prediction for facilitating text mes-     Parameters
saging. Despite the sensitive nature of the input data, there
has been very little work done on privacy-preserving charac-       We will be using the same notation as (Brakerski 2012)
ter language models. One example of such work is Apple’s           and (Fan and Vercauteren 2012), but as we provide only a
use of a version of differential privacy called randomized         brief overview of the homomorphic encryption scheme, the
response to improve their emoji QuickType predictions1 .           specific optimizations introduced in (Fan and Vercauteren
Briefly, what users truly type is sent to Apple with a cer-        2012),(Halevi, Polyakov, and Shoup 2018) will not be dis-
tain probability p and fake input is sent with a probability       cussed. Let R = Z[x]/(f (x)) be an integer ring, where
1 − p. While this technique can be used for efficiently train-     f (x) ∈ Z[x] is a monic irreducible polynomial of degree d.
ing a privacy-preserving emoji prediction system, it is easy       Bold lowercase letters denote elements of RP and their coeffi-
                                                                                                                 d−1
to see that on its own it would not preserve the privacy of        cients will be denoted by indices (e.g., a = i=0 ai ·xi . Zq ,
natural language input while preserving its utility.               where q > 1, q ∈ Z, denotes the set of integers (−q/2.q/2].
   We propose a privacy-preserving character-level n-gram          q is referred to as the ciphertext modulus. An integer n’s
language model, the inputs to which are entirely accurate          i-th bit is denoted n[i], from i = 0. The secret key is
and entirely private. We use Homomorphic Encryption (HE)           sk = (1, s), where s ← χ. The public key is called
for this purpose. HE has so far been used for a small              pk = ([−(a · s + e]q , a), where a ← Rq , e ← χ.
number of natural language processing and information re-          • Encrypt(pk,m): message m ∈ Rt , p0 = pk[0], p1 =
trieval tasks. Some of these include spam filtering (Pathak,          pk[1], u ← R2 , e1 , e2 ← χ:
Sharifi, and Raj 2011), hidden-Markov-model-based spoken                                                                  
keyword spotting (Pathak et al. 2011), speaker recognition                ct = [p0 · u + e1 + ∆ · m]q , [p1 · u + e2 ]q
(Pathak and Raj 2013), n-gram-based similar document de-
tection (Jiang and Samanthula 2011), language identifica-          • Decrypt(sk, ct): s = sk, c0 = ct[0], c1 = ct[1].
                                                                                  hj t · [c + c · s] mi
   1                                                                                           0       1   q
   https://machinelearning.apple.com/
docs/learning-with-privacy-at-scale/                                                               q           t
                                                                      2
appledifferentialprivacysystem.pdf                                        https://git.njit.edu/palisade/PALISADE
• Add(ct1 , ct2 ): ([ct1 [0] + ct2 [0]]q , [ct1 [1] + ct2 [1]]q )   Each of these vectors are zero-vectors, except for a vector
• Add(ct1 , pt2 ): ([ct1 [0] + pt2 [0]]q , [ct1 [1] + pt2 [1]]q )   of ones which is at the letter’s ‘designated index’. Here’s a
                                                                    simple example for a language containing k = 3 character
• Mul(ct1 , ct2 ): For this paper, we use component-                types. Assume we want to convert letter ‘a’; the server is
  wise multiplication, a simplified description of which is:        sent two matrices which represent the letter ‘a’ denoted by
  ([ct1 [0] · ct2 [0]]q , [ct1 [1] · ct2 [1]]q ). The algorithmic   Ma1 and Ma2 :
  details for obtaining this result can be found in (Fan and
  Vercauteren 2012).
                                                                                                                   [1    1   1]
                                                                                                               "               #!
• Mul(ct1 , pt2 ): Like with the Add function, it is pos-
  sible to multiply a ciphertext with plaintext, resulting in:          Ma1 = E([1      0   0]), Ma2 = E           [0    0   0]
  ([ct1 [0] · pt2 [0]]q , [ct1 [1] · pt2 [1]]q ).                                                                  [0    0   0]
Using homomorphic encryption, we can perform linear and               Say ‘a’ is followed by ‘b’, then the server receives:
(depending on the encryption scheme) polynomial opera-
tions on encrypted data (multiplication, addition, or both).
We can neither divide a ciphertext, nor exponentiate using                                                      "
                                                                                                                    [0   0   0]
                                                                                                                               #!
an encrypted exponent. We can keep track separately of a                Mb1 = E([0      1   0]), Mb2 = E            [1   1   1]
numerator and a corresponding denominator. For clarity, we                                                          [0   0   0]
shall refer to the encrypted version of a value ∗ as E(∗) and
× to represent Mul.                                                    The user wants to know how likely is it for ‘b’ to follow
                                                                    ‘a’. The server is able to calculate this like so:
Optimization: Single Instruction Multiple Data
Single Instruction Multiple Data (SIMD) is explained in
                                                                              [p11    p12   p13 ]       [1      1        1]
                                                                             "                    #   "                    #!
(Smart and Vercauteren 2014). Using the Chinese Remain-
                                                                              [p21    p22   p23 ] × E [0        0        0]
der Theorem, an operation between two SIMD-encoded lists
                                                                              [p31    p32   p33 ]       [0      0        0]
of encrypted numbers can be performed by the same opera-
tions between two regularly-encoded encrypted numbers.
                                                                      Resulting in:
Encoding Variables
                                                                                                [p11   p12   p13 ]
                                                                                            "                     #!
The very first step in converting an algorithm to an HE-
friendly form is to make the data amenable to HE analy-                              I=E         [0     0     0]
sis. This includes transforming floating point numbers into                                      [0     0     0]
approximations, such as by computing an approximate ra-
tional representation for them, clearing them by multiplying          We then take the inner products of each row of I with
them by a pre-specified power of 10, and then rounding to           Ma1 and add them all together. The result, E(p12 ), is sent
the nearest integer (Graepel, Lauter, and Naehrig 2012).            back to the user, who is able to decrypt it.
   We follow the method suggested in (Aslett, Esperança,
and Holmes 2015): choose the number of decimal places               Trigram Probabilities
to be retained based on a desired level of accuracy φ, then
multiply the data by 10φ and round to the nearest integer.          To use the trigram model, we again convert characters into
Since we are dealing with probabilities, we do not lose much        one-hot vectors. Along with them, however, we must send
information when converting, say, 99.6% to 99.                      a bit more information. Say we have a trigram probability
                                                                    matrix whose columns are sorted as follows:
    Privacy-Preserving Bigram and Trigram
                                                                                                  c1 c1
                                                                                                       
                    Models
                                                                                                 c1 c2 
We assume that a user has some sensitive data requiring                                          c c 
                                                                                                  1 3
character-level predictions to be made and that a server has                                     c2 c1 
a character-level language model that they do not want to                                              
                                                                                                 c2 c2 
share. We train a bigram model and a trigram model us-                                                 
                                                                                                 c2 c3 
ing plaintext from part of the Enron email dataset, which                                        c3 c1 
                                                                                                       
we pre-process to only contain k = 27 types (space and 26                                        c c 
                                                                                                   3 2
lowercase letters). A user’s emails are then converted into                                       c3 c3
binary vectors of dimension k.
                                                                       If we want to access the row containing the probabilities
Bigram Probabilities                                                of cx cy , we can find its index with the equation xt+y, where
For the bigram model, we convert characters into one-hot            t is the number of character types. Let us use ‘a’ as an ex-
vectors (e.g., the vector for ‘a’ has a 1 at index 0). Along        ample again, with t = 3. The user must again send along
with the one-hot vector, k ordered vectors of size k are sent.      Ma1 = E([1 0 0]), as well as:
                                                                Plaintext Model, Encrypted Input It takes us 1945.29
                                                                ms to output the probability of one encrypted character
          [1        1    1]                 [1    1   1]
                                                     
                                                                given the preceding character (also encrypted) and a plain-
         [0        0    0]               [1    1   1]       text character bigram model. It takes us 88549.1 ms to out-
         [0        0    0]               [1    1   1]
         
         [1
                           !                          !      put the probability of one encrypted character given its two
                   1    1]
                           
                                           [0
                                                 0   0]
                                                               preceding characters (both encrypted) and a plaintext char-
         [0        0    0]
 Ma2 = E                                  [0    0   0]
                            , Ma3 = E                        acter trigram model. These results are based on the parame-
         [0        0    0]               [0    0   0]       ters listed in Section .
         [1        1    1]               [0    0   0]
                                                     
         [0        0    0]               [0    0   0]
               [0   0    0]                  [0   0   0]        Encrypted Model, Encrypted Input It takes us 3868.65
                                                                ms to output the probability of one encrypted character
  If ‘b’ follows ‘a’, the server is then sent Mb1 =             given the preceding character (also encrypted) and an
E([0 1 0]) and:                                                 encrypted character bigram model. It takes us 102766 ms to
                                                                output the probability of one encrypted character given its
                                                                two preceding characters (both encrypted) and an encrypted
          [0        0    0]                 [0    0   0]
                                                      
                                                                character trigram model. These results are also based on the
         [1        1    1]               [0    0   0]       parameters listed in Section .
         [0        0    0]               [0    0   0]
                          !                           !
         [0        0    0]               [1    1   1]
                                                               Additional runtime comparisons are provided in Figure 1
         [1        1    1]
 Mb2 = E                                  [1    1   1]
                            , Mb3 = E                        and Figure 2. While the runtime of encrypted models might
         [0        0    0]               [1    1   1]       limit the practicality of their deployment, plaintext models
         [0        0    0]               [0    0   0]
                                                      
         [1                                                    running on encrypted data are practical for deployment, es-
                    1    1]               [0    0   0]       pecially when considering predictions parallelizability and
               [0   0    0]                  [0   0   0]        the speed ups that better RAM could lead to.
   The probability of ‘aba’, given a trigram probability ma-
trix Ptrigram can be calculated as follows:
                                                                           Conclusion and Future Work
                                                                We described a method for calculating character-level bi-
                                                                gram and trigram probabilities given encrypted data and per-
               I = Ptrigram × (Ma3 × Mb3 )
                                                                form runtime experiments across various security levels to
   Next, the server take the inner products of each row of I    test the scalability of our algorithms. Our next steps will be
with Ma1 and adds them all together. The result, E(p121 ),      to adapt this method to word-level language modeling, as
is sent back to the user, who is able to decrypt it.            well as to create a protocol for training n-gram models on
                                                                encrypted data.
Security
The BGV scheme has semantic security (Albrecht et al.
2018), which means that “whatever an eavesdropper can
compute about the cleartext given the ciphertext, he can also
compute without the ciphertext” (Shafi and Micali 1984).
For our first few experiments, we chose parameters that give
over a 128-bit security level the values presented in (Al-
brecht et al. 2018). This means that it would take over 2128
computations to crack the decryption key.
  We set the BFV scheme’s parameters as follows, to guar-
antee 128-bit security:
• Plaintext Modulus (t) = 65537,
• σ = 3.2,
• Root Hermite Factor = 1.0081,
• m = 16384,
• Ciphertext Modulus (q) = 153249540390512585124987
  3756002082622499024400469688321
  To run 256-bit security experiments we set m = 32768.

                        Experiments
The following experiments were run on an Intel Core i-7-
8650U CPU @ 1.90GHz and a 16GB RAM.
Figure 1: Runtime comparisons of bigram and trigram predictions for character languages models with 27 to 31 characters
across three security levels and between encrypted and plaintext character language models.


Figure 2: Runtime comparisons of 256-bit security of bigram and trigram predictions for character languages models with 27
to 31 characters.
                       References
Albrecht, M.; Chase, M.; Chen, H.; Ding, J.; Goldwasser, S.;
Gorbunov, S.; Hoffstein, J.; Lauter, K.; Lokam, S.; Miccian-
cio, D.; Moody, D.; Morrison, T.; Sahai, A.; and Vaikun-
tanathan, V. 2018. Homomorphic encryption security stan-
dard. Technical report, HomomorphicEncryption.org, Cam-
bridge MA.
Aslett, L. J.; Esperança, P. M.; and Holmes, C. C. 2015. En-
crypted statistical machine learning: new privacy preserving
methods. arXiv preprint arXiv:1508.06845.
Brakerski, Z. 2012. Fully homomorphic encryption without
modulus switching from classical GapSVP. In Advances in
cryptology–crypto 2012. Springer. 868–886.
Fan, J., and Vercauteren, F. 2012. Somewhat Practical Fully
Homomorphic Encryption. IACR Cryptology ePrint Archive
2012:144.
Graepel, T.; Lauter, K.; and Naehrig, M. 2012. Ml confiden-
tial: Machine learning on encrypted data. In International
Conference on Information Security and Cryptology, 1–21.
Springer.
Grinman, A. J. 2016. Natural language processing on en-
crypted patient data. Ph.D. Dissertation, Massachusetts In-
stitute of Technology.
Halevi, S.; Polyakov, Y.; and Shoup, V. 2018. An improved
rns variant of the bfv homomorphic encryption scheme.
IACR Cryptology ePrint Archive 2018:117.
Jiang, W., and Samanthula, B. K. 2011. N-gram based se-
cure similar document detection. In IFIP Annual Conference
on Data and Applications Security and Privacy, 239–246.
Springer.
Monet, N., and Clier, J. 2016. Privacy-preserving text lan-
guage identification using homomorphic encryption. US
Patent 9,288,039.
Pathak, M. A., and Raj, B. 2013. Privacy-preserving speaker
verification and identification using gaussian mixture mod-
els. IEEE Transactions on Audio, Speech, and Language
Processing 21(2):397–406.
Pathak, M. A.; Rane, S.; Sun, W.; and Raj, B. 2011. Pri-
vacy preserving probabilistic inference with hidden markov
models. In ICASSP, 5868–5871.
Pathak, M. A.; Sharifi, M.; and Raj, B. 2011. Privacy pre-
serving spam filtering. arXiv preprint arXiv:1102.4021.
Shafi, G., and Micali, S. 1984. Probabilistic encryption.
Journal of computer and system sciences 28(2):270–299.
Smart, N. P., and Vercauteren, F. 2014. Fully homomor-
phic simd operations. Designs, codes and cryptography
71(1):57–81.