<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Privacy-Preserving Character Language Modelling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patricia Thaine</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerald Penn</string-name>
          <email>gpenng@cs.toronto.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Toronto</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Some of the most sensitive information we generate is either written or spoken using natural language. Privacy-preserving methods for natural language processing are therefore crucial, especially considering the ever-growing number of data breaches. However, there has been little work in this area up until now. In fact, no privacy-preserving methods have been proposed for many of the most basic NLP tasks. We propose a method for calculating character bigram and trigram probabilities over sensitive data using homomorphic encryption. Where determining an encrypted character's probability using a plaintext bigram model has a runtime of 1945.29 ms per character, an encrypted bigram model takes us 88549.1 ms, a plaintext trigram model takes 3868.65 ms, and an encrypted trigram model takes 102766 ms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Character-level language models are used in a variety of
tasks, such as character prediction for facilitating text
messaging. Despite the sensitive nature of the input data, there
has been very little work done on privacy-preserving
character language models. One example of such work is Apple’s
use of a version of differential privacy called randomized
response to improve their emoji QuickType predictions1.
Briefly, what users truly type is sent to Apple with a
certain probability p and fake input is sent with a probability
1 p. While this technique can be used for efficiently
training a privacy-preserving emoji prediction system, it is easy
to see that on its own it would not preserve the privacy of
natural language input while preserving its utility.</p>
      <p>
        We propose a privacy-preserving character-level n-gram
language model, the inputs to which are entirely accurate
and entirely private. We use Homomorphic Encryption (HE)
for this purpose. HE has so far been used for a small
number of natural language processing and information
retrieval tasks. Some of these include spam filtering
        <xref ref-type="bibr" rid="ref12 ref13 ref9">(Pathak,
Sharifi, and Raj 2011)</xref>
        , hidden-Markov-model-based spoken
keyword spotting
        <xref ref-type="bibr" rid="ref12 ref13">(Pathak et al. 2011)</xref>
        , speaker recognition
        <xref ref-type="bibr" rid="ref11">(Pathak and Raj 2013)</xref>
        , n-gram-based similar document
detection
        <xref ref-type="bibr" rid="ref9">(Jiang and Samanthula 2011)</xref>
        , language
identifica1https://machinelearning.apple.com/
docs/learning-with-privacy-at-scale/
appledifferentialprivacysystem.pdf
tion
        <xref ref-type="bibr" rid="ref10">(Monet and Clier 2016)</xref>
        , keyword search, and
bag-ofwords frequency counting
        <xref ref-type="bibr" rid="ref6">(Grinman 2016)</xref>
        .
      </p>
      <sec id="sec-1-1">
        <title>Homomorphic Encryption</title>
        <p>Homomorphic encryption schemes allow for computations
to be performed on encrypted data without needing to
decrypt it.</p>
        <p>
          For this work, we use the Brakerski-Fan-Vercauteren
(BFV) Ring-Learning-With-Errors-based fully
homomorphic encryption scheme
          <xref ref-type="bibr" rid="ref3">(Brakerski 2012)</xref>
          <xref ref-type="bibr" rid="ref4">(Fan and
Vercauteren 2012)</xref>
          , with the encryption and homomorphic
multiplication improvements presented in
          <xref ref-type="bibr" rid="ref7">(Halevi, Polyakov,
and Shoup 2018)</xref>
          . This scheme is implemented in the
PALISADE Lattice Cryptography Library2. However, the
algorithm we propose can be implemented using any
homomorphic encryption scheme that allows for addition and
component-wise multiplication in the encrypted domain.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Notation, Scheme Overview, and Chosen</title>
      </sec>
      <sec id="sec-1-3">
        <title>Parameters</title>
        <p>
          We will be using the same notation as
          <xref ref-type="bibr" rid="ref3">(Brakerski 2012)</xref>
          and
          <xref ref-type="bibr" rid="ref4">(Fan and Vercauteren 2012)</xref>
          , but as we provide only a
brief overview of the homomorphic encryption scheme, the
specific optimizations introduced in
          <xref ref-type="bibr" rid="ref4">(Fan and Vercauteren
2012)</xref>
          ,
          <xref ref-type="bibr" rid="ref7">(Halevi, Polyakov, and Shoup 2018)</xref>
          will not be
discussed. Let R = Z[x]=(f (x)) be an integer ring, where
f (x) 2 Z[x] is a monic irreducible polynomial of degree d.
Bold lowercase letters denote elements of R and their
coefficients will be denoted by indices (e.g., a = Pid=01 ai xi. Zq,
where q &gt; 1; q 2 Z, denotes the set of integers ( q=2:q=2].
q is referred to as the ciphertext modulus. An integer n’s
i-th bit is denoted n[i], from i = 0. The secret key is
sk = (1; s), where s . The public key is called
pk = ([ (a s + e]q; a), where a Rq, e .
        </p>
        <p>
          Encrypt(pk,m): message m 2 Rt, p0 = pk[0], p1 =
pk[1], u R2, e1; e2 :
ct = [p0 u + e1 +
m]q; [p1 u + e2]q
Decrypt(sk, ct): s = sk, c0 = ct[0], c1 = ct[1].
hj t [c0 + c1 s]q mi
q
t
2https://git.njit.edu/palisade/PALISADE
Add(ct1; ct2): ([ct1[0] + ct2[0]]q; [ct1[1] + ct2[1]]q)
Add(ct1; pt2): ([ct1[0] + pt2[0]]q; [ct1[1] + pt2[1]]q)
Mul(ct1; ct2): For this paper, we use
componentwise multiplication, a simplified description of which is:
([ct1[0] ct2[0]]q; [ct1[1] ct2[1]]q). The algorithmic
details for obtaining this result can be found in
          <xref ref-type="bibr" rid="ref4">(Fan and
Vercauteren 2012)</xref>
          .
        </p>
        <p>Mul(ct1; pt2): Like with the Add function, it is
possible to multiply a ciphertext with plaintext, resulting in:
([ct1[0] pt2[0]]q; [ct1[1] pt2[1]]q).</p>
        <p>Using homomorphic encryption, we can perform linear and
(depending on the encryption scheme) polynomial
operations on encrypted data (multiplication, addition, or both).
We can neither divide a ciphertext, nor exponentiate using
an encrypted exponent. We can keep track separately of a
numerator and a corresponding denominator. For clarity, we
shall refer to the encrypted version of a value as E( ) and
to represent Mul.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Optimization: Single Instruction Multiple Data</title>
        <p>
          Single Instruction Multiple Data (SIMD) is explained in
          <xref ref-type="bibr" rid="ref16">(Smart and Vercauteren 2014)</xref>
          . Using the Chinese
Remainder Theorem, an operation between two SIMD-encoded lists
of encrypted numbers can be performed by the same
operations between two regularly-encoded encrypted numbers.
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>Encoding Variables</title>
        <p>
          The very first step in converting an algorithm to an
HEfriendly form is to make the data amenable to HE
analysis. This includes transforming floating point numbers into
approximations, such as by computing an approximate
rational representation for them, clearing them by multiplying
them by a pre-specified power of 10, and then rounding to
the nearest integer
          <xref ref-type="bibr" rid="ref4 ref5">(Graepel, Lauter, and Naehrig 2012)</xref>
          .
        </p>
        <p>
          We follow the method suggested in
          <xref ref-type="bibr" rid="ref2">(Aslett, Esperanc¸a,
and Holmes 2015)</xref>
          : choose the number of decimal places
to be retained based on a desired level of accuracy , then
multiply the data by 10 and round to the nearest integer.
Since we are dealing with probabilities, we do not lose much
information when converting, say, 99.6% to 99.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Privacy-Preserving Bigram and Trigram</title>
    </sec>
    <sec id="sec-3">
      <title>Models</title>
      <p>We assume that a user has some sensitive data requiring
character-level predictions to be made and that a server has
a character-level language model that they do not want to
share. We train a bigram model and a trigram model
using plaintext from part of the Enron email dataset, which
we pre-process to only contain k = 27 types (space and 26
lowercase letters). A user’s emails are then converted into
binary vectors of dimension k.</p>
      <sec id="sec-3-1">
        <title>Bigram Probabilities</title>
        <p>For the bigram model, we convert characters into one-hot
vectors (e.g., the vector for ‘a’ has a 1 at index 0). Along
with the one-hot vector, k ordered vectors of size k are sent.
Each of these vectors are zero-vectors, except for a vector
of ones which is at the letter’s ‘designated index’. Here’s a
simple example for a language containing k = 3 character
types. Assume we want to convert letter ‘a’; the server is
sent two matrices which represent the letter ‘a’ denoted by
Ma1 and Ma2:</p>
        <p>Ma1 = E([1
0</p>
        <p>0]); Ma2 = E
Say ‘a’ is followed by ‘b’, then the server receives:</p>
        <p>The user wants to know how likely is it for ‘b’ to follow
‘a’. The server is able to calculate this like so:</p>
        <p>Resulting in:
To use the trigram model, we again convert characters into
one-hot vectors. Along with them, however, we must send
a bit more information. Say we have a trigram probability
matrix whose columns are sorted as follows:
Plaintext Model, Encrypted Input It takes us 1945.29
ms to output the probability of one encrypted character
given the preceding character (also encrypted) and a
plaintext character bigram model. It takes us 88549.1 ms to
output the probability of one encrypted character given its two
preceding characters (both encrypted) and a plaintext
character trigram model. These results are based on the
parameters listed in Section .</p>
        <p>Encrypted Model, Encrypted Input It takes us 3868.65
ms to output the probability of one encrypted character
given the preceding character (also encrypted) and an
encrypted character bigram model. It takes us 102766 ms to
output the probability of one encrypted character given its
two preceding characters (both encrypted) and an encrypted
character trigram model. These results are also based on the
parameters listed in Section .</p>
        <p>Additional runtime comparisons are provided in Figure 1
and Figure 2. While the runtime of encrypted models might
limit the practicality of their deployment, plaintext models
running on encrypted data are practical for deployment,
especially when considering predictions parallelizability and
the speed ups that better RAM could lead to.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>We described a method for calculating character-level
bigram and trigram probabilities given encrypted data and
perform runtime experiments across various security levels to
test the scalability of our algorithms. Our next steps will be
to adapt this method to word-level language modeling, as
well as to create a protocol for training n-gram models on
encrypted data.</p>
      <p>The probability of ‘aba’, given a trigram probability
matrix Ptrigram can be calculated as follows:</p>
      <p>I = Ptrigram
(Ma3</p>
      <p>Mb3)</p>
      <p>Next, the server take the inner products of each row of I
with Ma1 and adds them all together. The result, E(p121),
is sent back to the user, who is able to decrypt it.</p>
      <sec id="sec-4-1">
        <title>Security</title>
        <p>
          The BGV scheme has semantic security
          <xref ref-type="bibr" rid="ref1">(Albrecht et al.
2018)</xref>
          , which means that “whatever an eavesdropper can
compute about the cleartext given the ciphertext, he can also
compute without the ciphertext”
          <xref ref-type="bibr" rid="ref14">(Shafi and Micali 1984)</xref>
          .
For our first few experiments, we chose parameters that give
over a 128-bit security level the values presented in
          <xref ref-type="bibr" rid="ref1">(Albrecht et al. 2018)</xref>
          . This means that it would take over 2128
computations to crack the decryption key.
        </p>
        <p>We set the BFV scheme’s parameters as follows, to
guarantee 128-bit security:</p>
        <p>Plaintext Modulus (t) = 65537,</p>
        <p>= 3:2,
Root Hermite Factor = 1:0081,
m = 16384,
Ciphertext Modulus (q) = 153249540390512585124987
3756002082622499024400469688321
To run 256-bit security experiments we set m = 32768.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>The following experiments were run on an Intel Core
i-78650U CPU @ 1.90GHz and a 16GB RAM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Albrecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chase</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Goldwasser,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Gorbunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Hoffstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Lauter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Lokam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Micciancio</surname>
          </string-name>
          , D.; Moody, D.; Morrison,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Sahai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; and
            <surname>Vaikuntanathan</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Homomorphic encryption security standard</article-title>
          .
          <source>Technical report</source>
          , HomomorphicEncryption.org, Cambridge MA.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Aslett</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          ; Esperanc¸a, P. M.; and
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Encrypted statistical machine learning: new privacy preserving methods</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <fpage>06845</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brakerski</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Fully homomorphic encryption without modulus switching from classical GapSVP</article-title>
          . In Advances in cryptology-crypto
          <year>2012</year>
          . Springer.
          <fpage>868</fpage>
          -
          <lpage>886</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vercauteren</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Somewhat Practical Fully Homomorphic Encryption</article-title>
          .
          <source>IACR Cryptology ePrint Archive</source>
          <year>2012</year>
          :
          <fpage>144</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Graepel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lauter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Naehrig,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Ml confidential: Machine learning on encrypted data</article-title>
          .
          <source>In International Conference on Information Security and Cryptology</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Grinman</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Natural language processing on encrypted patient data</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , Massachusetts Institute of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Halevi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Polyakov,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Shoup</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>An improved rns variant of the bfv homomorphic encryption scheme</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>IACR Cryptology ePrint Archive</source>
          <year>2018</year>
          :
          <fpage>117</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Samanthula</surname>
            ,
            <given-names>B. K.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>N-gram based secure similar document detection</article-title>
          .
          <source>In IFIP Annual Conference on Data and Applications Security and Privacy</source>
          ,
          <volume>239</volume>
          -
          <fpage>246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Monet</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Clier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Privacy-preserving text language identification using homomorphic encryption</article-title>
          .
          <source>US Patent 9</source>
          ,
          <issue>288</issue>
          ,
          <fpage>039</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Privacy-preserving speaker verification and identification using gaussian mixture models</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>397</fpage>
          -
          <lpage>406</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rane</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Privacy preserving probabilistic inference with hidden markov models</article-title>
          .
          <source>In ICASSP</source>
          ,
          <fpage>5868</fpage>
          -
          <lpage>5871</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sharifi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Privacy preserving spam filtering</article-title>
          .
          <source>arXiv preprint arXiv:1102</source>
          .
          <fpage>4021</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Shafi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Micali</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>1984</year>
          .
          <article-title>Probabilistic encryption</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Journal of computer and system sciences 28</source>
          <volume>(2)</volume>
          :
          <fpage>270</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Smart</surname>
            ,
            <given-names>N. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vercauteren</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Fully homomorphic simd operations</article-title>
          .
          <source>Designs, codes and cryptography 71</source>
          <volume>(1)</volume>
          :
          <fpage>57</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>