Privacy Preserving Reinforcement Learning over
             Distributed Datasets

      Regis Loeb2 , Timothy Verstraeten1 , Ann Nowe1 , and Ann Dooms1
                          1
                           Vrije Universiteit Brussel, Belgium
                      2
                          Universite Libre de Bruxelles, Belgium


    As the IoT is becoming increasingly popular, the possibility to exchange
data between a group, or fleet, of similar devices arises. This allows institutions
to share their data about a specific control task to boost learning processes.
However, such data sets are often confidential and cannot be shared in their raw
form. We propose a privacy-preserving reinforcement learning technique that
allows knowledge transfer among similar agents.
    Our starting point is the setting of fleet reinforcement learning (RL), in which
similar agents (the fleet) need to solve a similar task. The objective of an agent is
to learn a control policy that maximizes the expected cumulative reward. Before
the learning process, agents are allowed to share data. However, as agents in
reality have small discrepancies (e.g., degradation or production errors), not all
data samples are representative for a particular agent. Therefore, knowledge
needs to be transferred only when it is relevant.
    To construct a privacy-preserving (PP) fleet RL method, we enhance the
fleet Gaussian process reinforcement learning (FGPRL) method exposed in [1]
with a secure multi-party computation (SMPC) protocol. In FGPRL, the goal
is to estimate the transition function, such that the optimal policy can be in-
ferred. This is done using a Gaussian process (GP). A GP is a collection of
annotated random variables, any finite number of which have a joint Gaussian
distribution. In a regression context, these random variables are the outputs of
an unknown function, and their annotations are the inputs to that function. The
most important parameter of a GP is its covariance kernel, which describes how
these outputs are correlated, i.e., how knowledge about one output gives infor-
mation about another. Coregionalization extends this idea to the outputs over
multiple agents. A coregionalized GP is used in FGPRL as the joint transition
model, such that the transitions between fleet members can be correlated in or-
der to effectively share data based on the similarities between the members. In
order to perform prediction on new input points X∗ , a GP is conditioned on the
fleet’s training data. The following formulas can be used to obtain the posterior
statistics of f∗ := f (X∗ )

                              E[f∗ ] = K(X∗ , X)K(X, X)−1 y                      (1)

              V[f∗ ] = K(X∗ , X∗ ) − K(X∗ , X)K(X, X)−1 K(X, X∗ )                (2)
  Copyright 2019 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
2       Regis Loeb , Timothy Verstraeten , Ann Nowe, and Ann Dooms

(K(X, Y ) is the covariance matrix between all pairs of X and Y coordinates).
    In order to make the FGPRL method PP, we have to compute (1) and (2)
in such a way that each agent in the fleet leaks no information about its own
samples. To this end, we use a SMPC, which is an m-party cryptographic proto-
col. It consists of specifying a random process describing a desired functionality
mapping m inputs (one local input per party) to m outputs (one local output per
party). If the m parties could trust some external party, then they would each
send their local input to that trusted party, who would compute the outcome of
the process and send to each party the corresponding output. This imaginary
trusted party can be emulated by the mutually distrustful parties themselves. We
expose two models of SMPC with different assumptions in which such an emu-
lation is possible. Those assumptions are about communication channels, adver-
sarial behavior and the desired level of emulation of the trusted party (which is
equivalent to the level of security).
    We use secret sharings and perform computations modulo q on those. A value
x is secretly shared over Zq by picking x1 , ..., xm randomly and uniformly in Zq
          of integers modulo q represented by {0, 1, ..., q−1}) under the constraint
(the ring P
             m
that x = i=1 xi mod q and then distributing each share xi to party i. Additions,
subtractions or multiplications by a public constant of secret sharings can be
performed locally by the parties without any interaction. This is not the case
for other operations, [2] provides us with protocols to securely compute those
building block operations that we can compose: secure matrix multiplication,
truncation in order to deal with real numbers and not only integers and secure
covariance matrix inversion. Choosing the right covariance kernel and using the
secure protocols mentioned above, we manage to compute the calculations of (1)
and (2) on secret sharings.
    We implemented our method on the same experimental setting as described
in [1], which consists of three mountain cars with different masses. We man-
aged to share data between the cars successfully. Although there is no loss in
accuracy, the computation is heavy as the bottleneck is the matrix inversion
appearing in equations (1) and (2). This being said the covariance matrix in-
version is performed only once before any prediction on a new input is made.
The prediction itself is actually relatively light in terms of computation but the
amount of communication required is significant.
    We investigated one potential solution to lighten the communication load:
the introduction of homomorphic encryptions (HE) in our protocols. We leave
the implementation for further work.


References
1. Timothy Verstraeten, Ann Nowe. Reinforcement learning for fleet applications using
   coregionalized Gaussian processes (2018).
2. M De Cock, R Dowsley, A Nascimento, S Newman. Fast, privacy preserving linear
   regression over distributed datasets based on pre-distributed data (2015).