1. Introduction

A. Gilad Kusne

aaron.kusne@nist.gov 0 2 3

Olivera Kotevska

kotevskao@ornl.gov 1 2 3

Atlanta, GA

0 National Institute of Standards and Technology , 100 Bureau Drive, Gaithersburg, MD, 20899 , USA 1 Oak Ridge National Laboratory , 1 Bethel Valley Road, Oak Ridge, TN, 37830 , USA 2 We investigate the use of Bayesian inference-based 3 either Gaussian or Laplacian additive noise. We build

Diferential privacy policies allow one to preserve data privacy while sharing and analyzing data. However, these policies are susceptible to an array of attacks. In particular, often a portion of the data desired to be privacy protected is exposed online. Access to these pre-privacy protected data samples can then be used to reverse engineer the privacy policy. With knowledge of the generating privacy policy, an attacker can use machine learning to approximate the full set of originating data. Bayesian inference is one method for reverse engineering both model and model parameters. We present a methodology for evaluating and ranking privacy policy robustness to Bayesian inference-based reverse engineering, and demonstrated this method across data with a variety of temporal trends.

Diferential privacy Bayesian inference Privacy policy Privacy defenses

1. Introduction

In recent years, the number of devices connected to the Internet and online services has increased drastically [ 1 ] leading to an exponential growth in data generation [2]. This trend is visible across diferent domains and applications including among many others, streaming medical, personal tracking, and energy use data [ 1 ]. Typically, sensing systems are digitized and connected to network-based analysis tools, and the success of these data streaming devices results in increasing adoption and deployment. creases convenience across many aspects of life, it also creates dangers when sharing sensitive information. This is especially true for sharing unprotected data over the Internet. Around 98% of device trafic is unencrypted and transmitted over the Internet [3]. Cybercriminals have taken notice of this behavior. On average, sensor-based devices are probed for security vulnerabilities around 800 icant efort has been devoted to the development of privacy preservation algorithms and their integration into existing platforms. Some of the most used algorithms are randomization, k-anonymity, l-diversity, cryptography, and diferential privacy (DP) [ 5 ]. These methods have been successfully demonstrated on big data [ 6 ], deep CIKM’22: Privacy Algorithms in Systems (PAS), October 21, 2022, ful logins on each device [ 4 ]. times per hour, with 400 login attempts and 130 success- attacks to identify the privacy policy employed when

To address these rapidly growing security risks, signif- preliminary work we evaluate the likelihood of an adverlearning [ 7 ], medical records [ 8 ], as well as other domains. temporal trends. We then use analysis results to rank

Model 2.1. Bayesian Inference

Bayesian inference [ 14 ] is a statistical sampling method for determining the probability of a hypothesis given data, through the use of Bayes’ theorem [ 15 ]. A common application for Bayesian inference is to identify the most likely parameters values of a generating model () for observed data . Toward this goal, for a given model, a prior over the parameters is needed. The prior belief for the parameter values is given by the probability density function ()

(or the probability mass function ()

if the parameters take on discrete values.) The probability of observing data given particular values for the parameters is given by ( | ()) , also known as the likelihood. Through the use of Bayes’ theorem, the prior and likelihood are combined to determine the probability of diferent values of given the observed . This probability is known as the posterior and is represented by ( ()| )

. Bayesian inference employs statistical sampling of the model parameters’ prior and forward computation of the likelihood to evaluate the posterior. For this work, Markov Chain Monte Carlo (MCMC) is the Bayesian inference sampling method used.

2.2. Gaussian Process

− > ( ) = ((), (, tainty. Here () Gaussian process [ 16 ] (GP) is a common Bayesian nonparametric regression tool. To learn the function ∶ for the data = {( , )} , a prior is assumed ′)) to quantify epistemic unceris a mean function and (, ′) is a covariance function. Expected noise in the data (aleatoric uncertainty) is quantified by selected a likelihood, a common one being (| ) = ( , 2) which assumes het ( (), (,

′)), where: eroskedastic, normally distributed noise with standard deviation . The prior and likelihood are then combined to determine the posterior. When the prior and likelihood are both multivariate normal distributions, the posterior is analytically solvable, giving ( |) = () = () + () = (, ( +

2 ) −1 ′) − ( + 2 ) −1 ′

Here is the vector of { } and = (, (,

). For this work, the squared exponential, Matern 5/2, exponential, and Brownian kernels are used to define data stream temporal trends. ) and ′ = 2.3. Our approach posed or compromised data. We use probabilistic methods to determine the accuracy with which an adversarial actor can identify the privacy policy and its employed parameters from a compromised data stream. Bayesian inference is used to quantify the likelihood (i.e., probability) of each privacy policy being the generating policy and a posterior probability density function for each policy’s parameter. Here the target parameter is the privacy loss measure value . The privacy policies are then ranked for robustness to this type of attack for the given data.

Ten data stream samples are drawn from each of four Gaussian processes, which difer in kernel. The kernels used are the squared exponential, Matern 5/2, exponential, and Brownian. These data streams are then each privacy protected under two privacy policies - using Gaussian additive noise or Laplacian additive noise. For each policy, we explore the use of varying DP privacy loss measure values of = [0.1, 0.5, 1.0] . We assume that number of pre-privacy protected data samples from each data stream are exposed. We apply Bayesian inference to each of these situations and quantify the sum log likelihood (SLL) of the generating privacy noise policy being either Gaussian or Laplacian. When Bayesian inference identifies that the true generating policy is less likely than the alternative, that policy is ranked greater in robustness to this form of privacy policy reverse engineering. Here, to represent the adversarial attacker’s limited knowledge of the privacy policy parameters, the Bayesian inference uses a uniform prior over [10−1, 10−5] and [ 10−3, 10 ] for and , respectively.

For each set of data ( ) privacy protected using pol, we compute the four SLLs: icy ∈ { ,

} , = (

)). The most likely policy is then selected.

A measure of whether the used policy is well obfuscated is given by: Δ = , − , , with Δ positive (negative) if the wrong (right) policy is estimated to be more likely given the data and vice versa. A larger positive value indicates a more dificult challenge for Bayesian inference-based reverse engineering and a larger negative value indicates an easier challenge. We investigate Δ as a function of data stream generating kernel, value, and size of exposed data stream sample.

2.3.1. Assumption

We assume that number of raw data points are available. We investigate the robustness of DP Gaussian and Laplacian additive noise to the exposure of varying numbers of data points as well as diferent values of the DP privacy loss measure .

2.3.2. Investigating Privacy Risks

We present a methodology for ranking privacy policy robustness to reverse engineering in the presence of exWe performed Bayesian inference experiments to determine the privacy policy and its parameters used for privacy protection. Here the target data stream is privacy protected using the equation: = + with data index and noise given by either the Gaussian or Laplacian distribution with mean of zero and scale (or standard deviation) given by the following equations: =

2 ∗ ( √ = 1.25

) ∗ √ 2 2 versary is able to obtain data samples prior to privacy protection along with the same data after privacy protection. The pre-privacy protected data can be obtained in a few ways including public exposure by the data stream sensor, by the user (e.g., sharing data on social media), or by the adversary temporarily placing a similar sensing device close to the first, e.g., using a microphone or software hack to listen in to part of a conversation held over a cellphone or IoT device. The resulting pre-privacy protected data can then be used to reverse engineer the privacy policy. Knowledge of the privacy policy can then be used to extract raw data from privacy protected data collected before or after the data exposure occurs as in [ 13 ].

3. Results

based model determination for the Laplacian and Gaussian additive noise privacy policies. The diference in SLL Δ = , − , for each investigated case is shown, where Δ

is plotted in orange and Δ ted in green. The means of each are indicated by a solid line and the standard deviation is indicated by the colored regions. Diamond markers indicate a positive mean value and squares indicate a negative mean value - these correspond to the wrong and right policy having greater is plotlikelihood, respectively. Interestingly Δ tends to , indicating a greater ease in idenbe larger than Δ tifying the Gaussian policy over the Laplacian policy.

Additionally, both means tend to lower values with increasing number of exposed data samples. In other words, with access to larger amounts of pre-privacy protected data there is an increasing probability in identifying the correct policy, as would be expected. Additionally, a relationship between the choice of kernel or and the resulting Δ is not clear. Further investigation should be performed where the variance of additive noise is a larger percentage of the data stream variance.

where MaxAE is maximum allowed error, a function of the raw data. Here the MaxAE is set to one tenth the value of the current data stream value [ 17 ]. The ad- of data points. in identifying the correct value of is plotted for each case. Red, orange, green, and blue indicate ( for (|()) (|()) , (|())

, (|()) respectively. A greater increases the (1) (2) dificulty in identifying

. A greater robustness to parameter determination is shown by (|()) lower robustness is seen for (|()) plication of the Laplacian additive noise policy tends to provide greater robustness over the Gaussian policy. As expected, there also appears to be a subtle reduction in parameter estimation error with an increasing number

4. Conclusion

nation. sets.

As more data is shared online, the need for privacy preservation is becoming critical to ensure user confidence in sharing and analyzing personal data with online services. In this paper we investigated the robustness of privacy policies when a subset of pre-privacy protected data is exposed. We demonstrate a methodology for selecting the privacy policy that is more dificult to identify through Bayesian inference-based model and parameter determination. For the range of data stream trends investigated, the Laplacian noise privacy policy was more dificult to identify compared to the Gaussian policy, for both Bayesian inference-based model and parameter determi

We hope our results and discussion will be helpful to

Acknowledgment

This manuscript has been co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/ doe-public-access-plan).

[1] ”Number of IoT connected devices worldwide”,

[4] ”4

Ways

Cyber Attackers May Be Hacking Your IoT Devices Right Now”, shorturl .at/bdjm1, Accessed: August 17 , 2022 .

[5]

Khan ,

Foley ,

B. O

'Sullivan , From k-anonymity to diferential privacy: A brief introduction to formal privacy models ( 2021 ).

[6]

S. H.

Begum ,

Nausheen , A comparative analysis of diferential privacy vs other privacy mechanisms for big data , in: 2018 2nd International Conference on Inventive Systems and Control (ICISC) , 2018 , pp. 512 - 516 .

[7]

Vasa ,

Thakkar , Deep learning: Diferential privacy preservation in the era of big data , Journal of Computer Information Systems ( 2022 ) 1 - 24 .

[8]

Kumar ,

Kumar , Privacy preservation of electronic health record: Current status and future direction, Handbook of Computer Networks and Cyber Security ( 2020 ) 715 - 739 .

[9]

Peralta-Peterson ,

Kotevska , Efectiveness of privacy techniques in smart metering systems , in: 2021 International Conference on Computational Science and Computational Intelligence (CSCI) , IEEE, 2021 , pp. 675 - 678 .

[10]

Zhao ,

Chen , A survey on diferential privacy for unstructured data content , ACM Computing Surveys (CSUR) ( 2022 ).

[11]

Al-Rubaie ,

J. M.

Chang , Privacy-preserving machine learning: Threats and solutions , IEEE Security & Privacy 17 ( 2019 ) 49 - 58 .

[12]

Rigaki ,

Garcia , A survey of privacy attacks in machine learning , arXiv preprint arXiv:2007 . 07646 ( 2020 ).

[13]

Kotevska ,

Johnson , A. G. Kusne, Analyzing data privacy for edge systems , in: 2022 IEEE International Conference on Smart Computing (SMARTCOMP) , IEEE, 2022 , pp. 223 - 228 .

[14]

G. E.

Box ,

G. C.

Tiao , Bayesian inference in statistical analysis , John Wiley & Sons, 2011 .

[15]

Pishro-Nik , Introduction to probability, statistics, and random processes ( 2016 ).

[16]

C. E.

Rasmussen , Gaussian processes in machine learning , in: Summer school on machine learning , Springer, 2003 , pp. 63 - 71 .

[17]

M. U.

Hassan ,

M. H.

Rehmani ,

Kotagiri ,

Zhang , J. Chen, Diferential privacy for renewable energy resources based smart metering , Journal of Parallel and Distributed Computing 131 ( 2019 ) 69 - 80 .