Privacy policy robustness to reverse engineering A. Gilad Kusne1,∗ , Olivera Kotevska2 1 National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA 2 Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN, 37830, USA Abstract Differential privacy policies allow one to preserve data privacy while sharing and analyzing data. However, these policies are susceptible to an array of attacks. In particular, often a portion of the data desired to be privacy protected is exposed online. Access to these pre-privacy protected data samples can then be used to reverse engineer the privacy policy. With knowledge of the generating privacy policy, an attacker can use machine learning to approximate the full set of originating data. Bayesian inference is one method for reverse engineering both model and model parameters. We present a methodology for evaluating and ranking privacy policy robustness to Bayesian inference-based reverse engineering, and demonstrated this method across data with a variety of temporal trends. Keywords Differential privacy, Bayesian inference, Privacy policy, Privacy defenses 1. Introduction Recent studies have shown that DP is the most effective approach due to its rigorous privacy definition and low In recent years, the number of devices connected to the computational overhead for continuous (i.e., streaming) Internet and online services has increased drastically data sets [9]. A recent survey identified that DP provides [1] leading to an exponential growth in data generation successful privacy preservation with the most common [2]. This trend is visible across different domains and DP mechanisms being Laplacian and Gamma distribu- applications including among many others, streaming tions and randomized response [10]. medical, personal tracking, and energy use data [1]. Typ- A differentially private model ensures that adversaries ically, sensing systems are digitized and connected to are incapable of inferring high confidence information network-based analysis tools, and the success of these about a single record from released models or output re- data streaming devices results in increasing adoption and sults [11]. However, adversaries may manage to infer or deployment. identify sensitive information from employing additional Although the proliferation of connected devices in- unprotected publicly released data, especially equipped creases convenience across many aspects of life, it also with machine learning tools. Some common attack types creates dangers when sharing sensitive information. This proposed include: re-identification attack, membership is especially true for sharing unprotected data over the inference attack, model inversion attack, model extrac- Internet. Around 98% of device traffic is unencrypted and tion attack, and model attribute inference attack [12]. transmitted over the Internet [3]. Cybercriminals have These attacks seek to extract information about the data, taken notice of this behavior. On average, sensor-based model, or attributes. devices are probed for security vulnerabilities around 800 We investigate the use of Bayesian inference-based times per hour, with 400 login attempts and 130 success- attacks to identify the privacy policy employed when ful logins on each device [4]. pre-privacy protected data samples are available. In this To address these rapidly growing security risks, signif- preliminary work we evaluate the likelihood of an adver- icant effort has been devoted to the development of pri- sary to differentiate between a DP mechanism employing vacy preservation algorithms and their integration into either Gaussian or Laplacian additive noise. We build existing platforms. Some of the most used algorithms on the previous work of [13]. Application to data with are randomization, k-anonymity, l-diversity, cryptogra- different temporal trends are explored. Here the data phy, and differential privacy (DP) [5]. These methods streams are sampled from zero-mean Gaussian processes have been successfully demonstrated on big data [6], deep using different kernels, resulting in data with different learning [7], medical records [8], as well as other domains. temporal trends. We then use analysis results to rank privacy policy robustness to such reverse engineering. CIKM’22: Privacy Algorithms in Systems (PAS), October 21, 2022, Atlanta, GA ∗ Corresponding author. Envelope-Open aaron.kusne@nist.gov (A. G. Kusne); kotevskao@ornl.gov (O. Kotevska) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Model posed or compromised data. We use probabilistic meth- ods to determine the accuracy with which an adversarial 2.1. Bayesian Inference actor can identify the privacy policy and its employed parameters from a compromised data stream. Bayesian Bayesian inference [14] is a statistical sampling method inference is used to quantify the likelihood (i.e., proba- for determining the probability of a hypothesis given bility) of each privacy policy being the generating policy data, through the use of Bayes’ theorem [15]. A common and a posterior probability density function for each pol- application for Bayesian inference is to identify the most icy’s parameter. Here the target parameter is the privacy likely parameters values 𝜃 of a generating model 𝑀(𝜃) loss measure value 𝜀. The privacy policies are then ranked for observed data 𝑌. Toward this goal, for a given model, for robustness to this type of attack for the given data. a prior over the parameters is needed. The prior belief Ten data stream samples are drawn from each of four for the parameter values 𝜃 is given by the probability Gaussian processes, which differ in kernel. The kernels density function 𝑝(𝜃) (or the probability mass function used are the squared exponential, Matern 5/2, exponen- 𝑃(𝜃) if the parameters 𝜃 take on discrete values.) The tial, and Brownian. These data streams are then each probability of observing data 𝑌 given particular values privacy protected under two privacy policies - using for the parameters is given by 𝑝(𝑌 |𝑀(𝜃)), also known Gaussian additive noise or Laplacian additive noise. For as the likelihood. Through the use of Bayes’ theorem, each policy, we explore the use of varying DP privacy the prior and likelihood are combined to determine the loss measure values of 𝜀 = [0.1, 0.5, 1.0]. We assume that probability of different values of 𝜃 given the observed 𝑌. 𝑥 number of pre-privacy protected data samples from This probability is known as the posterior and is repre- each data stream are exposed. We apply Bayesian infer- sented by 𝑝(𝑀(𝜃)|𝑌 ). Bayesian inference employs statisti- ence to each of these situations and quantify the sum log cal sampling of the model parameters’ prior and forward likelihood (SLL) of the generating privacy noise policy computation of the likelihood to evaluate the posterior. being either Gaussian or Laplacian. When Bayesian in- For this work, Markov Chain Monte Carlo (MCMC) is ference identifies that the true generating policy is less the Bayesian inference sampling method used. likely than the alternative, that policy is ranked greater in robustness to this form of privacy policy reverse en- 2.2. Gaussian Process gineering. Here, to represent the adversarial attacker’s limited knowledge of the privacy policy parameters, the Gaussian process [16] (GP) is a common Bayesian non- Bayesian inference uses a uniform prior over [10−1 , 10−5 ] parametric regression tool. To learn the function 𝑓 ∶ and [10−3 , 10] for 𝛿 and 𝜀, respectively. 𝑋 − > 𝑦 for the data 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖 , a prior is assumed For each set of data 𝐷(𝑚𝑖 ) privacy protected using pol- 𝑝(𝑓 ) = 𝑁 (𝜇(𝑥), 𝐾 (𝑥, 𝑥 ′ )) to quantify epistemic uncer- icy 𝑚𝑖 ∈ {𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛, 𝐿𝑎𝑝𝑙𝑎𝑐𝑖𝑎𝑛}, we compute the four SLLs: tainty. Here 𝜇(𝑥) is a mean function and 𝐾 (𝑥, 𝑥 ′ ) is a 𝐿𝑖,𝑗 = 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 )). The most likely policy is then selected. covariance function. Expected noise in the data (aleatoric A measure of whether the used policy is well obfuscated uncertainty) is quantified by selected a likelihood, a com- is given by: Δ𝑖 = 𝐿𝑖,𝑗 −𝐿𝑖,𝑖 , with Δ𝑖 positive (negative) if the mon one being 𝑝(𝐷|𝑓 ) = 𝑁 (𝑓 , 𝐼 𝜎 2 ) which assumes het- wrong (right) policy is estimated to be more likely given eroskedastic, normally distributed noise with standard the data and vice versa. A larger positive value indicates deviation 𝜎. The prior and likelihood are then com- a more difficult challenge for Bayesian inference-based bined to determine the posterior. When the prior and reverse engineering and a larger negative value indicates likelihood are both multivariate normal distributions, an easier challenge. We investigate Δ𝑖 as a function of the posterior is analytically solvable, giving 𝑝(𝑓 |𝐷) = data stream generating kernel, 𝜀 value, and size of ex- 𝑁 (𝜇𝑛 (𝑥), 𝐾𝑛 (𝑥, 𝑥 ′ )), where: posed data stream sample. 𝜇𝑛 (𝑥) = 𝜇(𝑥) + 𝑘 𝑇 (𝐾 + 𝜎 2 𝐼 )−1 𝑦 2.3.1. Assumption 𝐾𝑛 (𝑥) = 𝐾 (𝑥, 𝑥 ′ ) − 𝑘 𝑇 (𝐾 + 𝜎 2 𝐼 )−1 𝑘 ′ We assume that 𝑥 number of raw data points are available. Here 𝑦 is the vector of {𝑦𝑖 } and 𝑘𝑖 = 𝐾 (𝑥, 𝑥𝑖 ) and 𝑘𝑖′ = We investigate the robustness of DP Gaussian and Lapla- 𝐾 (𝑥, 𝑥𝑖 ). For this work, the squared exponential, Matern cian additive noise to the exposure of varying numbers of 5/2, exponential, and Brownian kernels are used to define data points as well as different values of the DP privacy data stream temporal trends. loss measure 𝜀. 2.3. Our approach 2.3.2. Investigating Privacy Risks We present a methodology for ranking privacy policy We performed Bayesian inference experiments to deter- robustness to reverse engineering in the presence of ex- mine the privacy policy and its parameters used for pri- vacy protection. Here the target data stream is privacy Figure 2 provides a plot of robustness to Bayesian protected using the equation: 𝑦𝑖 = 𝑦𝑖 + 𝑛𝑖 with data index inference-based parameter determination. The deviation 𝑖 and noise 𝑛𝑖 given by either the Gaussian or Laplacian in identifying the correct value of 𝜀 is plotted for each distribution with mean of zero and scale (or standard case. Red, orange, green, and blue indicate 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 )) deviation) given by the following equations: for 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐿𝑎𝑝)), 𝐿(𝐿𝑎𝑝|𝐷(𝐺𝑎𝑢𝑠)) and 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)) respectively. A greater 𝜀 increases the 1.25 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 𝜎 = 2 ∗ 𝑙𝑜𝑔( )∗ (1) difficulty in identifying 𝜀. A greater robustness to param- √ 𝛿 𝜀 eter determination is shown by 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), while a lower robustness is seen for 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)). The ap- 𝑀𝑎𝑥𝐴𝐸 2 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (2) plication of the Laplacian additive noise policy tends to √ 2 provide greater robustness over the Gaussian policy. As where MaxAE is maximum allowed error, a function expected, there also appears to be a subtle reduction in of the raw data. Here the MaxAE is set to one tenth parameter estimation error with an increasing number the value of the current data stream value [17]. The ad- of data points. versary is able to obtain data samples prior to privacy protection along with the same data after privacy protec- tion. The pre-privacy protected data can be obtained in 4. Conclusion a few ways including public exposure by the data stream As more data is shared online, the need for privacy preser- sensor, by the user (e.g., sharing data on social media), vation is becoming critical to ensure user confidence in or by the adversary temporarily placing a similar sens- sharing and analyzing personal data with online services. ing device close to the first, e.g., using a microphone or In this paper we investigated the robustness of privacy software hack to listen in to part of a conversation held policies when a subset of pre-privacy protected data is ex- over a cellphone or IoT device. The resulting pre-privacy posed. We demonstrate a methodology for selecting the protected data can then be used to reverse engineer the privacy policy that is more difficult to identify through privacy policy. Knowledge of the privacy policy can then Bayesian inference-based model and parameter determi- be used to extract raw data from privacy protected data nation. For the range of data stream trends investigated, collected before or after the data exposure occurs as in the Laplacian noise privacy policy was more difficult [13]. to identify compared to the Gaussian policy, for both Bayesian inference-based model and parameter determi- 3. Results nation. We hope our results and discussion will be helpful to Figure 1 compares the robustness to Bayesian inference- the community using privacy protection for their data based model determination for the Laplacian and Gaus- sets. sian additive noise privacy policies. The difference in SLL Δ𝑖 = 𝐿𝑖,𝑗 − 𝐿𝑖,𝑖 for each investigated case is shown, where Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 is plotted in orange and Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 is plot- Acknowledgment ted in green. The means of each are indicated by a solid This manuscript has been co-authored by UT-Battelle, line and the standard deviation is indicated by the col- LLC under Contract No. DE-AC05-00OR22725 with the ored regions. Diamond markers indicate a positive mean U.S. Department of Energy. The United States Govern- value and squares indicate a negative mean value - these ment retains and the publisher, by accepting the arti- correspond to the wrong and right policy having greater cle for publication, acknowledges that the United States likelihood, respectively. Interestingly Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 tends to Government retains a non-exclusive, paid-up, irrevoca- be larger than Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 , indicating a greater ease in iden- ble, worldwide license to publish or reproduce the pub- tifying the Gaussian policy over the Laplacian policy. lished form of this manuscript, or allow others to do so, Additionally, both means tend to lower values with in- for United States Government purposes. The Depart- creasing number of exposed data samples. In other words, ment of Energy will provide public access to these results with access to larger amounts of pre-privacy protected of federally sponsored research in accordance with the data there is an increasing probability in identifying the DOE Public Access Plan (http://energy.gov/downloads/ correct policy, as would be expected. Additionally, a doe-public-access-plan). relationship between the choice of kernel or 𝜀 and the resulting Δ𝑖 is not clear. Further investigation should be performed where the variance of additive noise is a larger References percentage of the data stream variance. [1] ”Number of IoT connected devices worldwide”, Figure 1: Figure 1. Robustness to Bayesian inference-based model determination. The difference in sum log likelihood (SLL) Δ𝑖 = 𝐿𝑖,𝑗 − 𝐿𝑖,𝑖 for each investigated case. Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 is plotted in orange and Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 is plotted in green with the mean indicated by a solid line and standard deviation indicated by the colored region. Diamond markers indicate a positive mean value and squares indicate a negative mean value - these correspond to the wrong and right policy having greater likelihood, respectively. Figure 2: Figure 2. Robustness to Bayesian-inference-based parameter determination. The deviation in identifying the correct value of 𝜀 under each case. Red, orange, green, and blue indicating 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 )) for 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐿𝑎𝑝)), 𝐿(𝐿𝑎𝑝|𝐷(𝐺𝑎𝑢𝑠)) and 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)) respectively. A greater 𝜀 increases the difficulty in identifying 𝜀. The most difficulty is shown with 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), while the greatest ease is shown with 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)). There also appears to be a subtle reduction estimate error with increasing number of data points. Error bars are slightly shifted for visibility. https://www.statista.com/statistics/1183457/ [4] ”4 Ways Cyber Attackers May Be Hacking Your IoT iot-connected-devices-worldwide/, Accessed: Devices Right Now”, shorturl.at/bdjm1, Accessed: August 9, 2022. August 17, 2022. [2] ”The Growth in Connected IoT Devices”, shorturl. [5] M. Khan, S. Foley, B. O’Sullivan, From k-anonymity at/lmnv1, Accessed: August 9, 2022. to differential privacy: A brief introduction to for- [3] ”2020 Unit 42 IoT Threat Report”, https://start. mal privacy models (2021). paloaltonetworks.com/unit-42-iot-threat-report, [6] S. H. Begum, F. Nausheen, A comparative analysis Accessed: August 17, 2022. of differential privacy vs other privacy mechanisms for big data, in: 2018 2nd International Conference on Inventive Systems and Control (ICISC), 2018, pp. 512–516. [7] J. Vasa, A. Thakkar, Deep learning: Differential privacy preservation in the era of big data, Journal of Computer Information Systems (2022) 1–24. [8] A. Kumar, R. Kumar, Privacy preservation of elec- tronic health record: Current status and future di- rection, Handbook of Computer Networks and Cyber Security (2020) 715–739. [9] M. Peralta-Peterson, O. Kotevska, Effectiveness of privacy techniques in smart metering systems, in: 2021 International Conference on Computational Science and Computational Intelligence (CSCI), IEEE, 2021, pp. 675–678. [10] Y. Zhao, J. Chen, A survey on differential privacy for unstructured data content, ACM Computing Surveys (CSUR) (2022). [11] M. Al-Rubaie, J. M. Chang, Privacy-preserving ma- chine learning: Threats and solutions, IEEE Secu- rity & Privacy 17 (2019) 49–58. [12] M. Rigaki, S. Garcia, A survey of privacy attacks in machine learning, arXiv preprint arXiv:2007.07646 (2020). [13] O. Kotevska, J. Johnson, A. G. Kusne, Analyzing data privacy for edge systems, in: 2022 IEEE Inter- national Conference on Smart Computing (SMART- COMP), IEEE, 2022, pp. 223–228. [14] G. E. Box, G. C. Tiao, Bayesian inference in statisti- cal analysis, John Wiley & Sons, 2011. [15] H. Pishro-Nik, Introduction to probability, statistics, and random processes (2016). [16] C. E. Rasmussen, Gaussian processes in machine learning, in: Summer school on machine learning, Springer, 2003, pp. 63–71. [17] M. U. Hassan, M. H. Rehmani, R. Kotagiri, J. Zhang, J. Chen, Differential privacy for renewable energy resources based smart metering, Journal of Parallel and Distributed Computing 131 (2019) 69–80.