Privacy policy robustness to reverse engineering
A. Gilad Kusne1,∗ , Olivera Kotevska2
1
    National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA
2
    Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN, 37830, USA


                                             Abstract
                                             Differential privacy policies allow one to preserve data privacy while sharing and analyzing data. However, these policies
                                             are susceptible to an array of attacks. In particular, often a portion of the data desired to be privacy protected is exposed
                                             online. Access to these pre-privacy protected data samples can then be used to reverse engineer the privacy policy. With
                                             knowledge of the generating privacy policy, an attacker can use machine learning to approximate the full set of originating
                                             data. Bayesian inference is one method for reverse engineering both model and model parameters. We present a methodology
                                             for evaluating and ranking privacy policy robustness to Bayesian inference-based reverse engineering, and demonstrated this
                                             method across data with a variety of temporal trends.

                                             Keywords
                                             Differential privacy, Bayesian inference, Privacy policy, Privacy defenses


1. Introduction                                                                                                                       Recent studies have shown that DP is the most effective
                                                                                                                                      approach due to its rigorous privacy definition and low
In recent years, the number of devices connected to the                                                                               computational overhead for continuous (i.e., streaming)
Internet and online services has increased drastically                                                                                data sets [9]. A recent survey identified that DP provides
[1] leading to an exponential growth in data generation                                                                               successful privacy preservation with the most common
[2]. This trend is visible across different domains and                                                                               DP mechanisms being Laplacian and Gamma distribu-
applications including among many others, streaming                                                                                   tions and randomized response [10].
medical, personal tracking, and energy use data [1]. Typ-                                                                                A differentially private model ensures that adversaries
ically, sensing systems are digitized and connected to                                                                                are incapable of inferring high confidence information
network-based analysis tools, and the success of these                                                                                about a single record from released models or output re-
data streaming devices results in increasing adoption and                                                                             sults [11]. However, adversaries may manage to infer or
deployment.                                                                                                                           identify sensitive information from employing additional
   Although the proliferation of connected devices in-                                                                                unprotected publicly released data, especially equipped
creases convenience across many aspects of life, it also                                                                              with machine learning tools. Some common attack types
creates dangers when sharing sensitive information. This                                                                              proposed include: re-identification attack, membership
is especially true for sharing unprotected data over the                                                                              inference attack, model inversion attack, model extrac-
Internet. Around 98% of device traffic is unencrypted and                                                                             tion attack, and model attribute inference attack [12].
transmitted over the Internet [3]. Cybercriminals have                                                                                These attacks seek to extract information about the data,
taken notice of this behavior. On average, sensor-based                                                                               model, or attributes.
devices are probed for security vulnerabilities around 800                                                                               We investigate the use of Bayesian inference-based
times per hour, with 400 login attempts and 130 success-                                                                              attacks to identify the privacy policy employed when
ful logins on each device [4].                                                                                                        pre-privacy protected data samples are available. In this
   To address these rapidly growing security risks, signif-                                                                           preliminary work we evaluate the likelihood of an adver-
icant effort has been devoted to the development of pri-                                                                              sary to differentiate between a DP mechanism employing
vacy preservation algorithms and their integration into                                                                               either Gaussian or Laplacian additive noise. We build
existing platforms. Some of the most used algorithms                                                                                  on the previous work of [13]. Application to data with
are randomization, k-anonymity, l-diversity, cryptogra-                                                                               different temporal trends are explored. Here the data
phy, and differential privacy (DP) [5]. These methods                                                                                 streams are sampled from zero-mean Gaussian processes
have been successfully demonstrated on big data [6], deep                                                                             using different kernels, resulting in data with different
learning [7], medical records [8], as well as other domains.                                                                          temporal trends. We then use analysis results to rank
                                                                                                                                      privacy policy robustness to such reverse engineering.
CIKM’22: Privacy Algorithms in Systems (PAS), October 21, 2022,
Atlanta, GA
∗
     Corresponding author.
Envelope-Open aaron.kusne@nist.gov (A. G. Kusne); kotevskao@ornl.gov
(O. Kotevska)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Model                                                         posed or compromised data. We use probabilistic meth-
                                                                 ods to determine the accuracy with which an adversarial
2.1. Bayesian Inference                                          actor can identify the privacy policy and its employed
                                                                 parameters from a compromised data stream. Bayesian
Bayesian inference [14] is a statistical sampling method
                                                                 inference is used to quantify the likelihood (i.e., proba-
for determining the probability of a hypothesis given
                                                                 bility) of each privacy policy being the generating policy
data, through the use of Bayes’ theorem [15]. A common
                                                                 and a posterior probability density function for each pol-
application for Bayesian inference is to identify the most
                                                                 icy’s parameter. Here the target parameter is the privacy
likely parameters values 𝜃 of a generating model 𝑀(𝜃)
                                                                 loss measure value 𝜀. The privacy policies are then ranked
for observed data 𝑌. Toward this goal, for a given model,
                                                                 for robustness to this type of attack for the given data.
a prior over the parameters is needed. The prior belief
                                                                    Ten data stream samples are drawn from each of four
for the parameter values 𝜃 is given by the probability
                                                                 Gaussian processes, which differ in kernel. The kernels
density function 𝑝(𝜃) (or the probability mass function
                                                                 used are the squared exponential, Matern 5/2, exponen-
𝑃(𝜃) if the parameters 𝜃 take on discrete values.) The
                                                                 tial, and Brownian. These data streams are then each
probability of observing data 𝑌 given particular values
                                                                 privacy protected under two privacy policies - using
for the parameters is given by 𝑝(𝑌 |𝑀(𝜃)), also known
                                                                 Gaussian additive noise or Laplacian additive noise. For
as the likelihood. Through the use of Bayes’ theorem,
                                                                 each policy, we explore the use of varying DP privacy
the prior and likelihood are combined to determine the
                                                                 loss measure values of 𝜀 = [0.1, 0.5, 1.0]. We assume that
probability of different values of 𝜃 given the observed 𝑌.
                                                                 𝑥 number of pre-privacy protected data samples from
This probability is known as the posterior and is repre-
                                                                 each data stream are exposed. We apply Bayesian infer-
sented by 𝑝(𝑀(𝜃)|𝑌 ). Bayesian inference employs statisti-
                                                                 ence to each of these situations and quantify the sum log
cal sampling of the model parameters’ prior and forward
                                                                 likelihood (SLL) of the generating privacy noise policy
computation of the likelihood to evaluate the posterior.
                                                                 being either Gaussian or Laplacian. When Bayesian in-
For this work, Markov Chain Monte Carlo (MCMC) is
                                                                 ference identifies that the true generating policy is less
the Bayesian inference sampling method used.
                                                                 likely than the alternative, that policy is ranked greater
                                                                 in robustness to this form of privacy policy reverse en-
2.2. Gaussian Process                                            gineering. Here, to represent the adversarial attacker’s
                                                                 limited knowledge of the privacy policy parameters, the
Gaussian process [16] (GP) is a common Bayesian non-
                                                                 Bayesian inference uses a uniform prior over [10−1 , 10−5 ]
parametric regression tool. To learn the function 𝑓 ∶
                                                                 and [10−3 , 10] for 𝛿 and 𝜀, respectively.
𝑋 − > 𝑦 for the data 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖 , a prior is assumed
                                                                    For each set of data 𝐷(𝑚𝑖 ) privacy protected using pol-
𝑝(𝑓 ) = 𝑁 (𝜇(𝑥), 𝐾 (𝑥, 𝑥 ′ )) to quantify epistemic uncer-
                                                                 icy 𝑚𝑖 ∈ {𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛, 𝐿𝑎𝑝𝑙𝑎𝑐𝑖𝑎𝑛}, we compute the four SLLs:
tainty. Here 𝜇(𝑥) is a mean function and 𝐾 (𝑥, 𝑥 ′ ) is a
                                                                 𝐿𝑖,𝑗 = 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 )). The most likely policy is then selected.
covariance function. Expected noise in the data (aleatoric
                                                                 A measure of whether the used policy is well obfuscated
uncertainty) is quantified by selected a likelihood, a com-
                                                                 is given by: Δ𝑖 = 𝐿𝑖,𝑗 −𝐿𝑖,𝑖 , with Δ𝑖 positive (negative) if the
mon one being 𝑝(𝐷|𝑓 ) = 𝑁 (𝑓 , 𝐼 𝜎 2 ) which assumes het-
                                                                 wrong (right) policy is estimated to be more likely given
eroskedastic, normally distributed noise with standard
                                                                 the data and vice versa. A larger positive value indicates
deviation 𝜎. The prior and likelihood are then com-
                                                                 a more difficult challenge for Bayesian inference-based
bined to determine the posterior. When the prior and
                                                                 reverse engineering and a larger negative value indicates
likelihood are both multivariate normal distributions,
                                                                 an easier challenge. We investigate Δ𝑖 as a function of
the posterior is analytically solvable, giving 𝑝(𝑓 |𝐷) =
                                                                 data stream generating kernel, 𝜀 value, and size of ex-
𝑁 (𝜇𝑛 (𝑥), 𝐾𝑛 (𝑥, 𝑥 ′ )), where:
                                                                 posed data stream sample.
              𝜇𝑛 (𝑥) = 𝜇(𝑥) + 𝑘 𝑇 (𝐾 + 𝜎 2 𝐼 )−1 𝑦
                                                                 2.3.1. Assumption
           𝐾𝑛 (𝑥) = 𝐾 (𝑥, 𝑥 ′ ) − 𝑘 𝑇 (𝐾 + 𝜎 2 𝐼 )−1 𝑘 ′
                                                                 We assume that 𝑥 number of raw data points are available.
   Here 𝑦 is the vector of {𝑦𝑖 } and 𝑘𝑖 = 𝐾 (𝑥, 𝑥𝑖 ) and 𝑘𝑖′ =   We investigate the robustness of DP Gaussian and Lapla-
𝐾 (𝑥, 𝑥𝑖 ). For this work, the squared exponential, Matern       cian additive noise to the exposure of varying numbers of
5/2, exponential, and Brownian kernels are used to define        data points as well as different values of the DP privacy
data stream temporal trends.                                     loss measure 𝜀.

2.3. Our approach                                                2.3.2. Investigating Privacy Risks
We present a methodology for ranking privacy policy We performed Bayesian inference experiments to deter-
robustness to reverse engineering in the presence of ex- mine the privacy policy and its parameters used for pri-
vacy protection. Here the target data stream is privacy       Figure 2 provides a plot of robustness to Bayesian
protected using the equation: 𝑦𝑖 = 𝑦𝑖 + 𝑛𝑖 with data index inference-based parameter determination. The deviation
𝑖 and noise 𝑛𝑖 given by either the Gaussian or Laplacian   in identifying the correct value of 𝜀 is plotted for each
distribution with mean of zero and scale (or standard      case. Red, orange, green, and blue indicate 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 ))
deviation) given by the following equations:               for 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐿𝑎𝑝)), 𝐿(𝐿𝑎𝑝|𝐷(𝐺𝑎𝑢𝑠)) and
                                                           𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)) respectively. A greater 𝜀 increases the
                             1.25     𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
             𝜎 = 2 ∗ 𝑙𝑜𝑔(         )∗                   (1) difficulty in identifying 𝜀. A greater robustness to param-
                  √            𝛿           𝜀               eter determination is shown by 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), while a
                                                           lower robustness is seen for 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)). The ap-
                                  𝑀𝑎𝑥𝐴𝐸 2
                 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =                         (2) plication of the Laplacian additive noise policy tends to
                                √      2                   provide greater robustness over the Gaussian policy. As
   where MaxAE is maximum allowed error, a function expected, there also appears to be a subtle reduction in
of the raw data. Here the MaxAE is set to one tenth parameter estimation error with an increasing number
the value of the current data stream value [17]. The ad- of data points.
versary is able to obtain data samples prior to privacy
protection along with the same data after privacy protec-
tion. The pre-privacy protected data can be obtained in
                                                           4. Conclusion
a few ways including public exposure by the data stream As more data is shared online, the need for privacy preser-
sensor, by the user (e.g., sharing data on social media), vation is becoming critical to ensure user confidence in
or by the adversary temporarily placing a similar sens- sharing and analyzing personal data with online services.
ing device close to the first, e.g., using a microphone or In this paper we investigated the robustness of privacy
software hack to listen in to part of a conversation held policies when a subset of pre-privacy protected data is ex-
over a cellphone or IoT device. The resulting pre-privacy posed. We demonstrate a methodology for selecting the
protected data can then be used to reverse engineer the privacy policy that is more difficult to identify through
privacy policy. Knowledge of the privacy policy can then Bayesian inference-based model and parameter determi-
be used to extract raw data from privacy protected data nation. For the range of data stream trends investigated,
collected before or after the data exposure occurs as in the Laplacian noise privacy policy was more difficult
[13].                                                      to identify compared to the Gaussian policy, for both
                                                              Bayesian inference-based model and parameter determi-
3. Results                                                    nation.
                                                                 We hope our results and discussion will be helpful to
Figure 1 compares the robustness to Bayesian inference-       the  community using privacy protection for their data
based model determination for the Laplacian and Gaus- sets.
sian additive noise privacy policies. The difference in
SLL Δ𝑖 = 𝐿𝑖,𝑗 − 𝐿𝑖,𝑖 for each investigated case is shown,
where Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 is plotted in orange and Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 is plot-
                                                              Acknowledgment
ted in green. The means of each are indicated by a solid This manuscript has been co-authored by UT-Battelle,
line and the standard deviation is indicated by the col- LLC under Contract No. DE-AC05-00OR22725 with the
ored regions. Diamond markers indicate a positive mean U.S. Department of Energy. The United States Govern-
value and squares indicate a negative mean value - these ment retains and the publisher, by accepting the arti-
correspond to the wrong and right policy having greater cle for publication, acknowledges that the United States
likelihood, respectively. Interestingly Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 tends to Government retains a non-exclusive, paid-up, irrevoca-
be larger than Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 , indicating a greater ease in iden- ble, worldwide license to publish or reproduce the pub-
tifying the Gaussian policy over the Laplacian policy. lished form of this manuscript, or allow others to do so,
Additionally, both means tend to lower values with in- for United States Government purposes. The Depart-
creasing number of exposed data samples. In other words, ment of Energy will provide public access to these results
with access to larger amounts of pre-privacy protected of federally sponsored research in accordance with the
data there is an increasing probability in identifying the DOE Public Access Plan (http://energy.gov/downloads/
correct policy, as would be expected. Additionally, a doe-public-access-plan).
relationship between the choice of kernel or 𝜀 and the
resulting Δ𝑖 is not clear. Further investigation should be
performed where the variance of additive noise is a larger References
percentage of the data stream variance.
                                                               [1] ”Number of IoT connected devices worldwide”,
Figure 1: Figure 1. Robustness to Bayesian inference-based model determination. The difference in sum log likelihood (SLL)
Δ𝑖 = 𝐿𝑖,𝑗 − 𝐿𝑖,𝑖 for each investigated case. Δ𝐿𝑎𝑝𝑙𝑎𝑐𝑒 is plotted in orange and Δ𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 is plotted in green with the mean indicated
by a solid line and standard deviation indicated by the colored region. Diamond markers indicate a positive mean value and
squares indicate a negative mean value - these correspond to the wrong and right policy having greater likelihood, respectively.


Figure 2: Figure 2. Robustness to Bayesian-inference-based parameter determination. The deviation in identifying the
correct value of 𝜀 under each case. Red, orange, green, and blue indicating 𝐿(𝑚𝑖 |𝐷(𝑚𝑗 )) for 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐿𝑎𝑝)),
𝐿(𝐿𝑎𝑝|𝐷(𝐺𝑎𝑢𝑠)) and 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)) respectively. A greater 𝜀 increases the difficulty in identifying 𝜀. The most difficulty is
shown with 𝐿(𝐿𝑎𝑝|𝐷(𝐿𝑎𝑝)), while the greatest ease is shown with 𝐿(𝐺𝑎𝑢𝑠|𝐷(𝐺𝑎𝑢𝑠)). There also appears to be a subtle reduction
estimate error with increasing number of data points. Error bars are slightly shifted for visibility.


     https://www.statista.com/statistics/1183457/                   [4] ”4 Ways Cyber Attackers May Be Hacking Your IoT
     iot-connected-devices-worldwide/,       Accessed:                  Devices Right Now”, shorturl.at/bdjm1, Accessed:
     August 9, 2022.                                                    August 17, 2022.
 [2] ”The Growth in Connected IoT Devices”, shorturl.               [5] M. Khan, S. Foley, B. O’Sullivan, From k-anonymity
     at/lmnv1, Accessed: August 9, 2022.                                to differential privacy: A brief introduction to for-
 [3] ”2020 Unit 42 IoT Threat Report”, https://start.                   mal privacy models (2021).
     paloaltonetworks.com/unit-42-iot-threat-report,                [6] S. H. Begum, F. Nausheen, A comparative analysis
     Accessed: August 17, 2022.                                         of differential privacy vs other privacy mechanisms
     for big data, in: 2018 2nd International Conference
     on Inventive Systems and Control (ICISC), 2018, pp.
     512–516.
 [7] J. Vasa, A. Thakkar, Deep learning: Differential
     privacy preservation in the era of big data, Journal
     of Computer Information Systems (2022) 1–24.
 [8] A. Kumar, R. Kumar, Privacy preservation of elec-
     tronic health record: Current status and future di-
     rection, Handbook of Computer Networks and
     Cyber Security (2020) 715–739.
 [9] M. Peralta-Peterson, O. Kotevska, Effectiveness of
     privacy techniques in smart metering systems, in:
     2021 International Conference on Computational
     Science and Computational Intelligence (CSCI),
     IEEE, 2021, pp. 675–678.
[10] Y. Zhao, J. Chen, A survey on differential privacy
     for unstructured data content, ACM Computing
     Surveys (CSUR) (2022).
[11] M. Al-Rubaie, J. M. Chang, Privacy-preserving ma-
     chine learning: Threats and solutions, IEEE Secu-
     rity & Privacy 17 (2019) 49–58.
[12] M. Rigaki, S. Garcia, A survey of privacy attacks in
     machine learning, arXiv preprint arXiv:2007.07646
     (2020).
[13] O. Kotevska, J. Johnson, A. G. Kusne, Analyzing
     data privacy for edge systems, in: 2022 IEEE Inter-
     national Conference on Smart Computing (SMART-
     COMP), IEEE, 2022, pp. 223–228.
[14] G. E. Box, G. C. Tiao, Bayesian inference in statisti-
     cal analysis, John Wiley & Sons, 2011.
[15] H. Pishro-Nik, Introduction to probability, statistics,
     and random processes (2016).
[16] C. E. Rasmussen, Gaussian processes in machine
     learning, in: Summer school on machine learning,
     Springer, 2003, pp. 63–71.
[17] M. U. Hassan, M. H. Rehmani, R. Kotagiri, J. Zhang,
     J. Chen, Differential privacy for renewable energy
     resources based smart metering, Journal of Parallel
     and Distributed Computing 131 (2019) 69–80.