<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Privacy policy robustness to reverse engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A. Gilad Kusne</string-name>
          <email>aaron.kusne@nist.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivera Kotevska</string-name>
          <email>kotevskao@ornl.gov</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Atlanta, GA</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Standards and Technology</institution>
          ,
          <addr-line>100 Bureau Drive, Gaithersburg, MD, 20899</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oak Ridge National Laboratory</institution>
          ,
          <addr-line>1 Bethel Valley Road, Oak Ridge, TN, 37830</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>We investigate the use of Bayesian inference-based</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>either Gaussian or Laplacian additive noise. We build</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Diferential privacy policies allow one to preserve data privacy while sharing and analyzing data. However, these policies are susceptible to an array of attacks. In particular, often a portion of the data desired to be privacy protected is exposed online. Access to these pre-privacy protected data samples can then be used to reverse engineer the privacy policy. With knowledge of the generating privacy policy, an attacker can use machine learning to approximate the full set of originating data. Bayesian inference is one method for reverse engineering both model and model parameters. We present a methodology for evaluating and ranking privacy policy robustness to Bayesian inference-based reverse engineering, and demonstrated this method across data with a variety of temporal trends.</p>
      </abstract>
      <kwd-group>
        <kwd>Diferential privacy</kwd>
        <kwd>Bayesian inference</kwd>
        <kwd>Privacy policy</kwd>
        <kwd>Privacy defenses</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, the number of devices connected to the
Internet and online services has increased drastically
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] leading to an exponential growth in data generation
[2]. This trend is visible across diferent domains and
applications including among many others, streaming
medical, personal tracking, and energy use data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Typically, sensing systems are digitized and connected to
network-based analysis tools, and the success of these
data streaming devices results in increasing adoption and
deployment.
creases convenience across many aspects of life, it also
creates dangers when sharing sensitive information. This
is especially true for sharing unprotected data over the
Internet. Around 98% of device trafic is unencrypted and
transmitted over the Internet [3]. Cybercriminals have
taken notice of this behavior. On average, sensor-based
devices are probed for security vulnerabilities around 800
icant efort has been devoted to the development of
privacy preservation algorithms and their integration into
existing platforms. Some of the most used algorithms
are randomization, k-anonymity, l-diversity,
cryptography, and diferential privacy (DP) [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ]. These methods
have been successfully demonstrated on big data [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ], deep
CIKM’22: Privacy Algorithms in Systems (PAS), October 21, 2022,
ful logins on each device [
        <xref ref-type="bibr" rid="ref2">4</xref>
        ].
times per hour, with 400 login attempts and 130 success- attacks to identify the privacy policy employed when
      </p>
      <p>
        To address these rapidly growing security risks, signif- preliminary work we evaluate the likelihood of an
adverlearning [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ], medical records [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ], as well as other domains. temporal trends. We then use analysis results to rank
      </p>
    </sec>
    <sec id="sec-2">
      <title>Model</title>
      <sec id="sec-2-1">
        <title>2.1. Bayesian Inference</title>
        <p>
          Bayesian inference [
          <xref ref-type="bibr" rid="ref12">14</xref>
          ] is a statistical sampling method
for determining the probability of a hypothesis given
data, through the use of Bayes’ theorem [
          <xref ref-type="bibr" rid="ref13">15</xref>
          ]. A common
application for Bayesian inference is to identify the most
likely parameters values  of a generating model  ()
for observed data  . Toward this goal, for a given model,
a prior over the parameters is needed. The prior belief
for the parameter values  is given by the probability
density function ()
        </p>
        <p>(or the probability mass function
 ()</p>
        <p>if the parameters  take on discrete values.) The
probability of observing data  given particular values
for the parameters is given by ( | ())
, also known
as the likelihood. Through the use of Bayes’ theorem,
the prior and likelihood are combined to determine the
probability of diferent values of  given the observed  .
This probability is known as the posterior and is
represented by ( ()| )</p>
        <p>. Bayesian inference employs
statistical sampling of the model parameters’ prior and forward
computation of the likelihood to evaluate the posterior.
For this work, Markov Chain Monte Carlo (MCMC) is
the Bayesian inference sampling method used.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Gaussian Process</title>
        <p>
          − &gt; 
( ) =  ((),  (, 
tainty. Here ()
Gaussian process [
          <xref ref-type="bibr" rid="ref14">16</xref>
          ] (GP) is a common Bayesian
nonparametric regression tool. To learn the function  ∶
for the data  = {(  ,  
)}

 , a prior is assumed
′)) to quantify epistemic
unceris a mean function and  (, 
′) is a
covariance function. Expected noise in the data (aleatoric
uncertainty) is quantified by selected a likelihood, a
common one being (| ) =  ( ,  
2) which assumes
het (  (), 
 (,
        </p>
        <p>′)), where:
eroskedastic, normally distributed noise with standard
deviation  . The prior and likelihood are then
combined to determine the posterior. When the prior and
likelihood are both multivariate normal distributions,
the posterior is analytically solvable, giving ( |) =
  () = () + 
  () =  (, 
 ( +</p>
        <p>2 ) −1
′) −   ( + 
2 ) −1 ′</p>
        <p>Here  is the vector of {  } and   =  (, 
 (,</p>
        <p>). For this work, the squared exponential, Matern
5/2, exponential, and Brownian kernels are used to define
data stream temporal trends.
 ) and  ′ =
2.3. Our approach
posed or compromised data. We use probabilistic
methods to determine the accuracy with which an adversarial
actor can identify the privacy policy and its employed
parameters from a compromised data stream. Bayesian
inference is used to quantify the likelihood (i.e.,
probability) of each privacy policy being the generating policy
and a posterior probability density function for each
policy’s parameter. Here the target parameter is the privacy
loss measure value  . The privacy policies are then ranked
for robustness to this type of attack for the given data.</p>
        <p>
          Ten data stream samples are drawn from each of four
Gaussian processes, which difer in kernel. The kernels
used are the squared exponential, Matern 5/2,
exponential, and Brownian. These data streams are then each
privacy protected under two privacy policies - using
Gaussian additive noise or Laplacian additive noise. For
each policy, we explore the use of varying DP privacy
loss measure values of  = [0.1, 0.5, 1.0] . We assume that
 number of pre-privacy protected data samples from
each data stream are exposed. We apply Bayesian
inference to each of these situations and quantify the sum log
likelihood (SLL) of the generating privacy noise policy
being either Gaussian or Laplacian. When Bayesian
inference identifies that the true generating policy is less
likely than the alternative, that policy is ranked greater
in robustness to this form of privacy policy reverse
engineering. Here, to represent the adversarial attacker’s
limited knowledge of the privacy policy parameters, the
Bayesian inference uses a uniform prior over [10−1, 10−5]
and [
          <xref ref-type="bibr" rid="ref8">10−3, 10</xref>
          ] for  and  , respectively.
        </p>
        <p>For each set of data (  ) privacy protected using
pol, we compute the four SLLs:
icy   ∈ { ,</p>
        <p>}
 ,
= (</p>
        <p>)). The most likely policy is then selected.</p>
        <p>A measure of whether the used policy is well obfuscated
is given by: Δ =  , − , , with Δ positive (negative) if the
wrong (right) policy is estimated to be more likely given
the data and vice versa. A larger positive value indicates
a more dificult challenge for Bayesian inference-based
reverse engineering and a larger negative value indicates
an easier challenge. We investigate Δ as a function of
data stream generating kernel,  value, and size of
exposed data stream sample.</p>
        <sec id="sec-2-2-1">
          <title>2.3.1. Assumption</title>
          <p>We assume that  number of raw data points are available.
We investigate the robustness of DP Gaussian and
Laplacian additive noise to the exposure of varying numbers of
data points as well as diferent values of the DP privacy
loss measure  .</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.3.2. Investigating Privacy Risks</title>
          <p>We present a methodology for ranking privacy policy
robustness to reverse engineering in the presence of
exWe performed Bayesian inference experiments to
determine the privacy policy and its parameters used for
privacy protection. Here the target data stream is privacy
protected using the equation:   =   +   with data index
 and noise   given by either the Gaussian or Laplacian
distribution with mean of zero and scale (or standard
deviation) given by the following equations:
 =</p>
          <p>2 ∗ (
√
  =
1.25</p>
          <p>
            ) ∗
√
 
 
2

2
versary is able to obtain data samples prior to privacy
protection along with the same data after privacy
protection. The pre-privacy protected data can be obtained in
a few ways including public exposure by the data stream
sensor, by the user (e.g., sharing data on social media),
or by the adversary temporarily placing a similar
sensing device close to the first, e.g., using a microphone or
software hack to listen in to part of a conversation held
over a cellphone or IoT device. The resulting pre-privacy
protected data can then be used to reverse engineer the
privacy policy. Knowledge of the privacy policy can then
be used to extract raw data from privacy protected data
collected before or after the data exposure occurs as in
[
            <xref ref-type="bibr" rid="ref11">13</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>based model determination for the Laplacian and
Gaussian additive noise privacy policies. The diference in
SLL Δ =  , −  , for each investigated case is shown,
where Δ</p>
      <p>is plotted in orange and Δ
ted in green. The means of each are indicated by a solid
line and the standard deviation is indicated by the
colored regions. Diamond markers indicate a positive mean
value and squares indicate a negative mean value - these
correspond to the wrong and right policy having greater
is
plotlikelihood, respectively. Interestingly Δ
tends to
, indicating a greater ease in
idenbe larger than Δ
tifying the Gaussian policy over the Laplacian policy.</p>
      <p>Additionally, both means tend to lower values with
increasing number of exposed data samples. In other words,
with access to larger amounts of pre-privacy protected
data there is an increasing probability in identifying the
correct policy, as would be expected. Additionally, a
relationship between the choice of kernel or  and the
resulting Δ is not clear. Further investigation should be
performed where the variance of additive noise is a larger
percentage of the data stream variance.</p>
      <p>
        where MaxAE is maximum allowed error, a function
of the raw data. Here the MaxAE is set to one tenth
the value of the current data stream value [
        <xref ref-type="bibr" rid="ref15">17</xref>
        ]. The ad- of data points.
in identifying the correct value of  is plotted for each
case. Red, orange, green, and blue indicate (
for (|())
(|())
, (|())
      </p>
      <p>, (|())
respectively. A greater  increases the
(1)
(2)
dificulty in identifying</p>
      <p>. A greater robustness to
parameter determination is shown by (|())
lower robustness is seen for (|())
plication of the Laplacian additive noise policy tends to
provide greater robustness over the Gaussian policy. As
expected, there also appears to be a subtle reduction in
parameter estimation error with an increasing number</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>nation.
sets.</p>
      <p>As more data is shared online, the need for privacy
preservation is becoming critical to ensure user confidence in
sharing and analyzing personal data with online services.
In this paper we investigated the robustness of privacy
policies when a subset of pre-privacy protected data is
exposed. We demonstrate a methodology for selecting the
privacy policy that is more dificult to identify through
Bayesian inference-based model and parameter
determination. For the range of data stream trends investigated,
the Laplacian noise privacy policy was more dificult
to identify compared to the Gaussian policy, for both
Bayesian inference-based model and parameter
determi</p>
      <p>We hope our results and discussion will be helpful to</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This manuscript has been co-authored by UT-Battelle,
LLC under Contract No. DE-AC05-00OR22725 with the
U.S. Department of Energy. The United States
Government retains and the publisher, by accepting the
article for publication, acknowledges that the United States
Government retains a non-exclusive, paid-up,
irrevocable, worldwide license to publish or reproduce the
published form of this manuscript, or allow others to do so,
for United States Government purposes. The
Department of Energy will provide public access to these results
of federally sponsored research in accordance with the
DOE Public Access Plan (http://energy.gov/downloads/
doe-public-access-plan).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] ”Number of IoT connected devices worldwide”,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4] ”4
          <string-name>
            <given-names>Ways</given-names>
            <surname>Cyber Attackers May Be Hacking Your IoT Devices Right</surname>
          </string-name>
          <article-title>Now”, shorturl</article-title>
          .at/bdjm1,
          <source>Accessed: August</source>
          <volume>17</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Foley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <article-title>From k-anonymity to diferential privacy: A brief introduction to formal privacy models (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Begum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nausheen</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of diferential privacy vs other privacy mechanisms for big data</article-title>
          ,
          <source>in: 2018 2nd International Conference on Inventive Systems and Control (ICISC)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>512</fpage>
          -
          <lpage>516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vasa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thakkar</surname>
          </string-name>
          ,
          <article-title>Deep learning: Diferential privacy preservation in the era of big data</article-title>
          ,
          <source>Journal of Computer Information Systems</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Privacy preservation of electronic health record: Current status and future direction, Handbook of Computer Networks</article-title>
          and Cyber
          <string-name>
            <surname>Security</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>715</fpage>
          -
          <lpage>739</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peralta-Peterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kotevska</surname>
          </string-name>
          ,
          <article-title>Efectiveness of privacy techniques in smart metering systems</article-title>
          , in: 2021
          <source>International Conference on Computational Science and Computational Intelligence (CSCI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>678</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey on diferential privacy for unstructured data content</article-title>
          ,
          <source>ACM Computing Surveys (CSUR)</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Rubaie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Privacy-preserving machine learning: Threats and solutions</article-title>
          ,
          <source>IEEE Security &amp; Privacy</source>
          <volume>17</volume>
          (
          <year>2019</year>
          )
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rigaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <article-title>A survey of privacy attacks in machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:2007</source>
          .
          <volume>07646</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kotevska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , A. G. Kusne,
          <article-title>Analyzing data privacy for edge systems</article-title>
          ,
          <source>in: 2022 IEEE International Conference on Smart Computing (SMARTCOMP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Box</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Tiao</surname>
          </string-name>
          ,
          <article-title>Bayesian inference in statistical analysis</article-title>
          , John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Pishro-Nik</surname>
          </string-name>
          , Introduction to probability, statistics, and random processes (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <article-title>Gaussian processes in machine learning</article-title>
          ,
          <source>in: Summer school on machine learning</source>
          , Springer,
          <year>2003</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Rehmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kotagiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Chen,
          <article-title>Diferential privacy for renewable energy resources based smart metering</article-title>
          ,
          <source>Journal of Parallel and Distributed Computing</source>
          <volume>131</volume>
          (
          <year>2019</year>
          )
          <fpage>69</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>