<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the interplay of Explainability, Privacy and Predictive Performance with Explanation-assisted Model Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fatima Ezzeddine</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rinad Akel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ihab Sbeity</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Giordano</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Langheinrich</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Omran Ayoub</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lebanese University</institution>
          ,
          <addr-line>Beirut</addr-line>
          ,
          <country country="LB">Lebanon</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università della Svizzera italiana</institution>
          ,
          <addr-line>Lugano</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Applied Sciences and Arts of Southern Switzerland</institution>
          ,
          <addr-line>Lugano</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine Learning as a Service (MLaaS) has gained important attraction as a means for deploying powerful predictive models, ofering ease of use that enables organizations to leverage advanced analytics without substantial investments in specialized infrastructure or expertise. However, MLaaS platforms must be safeguarded against security and privacy attacks, such as model extraction (MEA) attacks. The increasing integration of explainable AI (XAI) within MLaaS has introduced an additional privacy challenge, as attackers can exploit model explanations-particularly counterfactual explanations (CFs) to facilitate MEA. In this paper, we investigate the trade-ofs among model performance, privacy, and explainability when employing Diferential Privacy (DP), a promising technique for mitigating CF-facilitated MEA. We evaluate two distinct DP strategies: implemented during the classification model training and at the explainer during CF generation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Counterfactual Explanations</kwd>
        <kwd>Model Extraction Attack</kwd>
        <kwd>Diferential Privacy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine Learning (ML) as a Service (MLaaS) is becoming increasingly popular for deploying powerful
predictive models as it facilitates access to ML training and deployment tools, while eliminating the
need for extensive computational resources [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The adoption of MLaaS, however, introduces important
security and privacy risks. For instance, adversaries can query the deployed ML models through
application programming interfaces (APIs) to perform various types of attacks, such as membership
inference (MIA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and model extraction (MEA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These attacks, if successful, pose serious threats to
data privacy and intellectual property. For instance, MIA can reveal whether specific data points were
used in training, MEA, instead, enables adversaries to replicate proprietary models, leading to financial
losses and competitive disadvantages and facilitates other data privacy attacks by having access to
a copy of the model. To defend against these attacks, data privacy-enhancing technologies such as
Diferential Privacy (DP) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] exist. DP has shown efectiveness in defending against such attacks and is
therefore widely adopted in use cases that require data and model sharing and deployments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DP
enables privacy-preserving training of deep neural networks (DNN) to efectively mitigate inferential
attacks by adding a controlled amount of noise to either raw data or model weights and ensures that
individual data points have minimal influence on the model’s response, which limits the amount of
sensitive information leaked when an attacker queries the model.
      </p>
      <p>
        Recently, with the increasing demand for transparency in automated decision-making, MLaaS
platforms are starting to incorporate Explainable Artificial Intelligence (XAI) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] techniques into their
workflows to provide explanations of the model’s decisions 1 [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. These platforms now provide not
only the final decisions of ML models but also explanations of the underlying processes. The increased
transparency provided by XAI introduces new challenges for preserving privacy and safeguarding
MLaaS platforms from adversarial threats, as model’s explanations can inadvertently reveal information
about model’s decision boundaries [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Specifically, counterfactual explanations (CFs) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which aim to
identify the smallest changes to input data that would alter an ML model’s prediction to a desired
outcome, can reveal the factors most influential in the model’s decision-making. Recent research has indeed
explored how explanations can be leveraged to enhance the efectiveness of such attacks [
        <xref ref-type="bibr" rid="ref10 ref11 ref6 ref7 ref8">8, 7, 10, 6, 11</xref>
        ].
Complementing this, DP can also be applied at the explanation level, where it masks explanations to
limit their utility to adversaries while balancing interpretability and privacy [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ]. As DP can have
impact on predictive performance and explanation quality, and can be applied on both levels, there is a
growing research to highlight the importance of DP in developing mitigation strategies that specifically
address risks introduced by explanations, emphasizing the need to adapt, utilize, or extend existing
defense methods to the exploitation of explainability. In this work, we focus on analyzing the mitigation
framework that integrates DP at the model and at the explainer, and investigate the interplay between )
model’s accuracy, as DP is expected to influence model’s inference capability, ) privacy, as employing
DP provides resilience against attacks, and ) explainability, as noise added to model or explainer may
impact quality of explanations [
        <xref ref-type="bibr" rid="ref12 ref6">6, 12</xref>
        ]. We aim to quantify this interplay and extract insights on the
choice of where to employ DP (at model or at explainer, or at both) and the degree of noise level to be
employed to balance predictive performance, explainability and privacy.
      </p>
      <p>
        To perform the attack, we employ a recently proposed MEA technique based on Knowledge Distillation
(KD) due to its proven performance and practicality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In terms of mitigation strategies, we employ
DP at the ML model using Diferential Private-SGD (DP-SGD) and at the explainer using a DP-based
Generative Adversarial Network (GAN) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with varying noise levels. To this end, we investigate the
following research questions (RQs):
• RQ1: To what extent does applying DP at the model or at the explainer, or both, efectively mitigates
      </p>
      <p>MEA facilitated by CFs?
• RQ2: How does noise level in DP influence the efectiveness of MEAs that leverage CFs?
• RQ3: In what ways does the quality of CF explanations difer when DP is applied at the model
compared to the explainer?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Several studies have explores leveraging XAI techniques and exploiting model explanations to perform
privacy attacks. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], authors explore the vulnerabilities of Local Interpretable Model-agnostic
Explanations and show that an adversary can generate new data samples near the decision boundary
and, consequently, perform MEA by crafting adaptive queries. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], authors show that by leveraging
gradient-based explanations, adversaries can enhance the efectiveness of MIA. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors
propose a methodology that performs MEA by jointly minimizing classification and explanation loss,
thereby improving its fidelity. Other works explore the use of CFs to enhance the efectiveness of
MEA. For instance, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] introduces a methodology that relies on model predictions and CFs to train a
substitute model. Similarly, [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] presents a novel strategy where CF pairs, including the CF of the CF,
serve as training samples to MEA. More recently, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposes a methodology based on KD techniques
that exploit CFs to perform MEA efectively while minimizing the number of queries to an MLaaS
system and to generate private CFs with DP. Moreover, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] explores the theoretical foundation of MEA
with CFs highlighting the risks associated with providing CF explanations.
      </p>
      <p>
        Several approaches have been proposed to prevent adversaries from exploiting model explanations
for privacy attacks. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the authors propose an approach that builds on the concept of providing
CFs that are not derived from the entire feature space but instead are generated within a designated
space. Some works developed methodologies to generate explanations while limiting the exposure of
sensitive insights related to decision boundaries, training data, or model architectures. Authors in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
present an approach to generate diferentially private CFs using functional mechanisms to protect the
underlying model from potential inference attacks. In contrast, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposes a novel approach that
constructs private recourse paths as CFs using diferentially private clustering. Authors in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] focus
on GAN-based CF (proposed in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]), injecting DP into the training process of the generator that is
responsible for generating CFs that limits the memorization of the private data points.
      </p>
      <p>Similar to these works, we focus on identifying a mitigation strategy against attacks that exploit the
model’s explanations. Specifically, we explore the application of DP to the ML model, the explainer,
and both simultaneously. Despite the numerous studies utilizing DP for mitigation strategies, our work
is, to the best of our knowledge, the first to explore the application of DP in both the ML model and
the explainer and to investigate their efectiveness in countering MEA and examine their influence on
the quality of explanations. Additionally, our work explores the interplay between preserving model
privacy and generating privacy-preserving CFs, as well as the implications for defending against MEA.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation and Methodology</title>
      <p>employed at the model or at the explainer to counter potential attacks.</p>
      <p>
        Given a dataset  = {(, )}=1, where  are feature vectors and  are corresponding labels. A
target model  (;  ) trained and optimized to achieve high performance on  is deployed as MLaaS
and is queryable through an API (as shown in Fig. 1). An  (adversary) attempts to extract an
approximation of  (;  ) using queries and the provided CFs. The attacker conducts MEA by exploiting
CFs and varying the number of queries. To perform our analysis, we proceed as follows:
• Step 1: Train target models as baseline models baseline(;  base).
• Step 2: Generate CFs by training a CounterGAN to generate CFs ^ = (; ) for baseline.
• Step 3: Simulate MEA, where the adversary queries the models with random points and collects
pairs of predictions and CFs. The adversary trains an extracted model using the KD-based method
proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
• Step 4: Measure MEA success by computing the agreement on a separate dataset to quantify and
compare the level of agreement between extracted and original models/explanations.
• Step 5: Assess the quality of CFs using metrics such as prediction gain, realism (explained in
more details in Sec 4).
      </p>
      <p>The efectiveness of the MEA is measured using similarity metrics such as agreement. In
prac︁)
︁(
tice, this agreement expectation is estimated empirically using a set of  test inputs {1, 2, . . . , }.
 = 1 ∑︀
=1   () = ^ ^() . Where  counts the number of times the extracted</p>
      <p>model’s predictions match the target model’s predictions.</p>
      <p>
        As a mitigation against MEA, we employ two strategies: 1) DP-Model with DP-SGD: where we apply
DP-SGD during model training on  (;  ). 2) DP-Explainer (DP in CounterGAN): We inject DP noise
at the generator (; ) that outputs private CFs ([
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). We then perform MEA leveraging CFs under
diferent DP settings, i.e., the approach adopted and the privacy parameter’s noise level  , and evaluate
the adversary’s MEA success and CF quality. Specifically in step 1, baseline(;  base) is first trained on 
without DP. We also train a DP-protected model DP(;  DP) using DP-SGD with privacy parameter  .
Similarly, in step 2, we also train a private CounterGAN ^ = (; ) to generate the private CFs
by varying the noise level. The attacker in step 3, apply MEA to extract (;  ) using the KD-based
method using either CFs generated by (; ) or (; ). For the comparative analysis, we
consider four distinct scenarios: (1) No DP: Baseline scenario that does not incorporate DP at any level,
allowing the evaluation of the unprotected model performance and vulnerability. (2) DP-Model: Only
the target model employs DP. This protects the model from adversarial replication while the explanation
generator remains unprotected. (3) DP-Explainer: DP is applied to the explanation generator. This
scenario assesses the impact of DP explanations on their utility without directly afecting the target
model. (4) DP-Model-Explainer: Both the target model and the explanation generator are protected with
DP, aiming to balance model performance, explanation quality, and resistance to MEA.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Settings</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets, Target and Threat Model</title>
        <p>
          We perform an evaluation on 2 datasets: Housing [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and EEG Eye State[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The Housing dataset
describes housing prices and includes 20,640 instances and 8 features a mix of socio-economic,
demographic, and geographic attributes. The target variable represents the median house value and
is converted into two classes using a threshold defined by the median. The EEG Eye State dataset
comprises EEG measurement data recorded using a Neuroheadset, and contains 14,980 data points and
14 features. The target variable is a binary label representing the eye-closed or opened state.
        </p>
        <p>The target model  is a DNN with 16 hidden layers of 64, 32, 16, 32, 64, 128, 64, 32, 128, 64, 128, 64, 128,
64, 32, and 16 neurons per layer with a GELU activation function and a the softmax activation function
in the output layer. We employ Adam optmizer for the cases where DP is not used and TensorFlow
Privacy’s DPKerasAdamOptimizer for the cases where DP applied. The model is trained without DP
and with noise levels of 0.1, 0.5 and 0.9 for DP cases and with varying learning rates (0.001, 0.002, and
0.01), and we vary the l2_norm_clip to between 1, and 1.5 (l2_norm_clip bounds the sensitivity of the
gradients by limiting the influence of any single training example on the overall gradient update, which
is a crucial step before adding noise). Note that the more noise, the higher the privacy. The target
models are trained using 80% of the corresponding dataset, and the best-performing model in term of
accuracy was chosen.</p>
        <p>To simulate a realistic attack scenario, we assume that the attacker has no prior knowledge of the
training data distribution and does not know the architecture of the target model, but can build a simple
threat model ϒ. The ϒ consists of 5 layers, with 32, 64, 128, and 64 neurons with ReLU activation,
followed by a softmax output layer. Attacker generate random diferent data points to query the
model, within a range of -3 to 3 for each feature and extract CFs to feed as input to the KD-based MEA.
Our evaluation involves performing MEA while varying the number of queries from 50 to 1000 and
therefore the input to KD. For optimization, we utilize both Adam for the cases where DP is not used
and TensorFlow Privacy’s DPKerasAdamOptimizer for the cases where DP is used to assess model
performance under three diferent noise levels, 0.1, 0.5, and 0.9. We tune diferent KD-based approach
hyperparameters, specifically, alpha within the range of 0.1 and 0.5, temperature within the range of 1
and 10. We compute the MEA agreement over the 20% test set and report average results of 5 runs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Counterfactual generator</title>
        <p>The generator of CounterGAN takes an input feature vector and processes it through 4 layers with
64, 32, and 64 neurons, with ReLU as an activation function and a final layer with Tanh activation.
The discriminator follows a simple feedforward design, consisting of 128, 128, 64 neurons with ReLU
activation, and the final output layer with Sigmoid activation. In the No-DP scenario, we used the
standard Adam optimizer. For the scenarios where DP is employed, we applied DP using noise levels of
0.1, 0.5, and 0.9, respectively on the generator, with TensorFlow Privacy’s DPKerasAdamOptimizer. We
varied the learning rate where we used 0.05, 0.005, 0.01, and 0.001, and l2_norm_clip to between 1, 1.5,
and 3. We report the results of the average of 5 runs. We consider the following metrics to assess the
influence of employing privacy on the CFs.</p>
        <p>
          • Prediction Gain: quantifies how the explainer modifies the input to influence the model’s decision
by measuring the change in the classifier’s confidence score for a specific target class  when
replacing the original data point with its CFs: Δ =  (, ) −  (, ) Where:  (, ) is
the probability score for the target class  of the CF and  (, ) is for initial point.
• Realism: quantifies how a data instance fits within a data distribution to evaluate how well CFs
and private CFs with diferent noise applied match the original training data distribution. It is
defined as: Realism = 1 ∑︀=1 ‖input − reconstruction‖2 Where: input represents the original
data point, reconstruction is the corresponding autoencoder reconstruction and  is the total
number of instances ([
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]). A lower realism value indicates that the data point is more realistic.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. ML Model Predictive Performance</title>
        <p>No DP</p>
        <p>Fig 2 reports the predictive performance metrics of the models while varying the noise level across
the two datasets used in our evaluations. As previously mentioned, we consider three noise levels when
applying DP, 0.1, 0.5 and 0.9, and we refer to each case as DP-Model-noise level. As expected, the results
across the two datasets indicate a decline in predictive performance metrics as the noise level increases.
For instance, in the EEG dataset, accuracy, precision, recall, and F1-score are 0.94, 0.92, 0.9, and 0.91,
respectively, when no DP is applied. However, at the highest noise level considered (0.9), these metrics
drop to 0.85, 0.78, 0.66, and 0.72, respectively. Similar results are seen across the Housing dataset, where
predictive performance metrics show a declining trend as the noise level applied increases.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Efectiveness of Diferential Privacy in Mitigating MEA</title>
        <p>We considering the 3 scenarios of application of DP, namely, DP-Model, DP-Explainer and
DP-ModelExplainer, and the baseline No DP scenario. Additionally, when we incorporate DP at explainer, we
refer to each case as DP-Explainer-noise level (i.e., DP-Explainer-0.1). This evaluation will allow us to
address RQ1 and RQ2. Figures 3 show the agreement observed by MEA across the various combinations
of applying DP for varying noise levels and number of queries across the Housing dataset. We start
with No DP (Fig. 3(a)), which allows us to quantify solely the impact of employing diferent levels
of noise at the explainer on the success of the MEA. Results show a general trend where the MEA
is more successful as the number of queries used increases across all cases (i.e., independent of the
noise level applied). Comparing the agreement when employing diferent noise levels, results show
that employing more noise, as expected, provides more defense against MEA. Specifically, with a noise
level of 0.9, agreement ranges between 50 and 72 when the number of queries increases up to 1000. In
contrast, when employing noise levels of 0.5 and 0.1, agreement falls within the ranges of 62–75 and
60–78, respectively. In the absence of DP at the explainer, agreement starts at 70 with 50 queries and
50010 020 030 500 ,000</p>
        <p>1
Queries Number
(a) No DP
50010 020 030 500 ,000</p>
        <p>1
Queries Number
reaches 80 when 1000 queries are used. We now focus on the cases where DP is employed at the model
level (Fig. 3(b), (c) and (d)). Generally, results show a similar trend across all cases, where agreement
increases with the number of queries used to perform the MEA. Comparing the agreement achieved
when employing diferent noise levels in each case, results show, as expected, that employing higher
noise levels at the explainer implies better protection against MEA. For instance, when employing
DP-Model with a noise level of 0.1 (Fig. 3(b)), the highest agreement observed is 70 when DP-Explainer
is employed (which is a DP-Model-Explainer case), compared to 76 without DP-Explainer. Similarly,
with a noise level of 0.5 at the model (Fig. 3(c)), the agreement consistently remains lower than in the
No DP case, reaching a maximum of 70.63 versus 75 to when DP is only applied at the model. Similar
trends were observed to DP-Model-0.9. Figure 4 show the agreement observed by MEA on the EEG
dataset across the various cases. The results show similar trends to the one observed with the Housing
Dataset. When no DP is applied to the model (Fig. 4(a)), the agreement improves with more queries
ranging between 68% and 96%, with the highest agreement observed when DP is not applied at all.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Impact of Diferential Privacy on Quality of Explanations</title>
        <p>Figure 5 shows the prediction gain achieved by explainer across the Housing and EEG datasets for
varying noise level. In the Housing dataset, a clear trend emerges as the DP-Explainer noise increases the
prediction gain decreases, which means that employing more noise decreases the CF probability toward
the desired class. For example, for No DP, the prediction gain starts at 0.488. However, for DP-Explainer
with noise level of 0.9 is applied, it drops dramatically to 0.055. This decline is observed consistently
across all model noise levels. Moreover, for DP-Model noise levels (0.5 and 0.9) are introduced, the
prediction gain prediction observed is less than that of no DP and DP-model 0.1, regardless of the
DP-Explainer noise. Similarly, the EEG dataset follows a comparable pattern. In scenarios without DP
applied to the model, the prediction gain is lower and ranges from 0.568 to 0.222 as the DP-Explainer
n
i
aG0.4
n
o
i
itc0.2
d
e
r</p>
        <p>P
a) Housing
noise is higher. When the model is subjected to DP noise at levels of 0.1, 0.5, and 0.9, the prediction gains
are consistently lower. We now focus on analyzing the impact of incorporating DP on realism. Across
both datasets, increasing the DP-Explainer noise consistently results in higher realism scores, indicating
less realistic CFs and degradation in CF quality. In the Housing dataset, even without any DP-model
noise, the realism score ranges from 0.113 to 1.116 at a DP-explainer noise of 0.9. This degradation is
further amplified when additional DP-Model noise is introduced, e.g. with a DP-Model noise of 0.1,
the realism score ranges from 0.356 to 3.289 as DP-explainer noise is higher, and similar patterns are
observed for DP-Model-0.5 and 0.9. The EEG dataset exhibits a comparable pattern, although the No
DP realism scores are generally higher.</p>
        <p>Discussion on Performance-Privacy-Explanations Interplay: Results indicate that introducing
DP mechanisms afects model performance, although the extent of this impact varies according to
the specific use case and dataset. Similarly, the quality of the generated CF explanations is influenced
by the privacy parameters applied. Experiments reveal that even slight amounts of noise, whether
introduced at DP-Model or within the DP-Explainer, can alter CF quality. In terms of the efectiveness
of DP interventions in the context of MEA. Analysis shows that introducing minimal noise at the model
level generally ofers resistance to MEA. In contrast, higher noise levels provide a more robust defense,
albeit at the cost of reduced model performance. When examining the impact of noise on the CFs,
we observe that small increments in noise can slightly reduce the success rate of MEA, but further
increases yield a more pronounced protective efect. Notably, when both the model and the explainer
are simultaneously subjected to DP, a synergistic improvement in resistance to MEA is observed.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we investigate the impact of diferential privacy (DP) in mitigating model extraction
attacks (MEAs) that leverage counterfactual explanations (CFs) within Machine Learning as a Service
(MLaaS) environments. We evaluate employing DP implemented at the ML model level via DP-Stochastic
Gradient Descent and at the explanation level, and at both simultaneously, to investigate their respective
impacts on MEA resilience. Our analysis, conducted across two datasets, demonstrates and quantifies a
fundamental trade-of between privacy protection and utility. The introduction of DP noise presents
a clear trade-of, as it efectively hinders an adversary’s ability to reconstruct the target model, yet it
simultaneously compromises both model performance and the quality of generated CF. Further research
will include testing other DP-based methods to generate CFs, MEA methods and more datasets.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tramèr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Juels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Ristenpart,</surname>
          </string-name>
          <article-title>Stealing machine learning models via prediction APIs</article-title>
          ,
          <source>in: 25th USENIX Security Symposium (USENIX Security 16)</source>
          , USENIX Association,
          <year>2016</year>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>618</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shokri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stronati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shmatikov</surname>
          </string-name>
          ,
          <article-title>Membership inference attacks against machine learning models</article-title>
          ,
          <source>in: 2017 IEEE Symposium on Security and Privacy (SP)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <article-title>Diferential privacy</article-title>
          , in: International colloquium on automata,
          <source>languages, and programming</source>
          , Springer,
          <year>2006</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chu</surname>
          </string-name>
          , I. Goodfellow, H. B.
          <string-name>
            <surname>McMahan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Mironov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Talwar</surname>
            ,
            <given-names>L. Zhang,</given-names>
          </string-name>
          <article-title>Deep learning with diferential privacy</article-title>
          ,
          <source>in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 51</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ezzeddine</surname>
          </string-name>
          ,
          <article-title>Privacy implications of explainable ai in data-driven systems (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shokri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zick</surname>
          </string-name>
          ,
          <article-title>On the privacy risks of model explanations</article-title>
          ,
          <source>in: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Spartalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Semertzidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Daras</surname>
          </string-name>
          ,
          <article-title>Balancing xai with privacy and security considerations</article-title>
          ,
          <source>in: European Symposium on Research in Computer Security</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wachter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mittelstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Counterfactual explanations without opening the black box: Automated decisions and the gdpr</article-title>
          ,
          <source>Harv. JL &amp; Tech. 31</source>
          (
          <year>2017</year>
          )
          <fpage>841</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ezzeddine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ayoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giordano</surname>
          </string-name>
          ,
          <article-title>Knowledge distillation-based model extraction attack using private counterfactual explanations</article-title>
          ,
          <source>arXiv preprint arXiv:2404.03348</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>U.</given-names>
            <surname>Aïvodji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gambs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehnaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yvinec</surname>
          </string-name>
          ,
          <article-title>Model extraction from counterfactual explanations</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on fairness, accountability, and transparency</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saracino</surname>
          </string-name>
          ,
          <article-title>Further insights: Balancing privacy, explainability, and utility in machine learning-based tabular data analysis</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Availability, Reliability and Security</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Oksuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halimi</surname>
          </string-name>
          , E. Ayday, Autolycus:
          <article-title>Exploiting explainable artificial intelligence (xai) for model extraction attacks against interpretable models</article-title>
          ,
          <source>Proceedings on Privacy Enhancing Technologies</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Towards explainable model extraction attacks</article-title>
          ,
          <source>International Journal of Intelligent Systems</source>
          <volume>37</volume>
          (
          <year>2022</year>
          )
          <fpage>9936</fpage>
          -
          <lpage>9956</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          , Dualcf:
          <article-title>Eficient model extraction attack from counterfactual explanations</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1318</fpage>
          -
          <lpage>1329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dissanayake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <article-title>Model reconstruction using counterfactual explanations: A perspective from polytope theory</article-title>
          ,
          <source>Advances in Neural Information Processing Systems (NeurIPS)</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Counterfactual explanation at will, with zero privacy leakage</article-title>
          ,
          <source>Proceedings of the ACM on Management of Data</source>
          <volume>2</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Diferentially private counterfactuals via functional mechanism</article-title>
          ,
          <source>arXiv preprint arXiv:2208.02878</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pentyala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kariyappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lécué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          ,
          <article-title>Privacy-preserving algorithmic recourse</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nemirovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thiebaut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Countergan:
          <article-title>Generating counterfactuals for real-time recourse and interpretability using residual gans</article-title>
          ,
          <source>in: Uncertainty in Artificial Intelligence, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1488</fpage>
          -
          <lpage>1497</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Scikit-learn Developers</surname>
          </string-name>
          , California housing dataset,
          <year>2024</year>
          . URL:
          <article-title>scikit-learn</article-title>
          .org/stable/modules/ generated/sklearn.datasets.fetch_california_housing.html, accessed:
          <fpage>2024</fpage>
          -01-04.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>O.</given-names>
            <surname>Roesler</surname>
          </string-name>
          ,
          <article-title>Eeg eye state, UCI Machine Learning Repos</article-title>
          .,
          <year>2013</year>
          . URL: doi.org/10.24432/C57G7J, accessed:
          <fpage>2024</fpage>
          -01-04.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>