1. Introduction

On the interplay of Explainability, Privacy and Predictive Performance with Explanation-assisted Model Extraction

Fatima Ezzeddine

1 2

Rinad Akel

Ihab Sbeity

Silvia Giordano

Marc Langheinrich

Omran Ayoub

2 0 Lebanese University , Beirut , Lebanon 1 Università della Svizzera italiana , Lugano , Switzerland 2 University of Applied Sciences and Arts of Southern Switzerland , Lugano , Switzerland

Machine Learning as a Service (MLaaS) has gained important attraction as a means for deploying powerful predictive models, ofering ease of use that enables organizations to leverage advanced analytics without substantial investments in specialized infrastructure or expertise. However, MLaaS platforms must be safeguarded against security and privacy attacks, such as model extraction (MEA) attacks. The increasing integration of explainable AI (XAI) within MLaaS has introduced an additional privacy challenge, as attackers can exploit model explanations-particularly counterfactual explanations (CFs) to facilitate MEA. In this paper, we investigate the trade-ofs among model performance, privacy, and explainability when employing Diferential Privacy (DP), a promising technique for mitigating CF-facilitated MEA. We evaluate two distinct DP strategies: implemented during the classification model training and at the explainer during CF generation.

eol>Counterfactual Explanations Model Extraction Attack Diferential Privacy

1. Introduction

Machine Learning (ML) as a Service (MLaaS) is becoming increasingly popular for deploying powerful predictive models as it facilitates access to ML training and deployment tools, while eliminating the need for extensive computational resources [ 1 ]. The adoption of MLaaS, however, introduces important security and privacy risks. For instance, adversaries can query the deployed ML models through application programming interfaces (APIs) to perform various types of attacks, such as membership inference (MIA) [ 2 ] and model extraction (MEA) [ 1 ]. These attacks, if successful, pose serious threats to data privacy and intellectual property. For instance, MIA can reveal whether specific data points were used in training, MEA, instead, enables adversaries to replicate proprietary models, leading to financial losses and competitive disadvantages and facilitates other data privacy attacks by having access to a copy of the model. To defend against these attacks, data privacy-enhancing technologies such as Diferential Privacy (DP) [ 3 ] exist. DP has shown efectiveness in defending against such attacks and is therefore widely adopted in use cases that require data and model sharing and deployments [ 4 ]. DP enables privacy-preserving training of deep neural networks (DNN) to efectively mitigate inferential attacks by adding a controlled amount of noise to either raw data or model weights and ensures that individual data points have minimal influence on the model’s response, which limits the amount of sensitive information leaked when an attacker queries the model.

Recently, with the increasing demand for transparency in automated decision-making, MLaaS platforms are starting to incorporate Explainable Artificial Intelligence (XAI) [ 5 ] techniques into their workflows to provide explanations of the model’s decisions 1 [ 6, 7 ]. These platforms now provide not only the final decisions of ML models but also explanations of the underlying processes. The increased transparency provided by XAI introduces new challenges for preserving privacy and safeguarding MLaaS platforms from adversarial threats, as model’s explanations can inadvertently reveal information about model’s decision boundaries [ 8 ]. Specifically, counterfactual explanations (CFs) [ 9 ], which aim to identify the smallest changes to input data that would alter an ML model’s prediction to a desired outcome, can reveal the factors most influential in the model’s decision-making. Recent research has indeed explored how explanations can be leveraged to enhance the efectiveness of such attacks [ 8, 7, 10, 6, 11 ]. Complementing this, DP can also be applied at the explanation level, where it masks explanations to limit their utility to adversaries while balancing interpretability and privacy [ 7, 10 ]. As DP can have impact on predictive performance and explanation quality, and can be applied on both levels, there is a growing research to highlight the importance of DP in developing mitigation strategies that specifically address risks introduced by explanations, emphasizing the need to adapt, utilize, or extend existing defense methods to the exploitation of explainability. In this work, we focus on analyzing the mitigation framework that integrates DP at the model and at the explainer, and investigate the interplay between ) model’s accuracy, as DP is expected to influence model’s inference capability, ) privacy, as employing DP provides resilience against attacks, and ) explainability, as noise added to model or explainer may impact quality of explanations [ 6, 12 ]. We aim to quantify this interplay and extract insights on the choice of where to employ DP (at model or at explainer, or at both) and the degree of noise level to be employed to balance predictive performance, explainability and privacy.

To perform the attack, we employ a recently proposed MEA technique based on Knowledge Distillation (KD) due to its proven performance and practicality [ 10 ]. In terms of mitigation strategies, we employ DP at the ML model using Diferential Private-SGD (DP-SGD) and at the explainer using a DP-based Generative Adversarial Network (GAN) [ 10 ] with varying noise levels. To this end, we investigate the following research questions (RQs): • RQ1: To what extent does applying DP at the model or at the explainer, or both, efectively mitigates

MEA facilitated by CFs? • RQ2: How does noise level in DP influence the efectiveness of MEAs that leverage CFs? • RQ3: In what ways does the quality of CF explanations difer when DP is applied at the model compared to the explainer?

2. Related Work

Several studies have explores leveraging XAI techniques and exploiting model explanations to perform privacy attacks. In [ 13 ], authors explore the vulnerabilities of Local Interpretable Model-agnostic Explanations and show that an adversary can generate new data samples near the decision boundary and, consequently, perform MEA by crafting adaptive queries. In [ 7 ], authors show that by leveraging gradient-based explanations, adversaries can enhance the efectiveness of MIA. In [ 14 ], the authors propose a methodology that performs MEA by jointly minimizing classification and explanation loss, thereby improving its fidelity. Other works explore the use of CFs to enhance the efectiveness of MEA. For instance, [ 11 ] introduces a methodology that relies on model predictions and CFs to train a substitute model. Similarly, [ 15 ] presents a novel strategy where CF pairs, including the CF of the CF, serve as training samples to MEA. More recently, [ 10 ] proposes a methodology based on KD techniques that exploit CFs to perform MEA efectively while minimizing the number of queries to an MLaaS system and to generate private CFs with DP. Moreover, [ 16 ] explores the theoretical foundation of MEA with CFs highlighting the risks associated with providing CF explanations.

Several approaches have been proposed to prevent adversaries from exploiting model explanations for privacy attacks. In [ 17 ], the authors propose an approach that builds on the concept of providing CFs that are not derived from the entire feature space but instead are generated within a designated space. Some works developed methodologies to generate explanations while limiting the exposure of sensitive insights related to decision boundaries, training data, or model architectures. Authors in [ 18 ] present an approach to generate diferentially private CFs using functional mechanisms to protect the underlying model from potential inference attacks. In contrast, [ 19 ] proposes a novel approach that constructs private recourse paths as CFs using diferentially private clustering. Authors in [ 10 ] focus on GAN-based CF (proposed in [ 20 ]), injecting DP into the training process of the generator that is responsible for generating CFs that limits the memorization of the private data points.

Similar to these works, we focus on identifying a mitigation strategy against attacks that exploit the model’s explanations. Specifically, we explore the application of DP to the ML model, the explainer, and both simultaneously. Despite the numerous studies utilizing DP for mitigation strategies, our work is, to the best of our knowledge, the first to explore the application of DP in both the ML model and the explainer and to investigate their efectiveness in countering MEA and examine their influence on the quality of explanations. Additionally, our work explores the interplay between preserving model privacy and generating privacy-preserving CFs, as well as the implications for defending against MEA.

3. Problem Formulation and Methodology

employed at the model or at the explainer to counter potential attacks.

Given a dataset = {(, )}=1, where are feature vectors and are corresponding labels. A target model (; ) trained and optimized to achieve high performance on is deployed as MLaaS and is queryable through an API (as shown in Fig. 1). An (adversary) attempts to extract an approximation of (; ) using queries and the provided CFs. The attacker conducts MEA by exploiting CFs and varying the number of queries. To perform our analysis, we proceed as follows: • Step 1: Train target models as baseline models baseline(; base). • Step 2: Generate CFs by training a CounterGAN to generate CFs ^ = (; ) for baseline. • Step 3: Simulate MEA, where the adversary queries the models with random points and collects pairs of predictions and CFs. The adversary trains an extracted model using the KD-based method proposed in [ 10 ]. • Step 4: Measure MEA success by computing the agreement on a separate dataset to quantify and compare the level of agreement between extracted and original models/explanations. • Step 5: Assess the quality of CFs using metrics such as prediction gain, realism (explained in more details in Sec 4).

The efectiveness of the MEA is measured using similarity metrics such as agreement. In prac︁) ︁( tice, this agreement expectation is estimated empirically using a set of test inputs {1, 2, . . . , }. = 1 ∑︀ =1 () = ^ ^() . Where counts the number of times the extracted

model’s predictions match the target model’s predictions.

As a mitigation against MEA, we employ two strategies: 1) DP-Model with DP-SGD: where we apply DP-SGD during model training on (; ). 2) DP-Explainer (DP in CounterGAN): We inject DP noise at the generator (; ) that outputs private CFs ([ 10 ]). We then perform MEA leveraging CFs under diferent DP settings, i.e., the approach adopted and the privacy parameter’s noise level , and evaluate the adversary’s MEA success and CF quality. Specifically in step 1, baseline(; base) is first trained on without DP. We also train a DP-protected model DP(; DP) using DP-SGD with privacy parameter . Similarly, in step 2, we also train a private CounterGAN ^ = (; ) to generate the private CFs by varying the noise level. The attacker in step 3, apply MEA to extract (; ) using the KD-based method using either CFs generated by (; ) or (; ). For the comparative analysis, we consider four distinct scenarios: (1) No DP: Baseline scenario that does not incorporate DP at any level, allowing the evaluation of the unprotected model performance and vulnerability. (2) DP-Model: Only the target model employs DP. This protects the model from adversarial replication while the explanation generator remains unprotected. (3) DP-Explainer: DP is applied to the explanation generator. This scenario assesses the impact of DP explanations on their utility without directly afecting the target model. (4) DP-Model-Explainer: Both the target model and the explanation generator are protected with DP, aiming to balance model performance, explanation quality, and resistance to MEA.

4. Experimental Settings 4.1. Datasets, Target and Threat Model

We perform an evaluation on 2 datasets: Housing [ 21 ] and EEG Eye State[ 22 ]. The Housing dataset describes housing prices and includes 20,640 instances and 8 features a mix of socio-economic, demographic, and geographic attributes. The target variable represents the median house value and is converted into two classes using a threshold defined by the median. The EEG Eye State dataset comprises EEG measurement data recorded using a Neuroheadset, and contains 14,980 data points and 14 features. The target variable is a binary label representing the eye-closed or opened state.

The target model is a DNN with 16 hidden layers of 64, 32, 16, 32, 64, 128, 64, 32, 128, 64, 128, 64, 128, 64, 32, and 16 neurons per layer with a GELU activation function and a the softmax activation function in the output layer. We employ Adam optmizer for the cases where DP is not used and TensorFlow Privacy’s DPKerasAdamOptimizer for the cases where DP applied. The model is trained without DP and with noise levels of 0.1, 0.5 and 0.9 for DP cases and with varying learning rates (0.001, 0.002, and 0.01), and we vary the l2_norm_clip to between 1, and 1.5 (l2_norm_clip bounds the sensitivity of the gradients by limiting the influence of any single training example on the overall gradient update, which is a crucial step before adding noise). Note that the more noise, the higher the privacy. The target models are trained using 80% of the corresponding dataset, and the best-performing model in term of accuracy was chosen.

To simulate a realistic attack scenario, we assume that the attacker has no prior knowledge of the training data distribution and does not know the architecture of the target model, but can build a simple threat model ϒ. The ϒ consists of 5 layers, with 32, 64, 128, and 64 neurons with ReLU activation, followed by a softmax output layer. Attacker generate random diferent data points to query the model, within a range of -3 to 3 for each feature and extract CFs to feed as input to the KD-based MEA. Our evaluation involves performing MEA while varying the number of queries from 50 to 1000 and therefore the input to KD. For optimization, we utilize both Adam for the cases where DP is not used and TensorFlow Privacy’s DPKerasAdamOptimizer for the cases where DP is used to assess model performance under three diferent noise levels, 0.1, 0.5, and 0.9. We tune diferent KD-based approach hyperparameters, specifically, alpha within the range of 0.1 and 0.5, temperature within the range of 1 and 10. We compute the MEA agreement over the 20% test set and report average results of 5 runs.

4.2. Counterfactual generator

The generator of CounterGAN takes an input feature vector and processes it through 4 layers with 64, 32, and 64 neurons, with ReLU as an activation function and a final layer with Tanh activation. The discriminator follows a simple feedforward design, consisting of 128, 128, 64 neurons with ReLU activation, and the final output layer with Sigmoid activation. In the No-DP scenario, we used the standard Adam optimizer. For the scenarios where DP is employed, we applied DP using noise levels of 0.1, 0.5, and 0.9, respectively on the generator, with TensorFlow Privacy’s DPKerasAdamOptimizer. We varied the learning rate where we used 0.05, 0.005, 0.01, and 0.001, and l2_norm_clip to between 1, 1.5, and 3. We report the results of the average of 5 runs. We consider the following metrics to assess the influence of employing privacy on the CFs.

• Prediction Gain: quantifies how the explainer modifies the input to influence the model’s decision by measuring the change in the classifier’s confidence score for a specific target class when replacing the original data point with its CFs: Δ = (, ) − (, ) Where: (, ) is the probability score for the target class of the CF and (, ) is for initial point. • Realism: quantifies how a data instance fits within a data distribution to evaluate how well CFs and private CFs with diferent noise applied match the original training data distribution. It is defined as: Realism = 1 ∑︀=1 ‖input − reconstruction‖2 Where: input represents the original data point, reconstruction is the corresponding autoencoder reconstruction and is the total number of instances ([ 20 ]). A lower realism value indicates that the data point is more realistic.

5. Results and Discussion 5.1. ML Model Predictive Performance

No DP

Fig 2 reports the predictive performance metrics of the models while varying the noise level across the two datasets used in our evaluations. As previously mentioned, we consider three noise levels when applying DP, 0.1, 0.5 and 0.9, and we refer to each case as DP-Model-noise level. As expected, the results across the two datasets indicate a decline in predictive performance metrics as the noise level increases. For instance, in the EEG dataset, accuracy, precision, recall, and F1-score are 0.94, 0.92, 0.9, and 0.91, respectively, when no DP is applied. However, at the highest noise level considered (0.9), these metrics drop to 0.85, 0.78, 0.66, and 0.72, respectively. Similar results are seen across the Housing dataset, where predictive performance metrics show a declining trend as the noise level applied increases.

5.2. Efectiveness of Diferential Privacy in Mitigating MEA

We considering the 3 scenarios of application of DP, namely, DP-Model, DP-Explainer and DP-ModelExplainer, and the baseline No DP scenario. Additionally, when we incorporate DP at explainer, we refer to each case as DP-Explainer-noise level (i.e., DP-Explainer-0.1). This evaluation will allow us to address RQ1 and RQ2. Figures 3 show the agreement observed by MEA across the various combinations of applying DP for varying noise levels and number of queries across the Housing dataset. We start with No DP (Fig. 3(a)), which allows us to quantify solely the impact of employing diferent levels of noise at the explainer on the success of the MEA. Results show a general trend where the MEA is more successful as the number of queries used increases across all cases (i.e., independent of the noise level applied). Comparing the agreement when employing diferent noise levels, results show that employing more noise, as expected, provides more defense against MEA. Specifically, with a noise level of 0.9, agreement ranges between 50 and 72 when the number of queries increases up to 1000. In contrast, when employing noise levels of 0.5 and 0.1, agreement falls within the ranges of 62–75 and 60–78, respectively. In the absence of DP at the explainer, agreement starts at 70 with 50 queries and 50010 020 030 500 ,000

1 Queries Number (a) No DP 50010 020 030 500 ,000

1 Queries Number reaches 80 when 1000 queries are used. We now focus on the cases where DP is employed at the model level (Fig. 3(b), (c) and (d)). Generally, results show a similar trend across all cases, where agreement increases with the number of queries used to perform the MEA. Comparing the agreement achieved when employing diferent noise levels in each case, results show, as expected, that employing higher noise levels at the explainer implies better protection against MEA. For instance, when employing DP-Model with a noise level of 0.1 (Fig. 3(b)), the highest agreement observed is 70 when DP-Explainer is employed (which is a DP-Model-Explainer case), compared to 76 without DP-Explainer. Similarly, with a noise level of 0.5 at the model (Fig. 3(c)), the agreement consistently remains lower than in the No DP case, reaching a maximum of 70.63 versus 75 to when DP is only applied at the model. Similar trends were observed to DP-Model-0.9. Figure 4 show the agreement observed by MEA on the EEG dataset across the various cases. The results show similar trends to the one observed with the Housing Dataset. When no DP is applied to the model (Fig. 4(a)), the agreement improves with more queries ranging between 68% and 96%, with the highest agreement observed when DP is not applied at all.

5.3. Impact of Diferential Privacy on Quality of Explanations

Figure 5 shows the prediction gain achieved by explainer across the Housing and EEG datasets for varying noise level. In the Housing dataset, a clear trend emerges as the DP-Explainer noise increases the prediction gain decreases, which means that employing more noise decreases the CF probability toward the desired class. For example, for No DP, the prediction gain starts at 0.488. However, for DP-Explainer with noise level of 0.9 is applied, it drops dramatically to 0.055. This decline is observed consistently across all model noise levels. Moreover, for DP-Model noise levels (0.5 and 0.9) are introduced, the prediction gain prediction observed is less than that of no DP and DP-model 0.1, regardless of the DP-Explainer noise. Similarly, the EEG dataset follows a comparable pattern. In scenarios without DP applied to the model, the prediction gain is lower and ranges from 0.568 to 0.222 as the DP-Explainer n i aG0.4 n o i itc0.2 d e r

P a) Housing noise is higher. When the model is subjected to DP noise at levels of 0.1, 0.5, and 0.9, the prediction gains are consistently lower. We now focus on analyzing the impact of incorporating DP on realism. Across both datasets, increasing the DP-Explainer noise consistently results in higher realism scores, indicating less realistic CFs and degradation in CF quality. In the Housing dataset, even without any DP-model noise, the realism score ranges from 0.113 to 1.116 at a DP-explainer noise of 0.9. This degradation is further amplified when additional DP-Model noise is introduced, e.g. with a DP-Model noise of 0.1, the realism score ranges from 0.356 to 3.289 as DP-explainer noise is higher, and similar patterns are observed for DP-Model-0.5 and 0.9. The EEG dataset exhibits a comparable pattern, although the No DP realism scores are generally higher.

Discussion on Performance-Privacy-Explanations Interplay: Results indicate that introducing DP mechanisms afects model performance, although the extent of this impact varies according to the specific use case and dataset. Similarly, the quality of the generated CF explanations is influenced by the privacy parameters applied. Experiments reveal that even slight amounts of noise, whether introduced at DP-Model or within the DP-Explainer, can alter CF quality. In terms of the efectiveness of DP interventions in the context of MEA. Analysis shows that introducing minimal noise at the model level generally ofers resistance to MEA. In contrast, higher noise levels provide a more robust defense, albeit at the cost of reduced model performance. When examining the impact of noise on the CFs, we observe that small increments in noise can slightly reduce the success rate of MEA, but further increases yield a more pronounced protective efect. Notably, when both the model and the explainer are simultaneously subjected to DP, a synergistic improvement in resistance to MEA is observed.

6. Conclusion

In this work, we investigate the impact of diferential privacy (DP) in mitigating model extraction attacks (MEAs) that leverage counterfactual explanations (CFs) within Machine Learning as a Service (MLaaS) environments. We evaluate employing DP implemented at the ML model level via DP-Stochastic Gradient Descent and at the explanation level, and at both simultaneously, to investigate their respective impacts on MEA resilience. Our analysis, conducted across two datasets, demonstrates and quantifies a fundamental trade-of between privacy protection and utility. The introduction of DP noise presents a clear trade-of, as it efectively hinders an adversary’s ability to reconstruct the target model, yet it simultaneously compromises both model performance and the quality of generated CF. Further research will include testing other DP-based methods to generate CFs, MEA methods and more datasets.

Declaration on Generative AI

The author has not employed any Generative AI tools.

[1]

Tramèr ,

Zhang ,

Juels ,

M. K.

Reiter , T. Ristenpart, Stealing machine learning models via prediction APIs , in: 25th USENIX Security Symposium (USENIX Security 16) , USENIX Association, 2016 , pp. 601 - 618 .

[2]

Shokri ,

Stronati ,

Song ,

Shmatikov , Membership inference attacks against machine learning models , in: 2017 IEEE Symposium on Security and Privacy (SP) , IEEE, 2017 , pp. 3 - 18 .

[3]

Dwork , Diferential privacy , in: International colloquium on automata, languages, and programming , Springer, 2006 , pp. 1 - 12 .

[4]

Abadi ,

Chu , I. Goodfellow, H. B. McMahan , I.

Mironov , K.

Talwar , L. Zhang,

Deep learning with diferential privacy , in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , 2016 , pp. 308 - 318 .

[5]

Guidotti ,

Monreale ,

Ruggieri ,

Turini ,

Giannotti ,

Pedreschi , A survey of methods for explaining black box models , ACM Computing Surveys (CSUR) 51 ( 2018 ) 1 - 42 .

[6]

Ezzeddine , Privacy implications of explainable ai in data-driven systems ( 2024 ).

[7]

Shokri ,

Strobel ,

Zick , On the privacy risks of model explanations , in: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , 2021 , pp. 231 - 241 .

[8]

C. N.

Spartalis ,

Semertzidis ,

Daras , Balancing xai with privacy and security considerations , in: European Symposium on Research in Computer Security , Springer, 2023 , pp. 111 - 124 .

[9]

Wachter ,

Mittelstadt ,

Russell , Counterfactual explanations without opening the black box: Automated decisions and the gdpr , Harv. JL & Tech. 31 ( 2017 ) 841 .

[10]

Ezzeddine ,

Ayoub ,

Giordano , Knowledge distillation-based model extraction attack using private counterfactual explanations , arXiv preprint arXiv:2404.03348 ( 2024 ).

[11]

Aïvodji ,

Bolot ,

Gambs ,

Mehnaz ,

Yvinec , Model extraction from counterfactual explanations , in: Proceedings of the 2020 conference on fairness, accountability, and transparency , 2020 , pp. 99 - 109 .

[12]

Abbasi ,

Mori ,

Saracino , Further insights: Balancing privacy, explainability, and utility in machine learning-based tabular data analysis , in: Proceedings of the 19th International Conference on Availability, Reliability and Security , 2024 , pp. 1 - 10 .

[13]

A. C.

Oksuz ,

Halimi , E. Ayday, Autolycus: Exploiting explainable artificial intelligence (xai) for model extraction attacks against interpretable models , Proceedings on Privacy Enhancing Technologies ( 2024 ).

[14]

Yan ,

Hou ,

Liu ,

Yan ,

Huang ,

Wang , Towards explainable model extraction attacks , International Journal of Intelligent Systems 37 ( 2022 ) 9936 - 9956 .

[15]

Wang ,

Qian ,

Miao , Dualcf: Eficient model extraction attack from counterfactual explanations , in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , 2022 , pp. 1318 - 1329 .

[16]

Dissanayake ,

Dutta , Model reconstruction using counterfactual explanations: A perspective from polytope theory , Advances in Neural Information Processing Systems (NeurIPS) ( 2024 ).

[17]

An ,

Cao , Counterfactual explanation at will, with zero privacy leakage , Proceedings of the ACM on Management of Data 2 ( 2024 ) 1 - 29 .

[18]

Yang ,

Feng ,

Zhou ,

Chen ,

Hu , Diferentially private counterfactuals via functional mechanism , arXiv preprint arXiv:2208.02878 ( 2022 ).

[19]

Pentyala ,

Sharma ,

Kariyappa ,

Lécué ,

Magazzeni , Privacy-preserving algorithmic recourse , CoRR ( 2023 ).

[20]

Nemirovsky ,

Thiebaut ,

Xu ,

Gupta , Countergan: Generating counterfactuals for real-time recourse and interpretability using residual gans , in: Uncertainty in Artificial Intelligence, PMLR , 2022 , pp. 1488 - 1497 .

[21] Scikit-learn Developers , California housing dataset, 2024 . URL: scikit-learn .org/stable/modules/ generated/sklearn.datasets.fetch_california_housing.html, accessed: 2024 -01-04.

[22]

Roesler , Eeg eye state, UCI Machine Learning Repos ., 2013 . URL: doi.org/10.24432/C57G7J, accessed: 2024 -01-04.