Evaluating Differential Privacy Approaches for Query Obfuscation in Information Retrieval ⋆ Discussion Paper

Evaluating Differential Privacy Approaches for Query Obfuscation in Information Retrieval ⋆ Discussion Paper GuglielmoFaggioli University of Padova

Padova Italy

NicolaFerro University of Padova

Padova Italy

Evaluating Differential Privacy Approaches for Query Obfuscation in Information Retrieval ⋆ Discussion Paper 1613-0073 8A27DC8F3ABCFC8FEBA7261EB89AF106 GROBID - A machine learning software for extracting information from scholarly documents

Protecting the privacy of a user while they interact with an Information Retrieval (IR) system is crucial. This becomes more challenging when the IR system is not cooperative in satisfying the user's privacy needs. Recent advancements in Natural Language Processing (NLP) have demonstrated Differential Privacy's (DP) effectiveness in safeguarding text privacy for tasks like spam detection and sentiment analysis, even under the assumption of a non-cooperative system. Our investigation explores if DP methods, originally designed for specific NLP tasks, can effectively obscure queries in IR. Our analyses show that using the Vickrey DP mechanism, employing the Mahalanobis norm with a privacy budget ranging from 𝜖 = 10 to 12.5, provides cutting-edge privacy protection and enhances effectiveness. Unlike previous methods, DP allows users to fine-tune their desired level of privacy by adjusting the privacy budget 𝜖. This flexibility offers a balance between how effective the system is and how much privacy is maintained, unlike the more rigid nature of previous approaches.

Introduction

Information Retrieval (IR) systems are a commodity used for many tasks, including searching for personal information, such as symptoms and diseases, political opinions, or egosurfing. Such searches can be used to profile the user and can put at risk their privacy. For example, an insurance company might try to access the user's queries to determine if they have any disease, or a malicious employee of a search engine might access the query log to blackmail them. To alleviate this, obfuscation approaches hide the sensitive information need by breaking it down into multiple non-sensitive queries. To this end, some approaches rely on replacing words with generalizations, i.e., hypernyms [2]. Other strategies use a local corpus to determine which words, by co-occurring in the documents with those in the query, induce the same ranked list [3]. We investigate for the first time whether Differential Privacy (DP) mechanisms, originally designed for specific Natural Language Processing (NLP) tasks, can effectively be used in IR to obfuscate queries. DP [4] is a state-of-the-art framework meant to release privately sensitive information. The general idea is to use a randomized mechanism that introduces noise into the computation. Thanks to this, the user can "plausibly deny" the output: it is impossible to prove that the output corresponds to the input of the user and is not due to the randomness of the mechanism. DP is particularly effective in the NLP domain. A line of research [5,6,7] operationalizes DP to release text by obfuscating each word individually. Such mechanisms work as follows: i) each word in the text is mapped to a non-contextual embedding space; ii) the embeddings are perturbed with noise drawn from a specific distribution; iii) each word is replaced with the word closest to the noisy embedding. A major advantage of DP is that it allows setting the privacy budget based on the needs of the user. This is different from current obfuscation mechanisms in IR, which are either active or not and cannot be tuned based on the user's needs. In this work, we focus on three of DP mechanisms: the Calibrated Multivariate Perturbation (CMP) [5], the Mahalanobis [6] and the Vickrey [7]. These approaches were originally devised and tested for NLP tasks that include text classification and sentiment analysis. We assume the IR system to not preserve user privacy, and to possibly be malicious. In our use case, users are the ones concerned about their privacy. They do not want to reveal their real information needs and prefer to transmit obfuscated queries to the IR system while still retrieving relevant documents. Therefore, to operationalize our mechanism, we assume each user to locally obfuscate their query and transmit the obfuscated query, or possibly multiple queries, to the IR system instead of their real query. Our goal is to determine if the DP mechanisms introduced above can successfully obfuscate users' information needs while still retrieving relevant documents.

Approaches

All the approaches described in this work are based on a relaxation of classical DP, called Metric-DP. To achieve traditional DP in a metric space, an obfuscation mechanism should have an equal probability of obfuscating any pair of points as the same point, irrespective of their distance. While this grants the highest level of privacy, it also requires high levels of noise, decreasing the utility of the data. In the case of metric spaces, it is often sufficient if the probability of obfuscating two points with the same one is proportional to the distance between the two points. Alternatively, the proportion of sampling a certain noise is inversely proportional to the norm of the noise itself. To this end, a relaxation of DP, called Metric-DP, has been introduced. Metric-DP [8,9,10] is defined as follows: given a privacy budget 𝜖 and a distance measure 𝑑 : R 𝑝 × R 𝑝 → [0, ∞), a randomized mechanism ℳ : R 𝑝 → R 𝑝 defined over a geometric space is Metric-DP iff, for any three points in the space 𝑤, 𝑤 ′ , 𝑤 ˆ∈ R 𝑝 , the following holds: 𝑃 𝑟{ℳ(𝑤)=𝑤 ^} 𝑃 𝑟{ℳ(𝑤 ′ )=𝑤 ^} ≤ exp(𝜖𝑑{𝑤, 𝑤 ′ }) If the 𝑑{𝑤, 𝑤 ′ } is small, 𝑤 and 𝑤 ′ are more likely to be obfuscated with the same point. Vice-versa, far apart points might be obfuscated with different points, without violating privacy constraints.

We describe here the three major DP efforts for obfuscating text in the NLP scenario, which we evaluate for the IR task. More in detail, these approaches take as input a sequence of words. Each word is mapped into a non-contextual embedding, such as GloVe [11]. Then, the embedding is obfuscated by adding some appositely sampled noise to it. To ensure that Metric-DP is achieved, the noise vector 𝑧 is expected to be sampled from a distribution 𝑓 such that the probability of observing 𝑧 is 𝑓 (𝑧) ∝ exp(−𝜖||𝑧||), i.e., the probability of sampling a noise with norm ||𝑧|| is inversely proportional to ||𝑧||. Finally, the closest word to the noisy embedding is used to obfuscate the corresponding word in the original text. We propose to use these approaches in the IR scenario to perturb the queries instead of the documents, as done for NLP tasks.

The Calibrated Multivariate Perturbation (CMP) mechanism, defined by Feyisetan et al. [5], is based on sampling a noise vector for each term in the query following an n-dimensional Laplace distribution. Such sampling works by sampling two vectors: i) an n-dimensional unitary vector 𝑝 ∈ R 𝑛 that represents the direction of the perturbation. ii) the radius of the perturbation 𝑟 ∈ R + is sampled from a Gamma distribution. To sample 𝑝, a vector 𝑁 ∈ R 𝑛 is sampled from a multivariate normal distribution, with location 0 and identity covariance matrix I 𝑛 : 𝑁 ∼ 𝒩 (0, I 𝑛 ). Then 𝑝 = 𝑁/||𝑁 || 2 . The radius 𝑟 of the noise is sampled from a Gamma distribution with shape 𝑛 and scale 1 𝜖 as 𝑟 ∼ 𝐺𝑎𝑚(𝑛, 1 𝜖 ). It is possible to observe that, the larger the privacy requirement, i.e., the smaller the 𝜖, the bigger the noise. The noise 𝑧 is defined as 𝑧 = 𝑝 • 𝑟. To perturb a word 𝑤, the noise vector 𝑧 is added to the original word embedding 𝜑(𝑤) ∈ R 𝑛 , and the word closest to the noisy word embedding is used as obfuscation. Feyisetan et al. [5] demonstrate that for any word sequence 𝒲 𝑙 of length 𝑙 ≥ 1 and any 𝜖 > 0, CMP satisfies 𝜖𝑑-privacy with respect to 𝑑, where 𝑑 is the Euclidean distance.

The second mechanism investigated is the Mahalanobis (Mhl) mechanism. Xu et al. [6] noticed how the perturbation induced by CMP mechanism tends to be weak, especially for high 𝜖. They hypothesize that sampling the direction of the perturbation on a circumference (||𝑝|| 2 = 1) increases the risk of sampling a point on an empty region. Therefore, Xu et al. adapt the CMP mechanism by transforming the direction of the noise from a circumference to an ellipsis whose orientation can be set to be towards the other embeddings. To do so, it is necessary to modify the sampling mechanism, so that, instead of sampling 𝑝 such that ||𝑝|| 2 = 1, 𝑝 is sampled so that ||𝑝|| 𝑀 = 1 where || • || 𝑀 is the Mahalanobis norm. To ensure that the noise 𝑧 is sampled such that its probability distribution is 𝑓 (𝑧) ∝ exp(−𝑒||𝑧|| 𝑀 ) a vector 𝑁 is sampled from the multivariate normal distribution 𝑁 ∼ 𝒩 (0, 𝐼 𝑛 ). Then, 𝑝 is such that 𝑝 = Σ 1/2 • (𝑁/||𝑁 || 2 ), where Σ ∈ R 𝑛×𝑛 is the covariance matrix of all the word embeddings. This forces the noise towards more populated areas. The sampling of the norm of the noise 𝑟 is the same as for CMP.

Finally, we investigate the Vickrey (Vkr) mechanism. The Mhl still tends to obfuscate a word with itself for large 𝜖. To reduce the probability of masking a token with itself, Xu et al. [7] define the Vickrey DP mechanism (we refer to it as Vkr). Vkr is based on two steps. In the first step, a noisy vector is sampled using any of the mechanisms described above: we can instantiate Vkr with either Mhl mechanism (Vkr 𝑀 ℎ𝑙 ) or the CMP mechanism (Vkr 𝐶𝑀 𝑃 ). In the second step, with probability 𝑃 𝑟 the word corresponding to the closest embedding to the noisy vector is used as the obfuscation word. Vice versa, with probability 1 − 𝑃 𝑟 the word corresponding to the second closest embedding is used as obfuscation. The probability 𝑃 𝑟 is defined as 𝑃 𝑟(𝑡, 𝑣 ˆ) =

(1−𝑡)||𝜑(𝑢 2 )−𝑣 ^|| 2 𝑡||𝜑(𝑢 1 )−𝑣 ^|| 2 +(1−𝑡)||𝜑(𝑢 2 )−𝑣 ^|| 2 ,

where 𝜑(𝑢 1 ) and 𝜑(𝑢 2 ) are respectively the closest and second closest word embeddings to 𝑣 ˆ, the perturbed embedding of 𝑤, and 𝑡 is an additional free parameter. We set 𝑡 = 0.75, being the best performing [7].

Evaluation

We consider two different collections TREC Robust '04 and TREC Deep Learning (DL '19). As word embeddings, we used GloVe [11] with 300 dimensions trained on the Common Crawl. In terms of retrieval models, we consider a sparse bag-of-word model, BM25, and a dense bi-encoder, Contriever [12]. To set a baseline, we compare the DP approaches aforementioned with two non-DP obfuscation approaches originally devised explicitly for the IR task. We take into consideration the seminal work by Arampatzis et al. [2], labeled AED, and the recent state-of-the-art solution by Fröbe et al. [3], labeled FSH. For each approach and for each query, we generate 20 obfuscation queries. Table 1 shows, as a proxy of the privacy achieved by the mechanisms, the average similarity between the original query and the obfuscation queries generated to hide it. We compute the similarity as the dot product between the MiniLM [13] representations of the original query and the obfuscated ones. As expected from a DP mechanism, the higher the 𝜖 the higher the similarity between the queries -with 𝜖 = 50 for both Robust '04 and DL '19, CMP and Mhl achieve similarity higher than 95%. This indicates that overall the generated queries are almost identical to the original ones and there is no substantial privacy protection. FSH, which explicitly removes synonyms and hypernyms from the queries, is particularly safe and corresponds to a DP Vkr 𝐶𝑀 𝑃 mechanism with 𝜖 ∈ [5,10] or a Vkr 𝑀 ℎ𝑙 with 𝜖 ∈ [10, 12.5] for the Robust '04, and DP Vkr 𝐶𝑀 𝑃 and a Vkr 𝑀 ℎ𝑙 mechanism with 𝜖 ∈ [5,10] for the DL '19. The privacy achieved by AED can be achieved with 𝜖 in the range [10; 12.5] by CMP and Mhl on both collections. 𝜖 values that grant a comparable level of privacy are much higher for Vkr-based mechanisms, especially Vkr 𝑀 ℎ𝑙 , on both collections -this means that the Vkr mechanisms are substantially more secure from a privacy perspective.

As both CMP and Mhl are less effective from a privacy perspective, we focus the following analyses on the Vkr mechanism, with 𝜖 ∈ {10, 12.5, 15}. More in detail, we compare these DP mechanisms with AED and FSH, based on three axes: i) the obfuscation; ii) the pooled recall; iii) the nDCG@10 observed if we re-rank the documents pooled by the obfuscation queries. We define the obfuscation as 1 minus the similarity of the original query and the obfuscated one. The pooled recall is obtained by transmitting to the IR system 20 obfuscated queries: for each ranked list in response to an obfuscated query, we select the first 100 documents and merge all the sets of documents obtained. We compute the recall on this new set of documents. Finally, to compute nDCG@10, we rerank the pooled documents using a different IR model (we use TAS-B to avoid biasing toward any IR model) and evaluate the quality of this ranked list. For each approach, these measures are reported on a radar plot where, as a rule of thumb, a larger area corresponds to more desirable results. Figure 1 reports the radar plots, showing the performance of different obfuscation approaches over the three axes mentioned above. We notice that the area corresponding to the AED approach (in red) is encompassed within the area corresponding to Vkr 𝑀 ℎ𝑙 with 𝜖 = 15 (green). In fact, on the Robust '04 collection, AED achieves nDCG@10 of 0.410 and 0.424 for BM25 and Contriever respectively, recall of 0.420 and 0.419, and obfuscation of 0.513. Vice versa Vkr 𝑀 ℎ𝑙 with 𝜖 = 15 obtains nDCG@10 of 0.416 and 0.431, recall of 0.493 and 0.462, and obfuscation of 0.618. The exception is DL '19 with Contriever as the IR system, where AED has higher recall than Vkr 𝑀 ℎ𝑙 (0.497 against 0.418). Nevertheless, this larger recall does not correspond to much larger nDCG@10, indicating that Vkr 𝑀 ℎ𝑙 is preferable over AED, as it has comparable nDCG@10 (0.604 for Vkr 𝑀 ℎ𝑙 against 0.607 for AED), with improved obfuscation (0.785 against 0.491). When it comes to FSH (purple), the behaviour depends on the collection. In the DL '19, using Vkr 𝑀 ℎ𝑙 with 𝜖 = 10 (blue) provides an edge over FSH: they have comparable obfuscation (0.916 the former, 0.923 the latter), but Vkr 𝑀 ℎ𝑙 has much larger nDCG@10 (0.254 compared to 0.064). On the Robust '04 collection, to observe an improvement in terms of nDCG@10, it is necessary to use Vkr 𝑀 ℎ𝑙 with 𝜖 = 12.5 (nDCG@10 of 0.349 and 0.355 for BM25 and Contriever respectively) to overcome FSH in terms of nDCG@10 (0.140 and 0.194). While Vkr 𝑀 ℎ𝑙 with 𝜖 = 12.5 exhibits nDCG@10 performance slightly lower than AED, it also has obfuscation (0.719) relatively close to FSH, which has obfuscation of 0.797, much closer than AED, with obfuscation 0.513. As a general guideline, we propose to use Vkr 𝑀 ℎ𝑙 as the obfuscation mechanism, with 𝜖 chosen in the interval [10,15], depending on the optimal trade-off between privacy and effectiveness, as chosen by the user.

Conclusion and Future Work

In this work, we analyzed for the first time the performance of three DP mechanisms, originally designed for NLP, in the proxy query obfuscation IR task. We evaluated these mechanisms on the IR setting by considering three aspects: their obfuscation capabilities, their effectiveness in terms of recall, and their ability in allowing to retrieve highly relevant documents. Our findings highlight that the Vickrey mechanism with 𝜖 ∈ [10, 12.5] achieves higher privacy guarantees, with improved effectiveness, than current state-of-the-art approaches. Furthermore, lower or higher levels of 𝜖 allow for better satisfy the user, either in terms of privacy or accuracy, depending on their inclinations. As a future work, we plan to investigate how to perturb dense representations of the queries and combine them with generative language models to produce obfuscation queries with the same dense representation, but different terms.

Figure 1 :1Figure 1: Performance of different obfuscation mechanisms over three axes: pooled recall, nDCG@10 of the reranked documents, obfuscation (obf), measured as 1-similarity. Cnt. stands for "Contriever".

Table 11Average MiniLM sentence similarity between the original query and 20 obfuscation queries generated with different approaches.Robust '04DL '19𝜖151012.51517.52050No DP151012.51517.52050No DPCMP 0.074 0.100 0.396 0.672 0.871 0.961 0.987 0.9960.024 0.032 0.214 0.458 0.681 0.824 0.903 0.952Mhl 0.077 0.095 0.244 0.427 0.627 0.794 0.907 0.9960.020 0.034 0.119 0.241 0.427 0.610 0.750 0.951Vkr𝐶𝑀𝑃 0.077 0.100 0.278 0.412 0.511 0.578 0.622 0.7600.028 0.032 0.137 0.211 0.308 0.372 0.413 0.565Vkr𝑀ℎ𝑙 0.076 0.096 0.188 0.282 0.382 0.472 0.533 0.7460.023 0.026 0.084 0.149 0.215 0.284 0.333 0.553AED0.4870.509FSH0.2030.077

Query Obfuscation for Information Retrieval through Differential Privacy GFaggioli NFerro Procs. of ECIR 2024 s. of ECIR 2024 2024 Enhancing deniability against query-logs AArampatzis PSEfraimidis GDrosatos Procs. of ECIR 2011 s. of ECIR 2011 2011 Efficient query obfuscation with keyqueries MFröbe EOSchmidt MHagen Procs. of WI-IAT '21 s. of WI-IAT '21 2021 The algorithmic foundations of differential privacy CDwork ARoth Found. Trends Theor. Comput. Sci 9 2014 Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations OFeyisetan BBalle TDrake TDiethe Procs. of WSDM '20 s. of WSDM '20 2020 A differentially private text perturbation method using regularized mahalanobis metric ZXu AAggarwal OFeyisetan NTeissier Procs. of the Second Workshop on Privacy in NLP s. of the Second Workshop on Privacy in NLP 2020 On a utilitarian approach to privacy preserving text generation ZXu AAggarwal OFeyisetan NTeissier CoRR abs/2104.11838 2021 Geo-indistinguishability: differential privacy for location-based systems MEAndrés NBordenabe KChatzikokolakis CPalamidessi 10.1145/2508859.2516735 ACM SIGSAC Conference on Computer and Communications Security, CCS'13 ASadeghi VDGligor MYung

Berlin, Germany

ACM 2013. November 4-8, 2013. 2013 Broadening the scope of differential privacy using metrics KChatzikokolakis MAndrés NBordenabe CPalamidessi 10.1007/978-3-642-39077-7_5 doi: Privacy Enhancing Technologies -13th International Symposium, PETS 2013 Lecture Notes in Computer Science EDCristofaro MKWright

Bloomington, IN, USA

Springer July 10-12, 2013. 2013 7981 Proceedings A framework of metrics for differential privacy from local sensitivity PLaud APankova MPettai 10.2478/popets-2020-0023 Proc. Priv. Enhancing Technol 2020 2020 Glove: Global vectors for word representation JPennington RSocher CDManning Procs. of EMNLP 2014 s. of EMNLP 2014 2014 Unsupervised dense information retrieval with contrastive learning GIzacard MCaron LHosseini SRiedel PBojanowski AJoulin EGrave Trans. Mach. Learn. Res 2022 MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers WWang FWei LDong HBao NYang MZhou Procs. of NeurIPS '20 s. of NeurIPS '20 2020