1. Introduction

Evaluating Diferential Privacy Approaches for Query Obfuscation in Information Retrieval⋆

Discussion Paper

Guglielmo Faggioli

Nicola Ferro

0 0 University of Padova , Padova , Italy

Protecting the privacy of a user while they interact with an Information Retrieval (IR) system is crucial. This becomes more challenging when the IR system is not cooperative in satisfying the user's privacy needs. Recent advancements in Natural Language Processing (NLP) have demonstrated Diferential Privacy's (DP) efectiveness in safeguarding text privacy for tasks like spam detection and sentiment analysis, even under the assumption of a non-cooperative system. Our investigation explores if DP methods, originally designed for specific NLP tasks, can efectively obscure queries in IR. Our analyses show that using the Vickrey DP mechanism, employing the Mahalanobis norm with a privacy budget ranging from = 10 to 12.5, provides cutting-edge privacy protection and enhances efectiveness. Unlike previous methods, DP allows users to fine-tune their desired level of privacy by adjusting the privacy budget . This flexibility ofers a balance between how efective the system is and how much privacy is maintained, unlike the more rigid nature of previous approaches.

1. Introduction

the randomness of the mechanism. DP is particularly efective in the NLP domain. A line of research [ 5, 6, 7 ] operationalizes DP to release text by obfuscating each word individually. Such mechanisms work as follows: i) each word in the text is mapped to a non-contextual embedding space; ii) the embeddings are perturbed with noise drawn from a specific distribution; iii) each word is replaced with the word closest to the noisy embedding. A major advantage of DP is that it allows setting the privacy budget based on the needs of the user. This is diferent from current obfuscation mechanisms in IR, which are either active or not and cannot be tuned based on the user’s needs. In this work, we focus on three of DP mechanisms: the Calibrated Multivariate Perturbation (CMP) [ 5 ], the Mahalanobis [ 6 ] and the Vickrey [ 7 ]. These approaches were originally devised and tested for NLP tasks that include text classification and sentiment analysis. We assume the IR system to not preserve user privacy, and to possibly be malicious. In our use case, users are the ones concerned about their privacy. They do not want to reveal their real information needs and prefer to transmit obfuscated queries to the IR system while still retrieving relevant documents. Therefore, to operationalize our mechanism, we assume each user to locally obfuscate their query and transmit the obfuscated query, or possibly multiple queries, to the IR system instead of their real query. Our goal is to determine if the DP mechanisms introduced above can successfully obfuscate users’ information needs while still retrieving relevant documents.

2. Approaches

All the approaches described in this work are based on a relaxation of classical DP, called Metric-DP. To achieve traditional DP in a metric space, an obfuscation mechanism should have an equal probability of obfuscating any pair of points as the same point, irrespective of their distance. While this grants the highest level of privacy, it also requires high levels of noise, decreasing the utility of the data. In the case of metric spaces, it is often suficient if the probability of obfuscating two points with the same one is proportional to the distance between the two points. Alternatively, the proportion of sampling a certain noise is inversely proportional to the norm of the noise itself. To this end, a relaxation of DP, called Metric-DP, has been introduced. Metric-DP [ 8, 9, 10 ] is defined as follows: given a privacy budget and a distance measure : R × R → [0, ∞), a randomized mechanism ℳ : R → R defined over a geometric space is Metric-DP if, for any three points in the space , ′, ˆ ∈ R, the following holds: {{ℳℳ((′))==^^}} ≤ exp( {, ′}) If the {, ′} is small, and ′ are more likely to be obfuscated with the same point. Vice-versa, far apart points might be obfuscated with diferent points, without violating privacy constraints.

We describe here the three major DP eforts for obfuscating text in the NLP scenario, which we evaluate for the IR task. More in detail, these approaches take as input a sequence of words. Each word is mapped into a non-contextual embedding, such as GloVe [ 11 ]. Then, the embedding is obfuscated by adding some appositely sampled noise to it. To ensure that Metric-DP is achieved, the noise vector is expected to be sampled from a distribution such that the probability of observing is () ∝ exp(− ||||), i.e., the probability of sampling a noise with norm |||| is inversely proportional to ||||. Finally, the closest word to the noisy embedding is used to obfuscate the corresponding word in the original text. We propose to use these approaches in the IR scenario to perturb the queries instead of the documents, as done for NLP tasks.

The Calibrated Multivariate Perturbation (CMP) mechanism, defined by Feyisetan et al. [ 5 ], is based on sampling a noise vector for each term in the query following an n-dimensional Laplace distribution. Such sampling works by sampling two vectors: i) an n-dimensional unitary vector ∈ R that represents the direction of the perturbation. ii) the radius of the perturbation ∈ R+ is sampled from a Gamma distribution. To sample , a vector ∈ R is sampled from a multivariate normal distribution, with location 0 and identity covariance matrix I: ∼ (0, I). Then = /|| ||2. The radius of the noise is sampled from a Gamma distribution with shape and scale 1 as ∼ (, 1 ). It is possible to observe that, the larger the privacy requirement, i.e., the smaller the , the bigger the noise. The noise is defined as = · . To perturb a word , the noise vector is added to the original word embedding () ∈ R, and the word closest to the noisy word embedding is used as obfuscation. Feyisetan et al. [ 5 ] demonstrate that for any word sequence of length ≥ 1 and any > 0, CMP satisfies -privacy with respect to , where is the Euclidean distance.

The second mechanism investigated is the Mahalanobis (Mhl) mechanism. Xu et al. [ 6 ] noticed how the perturbation induced by CMP mechanism tends to be weak, especially for high . They hypothesize that sampling the direction of the perturbation on a circumference (||||2 = 1) increases the risk of sampling a point on an empty region. Therefore, Xu et al. adapt the CMP mechanism by transforming the direction of the noise from a circumference to an ellipsis whose orientation can be set to be towards the other embeddings. To do so, it is necessary to modify the sampling mechanism, so that, instead of sampling such that ||||2 = 1, is sampled so that |||| = 1 where || · || is the Mahalanobis norm. To ensure that the noise is sampled such that its probability distribution is () ∝ exp(− |||| ) a vector is sampled from the multivariate normal distribution ∼ (0, ). Then, is such that = Σ 1/2 · (/|| ||2), where Σ ∈ R× is the covariance matrix of all the word embeddings. This forces the noise towards more populated areas. The sampling of the norm of the noise is the same as for CMP.

Finally, we investigate the Vickrey (Vkr) mechanism. The Mhl still tends to obfuscate a word with itself for large . To reduce the probability of masking a token with itself, Xu et al. [ 7 ] define the Vickrey DP mechanism (we refer to it as Vkr). Vkr is based on two steps. In the ifrst step, a noisy vector is sampled using any of the mechanisms described above: we can instantiate Vkr with either Mhl mechanism (Vkrℎ) or the CMP mechanism (Vkr ). In the second step, with probability the word corresponding to the closest embedding to the noisy vector is used as the obfuscation word. Vice versa, with probability 1 − the word corresponding to the second closest embedding is used as obfuscation. The probability is (1− )||(2)− ^||2 defined as (, ˆ) = ||(1)− ^||2+(1− )||(2)− ^||2 , where (1) and (2) are respectively the closest and second closest word embeddings to ˆ, the perturbed embedding of , and is an additional free parameter. We set = 0.75, being the best performing [ 7 ].

3. Evaluation

We consider two diferent collections TREC Robust ‘04 and TREC Deep Learning (DL ‘19). As word embeddings, we used GloVe [ 11 ] with 300 dimensions trained on the Common Crawl. Average MiniLM sentence similarity between the original query and 20 obfuscation queries generated with diferent approaches. DP Vkr mechanism with ∈ [ 5, 10 ] or a Vkrℎ with ∈ [10, 12.5] for the Robust ‘04, and DP Vkr and a Vkrℎ mechanism with ∈ [ 5, 10 ] for the DL ‘19. The privacy achieved by AED can be achieved with in the range [10; 12.5] by CMP and Mhl on both collections. values that grant a comparable level of privacy are much higher for Vkr-based mechanisms, especially Vkrℎ, on both collections – this means that the Vkr mechanisms are substantially more secure from a privacy perspective.

As both CMP and Mhl are less efective from a privacy perspective, we focus the following analyses on the Vkr mechanism, with ∈ {10, 12.5, 15}. More in detail, we compare these DP mechanisms with AED and FSH, based on three axes: i) the obfuscation; ii) the pooled recall; iii) the nDCG@10 observed if we re-rank the documents pooled by the obfuscation queries. We define the

obfuscation as 1 minus the similarity of the original query and the obfuscated one. The pooled recall is obtained by transmitting to the IR system 20 obfuscated queries: for each ranked list in response to an obfuscated query, we select the first 100 documents and merge all the sets of documents obtained. We compute the recall on this new set of documents. Finally, to compute nDCG@10, we rerank the pooled documents using a diferent IR model (we use TAS-B to avoid biasing toward any IR model) and evaluate the quality of this ranked list. For each approach, these measures are reported on a radar plot where, as a rule of thumb, a larger area corresponds to more desirable results. Figure 1 reports the radar plots, showing the performance of diferent obfuscation approaches over the three axes mentioned above. We notice that the area corresponding to the AED approach (in red) is encompassed within the area corresponding to Vkrℎ with = 15 (green). In fact, on the Robust ‘04 collection, AED VkrMhl =10 VkrMhl =12.5 VkrMhl =15 AED FSH achieves nDCG@10 of 0.410 and 0.424 for BM25 and Contriever respectively, recall of 0.420 and 0.419, and obfuscation of 0.513. Vice versa Vkrℎ with = 15 obtains nDCG@10 of 0.416 and 0.431, recall of 0.493 and 0.462, and obfuscation of 0.618. The exception is DL ‘19 with Contriever as the IR system, where AED has higher recall than Vkrℎ (0.497 against 0.418). Nevertheless, this larger recall does not correspond to much larger nDCG@10, indicating that Vkrℎ is preferable over AED, as it has comparable nDCG@10 (0.604 for Vkrℎ against 0.607 for AED), with improved obfuscation (0.785 against 0.491). When it comes to FSH (purple), the behaviour depends on the collection. In the DL ‘19, using Vkrℎ with = 10 (blue) provides an edge over FSH: they have comparable obfuscation (0.916 the former, 0.923 the latter), but Vkrℎ has much larger nDCG@10 (0.254 compared to 0.064). On the Robust ‘04 collection, to observe an improvement in terms of nDCG@10, it is necessary to use Vkrℎ with = 12.5 (nDCG@10 of 0.349 and 0.355 for BM25 and Contriever respectively) to overcome FSH in terms of nDCG@10 (0.140 and 0.194). While Vkrℎ with = 12.5 exhibits nDCG@10 performance slightly lower than AED, it also has obfuscation (0.719) relatively close to FSH, which has obfuscation of 0.797, much closer than AED, with obfuscation 0.513. As a general guideline, we propose to use Vkrℎ as the obfuscation mechanism, with chosen in the interval [ 10, 15 ], depending on the optimal trade-of between privacy and efectiveness, as chosen by the user.

4. Conclusion and Future Work

In this work, we analyzed for the first time the performance of three DP mechanisms, originally designed for NLP, in the proxy query obfuscation IR task. We evaluated these mechanisms on the IR setting by considering three aspects: their obfuscation capabilities, their efectiveness in terms of recall, and their ability in allowing to retrieve highly relevant documents. Our findings highlight that the Vickrey mechanism with ∈ [10, 12.5] achieves higher privacy guarantees, with improved efectiveness, than current state-of-the-art approaches. Furthermore, lower or higher levels of allow for better satisfy the user, either in terms of privacy or accuracy, depending on their inclinations. As a future work, we plan to investigate how to perturb dense representations of the queries and combine them with generative language models to produce obfuscation queries with the same dense representation, but diferent terms.

[1]

Faggioli ,

Ferro , Query Obfuscation for Information Retrieval through Diferential Privacy , in: Procs. of ECIR 2024 , 2024 .

[2]

Arampatzis ,

P. S.

Efraimidis , G. Drosatos, Enhancing deniability against query-logs , in: Procs. of ECIR 2011 , 2011 , pp. 117 - 128 .

[3]

Fröbe ,

E. O.

Schmidt ,

Hagen , Eficient query obfuscation with keyqueries , in: Procs. of WI-IAT '21 , 2021 , pp. 154 - 161 .

[4]

Dwork ,

Roth , The algorithmic foundations of diferential privacy , Found. Trends Theor. Comput. Sci. 9 ( 2014 ) 211 - 407 .

[5]

Feyisetan ,

Balle ,

Drake , T. Diethe, Privacy- and utility-preserving textual analysis via calibrated multivariate perturbations , in: Procs. of WSDM '20 , 2020 , pp. 178 - 186 .

[6]

Xu ,

Aggarwal ,

Feyisetan ,

Teissier , A diferentially private text perturbation method using regularized mahalanobis metric , in: Procs. of the Second Workshop on Privacy in NLP, 2020 .

[7]

Xu ,

Aggarwal ,

Feyisetan ,

Teissier , On a utilitarian approach to privacy preserving text generation , CoRR abs/2104 .11838 ( 2021 ).

[8]

M. E.

Andrés ,

Bordenabe ,

Chatzikokolakis ,

Palamidessi , Geo-indistinguishability: diferential privacy for location-based systems , in: A. Sadeghi , V. D. Gligor , M. Yung (Eds.), 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS'13 , Berlin, Germany, November 4- 8 , 2013 , ACM, 2013 , pp. 901 - 914 . doi: 10 .1145/2508859. 2516735.

[9]

Chatzikokolakis ,

Andrés ,

Bordenabe , C. Palamidessi, Broadening the scope of diferential privacy using metrics , in: E. D. Cristofaro , M. K. Wright (Eds.), Privacy Enhancing Technologies - 13th International Symposium, PETS 2013 , Bloomington , IN, USA, July 10 - 12 , 2013 . Proceedings, volume 7981 of Lecture Notes in Computer Science, Springer, 2013 , pp. 82 - 102 . URL: https://doi.org/10.1007/978-3- 642 -39077- 7 _5. doi: 10 . 1007/978-3- 642 -39077-7\_5.

[10]

Laud ,

Pankova ,

Pettai , A framework of metrics for diferential privacy from local sensitivity , Proc. Priv. Enhancing Technol . 2020 ( 2020 ) 175 - 208 . doi: 10 .2478/ popets-2020-0023.

[11]

Pennington ,

Socher ,

C. D.

Manning , Glove: Global vectors for word representation , in: Procs. of EMNLP 2014 , 2014 , pp. 1532 - 1543 .

[12]

Izacard ,

Caron ,

Hosseini ,

Riedel ,

Bojanowski ,

Joulin , E. Grave, Unsupervised dense information retrieval with contrastive learning , Trans. Mach. Learn. Res . ( 2022 ).

[13]

Wang ,

Wei ,

Dong ,

Bao ,

Yang , M. Zhou, MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , in: Procs. of NeurIPS '20, 2020 .