Inference-proof Data Filtering for a Probabilistic Setting? J. Biskup1 , P.A. Bonatti2 , C. Galdi2 , and L. Sauro2 1 Fakultät für Informatik, Technische Universität Dortmund 2 Dip. Ing. Elet. e Tecnologie dell’Informazione, Università di Napoli Federico II Abstract. In querying semantic data, access control must take into account the information that is implicitly entailed by the accessible part of triple stores and ontologies. While there exist inference control frameworks for this purpose, they still have a limitation: the confidentiality criterion does not take into account the probabilistic knowledge of the attacker. Therefore, the existing controlled query evaluation methods may return answers that actually reveal that a secret is true with very high probability. Given that such probabilistic knowledge is becoming more and more widely available as a result of analytics of various sorts, it is im- portant to develop a refined confidentiality framework where probabilistic knowl- edge is taken into due account. Accordingly, in this paper, we extend and gener- alize an abstract data filtering framework for confidentiality-preserving, policy- based data publishing. The confidentiality requirement is strengthened so that the probability that a secret is true is bounded by a small constant . We formally de- fine such a probabilistic setting, then we study two greedy data publishing meth- ods based on refusals and lies, respectively. The refusal-based method is proved to be secure and maximally cooperative among a class of “reasonable” methods. We prove also that the natural generalization of the lying method is not secure. Furthermore, we extend the complexity hardness results from the deterministic framework to the probabilistic one. Keywords: A priori knowledge, Confidentiality criterion, Confidentiality- preserving data publishing, Cooperativeness, Inference control, Lying, Privacy, Probabilistic methods, Refusal 1 Introduction The need for inference-proof data publishing has been discovered well before the se- mantic web was born. Still, semantic (meta)data, that are expressed with standardized knowledge representation languages such as RDFS and OWL, are by design well-suited to the automated derivation of implied data. The applicable inference engines are be- coming more and more powerful and scalable. This makes the need of inference-proof access control techniques particularly crucial in this area. The literature is rich of data filtering methods for confidentiality-preserving data publishing (cf. [15] ) that achieve secrecy properties of different kind and strength. In ? This work has been partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement N. 731601. 2 recent work [4], we have proposed an abstract framework that covers a wide range of deterministic filtering methods while not relying on any particular data model or query representation. Within this framework, we studied possibilistic secrecy regarding an observer-specific confidentiality policy which consists of (potential) secrets in form of yes–no queries. For no secret query should a user be able to derive a “yes” answer, no matter how powerful is the rational reasoning method exploited by the user and no matter how many resources are employed. In this framework, confidentiality relies on the assumption that one counterexample suffices to prevent the user from believing that the secret is true. However, reality may be different. The wide range of analytics tools for seman- tic data provide rich statistical information about the real world, that may make the available counterexamples unlikely. Moreover, if the majority of the interpretations of a knowledge base agree that a secret is true, the user may be inclined to believe it. To put this in general terms, a variety of probabilistic information may lead the user to believe that a secret is actually true with high probability. Then the aforementioned confiden- tiality framework should be refined by strengthening its confidentiality criterion so as to accommodate a priori probabilistic information about the domain of discourse. Accordingly, in this paper, we generalize and extend the setting of [4] in the follow- ing ways. First, we consider probabilistic a priori knowledge in form of a probability distribution for the set of all possible data sources. Second, we measure the informa- tion regarding a secret query’s validity learnt by observing filtered data and reasoning about it. This measurement is taken as the conditional probability of the secret query’s validity under the observation. Third, we refine the secrecy criterion by requiring the observer to believe in the truth of secrets with very low probability, bounded by a suit- able threshold . Example 1. Let us consider a simple artificial situation where the set of data sources contains the 8 possible interpretations of 3 RDF triples p, q and r, represented by the words pqr, pqr̄, pq̄r, pq̄r̄, p̄qr, p̄qr̄, p̄q̄r, p̄q̄r̄. Intuitively speaking, the filtering should keep a joint validity of p and q confidential, formally expressed by the query S = {pqr, pqr̄}, (the characteristic function of) which returns “yes” (true) for the two interpretations with representation pq∗, and “no” (false) otherwise. Accordingly, the confidentiality policy is just the singleton {S}. A straightforward filtering f hides the secret when applied to an interpretation by setting the first variable always to p̄. Since in general we will allow uncertainty in a generated view, a filtering will return a set of data sources, and thus this filtering is formally defined by generating f (xyz) := {p̄yz}. However, seeing the verbatim view {p̄yz}, an intelligent and knowledgeable observer can easily determine the inferred view {pyz, p̄yz}, each of whose elements might be the actual data source underlying the filtering. Nevertheless, since {pqr, pqr̄} 6⊆ {pyz, p̄yz}, an observer will always believe in the possibility that the secret is not valid in the actual data source. For our probabilistic setting we are even more ambitious: we want the observer to believe in a probability of the non-validity of the secret not less than 1 − , for some security parameter  ∈ [0, 1]. So let us further assume that the anticipated observer is knowing a priori the fol- lowing probability distribution µ: each of the two protected interpretations occurs only 3 1 relatively rarely, with probability 16 ; each of the two interpretations containing both p̄ 3 and q̄ are more likely, having probability 16 ; and each of the remaining four interpreta- 2 tions containing either p or q are equally distributed, with probability 16 . Then we have 1 1 1 µ(S) = µ({pqr, pqr̄}) = 16 + 16 = 8 . By similar elementary calculations, for each case we can determine the a posteriori probability of the secret being valid under the condition of the inferred view based on the observation of the verbatim view. For the straightforward filtering f defined above and the assumed probability distribution µ the calculations show that these probabilities are either 31 or 0, and thus the filtering is seen to be secure for each security parameter  in the open interval ( 13 , 1]. This example is summarized in Table 1, and variants of it are serving throughout the paper. Table 1. Summarized example, using the following notations: “→” indicate that the interpre- tation d is mapped on the singleton containing the neighbored interpretation in the same line; “↓” indicate that the interpretation d is mapped on the singleton containing itelf; [.] denotes the inferred view, formed as equivalence class consisting of the two interpretations in the respec- tive line; µ(S | [.]) denotes the derived conditional probability of the secret S under the inferred view [.] A Interpre- Filter- Filter- Interpre- A Inferred Secret A poste- priori tation ing ing tation priori view fraction riori µ(d) d f (d) f (d) d µ(d) µ([.]) µ(S ∩ [.]) µ(S | [.]) 1 2 3 1 1 16 pqr → ↓ p̄qr 16 16 16 3 1 2 3 1 1 16 pqr̄ → ↓ p̄qr̄ 16 16 16 3 2 3 5 16 pq̄r → ↓ p̄q̄r 16 16 0 0 2 3 5 16 pq̄r̄ → ↓ p̄q̄r̄ 16 16 0 0 Popular probabilistic models and approaches have already been introduced. Dif- ferential privacy [11,12,13,14] is currently one of the most important of them. Under suitable assumptions it is very effective and relatively easy to apply. Unfortunately, when (probabilistic) background knowledge such as record correlations is available, differential privacy may fail to preserve confidentiality [20]. For this reason, investi- gating methods that preserve confidentiality in a probabilistic sense, in the presence of probabilistic background knowledge, is still an important topic. Moreover, differen- tial privacy is based on the perturbation of data or query answers, while we intend to consider (also) methods that do not report incorrect answers. Accordingly, in this paper we provide as a further contribution a maximally coop- erative, secure query-answering method based on a greedy refusal-based algorithm. We prove also that – differently from the deterministic case – its lying-based analogue is not secure, instead. Finally, we extend some of the computational hardness results from the deterministic framework to the probabilistic one. This abstract framework should be regarded as a preliminary, general feasibility study that constitutes the first step towards inference-proof, confidential access control methods for semantic data. Thus the implementation of concrete mechanism still lies beyond the scope of this paper. 4 The paper is organized as follows. After formally defining the probabilistic setting in Section 2, we analyze the safety of the greedy filtering methods based on refusal and lying in Section 3 and Section 4, respectively. In Section 5, we investigate the degree of cooperativeness of the greedy refusal method (a form of optimality). Section 6 reports the complexity results. Finally, in Section 7, we briefly summarize and evaluate our achievements, discuss related work, and list some challenging open problems. 2 A Probabilistic Setting As in the deterministic case presented in [4], the probabilistic framework is based on an arbitrary set D of data sources, where each data source d ∈ D is treated as an abstract entity. To enable reasoning about the probability of sets of data sources, we need to introduce a σ-algebra ΣD ⊆ ℘(D), where ℘(D) denotes the powerset of D. By defini- tion, ΣD includes both D and ∅ and is closed under the set operations of complement, countable union and countable intersection. Like for perfect encryption,we assume that the attacking observer has probabilistic a priori knowledge about the owner’s data source. Such a knowledge is formalized by a probability measure µ : ΣD −→ [0, 1], that is any function which satisfies the following S∞conditions:P(i) µ(D) = 1 and (ii) if ∞ A1 , A2 , . . . ∈ ΣD are pairwise disjoint, then µ( i=1 Ai ) = i=1 µ(Ai ).3 We refer to [1] for a wider introduction on Probability Theory, here we just report a few properties which will be used later on: – (complementation) µ(A) = 1 − µ(A), where A is the complement of A; – (monotonicity) if A ⊆ A0 , then µ(A) ≤ µ(A0 ); – (∩-continuity) T∞ if A1 , A2 , . . . is a descending chain, i.e. Ai+1 ⊆ Ai for all i, then µ( i=1 Ai ) = limi µ(Ai ). Some subsets Q of D can be regarded as a Boolean query, indicating whether a data source d satisfies Q, i.e., d ∈ Q, or not, i.e., d 6∈ Q. To comply with the proba- bilistic framework, each query must be measurable. Let B ⊆ ΣD be the set of queries considered. Clearly, since queries have to be expressed syntactically, B is countable. To keep some aspects of his data source secret to the observer, the owner can specify a confidentiality policy consisting of a finite set S ⊆ B of queries, in this context called (potential) secrets. Informally speaking, if the owner’s data source satisfies a secret then the observer should not be able to learn this fact; conversely, the observer is allowed to know that a secret is not satisfied. To preserve secrets, the owner applies a filtering f : D −→ ℘(D) \ {∅} on his actual data source d and publishes f (d) as a (verbatim) secure view, making the ob- server uncertain about which data source in f (d) is the actual one, and possibly even misleading the observer by not including the actual data source d in f (d). Assuming that the observer is a rational agent that knows the filtering f , the observer can compute the inferred filtering [ · ]f : D −→ ℘D \ {∅}, defined by [ d ]f := { d0 | f (d) = f (d0 ) } . (1) 3 Function µ is the analogue of the data generation function P of [20]. 5 So, given the published verbatim view f (d), the observer can construct the inferred view [ d ]f , which may still be “uncertain” (i.e., there are multiple data sources) but it definitely contains the actual data source d. To enforce the confidentiality policy, the filtering should prevent the observer from inferring that a secret is satisfied. For instance, in a deterministic setting, the filtering should enforce [ d ]f 6⊆ S for all secrets S ∈ S. Finally, to complete the probabilistic setting, we require that [ d ]f ∈ ΣD , i.e., each set [ d ]f of data sources indistinguishable under f is supposed to be measurable. Then we strengthen the security criterion by requiring that the probability that a secret is sat- isfied by d – given the observable view f (d) – should not exceed a (small) threshold . Definition 1. For a probabilistic setting with a priori knowledge µ : ΣD −→ [0, 1], confidentiality policy S , and threshold  ∈ [0, 1], a filtering f : D −→ ℘(D) \ {∅} with measurable inferred views is (µ, S , )-secure iff for all data sources d ∈ D, for all secrets S ∈ S , it holds that µ(S ∩ [ d ]f ) ≤  · µ([ d ]f ) . (2) Note that inequality (2) means that, whenever defined (i.e., µ([ d ]f ) 6= 0), the condi- tional probability of S under the inferred view [d ]f µ(S ∩ [ d ]f ) µ( S | [ d ]f ) := µ([ d ]f ) is bounded by . The abstract framework presented so far can be embodied in the context of Linked Data as follows: data sources consist of RDF-graphs, Boolean queries are represented by ASK forms of SPARQL4 and secrets are specific Boolean queries, i.e. confidential pieces of information that can be retrieved through (a sequence of) ASK forms. More- over, the a priori knowledge of the observer is represented by a probability distributions over possible RDF-graphs. More precisely, the set D is the set of all possible RDF- graphs over a given signature of interest and ΣD is the whole powerset ℘(D). Then, µ is determined by a discrete probability distribution δ over RDF-graphs such that for all A ⊆ D, X µ(A) = δ(d) . (3) d∈A Finally, a Boolean query Qa is the denotation of a ASK form a, that is the set of RDF- graphs where the form a returns true. In the remainder of this paper, we will study the secure filterings f that result from a greedy construction of a decreasing sequence of sets Li (d) of data sources, for each input data source d ∈ D: \ D =: L0 (d) ⊇ L1 (d) ⊇ · · · ⊇ Li (d) ⊇ · · · ⊇ Li (d) := f (d) . (4) i 4 Roughly speaking, an ASK form is a query that returns true if the RDF-graph satisfies a specified graph pattern, false otherwise. Graph patterns allow to verify, for example, whether a confidential set of RDF-triples or a specified node denoting some sensible individual occurs in the graph. 6 All constructions will be based on an exhaustive enumeration Ben = hQ1 , . . . , Qi , . . . i of all queries in B. Stepwise, each query Qi is submitted to a kind of stateful censor that keeps track of previous answers to determine whether Qi should be answered correctly or distorted. This decision determines Li (d) from Li−1 (d). The sequence hLi (d)ii has also a dynamic interpretation. The observer chooses the queries Q1 , Q2 , . . . , Qi , . . . and submits them iteratively to the confidential data source. The query answering system returns for each Qi the corresponding direct answer Ai (d), leaving the computation of the accumulated information represented by Li (d) to the observer. In this dynamic setting, the confidentiality criterion is that at each step i, the accumulated information Li (d) of the direct answers to Q1 , Q2 , . . . , Qi should not tell too much about the secrets. 3 Greedy Refusal Under the refusal approach to data filtering, harmful queries are somehow explicitly notified to be hidden. In a dynamic query-answering environment, this may result in returning the special answer “mum” to a query whose correct answer would violate the confidentiality policy. In the data filtering framework there is yet another interpretation of the refusal ap- proach: some specific information about the actual data source is hidden by providing a view which contains not only the actual data source but also other data sources. If this generation of uncertainty is properly done, then the observer cannot know which element of the view is the actual one, therefore he is not able to infer that specific infor- mation. A crucial point of the refusal approach is to block so-called meta-inferences. For instance, if a filtering f refused only the queries that entail a secret, then refused an- swers would always correspond to a secret satisfied by the actual data source d. The basic strategy for blocking this kind of attacks is to make the critical part of the censor decision independent from the actual data source. This strategy has already been pro- posed in the seminal work [27] on the refusal approach, further elaborated and refined for various models, e.g., [2,3,7,5], adopted for the abstract framework [4], and will also be exploited in the rest of this section for the probabilistic setting under consideration. Using the notations of Definition 1 and instantiating (4), for an enumeration Ben = hQ1 , . . . , Qi , . . . i of all queries in B, the (greedy) refusal filtering fre results from the iterative application of the following distortion criterion to each query Qi : IF [censor criterion evaluating Qi ’s harmfulness in a d-independent way] for some secret S ∈ S , µ( S ∩ Li−1 (d) ∩ Qi ) >  · µ( Li−1 (d) ∩ Qi ) or µ( S ∩ Li−1 (d) ∩ Qi ) >  · µ( Li−1 (d) ∩ Qi ) THEN [distortion by refusal] Li (d) := Li−1 (d) ELSE [honest answer w.r.t. d] Ai (d) := IF d ∈ Qi THEN Qi ELSE Qi Li (d) := Li−1 (d) ∩ Ai (d) 7 1 1 2 Example 2. Resuming Example 1 with µ(S) = µ({pqr, pqr̄}) = 16 + 16 = 16 , we 6 consider the refusal filtering for d := pqr. We choose  := 16 as security parameter and take the powerset of the set of all interpretations over the propositional variables p, q and r as the set of all queries. We start the enumeration of the 28 queries by “p?”, intuitively asking whether p 6 is valid, formalized as Q1 := {pqr, pqr̄, pq̄r, pq̄r̄} with µ(Q1 ) = 16 . Then Q1 := 10 2 {p̄qr, p̄qr̄, p̄q̄r, p̄q̄r̄} with µ(Q1 ) = 16 . Accordingly, µ(S ∩ Q1 ) = µ({pqr, pqr̄}) = 16 and µ(S ∩ Q1 ) = µ(∅) = 0. Evaluating the censoring condition we get µ(S | Q1 ) = 2 6 6 10 < 16 and µ(S | Q1 ) = 0 < 16 , and thus the honest answer A1 (pqr) := Q1 is processed to determine L1 (d) := Q1 . We continue the enumeration with the query “q?”, intuitively asking whether q is valid, formalized as Q2 := {pqr, pqr̄, p̄qr, p̄qr̄}. Evaluating the censoring condition, 2 we then get µ(L1 (d) ∩ Q2 ) = µ({pqr, pqr̄}) = 16 and µ(S ∩ (L1 (d) ∩ Q2 )) = 2 µ({pqr, pqr̄}) = 16 , and thus µ(S ∩ (L1 (d) ∩ Q2 ) |(L1 (d) ∩ Q2 ) ) = 1 leads to a refusal resulting in L2 (d) := L1 (d) = {pqr, pqr̄, pq̄r, pq̄r̄}. Further continuing with “r?”, i.e., Q3 := {pqr, pq̄r, p̄qr, p̄q̄r}, we get µ(L2 (d) ∩ 3 1 Q3 ) = µ({pqr, pq̄r}) = 16 and µ(S ∩(L2 (d)∩Q3 )) = µ({pqr}) = 16 and thus µ(S ∩ 1 6 (L2 (d) ∩ Q3 ) |(L2 (d) ∩ Q3 ) ) = 3 < 16 as well as µ(L2 (d) ∩ Q3 ) = µ({pqr̄, pq̄r̄}) = 3 1 16 and µ(S ∩ (L2 (d) ∩ Q3 )) = µ({pqr̄}) = 16 and thus µ(S ∩ (L2 (d) ∩ Q3 ) |(L2 (d) ∩ 1 6 Q3 ) ) = 3 < 16 , leading to an honest reaction such that L3 (d) := L2 (d) ∩ A3 (d) = {pqr, pq̄r}. Though somehow tedious to check, all further queries do not change that intermedi- ate view and, thus, for the actual data source pqr we get {pqr, pq̄r} as the final verbatim view, leaving an observer uncertain whether or not q is valid. One may note that in Ex- ample 1 we got the verbatim view {pqr, p̄qr} leaving the status of p open; we would obtain this result by the refusal filtering if we exchanged the processing of the first two queries. Clearly, in the context of RDF-graphs considered above, the presented approach re- quires that an ASK form returns a notification of the kind query_refused in case of refusal. Example 3. Consider an RDF-graph containing sensible information about medical treatments in a small town of 10 000 citizens. For privacy reasons, we want to enforce anonymity by preventing the identification of any specific person with a confidence larger than 0.5. Then, Oreste Galli, who actually occurs in the data source, wants to verify that the refusal framework works properly by maliciously querying the system about himself. In particular, he considers two queries, the former asks whether there ex- ists an individual in the graph whose name is Oreste, the latter asks for the presence of an individual who has the same address Machiavelli Street. The previous queries may look like: Q1 = ASK(?x ex : given_name “Oreste”) Q2 = ASK(?x ex : street_address “Machiavelli”) The a priori knowledge is that (i) the data source contains the 1% of the whole pop- ulation, (ii) 10 people are called Oreste, and (iii) 15 people live in Machiavelli Street. 8 Then, by the applying the Bayes’ theorem, the conditional probability to infer the secret S that Oreste Galli occurs in the data source by querying Q1 is given by µ(Q1 | S) · µ(S) µ(S | Q1 ) = . µ(Q1 ) Note that the fact that Oreste Galli occurs in the graph implies that there exists a person called Oreste; in other words, since S ⊂ Q1 , we have that µ(Q1 |S) = 1. Furthermore, µ(S) is the prior probability that Oreste Galli occurs in the graph, which is equal to the percentage of people occurring in the graph, i.e. 0.01. Finally, µ(Q1 ) is the probability that, given a generic graph containing the 1% of the whole population, there exists at least one person in the graph called Oreste. This is given by the formula: N −K µ(Q1 ) = , N where N = 10100 000  is the number of all possible graphs containing the 1% of the whole population, and K is the number of possible graphs where no Oreste occurs. Since only 10 people over 10 000 are called Oreste, K = 9100 990  . Then, we straightforwardly have that µ(S | Q1 ) = 0.104. Note that, since S ⊆ Q1 , it follows that µ(S | Q1 ) is equal to zero. Consequently, since µ(S | Q1 ) < 0.5, the censor correctly answers to Q1 . Then, the user queries Q2 . By using again the Bayes’ theorem, the censor estimates the conditional probability of S given Q2 , provided that Q1 has been already answered and hence the prior proba- bility is 0.104 instead of 0.01. By similar calculation, the resulting probability is 0.74, subsequently the censor refuses Q2 . Theorem 1. Let Ben be an enumeration of B. For each µ, S and  ∈ [0, 1], fre is filtering function with measurable inferred views; moreover, fre is (µ, S , )-secure, provided the following precondition holds: µ(S ∩ L0 (d)) µ(S) = ≤  , for all S ∈ S .5 µ(L0 (d)) Proof. We first prove the following sub-statements. Fact 1: fre (d) ⊆ [d]fre . Assume that d0 ∈ fre (d), we will show by induction on the iterative construction of fre that fre (d0 ) = fre (d); consequently, d0 ∈ [d]fre . Since L0 (d0 ) = D = L0 (d), the base case i = 0 trivially holds. Let i > 0 and assume by induction hypothesis that Li−1 (d) = Li−1 (d0 ). Clearly, then for all S ∈ S , both µ(S | Li−1 (d) ∩ Qi ) = µ(S | Li−1 (d0 ) ∩ Qi ) and µ(S | Li−1 (d) ∩ Qi ) = µ(S | Li−1 (d0 ) ∩ Qi ); this means that the refusal behavior of the censor to the query Qi will be the same for both d and d0 . Moreover, since d0 ∈ fre (d), we also have that d0 ∈ Ai (d) and, hence, Ai (d) = Ai (d0 ). Consequently, Li (d) = Li (d0 ). Fact 2: d ∈ fre (d). Again, by induction on the iterative construction of fre , the base case, d ∈ L0 (d) = D, holds. Furthermore, assume that d ∈ Li−1 (d), where i > 0. 5 This precondition is required since no secret which is violated ex ante can be protected by any filtering. 9 If the censor refuses Qi , then Li (d) = Li−1 (d) and the induction hypothesis directly implies that d ∈ Li (d). Otherwise, Li (d) = Li−1 (d) ∩ Ai (d). Since by induction hypothesis d ∈ Li−1 (d) and by construction d ∈ Ai (d), also in this case d ∈ Li (d). Fact 3: [d]fre ⊆ fre (d). Assume d0 ∈ [ d ]fre , i.e., fre (d) = fre (d0 ). By Fact 2, d ∈ fre (d0 ) so we have d0 ∈ fre (d). 0 That fre is a filtering function, i.e. fre (d) 6= ∅ for all d ∈ D, directly follows from Fact 2. Moreover, by Fact 1 and 3, fre (d) = [d]fre . Note that, since B ⊆ ΣD and fre (d) consists of a countable intersection of elements of B or their complements, fre (d) ∈ ΣD and hence [d]fre ∈ ΣD too. Finally, the statement’s precondition and the definition of fre immediately imply the following invariant: µ(S ∩ Li (d)) ≤  · µ(Li (d)) , for all i and for all S ∈ S . Hence, taking the limits, it holds that: lim µ(S ∩ Li (d)) ≤  · lim µ(Li (d)) . i i T On the other hand, by the ∩-continuity T of µ, µ(fre T(d)) = µ( i Li (d)) = limi µ(Li (d)) and µ(S ∩ fre (d)) = µ(S ∩ i Li (d)) = µ( i (S ∩ Li (d))) = limi µ(S ∩ Li (d)). Putting all together, we have that µ(S ∩ fre (d)) ≤  · µ(fre (d)) . The theorem immediately follows by reminding that fre (d) = [d]fre . t u Remark 1. The above theorem and its proof show that the refusal approach is inference- proof in a very strong sense. Namely, as fre (d) always coincides with [d]fre , inferring does not provide any information beyond what is immediately visible from the verbatim view. Moreover, the two views contains only limited information about the secrets, as specified by the security parameter . 4 Greedy Lying Under the lying approach to data filtering a harmful piece of data is implicitly distorted, without any notification, by saying that it is false. In a dynamic query-answering envi- ronment, such a distortion consists in returning the complement of the correct answer to a query Q whenever the correct answer violates the confidentiality policy. In the frame- work of data filterings, lying aims at providing a view that contains a single data source (if necessary, different from the actual one) where all the secrets are false. A crucial point of the lying approach is to make a current answer consistent with previously accumulated knowledge and to prepare for being able to provide definite (though possibly untrue) answers to any further queries. The basic strategy for achiev- ing these goals is to always protect the disjunction of all secrets (in the abstract frame- work, their union). This strategy has already been proposed in the seminal work [9] on the lying approach, further elaborated and refined for various models, e.g., [2,3,7,8], adopted for the abstract framework [4], and will also be tentatively exploited next in the probabilistic setting. 10 Interestingly, the natural probabilistic generalization of the deterministic greedy lying approach defined in [4] is not secure, as shown in the following. Using the notations of Definition 1 and instantiating (4), for an enumeration Ben = hQ1 , . . . , Qi , . . . i of queries in B, the natural probabilistic generalization of the (greedy) lying filtering fly is obtained by iteratively applying the following distor- tion mechanism to all queries Qi (i = 1, 2, . . .): Ai (d) := IF d ∈ Qi THEN Qi ELSE Qi IF [censoring by checking harmfulness of the correct (d-dependent) answer] µ( S ∩ Li−1 (d) ∩ Ai (d)) >  · µ(Li−1 (d) ∩ Ai (d)) S THEN [distortion by lying] Li (d) := Li−1 (d) ∩ Ai (d) ELSE [honest answer w.r.t. d] Li (d) := Li−1 (d) ∩ Ai (d) The precondition for applying the greedy lying method, by analogy with the determin- istic case, should be: [  µ S ≤ . Proposition 1. There exist µ, S and  for which the above precondition is satisfied but fly is not (µ, S , )-secure. Proof. Let us consider the following setting. The set of data sources D = {d1 , d2 , d3 }. The a priori knowledge for di is described by the following measures: µ(d1 ) = 0.7, µ(d2 ) = 0.2 and µ(d3 ) = 0.1. The confidentiality policy S contains only the secret S = {d2 }, i.e., S = {{d2 }}. The value of the threshold is given by  = 0.5. The set of queries B = ℘(D). We employ the following enumeration for the queries Ben = hQ1 , . . . , Q8 i, where the Qi are defined as follows: Q1 = {d2 } = S, Q2 = {d2 , d3 }, Q3 = ∅, Q4 = {d1 }, Q5 = {d3 }, Q6 = {d1 , d2 }, Q7 = {d1 , d3 }, Q8 = {d1 , d2 , d3 }. Let us then examine the execution of the filtering fly as defined above. We first show the computation for the data source d2 . We start by L0 (d2 ) = D = {d1 , d2 , d3 }. The answers to the queries are computed as follows, where – in order to easy the presentation – we will use the notation µi (d) = µ( S | Li−1 (d) ∩ Ai (d)) = S µ({d2 } | Li−1 (d) ∩ Ai (d)): – Q1 = {d2 }: Since d2 ∈ Q1 , it follows that A1 (d2 ) S= Q1 = {d2 } and, thus, L0 (d2 ) ∩ A1 (d2 ) = {d2 }. We obtain that µ1 (d2 ) = µ( S | L0 (d2 ) ∩ A1 (d2 )) = µ({d2 } |{d2 }) = 1 ≥  = 0.5. In this case, the filter lets L1 (d2 ) = L0 (d2 ) ∩ A1 (d2 ) = {d1 , d2 , d3 } ∩ {d1 , d3 } = {d1 , d3 } . – Q2 = {d2 , d3 }: Since d2 ∈ Q2 , it follows that A2 (d2 ) = Q2 = {d2 , d3 } and, thus, L1 (d2 ) ∩ A2 (d2 ) = {d3 }. We obtain that µ2 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus, L2 (d2 ) = L1 (d2 ) ∩ A2 (d2 ) = {d1 , d3 } ∩ {d2 , d3 } = {d3 } . 11 – Q3 = ∅: Since d2 6∈ Q3 , it follows that A3 (d2 ) = Q3 = {d1 , d2 , d3 } and, thus, L2 (d2 ) ∩ A3 (d2 ) = {d3 }. As before, µ3 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus, L3 (d2 ) = L2 (d2 ) ∩ A3 (d2 ) = {d3 } ∩ {d1 , d2 , d3 } = {d3 } . – Q4 = {d1 }: Since d2 6∈ Q4 , it follows that A4 (d2 ) = Q4 = {d2 , d3 } and, thus, L3 (d2 ) ∩ A4 (d2 ) = {d3 }. As before, µ4 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus, L4 (d2 ) = L3 (d2 ) ∩ A4 (d2 ) = {d3 } ∩ {d2 , d3 } = {d3 } . – Q5 = {d3 }: Since d2 6∈ Q5 , it follows that A5 (d2 ) = Q5 = {d1 , d2 } and, thus, L4 (d2 ) ∩ A5 (d2 ) = ∅. In this case the filter computes L5 (d2 ) = ∅ . The complete construction of fly is summarized in Table 2. Table 2. Construction of fly d1 d2 d3 L0 () {d1 , d2 , d3 } {d1 , d2 , d3 } {d1 , d2 , d3 } L1 () {d1 , d3 } {d1 , d3 } {d1 , d3 } L2 () {d1 } {d3 } {d3 } L3 () {d1 } {d3 } {d3 } L4 () {d1 } {d3 } {d3 } L5 () {d1 } ∅ ∅ L6 () {d1 } ∅ ∅ L7 () {d1 } ∅ ∅ L8 () {d1 } ∅ ∅ Given the above table, we can S write that [d1 ]fly = {d1 } and [d2 ]fly = [d3 ]fly = {d2 , d3 }. This means that µ( S | [d2 ]fly ) = µ({d2 } | {d2 , d3 }) = 32 > , thus the proposition follows. t u Proposition 1 shows that the natural probabilistic generalization of the greedy lying approach is not secure. The existence of secure probabilistic variants of greedy lying is left as an open problem. 5 Cooperativeness In this section we show that, under some natural assumptions, the greedy method based on refusals is maximally cooperative, that is, it hides a minimal amount of information. Maximal cooperativeness is formalized in [4] as follows: Definition 2 ([4]). A filtering f is more cooperative than a filtering g iff for all d ∈ D, [d]f ⊆ [d]g . If f1 is more cooperative than f2 then we write f1  f2 . If f1  f2 and f2 6 f1 , then we write f1  f2 . 12 Informally speaking, if f is more cooperative than g then f systematically refuses to answer less queries than g (or the same queries as g), because the partition induced by f is finer than (or equal to) g’s. Maximally cooperative filterings are called optimal. Definition 3 ([4]). A secure filtering f is optimal iff there exists no secure filtering f 0 such that f 0  f . Currently, we do not know whether the greedy refusal filtering fre is optimal in this strong sense. It is difficult to compare fre with arbitrary filterings g because g might exploit partitions (i.e. secure views) that cannot be denoted with the query language (while fre , by construction, can only exploit the expressive power of B). What we are going to prove is that fre is maximally cooperative among the filterings whose views can be defined using B. This class of filtering is formally defined as follows: Definition 4. A filtering g is query-based iff for all d ∈ D, there exists a query Q ∈ B such that [d]g = Q. To prove the optimality of fre with respect to query-based filterings, we adopt some mild restrictions. We assume that data sources are countable: D = {d1 , d2 , . . . , di , . . .}. For example, this is true of knowledge bases, as well as database tables whose values range over countable domains such as strings, integers and rational numbers. A property of this discrete framework is that all (possibly infinite) collections of pairwise disjoint subsets of D are countable.6 The cooperativeness theorem is now formalized as follows: Theorem 2. If D is countable, then there exists no (µ, S , )-secure query-based filter- ing g such that g  fre . Proof. In the following, for the sake of readability, we abbreviate fre with f . Suppose the theorem does not hold (we will derive a contradiction). Then there exists a (µ, S , )- secure query-based filtering g  f . This means that for all d ∈ D, [d]g ⊆ [d]f and for some d0 ∈ D, [d0 ]g ⊂ [d0 ]f . Since g is query-based, for some step k of the greedy construction the query Qk satisfies Qk = [d0 ]g . Note that in f ’s construction, Qk is refused (otherwise [d0 ]f ⊆ [d0 ]g would hold, which is a contradiction). This means that for some secret S0 ∈ S , either µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) >  · µ(Lk−1 (d0 ) ∩ Qk ), or (5) µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) >  · µ(Lk−1 (d0 ) ∩ Qk ) . (6) Note that Lk−1 (d0 ) ⊇ [d0 ]f ⊃ [d0 ]g = Qk , therefore (5) entails µ(S0 ∩ [d0 ]g ) >  · µ([d0 ]g ). This disequality cannot hold, because g is (µ, S , )-secure by assumption, so (5) does not hold, and hence (6) must hold. Finally, we prove that the security of g implies that (6) should not hold, instead, which proves the theorem. 6 To see this, associate each X ⊆ D in the collection to the integer min{i | di ∈ X}. 13 Let C = {[d]g | d ∈ Lk−1 (d0 ) ∩ Qk }. Recall that C can be enumerated, because its elements are pairwise disjoint: so let C = {X1 , X2 , . . . , Xi , . . .}, where each Xi is an equivalence class induced by g. Since g is (µ, S , )-secure, for all Xi ∈ C we have µ(S0 ∩ Xi ) ≤  · µ(Xi ). (7) It follows that [ µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) = µ(S0 ∩ C) ∞ X = µ(S0 ∩ Xi ) i=1 X∞ ≤  · µ(Xi ) by (7) i=1 [ =  · µ( C) =  · µ( Lk−1 (d0 ) ∩ Qk ) . This contradicts (6). t u 6 Computational Complexity The computational complexity of the non-probabilistic framework has been studied in [4] using finite D (the abstract analogue of propositional logic frameworks). For such frameworks, it can be assumed that all subsets of D are measurable and that B = ℘(D). Such finite non-probabilistic frameworks can be seen as a special case of the probabilistic framework: For all X ⊆ D, let |X| 1 µ0 (X) = and 0 = 1 − . |D| |D| Then it can be proved that: Proposition 2. For all S ⊆ B, a filtering is (µ0 , S , 0 )-secure iff it is secure in the sense of [4]. Thanks to this proposition, all the computational hardness results of [4] can imme- diately be extended to the probabilistic framework. In particular, for any given µ, S , and : – (Optimality checking) Deciding whether a given filtering f is an optimal (µ, S , )- secure filtering is coNP-hard. – (Query availability) Given a data source d ∈ D and a query of interest Q ∈ B, deciding whether there exists a (µ, S , )-secure filtering f that preserves Q on d (that is, [d]f ⊆ Q iff d ∈ Q) is NP-complete. 14 7 Conclusions and related work Extending recent work [4], we presented an abstract probabilistic setting for data fil- terings to achieve confidentiality according to a policy, in the presence of probabilistic background knowledge. The security property is probabilistic, too: a rational and knowl- edgeable observer should not be able to believe in the truth of a secret with a probability larger than a given parameter . We proved that a natural, probabilistic generalization of the refusal-based controlled query evaluation approach is both secure and maximally cooperative among a class of “reasonable” filterings, that define their secure views using the query language. Interest- ingly, the corresponding lying approach is not secure (differently from the deterministic framework). Finally, computational hardness results for some decision problems have been extended from the deterministic case to the probabilistic case. The security criterion can be easily modified by associating a different threshold i to each secret Si , and adapting the precondition of Theorem 1 accordingly. In this way one can simulate relativistic privacy preservation requirements, where the inferred probability of each secret S remains close to the prior probability µ(S) of S, just by defining the thresholds as i = µ(Si ) + ˜i . Then the precondition of Theorem 1 is satisfied by the definitions. Proofs do not require any significant changes. Our work continues a long line of research on imposing confidentiality constraints on computing system, including logic-oriented information systems used for query an- swering and data publishing. The research line on refusals and controlled interaction execution has been devoted to non-probabilistic methods, so far [27,3,7,6,5]. Several of these papers additionally and [9,8] explicitly deal with lies in non-probabilistic settings. Similarly, the non-interference property for inference control in non-logical, opera- tional settings has originally not been probabilistic [16], but some later and recent work also considers probabilities, e.g. [17,26,10,19,29], in particular to measure the (aver- age, entropy-based) information flow from security high (secret) inputs to security low (open) outputs. Moreover, formalized in a more general framework of value transmis- sion over a randomized channel, based on [10] the authors of [19] emphasize that the actual probabilities might differ from those believed by the attacker, and they then study both the (decrease of) belief-uncertainty – defined as min-entropy of belief vulnerability – and the impact of the posterior inaccuracy of the attacker’s belief. These studies, how- ever, leave constructive methods to minimize secrecy-violating effects to future work. Also inspired by [10] – and rediscovering fundamental insight about refusals [27] –, now within a framework of procedural programs whose execution semantics are seen as a transformation of probabilities of states over numeric variables, the authors of [23] propose to employ efficiently updatable probabilistic polyhedra to approximate the cur- rently assumed probabilistic knowledge of the attacker. Other popular methods, such as k-anonymity with l-diversity [25,28,22], are partly covered by the deterministic abstract framework [4]. In some sense, though, k- anonymity lies in between possibilistic and probabilistic methods, since its first con- fidentiality criterion is aimed at ensuring a sufficient number of alternative states of the world (governed by the k parameter) to prevent re-identification. The approach of weak- ened relational views [5] pursues a closely related goal within a more general setting. 15 Moreover, probabilistic refinements of l-diversity of values associated with a k-block of worlds attempt to let each of the associated values appear to be equally plausible. Halpern and O’Neill studied probabilistic secrecy in the context of dynamic sys- tems [18]. Their approach is based on a modal logic of beliefs whose models encode a streamlined account of system runs. Our technical analysis differs in several respects. Their focus is on the definition and logical characterization of security properties; they do not show how to develop secure systems, nor do they deal with cooperativeness. Moreover, they do not estimate the complexity of achieving security. The methods based on lies (which have been extensively studied in the above non- probabilistic contexts) can be regarded as the mainstream method in probabilistic meth- ods, in the form of random answer perturbations [15]. Concerning differential privacy [11,12,13,14], the main difference from our framework are two: (i) The differential privacy model does not embody prior knowledge such as record correlation, that may affect confidentiality [20]; in our model the a priori knowledge modeled by µ is used to ensure confidentiality even in the presence of record correlations and the like. (ii) The filterings for differential privacy are probabilistic while our filterings are not. Extending our framework to probabilistic filterings is an interesting topic for further research. Recent more thorough comparisons and discussion of various probabilistic ap- proaches to enforce confidentiality of sensitive information in provider-consumer in- teractions are provided by [21] and [24]. The challenges for future work include dealing with the inherent complexity of some decision problems of interest (see, e.g., [4,29]), working safely with approximate estimates of µ (see, e.g., [10,19,14]), and designing efficient implementations of the secure methods (see, e.g., [23]). References 1. P. Billingsley. Probability and Measure, 3rd edition. John Wiley & Sons, New York, NY, USA, 1995. 2. J. Biskup and P. A. Bonatti. Lying versus refusal for known potential secrets. Data Knowl. Eng., 38(2):199–222, 2001. 3. J. Biskup and P. A. Bonatti. Controlled query evaluation for enforcing confidentiality in complete information systems. Int. J. Inf. Sec., 3(1):14–27, 2004. 4. J. Biskup, P. A. Bonatti, C. Galdi, and L. Sauro. Optimality and complexity of inference- proof data filtering and CQE. In M. Kutylowski and J. Vaidya, editors, European Symposium on Research in Computer Security, ESORICS 2014, Part II, volume 8713 of Lecture Notes in Computer Science, pages 165–181. Springer, 2014. 5. J. Biskup and M. Preuß. Information control by policy-based relational weakening templates. In I. G. Askoxylakis, S. Ioannidis, S. K. Katsikas, and C. A. Meadows, editors, Computer Security - ESORICS 2016 - 21st European Symposium on Research in Computer Security, Part II, volume 9879 of Lecture Notes in Computer Science, pages 361–381. Springer, 2016. 6. J. Biskup and C. Tadros. Preserving confidentiality while reacting on iterated queries and belief revisions. Ann. Math. Artif. Intell., 73(1-2):75–123, 2015. 7. J. Biskup and T. Weibert. Keeping secrets in incomplete databases. Int. J. Inf. Sec., 7(3):199– 217, 2008. 8. J. Biskup and L. Wiese. A sound and complete model-generation procedure for consistent and confidentiality-preserving databases. Theoretical Computer Science, 412:4044–4072, 2011. 16 9. P. A. Bonatti, S. Kraus, and V. S. Subrahmanian. Foundations of secure deductive databases. IEEE Trans. Knowl. Data Eng., 7(3):406–422, 1995. 10. M. R. Clarkson, A. C. Myers, and F. B. Schneider. Quantifying information flow with beliefs. Journal of Computer Security, 17(5):655–701, 2009. 11. I. Dinur and K. Nissim. Revealing information while preserving privacy. In F. Neven, C. Beeri, and T. Milo, editors, 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Prin- ciples of Database Systems, PODS 2003, pages 202–210. ACM, 2003. 12. C. Dwork. Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Part II, volume 4052 of Lecture Notes in Computer Science, pages 1–12. Springer, 2006. 13. C. Dwork. Differential privacy: A survey of results. In M. Agrawal, D. Du, Z. Duan, and A. Li, editors, Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, volume 4978 of Lecture Notes in Computer Science, pages 1–19. Springer, 2008. 14. C. Dwork and G. N. Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887, 2016. 15. B. C. M. Fung, K. Wang, A. W.-C. Fu, and P. S. Yu. Introduction to Privacy-Preserving Data Publishing – Concepts and Techniques. Chapman & Hall/CRC, Boca Raton, FL, 2010. 16. J. A. Goguen and J. Meseguer. Unwinding and inference control. In IEEE Symposium on Security and Privacy, pages 75–87, 1984. 17. J. W. Gray III. Toward a mathematical foundation for information. Journal of Computer Security, 1(3-4):255–294, 1992. 18. J. Y. Halpern and K. R. O’Neill. Secrecy in multiagent systems. ACM Trans. Inf. Syst. Secur., 12(1):5.1–5.47, 2008. 19. S. Hamadou, C. Palamidessi, and V. Sassone. Quantifying leakage in the presence of unreli- able sources of information. J. Comput. Syst. Sci., 88:27–52, 2017. 20. D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In T. K. Sellis, R. J. Miller, A. Kementsietsidis, and Y. Velegrakis, editors, ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pages 193–204. ACM, 2011. 21. J. Liu, L. Xiong, and J. Luo. Semantic security: Privacy definitions revisited. Trans. Data Privacy, 6(3):185–198, 2013. 22. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: privacy beyond k-anonymity. TKDD, 1(1), 2007. 23. P. Mardziel, S. Magill, M. Hicks, and M. Srivatsa. Dynamic enforcement of knowledge- based security policies using probabilistic abstract interpretation. Journal of Computer Se- curity, 21(4):463–532, 2013. 24. C. Palamidessi. Quantitative approaches to the protection of private information: State of the art and some open challenges. In R. Focardi and A. C. Myers, editors, Principles of Security and Trust - 4th International Conference, POST 2015, volume 9036 of Lecture Notes in Computer Science, pages 3–7. Springer, 2015. 25. P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng., 13(6):1010–1027, 2001. 26. T. Santen. Preservation of probabilistic information flow under refinement. Inf. Comput., 206(2-4):213–249, 2008. 27. G. L. Sicherman, W. de Jonge, and R. P. van de Riet. Answering queries without revealing secrets. ACM Trans. Database Syst., 8(1):41–59, 1983. 28. L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncer- tainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002. 29. H. Yasuoka and T. Terauchi. Quantitative information flow as safety and liveness hyperprop- erties. Theor. Comput. Sci., 538:167–182, 2014.