=Paper= {{Paper |id=Vol-1951/PrivOn2017_paper_2 |storemode=property |title=Inference-proof Data Filtering for a Probabilistic Setting |pdfUrl=https://ceur-ws.org/Vol-1951/PrivOn2017_paper_2.pdf |volume=Vol-1951 |authors=Joachim Biskup,Piero Bonatti,Clemente Galdi,Luigi Sauro |dblpUrl=https://dblp.org/rec/conf/semweb/BiskupBGS17 }} ==Inference-proof Data Filtering for a Probabilistic Setting== https://ceur-ws.org/Vol-1951/PrivOn2017_paper_2.pdf
                          Inference-proof Data Filtering
                            for a Probabilistic Setting?

                        J. Biskup1 , P.A. Bonatti2 , C. Galdi2 , and L. Sauro2
                    1
                       Fakultät für Informatik, Technische Universität Dortmund
        2
            Dip. Ing. Elet. e Tecnologie dell’Informazione, Università di Napoli Federico II



        Abstract. In querying semantic data, access control must take into account the
        information that is implicitly entailed by the accessible part of triple stores and
        ontologies. While there exist inference control frameworks for this purpose, they
        still have a limitation: the confidentiality criterion does not take into account the
        probabilistic knowledge of the attacker. Therefore, the existing controlled query
        evaluation methods may return answers that actually reveal that a secret is true
        with very high probability. Given that such probabilistic knowledge is becoming
        more and more widely available as a result of analytics of various sorts, it is im-
        portant to develop a refined confidentiality framework where probabilistic knowl-
        edge is taken into due account. Accordingly, in this paper, we extend and gener-
        alize an abstract data filtering framework for confidentiality-preserving, policy-
        based data publishing. The confidentiality requirement is strengthened so that the
        probability that a secret is true is bounded by a small constant . We formally de-
        fine such a probabilistic setting, then we study two greedy data publishing meth-
        ods based on refusals and lies, respectively. The refusal-based method is proved
        to be secure and maximally cooperative among a class of “reasonable” methods.
        We prove also that the natural generalization of the lying method is not secure.
        Furthermore, we extend the complexity hardness results from the deterministic
        framework to the probabilistic one.

        Keywords: A priori knowledge, Confidentiality criterion, Confidentiality-
        preserving data publishing, Cooperativeness, Inference control, Lying, Privacy,
        Probabilistic methods, Refusal


1     Introduction
The need for inference-proof data publishing has been discovered well before the se-
mantic web was born. Still, semantic (meta)data, that are expressed with standardized
knowledge representation languages such as RDFS and OWL, are by design well-suited
to the automated derivation of implied data. The applicable inference engines are be-
coming more and more powerful and scalable. This makes the need of inference-proof
access control techniques particularly crucial in this area.
    The literature is rich of data filtering methods for confidentiality-preserving data
publishing (cf. [15] ) that achieve secrecy properties of different kind and strength. In
?
    This work has been partially supported by the European Union’s Horizon 2020 research and
    innovation programme under grant agreement N. 731601.
2

recent work [4], we have proposed an abstract framework that covers a wide range of
deterministic filtering methods while not relying on any particular data model or query
representation. Within this framework, we studied possibilistic secrecy regarding an
observer-specific confidentiality policy which consists of (potential) secrets in form of
yes–no queries.
     For no secret query should a user be able to derive a “yes” answer, no matter how
powerful is the rational reasoning method exploited by the user and no matter how many
resources are employed. In this framework, confidentiality relies on the assumption that
one counterexample suffices to prevent the user from believing that the secret is true.
     However, reality may be different. The wide range of analytics tools for seman-
tic data provide rich statistical information about the real world, that may make the
available counterexamples unlikely. Moreover, if the majority of the interpretations of a
knowledge base agree that a secret is true, the user may be inclined to believe it. To put
this in general terms, a variety of probabilistic information may lead the user to believe
that a secret is actually true with high probability. Then the aforementioned confiden-
tiality framework should be refined by strengthening its confidentiality criterion so as
to accommodate a priori probabilistic information about the domain of discourse.
     Accordingly, in this paper, we generalize and extend the setting of [4] in the follow-
ing ways. First, we consider probabilistic a priori knowledge in form of a probability
distribution for the set of all possible data sources. Second, we measure the informa-
tion regarding a secret query’s validity learnt by observing filtered data and reasoning
about it. This measurement is taken as the conditional probability of the secret query’s
validity under the observation. Third, we refine the secrecy criterion by requiring the
observer to believe in the truth of secrets with very low probability, bounded by a suit-
able threshold .

Example 1. Let us consider a simple artificial situation where the set of data sources
contains the 8 possible interpretations of 3 RDF triples p, q and r, represented by
the words pqr, pqr̄, pq̄r, pq̄r̄, p̄qr, p̄qr̄, p̄q̄r, p̄q̄r̄. Intuitively speaking, the filtering
should keep a joint validity of p and q confidential, formally expressed by the query
S = {pqr, pqr̄}, (the characteristic function of) which returns “yes” (true) for the two
interpretations with representation pq∗, and “no” (false) otherwise. Accordingly, the
confidentiality policy is just the singleton {S}.
    A straightforward filtering f hides the secret when applied to an interpretation by
setting the first variable always to p̄. Since in general we will allow uncertainty in a
generated view, a filtering will return a set of data sources, and thus this filtering is
formally defined by generating f (xyz) := {p̄yz}. However, seeing the verbatim view
{p̄yz}, an intelligent and knowledgeable observer can easily determine the inferred
view {pyz, p̄yz}, each of whose elements might be the actual data source underlying
the filtering. Nevertheless, since {pqr, pqr̄} 6⊆ {pyz, p̄yz}, an observer will always
believe in the possibility that the secret is not valid in the actual data source. For our
probabilistic setting we are even more ambitious: we want the observer to believe in
a probability of the non-validity of the secret not less than 1 − , for some security
parameter  ∈ [0, 1].
    So let us further assume that the anticipated observer is knowing a priori the fol-
lowing probability distribution µ: each of the two protected interpretations occurs only
                                                                                                  3

                                      1
relatively rarely, with probability 16  ; each of the two interpretations containing both p̄
                                              3
and q̄ are more likely, having probability 16   ; and each of the remaining four interpreta-
                                                                           2
tions containing either p or q are equally distributed, with probability 16  . Then we have
                             1      1      1
µ(S) = µ({pqr, pqr̄}) = 16 + 16 = 8 . By similar elementary calculations, for each
case we can determine the a posteriori probability of the secret being valid under the
condition of the inferred view based on the observation of the verbatim view. For the
straightforward filtering f defined above and the assumed probability distribution µ the
calculations show that these probabilities are either 31 or 0, and thus the filtering is seen
to be secure for each security parameter  in the open interval ( 13 , 1].
    This example is summarized in Table 1, and variants of it are serving throughout
the paper.


Table 1. Summarized example, using the following notations: “→” indicate that the interpre-
tation d is mapped on the singleton containing the neighbored interpretation in the same line;
“↓” indicate that the interpretation d is mapped on the singleton containing itelf; [.] denotes the
inferred view, formed as equivalence class consisting of the two interpretations in the respec-
tive line; µ(S | [.]) denotes the derived conditional probability of the secret S under the inferred
view [.]

        A    Interpre- Filter- Filter- Interpre- A Inferred Secret A poste-
       priori tation     ing    ing      tation priori view   fraction    riori
       µ(d)      d      f (d) f (d)         d    µ(d) µ([.]) µ(S ∩ [.]) µ(S | [.])
          1                                       2      3       1          1
         16
                pqr       →       ↓       p̄qr    16    16       16         3
          1                                       2      3       1          1
         16
                pqr̄      →       ↓       p̄qr̄   16    16       16         3
          2                                       3      5
         16
                pq̄r      →       ↓       p̄q̄r   16    16
                                                                 0          0
          2                                       3      5
         16
                pq̄r̄     →       ↓       p̄q̄r̄  16    16
                                                                 0          0



     Popular probabilistic models and approaches have already been introduced. Dif-
ferential privacy [11,12,13,14] is currently one of the most important of them. Under
suitable assumptions it is very effective and relatively easy to apply. Unfortunately,
when (probabilistic) background knowledge such as record correlations is available,
differential privacy may fail to preserve confidentiality [20]. For this reason, investi-
gating methods that preserve confidentiality in a probabilistic sense, in the presence
of probabilistic background knowledge, is still an important topic. Moreover, differen-
tial privacy is based on the perturbation of data or query answers, while we intend to
consider (also) methods that do not report incorrect answers.
     Accordingly, in this paper we provide as a further contribution a maximally coop-
erative, secure query-answering method based on a greedy refusal-based algorithm. We
prove also that – differently from the deterministic case – its lying-based analogue is
not secure, instead. Finally, we extend some of the computational hardness results from
the deterministic framework to the probabilistic one.
     This abstract framework should be regarded as a preliminary, general feasibility
study that constitutes the first step towards inference-proof, confidential access control
methods for semantic data. Thus the implementation of concrete mechanism still lies
beyond the scope of this paper.
4

    The paper is organized as follows. After formally defining the probabilistic setting
in Section 2, we analyze the safety of the greedy filtering methods based on refusal and
lying in Section 3 and Section 4, respectively. In Section 5, we investigate the degree of
cooperativeness of the greedy refusal method (a form of optimality). Section 6 reports
the complexity results. Finally, in Section 7, we briefly summarize and evaluate our
achievements, discuss related work, and list some challenging open problems.


2      A Probabilistic Setting

As in the deterministic case presented in [4], the probabilistic framework is based on an
arbitrary set D of data sources, where each data source d ∈ D is treated as an abstract
entity. To enable reasoning about the probability of sets of data sources, we need to
introduce a σ-algebra ΣD ⊆ ℘(D), where ℘(D) denotes the powerset of D. By defini-
tion, ΣD includes both D and ∅ and is closed under the set operations of complement,
countable union and countable intersection. Like for perfect encryption,we assume that
the attacking observer has probabilistic a priori knowledge about the owner’s data
source. Such a knowledge is formalized by a probability measure µ : ΣD −→ [0, 1],
that is any function which satisfies the following S∞conditions:P(i) µ(D) = 1 and (ii) if
                                                                   ∞
A1 , A2 , . . . ∈ ΣD are pairwise disjoint, then µ( i=1 Ai ) = i=1 µ(Ai ).3
    We refer to [1] for a wider introduction on Probability Theory, here we just report a
few properties which will be used later on:

    – (complementation) µ(A) = 1 − µ(A), where A is the complement of A;
    – (monotonicity) if A ⊆ A0 , then µ(A) ≤ µ(A0 );
    – (∩-continuity)
         T∞          if A1 , A2 , . . . is a descending chain, i.e. Ai+1 ⊆ Ai for all i, then
      µ( i=1 Ai ) = limi µ(Ai ).

    Some subsets Q of D can be regarded as a Boolean query, indicating whether a
data source d satisfies Q, i.e., d ∈ Q, or not, i.e., d 6∈ Q. To comply with the proba-
bilistic framework, each query must be measurable. Let B ⊆ ΣD be the set of queries
considered. Clearly, since queries have to be expressed syntactically, B is countable.
    To keep some aspects of his data source secret to the observer, the owner can specify
a confidentiality policy consisting of a finite set S ⊆ B of queries, in this context called
(potential) secrets. Informally speaking, if the owner’s data source satisfies a secret then
the observer should not be able to learn this fact; conversely, the observer is allowed to
know that a secret is not satisfied.
    To preserve secrets, the owner applies a filtering f : D −→ ℘(D) \ {∅} on his
actual data source d and publishes f (d) as a (verbatim) secure view, making the ob-
server uncertain about which data source in f (d) is the actual one, and possibly even
misleading the observer by not including the actual data source d in f (d). Assuming
that the observer is a rational agent that knows the filtering f , the observer can compute
the inferred filtering [ · ]f : D −→ ℘D \ {∅}, defined by

      [ d ]f := { d0 | f (d) = f (d0 ) } .                                               (1)
 3
     Function µ is the analogue of the data generation function P of [20].
                                                                                                 5

So, given the published verbatim view f (d), the observer can construct the inferred
view [ d ]f , which may still be “uncertain” (i.e., there are multiple data sources) but
it definitely contains the actual data source d. To enforce the confidentiality policy,
the filtering should prevent the observer from inferring that a secret is satisfied. For
instance, in a deterministic setting, the filtering should enforce [ d ]f 6⊆ S for all secrets
S ∈ S.
     Finally, to complete the probabilistic setting, we require that [ d ]f ∈ ΣD , i.e., each
set [ d ]f of data sources indistinguishable under f is supposed to be measurable. Then
we strengthen the security criterion by requiring that the probability that a secret is sat-
isfied by d – given the observable view f (d) – should not exceed a (small) threshold .
Definition 1. For a probabilistic setting with a priori knowledge µ : ΣD −→ [0, 1],
confidentiality policy S , and threshold  ∈ [0, 1], a filtering f : D −→ ℘(D) \ {∅}
with measurable inferred views is (µ, S , )-secure iff for all data sources d ∈ D, for
all secrets S ∈ S , it holds that
      µ(S ∩ [ d ]f ) ≤  · µ([ d ]f ) .                                                        (2)
Note that inequality (2) means that, whenever defined (i.e., µ([ d ]f ) 6= 0), the condi-
tional probability of S under the inferred view [d ]f
                           µ(S ∩ [ d ]f )
      µ( S | [ d ]f ) :=
                             µ([ d ]f )
is bounded by .
    The abstract framework presented so far can be embodied in the context of Linked
Data as follows: data sources consist of RDF-graphs, Boolean queries are represented
by ASK forms of SPARQL4 and secrets are specific Boolean queries, i.e. confidential
pieces of information that can be retrieved through (a sequence of) ASK forms. More-
over, the a priori knowledge of the observer is represented by a probability distributions
over possible RDF-graphs. More precisely, the set D is the set of all possible RDF-
graphs over a given signature of interest and ΣD is the whole powerset ℘(D). Then, µ
is determined by a discrete probability distribution δ over RDF-graphs such that for all
A ⊆ D,
              X
     µ(A) =        δ(d) .                                                              (3)
                d∈A

Finally, a Boolean query Qa is the denotation of a ASK form a, that is the set of RDF-
graphs where the form a returns true.
    In the remainder of this paper, we will study the secure filterings f that result from
a greedy construction of a decreasing sequence of sets Li (d) of data sources, for each
input data source d ∈ D:
                                                     \
     D =: L0 (d) ⊇ L1 (d) ⊇ · · · ⊇ Li (d) ⊇ · · · ⊇   Li (d) := f (d) .                (4)
                                                            i
 4
     Roughly speaking, an ASK form is a query that returns true if the RDF-graph satisfies a
     specified graph pattern, false otherwise. Graph patterns allow to verify, for example, whether
     a confidential set of RDF-triples or a specified node denoting some sensible individual occurs
     in the graph.
6

All constructions will be based on an exhaustive enumeration Ben = hQ1 , . . . , Qi , . . . i
of all queries in B. Stepwise, each query Qi is submitted to a kind of stateful censor that
keeps track of previous answers to determine whether Qi should be answered correctly
or distorted. This decision determines Li (d) from Li−1 (d).
    The sequence hLi (d)ii has also a dynamic interpretation. The observer chooses the
queries Q1 , Q2 , . . . , Qi , . . . and submits them iteratively to the confidential data source.
The query answering system returns for each Qi the corresponding direct answer Ai (d),
leaving the computation of the accumulated information represented by Li (d) to the
observer. In this dynamic setting, the confidentiality criterion is that at each step i, the
accumulated information Li (d) of the direct answers to Q1 , Q2 , . . . , Qi should not tell
too much about the secrets.


3    Greedy Refusal

Under the refusal approach to data filtering, harmful queries are somehow explicitly
notified to be hidden. In a dynamic query-answering environment, this may result in
returning the special answer “mum” to a query whose correct answer would violate the
confidentiality policy.
    In the data filtering framework there is yet another interpretation of the refusal ap-
proach: some specific information about the actual data source is hidden by providing
a view which contains not only the actual data source but also other data sources. If
this generation of uncertainty is properly done, then the observer cannot know which
element of the view is the actual one, therefore he is not able to infer that specific infor-
mation.
    A crucial point of the refusal approach is to block so-called meta-inferences. For
instance, if a filtering f refused only the queries that entail a secret, then refused an-
swers would always correspond to a secret satisfied by the actual data source d. The
basic strategy for blocking this kind of attacks is to make the critical part of the censor
decision independent from the actual data source. This strategy has already been pro-
posed in the seminal work [27] on the refusal approach, further elaborated and refined
for various models, e.g., [2,3,7,5], adopted for the abstract framework [4], and will also
be exploited in the rest of this section for the probabilistic setting under consideration.
    Using the notations of Definition 1 and instantiating (4), for an enumeration Ben =
hQ1 , . . . , Qi , . . . i of all queries in B, the (greedy) refusal filtering fre results from the
iterative application of the following distortion criterion to each query Qi :

    IF  [censor criterion evaluating Qi ’s harmfulness in a d-independent way]
        for some secret S ∈ S ,
        µ( S ∩ Li−1 (d) ∩ Qi ) >  · µ( Li−1 (d) ∩ Qi ) or
        µ( S ∩ Li−1 (d) ∩ Qi ) >  · µ( Li−1 (d) ∩ Qi )
    THEN [distortion by refusal]
        Li (d) := Li−1 (d)
    ELSE [honest answer w.r.t. d]
        Ai (d) := IF d ∈ Qi THEN Qi ELSE Qi
        Li (d) := Li−1 (d) ∩ Ai (d)
                                                                                             7

                                                                           1      1      2
Example 2. Resuming Example 1 with µ(S) = µ({pqr, pqr̄}) = 16                  + 16  = 16  , we
                                                                      6
consider the refusal filtering for d := pqr. We choose  := 16 as security parameter
and take the powerset of the set of all interpretations over the propositional variables p,
q and r as the set of all queries.
    We start the enumeration of the 28 queries by “p?”, intuitively asking whether p
                                                                              6
is valid, formalized as Q1 := {pqr, pqr̄, pq̄r, pq̄r̄} with µ(Q1 ) = 16         . Then Q1 :=
                                            10                                                2
{p̄qr, p̄qr̄, p̄q̄r, p̄q̄r̄} with µ(Q1 ) = 16 . Accordingly, µ(S ∩ Q1 ) = µ({pqr, pqr̄}) = 16
and µ(S ∩ Q1 ) = µ(∅) = 0. Evaluating the censoring condition we get µ(S | Q1 ) =
 2       6                                 6
10 < 16 and µ(S | Q1 ) = 0 < 16 , and thus the honest answer A1 (pqr) := Q1 is
processed to determine L1 (d) := Q1 .
    We continue the enumeration with the query “q?”, intuitively asking whether q is
valid, formalized as Q2 := {pqr, pqr̄, p̄qr, p̄qr̄}. Evaluating the censoring condition,
                                                             2
we then get µ(L1 (d) ∩ Q2 ) = µ({pqr, pqr̄}) = 16              and µ(S ∩ (L1 (d) ∩ Q2 )) =
                           2
µ({pqr, pqr̄}) = 16 , and thus µ(S ∩ (L1 (d) ∩ Q2 ) |(L1 (d) ∩ Q2 ) ) = 1 leads to a
refusal resulting in L2 (d) := L1 (d) = {pqr, pqr̄, pq̄r, pq̄r̄}.
    Further continuing with “r?”, i.e., Q3 := {pqr, pq̄r, p̄qr, p̄q̄r}, we get µ(L2 (d) ∩
                                  3                                          1
Q3 ) = µ({pqr, pq̄r}) = 16          and µ(S ∩(L2 (d)∩Q3 )) = µ({pqr}) = 16      and thus µ(S ∩
                                         1     6
(L2 (d) ∩ Q3 ) |(L2 (d) ∩ Q3 ) ) = 3 < 16 as well as µ(L2 (d) ∩ Q3 ) = µ({pqr̄, pq̄r̄}) =
 3                                                   1
16 and µ(S ∩ (L2 (d) ∩ Q3 )) = µ({pqr̄}) = 16 and thus µ(S ∩ (L2 (d) ∩ Q3 ) |(L2 (d) ∩
            1        6
Q3 ) ) = 3 < 16 , leading to an honest reaction such that L3 (d) := L2 (d) ∩ A3 (d) =
{pqr, pq̄r}.
    Though somehow tedious to check, all further queries do not change that intermedi-
ate view and, thus, for the actual data source pqr we get {pqr, pq̄r} as the final verbatim
view, leaving an observer uncertain whether or not q is valid. One may note that in Ex-
ample 1 we got the verbatim view {pqr, p̄qr} leaving the status of p open; we would
obtain this result by the refusal filtering if we exchanged the processing of the first two
queries.
Clearly, in the context of RDF-graphs considered above, the presented approach re-
quires that an ASK form returns a notification of the kind query_refused in case of
refusal.
Example 3. Consider an RDF-graph containing sensible information about medical
treatments in a small town of 10 000 citizens. For privacy reasons, we want to enforce
anonymity by preventing the identification of any specific person with a confidence
larger than 0.5. Then, Oreste Galli, who actually occurs in the data source, wants to
verify that the refusal framework works properly by maliciously querying the system
about himself. In particular, he considers two queries, the former asks whether there ex-
ists an individual in the graph whose name is Oreste, the latter asks for the presence of
an individual who has the same address Machiavelli Street. The previous queries may
look like:
    Q1 = ASK(?x       ex : given_name “Oreste”)
    Q2 = ASK(?x       ex : street_address “Machiavelli”)
The a priori knowledge is that (i) the data source contains the 1% of the whole pop-
ulation, (ii) 10 people are called Oreste, and (iii) 15 people live in Machiavelli Street.
8

Then, by the applying the Bayes’ theorem, the conditional probability to infer the secret
S that Oreste Galli occurs in the data source by querying Q1 is given by

                      µ(Q1 | S) · µ(S)
      µ(S | Q1 ) =                     .
                          µ(Q1 )

Note that the fact that Oreste Galli occurs in the graph implies that there exists a person
called Oreste; in other words, since S ⊂ Q1 , we have that µ(Q1 |S) = 1. Furthermore,
µ(S) is the prior probability that Oreste Galli occurs in the graph, which is equal to the
percentage of people occurring in the graph, i.e. 0.01. Finally, µ(Q1 ) is the probability
that, given a generic graph containing the 1% of the whole population, there exists at
least one person in the graph called Oreste. This is given by the formula:

                N −K
      µ(Q1 ) =           ,
                   N

where N = 10100  000
                    
                      is the number of all possible graphs containing the 1% of the whole
population, and K is the number of possible graphs where      no Oreste occurs. Since only
10 people over 10 000 are called Oreste, K = 9100   990
                                                       
                                                         . Then, we straightforwardly have
that µ(S | Q1 ) = 0.104.
     Note that, since S ⊆ Q1 , it follows that µ(S | Q1 ) is equal to zero. Consequently,
since µ(S | Q1 ) < 0.5, the censor correctly answers to Q1 . Then, the user queries Q2 .
By using again the Bayes’ theorem, the censor estimates the conditional probability of
S given Q2 , provided that Q1 has been already answered and hence the prior proba-
bility is 0.104 instead of 0.01. By similar calculation, the resulting probability is 0.74,
subsequently the censor refuses Q2 .

Theorem 1. Let Ben be an enumeration of B. For each µ, S and  ∈ [0, 1], fre is
filtering function with measurable inferred views; moreover, fre is (µ, S , )-secure,
provided the following precondition holds:

                                   µ(S ∩ L0 (d))
                         µ(S) =                  ≤  , for all S ∈ S .5
                                     µ(L0 (d))

Proof. We first prove the following sub-statements.
    Fact 1: fre (d) ⊆ [d]fre . Assume that d0 ∈ fre (d), we will show by induction on
the iterative construction of fre that fre (d0 ) = fre (d); consequently, d0 ∈ [d]fre . Since
L0 (d0 ) = D = L0 (d), the base case i = 0 trivially holds. Let i > 0 and assume
by induction hypothesis that Li−1 (d) = Li−1 (d0 ). Clearly, then for all S ∈ S , both
µ(S | Li−1 (d) ∩ Qi ) = µ(S | Li−1 (d0 ) ∩ Qi ) and µ(S | Li−1 (d) ∩ Qi ) = µ(S |
Li−1 (d0 ) ∩ Qi ); this means that the refusal behavior of the censor to the query Qi will
be the same for both d and d0 . Moreover, since d0 ∈ fre (d), we also have that d0 ∈ Ai (d)
and, hence, Ai (d) = Ai (d0 ). Consequently, Li (d) = Li (d0 ).
    Fact 2: d ∈ fre (d). Again, by induction on the iterative construction of fre , the base
case, d ∈ L0 (d) = D, holds. Furthermore, assume that d ∈ Li−1 (d), where i > 0.
 5
     This precondition is required since no secret which is violated ex ante can be protected by any
     filtering.
                                                                                         9

If the censor refuses Qi , then Li (d) = Li−1 (d) and the induction hypothesis directly
implies that d ∈ Li (d). Otherwise, Li (d) = Li−1 (d) ∩ Ai (d). Since by induction
hypothesis d ∈ Li−1 (d) and by construction d ∈ Ai (d), also in this case d ∈ Li (d).
     Fact 3: [d]fre ⊆ fre (d). Assume d0 ∈ [ d ]fre , i.e., fre (d) = fre (d0 ). By Fact 2,
d ∈ fre (d0 ) so we have d0 ∈ fre (d).
  0

     That fre is a filtering function, i.e. fre (d) 6= ∅ for all d ∈ D, directly follows
from Fact 2. Moreover, by Fact 1 and 3, fre (d) = [d]fre . Note that, since B ⊆ ΣD
and fre (d) consists of a countable intersection of elements of B or their complements,
fre (d) ∈ ΣD and hence [d]fre ∈ ΣD too. Finally, the statement’s precondition and the
definition of fre immediately imply the following invariant:

              µ(S ∩ Li (d)) ≤  · µ(Li (d)) , for all i and for all S ∈ S .

Hence, taking the limits, it holds that:

                         lim µ(S ∩ Li (d)) ≤  · lim µ(Li (d)) .
                           i                         i
                                                          T
On the other hand, by the ∩-continuity
                                T      of µ, µ(fre
                                                T(d)) = µ( i Li (d)) = limi µ(Li (d))
and µ(S ∩ fre (d)) = µ(S ∩ i Li (d)) = µ( i (S ∩ Li (d))) = limi µ(S ∩ Li (d)).
Putting all together, we have that

                               µ(S ∩ fre (d)) ≤  · µ(fre (d)) .
The theorem immediately follows by reminding that fre (d) = [d]fre .                     t
                                                                                         u
Remark 1. The above theorem and its proof show that the refusal approach is inference-
proof in a very strong sense. Namely, as fre (d) always coincides with [d]fre , inferring
does not provide any information beyond what is immediately visible from the verbatim
view. Moreover, the two views contains only limited information about the secrets, as
specified by the security parameter .


4   Greedy Lying
Under the lying approach to data filtering a harmful piece of data is implicitly distorted,
without any notification, by saying that it is false. In a dynamic query-answering envi-
ronment, such a distortion consists in returning the complement of the correct answer to
a query Q whenever the correct answer violates the confidentiality policy. In the frame-
work of data filterings, lying aims at providing a view that contains a single data source
(if necessary, different from the actual one) where all the secrets are false.
     A crucial point of the lying approach is to make a current answer consistent with
previously accumulated knowledge and to prepare for being able to provide definite
(though possibly untrue) answers to any further queries. The basic strategy for achiev-
ing these goals is to always protect the disjunction of all secrets (in the abstract frame-
work, their union). This strategy has already been proposed in the seminal work [9] on
the lying approach, further elaborated and refined for various models, e.g., [2,3,7,8],
adopted for the abstract framework [4], and will also be tentatively exploited next in the
probabilistic setting.
10

    Interestingly, the natural probabilistic generalization of the deterministic greedy
lying approach defined in [4] is not secure, as shown in the following.
    Using the notations of Definition 1 and instantiating (4), for an enumeration
Ben = hQ1 , . . . , Qi , . . . i of queries in B, the natural probabilistic generalization of
the (greedy) lying filtering fly is obtained by iteratively applying the following distor-
tion mechanism to all queries Qi (i = 1, 2, . . .):

     Ai (d) := IF d ∈ Qi THEN Qi ELSE Qi
     IF    [censoring by checking harmfulness of the correct (d-dependent) answer]
           µ( S ∩ Li−1 (d) ∩ Ai (d)) >  · µ(Li−1 (d) ∩ Ai (d))
              S

     THEN [distortion by lying]
           Li (d) := Li−1 (d) ∩ Ai (d)
     ELSE [honest answer w.r.t. d]
           Li (d) := Li−1 (d) ∩ Ai (d)

The precondition for applying the greedy lying method, by analogy with the determin-
istic case, should be:                [ 
                                   µ     S ≤ .

Proposition 1. There exist µ, S and  for which the above precondition is satisfied but
fly is not (µ, S , )-secure.

Proof. Let us consider the following setting. The set of data sources D = {d1 , d2 , d3 }.
The a priori knowledge for di is described by the following measures: µ(d1 ) = 0.7,
µ(d2 ) = 0.2 and µ(d3 ) = 0.1. The confidentiality policy S contains only the secret
S = {d2 }, i.e., S = {{d2 }}. The value of the threshold is given by  = 0.5. The set
of queries B = ℘(D). We employ the following enumeration for the queries Ben =
hQ1 , . . . , Q8 i, where the Qi are defined as follows: Q1 = {d2 } = S, Q2 = {d2 , d3 },
Q3 = ∅, Q4 = {d1 }, Q5 = {d3 }, Q6 = {d1 , d2 }, Q7 = {d1 , d3 }, Q8 = {d1 , d2 , d3 }.
    Let us then examine the execution of the filtering fly as defined above. We first
show the computation for the data source d2 . We start by L0 (d2 ) = D = {d1 , d2 , d3 }.
The answers to the queries are computed as follows, where          – in order to easy the
presentation – we will use the notation µi (d) = µ( S | Li−1 (d) ∩ Ai (d)) =
                                                            S
µ({d2 } | Li−1 (d) ∩ Ai (d)):

 – Q1 = {d2 }: Since d2 ∈ Q1 , it follows that A1 (d2 ) S= Q1 = {d2 } and, thus,
   L0 (d2 ) ∩ A1 (d2 ) = {d2 }. We obtain that µ1 (d2 ) = µ( S | L0 (d2 ) ∩ A1 (d2 )) =
   µ({d2 } |{d2 }) = 1 ≥  = 0.5. In this case, the filter lets

            L1 (d2 ) = L0 (d2 ) ∩ A1 (d2 ) = {d1 , d2 , d3 } ∩ {d1 , d3 } = {d1 , d3 } .

 – Q2 = {d2 , d3 }: Since d2 ∈ Q2 , it follows that A2 (d2 ) = Q2 = {d2 , d3 } and, thus,
   L1 (d2 ) ∩ A2 (d2 ) = {d3 }. We obtain that µ2 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus,

                L2 (d2 ) = L1 (d2 ) ∩ A2 (d2 ) = {d1 , d3 } ∩ {d2 , d3 } = {d3 } .
                                                                                        11

    – Q3 = ∅: Since d2 6∈ Q3 , it follows that A3 (d2 ) = Q3 = {d1 , d2 , d3 } and, thus,
      L2 (d2 ) ∩ A3 (d2 ) = {d3 }. As before, µ3 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus,

                 L3 (d2 ) = L2 (d2 ) ∩ A3 (d2 ) = {d3 } ∩ {d1 , d2 , d3 } = {d3 } .
    – Q4 = {d1 }: Since d2 6∈ Q4 , it follows that A4 (d2 ) = Q4 = {d2 , d3 } and, thus,
      L3 (d2 ) ∩ A4 (d2 ) = {d3 }. As before, µ4 (d2 ) = µ({d2 } |{d3 }) = 0 and, thus,

                   L4 (d2 ) = L3 (d2 ) ∩ A4 (d2 ) = {d3 } ∩ {d2 , d3 } = {d3 } .
    – Q5 = {d3 }: Since d2 6∈ Q5 , it follows that A5 (d2 ) = Q5 = {d1 , d2 } and, thus,
      L4 (d2 ) ∩ A5 (d2 ) = ∅. In this case the filter computes

                                             L5 (d2 ) = ∅ .
The complete construction of fly is summarized in Table 2.


                                  Table 2. Construction of fly

                                     d1              d2              d3
                         L0 () {d1 , d2 , d3 } {d1 , d2 , d3 } {d1 , d2 , d3 }
                         L1 () {d1 , d3 }       {d1 , d3 }      {d1 , d3 }
                         L2 ()    {d1 }           {d3 }           {d3 }
                         L3 ()    {d1 }           {d3 }           {d3 }
                         L4 ()    {d1 }           {d3 }           {d3 }
                         L5 ()    {d1 }              ∅               ∅
                         L6 ()    {d1 }              ∅               ∅
                         L7 ()    {d1 }              ∅               ∅
                         L8 ()    {d1 }              ∅               ∅



Given the above table, we can S write that [d1 ]fly = {d1 } and [d2 ]fly = [d3 ]fly =
{d2 , d3 }. This means that µ( S | [d2 ]fly ) = µ({d2 } | {d2 , d3 }) = 32 > , thus the
proposition follows.                                                                   t
                                                                                       u
     Proposition 1 shows that the natural probabilistic generalization of the greedy lying
approach is not secure. The existence of secure probabilistic variants of greedy lying is
left as an open problem.


5     Cooperativeness
In this section we show that, under some natural assumptions, the greedy method based
on refusals is maximally cooperative, that is, it hides a minimal amount of information.
Maximal cooperativeness is formalized in [4] as follows:
Definition 2 ([4]). A filtering f is more cooperative than a filtering g iff for all d ∈ D,
[d]f ⊆ [d]g . If f1 is more cooperative than f2 then we write f1  f2 . If f1  f2 and
f2 6 f1 , then we write f1  f2 .
12

Informally speaking, if f is more cooperative than g then f systematically refuses to
answer less queries than g (or the same queries as g), because the partition induced by
f is finer than (or equal to) g’s. Maximally cooperative filterings are called optimal.
Definition 3 ([4]). A secure filtering f is optimal iff there exists no secure filtering f 0
such that f 0  f .
    Currently, we do not know whether the greedy refusal filtering fre is optimal in this
strong sense. It is difficult to compare fre with arbitrary filterings g because g might
exploit partitions (i.e. secure views) that cannot be denoted with the query language
(while fre , by construction, can only exploit the expressive power of B). What we are
going to prove is that fre is maximally cooperative among the filterings whose views
can be defined using B. This class of filtering is formally defined as follows:
Definition 4. A filtering g is query-based iff for all d ∈ D, there exists a query Q ∈ B
such that [d]g = Q.
   To prove the optimality of fre with respect to query-based filterings, we adopt some
mild restrictions. We assume that data sources are countable:

                                   D = {d1 , d2 , . . . , di , . . .}.

For example, this is true of knowledge bases, as well as database tables whose values
range over countable domains such as strings, integers and rational numbers. A property
of this discrete framework is that all (possibly infinite) collections of pairwise disjoint
subsets of D are countable.6 The cooperativeness theorem is now formalized as follows:
Theorem 2. If D is countable, then there exists no (µ, S , )-secure query-based filter-
ing g such that g  fre .
Proof. In the following, for the sake of readability, we abbreviate fre with f . Suppose
the theorem does not hold (we will derive a contradiction). Then there exists a (µ, S , )-
secure query-based filtering g  f . This means that for all d ∈ D, [d]g ⊆ [d]f and for
some d0 ∈ D, [d0 ]g ⊂ [d0 ]f .
     Since g is query-based, for some step k of the greedy construction the query Qk
satisfies Qk = [d0 ]g . Note that in f ’s construction, Qk is refused (otherwise [d0 ]f ⊆
[d0 ]g would hold, which is a contradiction). This means that for some secret S0 ∈ S ,
either

      µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) >  · µ(Lk−1 (d0 ) ∩ Qk ), or                               (5)

      µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) >  · µ(Lk−1 (d0 ) ∩ Qk ) .                                 (6)
Note that Lk−1 (d0 ) ⊇ [d0 ]f ⊃ [d0 ]g = Qk , therefore (5) entails µ(S0 ∩ [d0 ]g ) >
 · µ([d0 ]g ). This disequality cannot hold, because g is (µ, S , )-secure by assumption,
so (5) does not hold, and hence (6) must hold.
     Finally, we prove that the security of g implies that (6) should not hold, instead,
which proves the theorem.
 6
     To see this, associate each X ⊆ D in the collection to the integer min{i | di ∈ X}.
                                                                                            13

     Let C = {[d]g | d ∈ Lk−1 (d0 ) ∩ Qk }. Recall that C can be enumerated, because
its elements are pairwise disjoint: so let C = {X1 , X2 , . . . , Xi , . . .}, where each Xi is
an equivalence class induced by g.
     Since g is (µ, S , )-secure, for all Xi ∈ C we have

      µ(S0 ∩ Xi ) ≤  · µ(Xi ).                                                            (7)

It follows that
                                                [
      µ(S0 ∩ Lk−1 (d0 ) ∩ Qk ) = µ(S0 ∩             C)
                                      ∞
                                      X
                                  =         µ(S0 ∩ Xi )
                                      i=1
                                      X∞
                                  ≤          · µ(Xi )      by (7)
                                      i=1
                                             [
                                  =  · µ(       C)
                                  =  · µ( Lk−1 (d0 ) ∩ Qk ) .
This contradicts (6).                                                                        t
                                                                                             u


6     Computational Complexity

The computational complexity of the non-probabilistic framework has been studied in
[4] using finite D (the abstract analogue of propositional logic frameworks). For such
frameworks, it can be assumed that all subsets of D are measurable and that B =
℘(D). Such finite non-probabilistic frameworks can be seen as a special case of the
probabilistic framework: For all X ⊆ D, let

                                       |X|                           1
                          µ0 (X) =               and     0 = 1 −       .
                                       |D|                          |D|

Then it can be proved that:
Proposition 2. For all S ⊆ B, a filtering is (µ0 , S , 0 )-secure iff it is secure in the
sense of [4].
    Thanks to this proposition, all the computational hardness results of [4] can imme-
diately be extended to the probabilistic framework. In particular, for any given µ, S ,
and :

    – (Optimality checking) Deciding whether a given filtering f is an optimal (µ, S , )-
      secure filtering is coNP-hard.
    – (Query availability) Given a data source d ∈ D and a query of interest Q ∈ B,
      deciding whether there exists a (µ, S , )-secure filtering f that preserves Q on d
      (that is, [d]f ⊆ Q iff d ∈ Q) is NP-complete.
14

7    Conclusions and related work

Extending recent work [4], we presented an abstract probabilistic setting for data fil-
terings to achieve confidentiality according to a policy, in the presence of probabilistic
background knowledge. The security property is probabilistic, too: a rational and knowl-
edgeable observer should not be able to believe in the truth of a secret with a probability
larger than a given parameter .
    We proved that a natural, probabilistic generalization of the refusal-based controlled
query evaluation approach is both secure and maximally cooperative among a class of
“reasonable” filterings, that define their secure views using the query language. Interest-
ingly, the corresponding lying approach is not secure (differently from the deterministic
framework). Finally, computational hardness results for some decision problems have
been extended from the deterministic case to the probabilistic case.
    The security criterion can be easily modified by associating a different threshold
i to each secret Si , and adapting the precondition of Theorem 1 accordingly. In this
way one can simulate relativistic privacy preservation requirements, where the inferred
probability of each secret S remains close to the prior probability µ(S) of S, just by
defining the thresholds as i = µ(Si ) + ˜i . Then the precondition of Theorem 1 is
satisfied by the definitions. Proofs do not require any significant changes.
    Our work continues a long line of research on imposing confidentiality constraints
on computing system, including logic-oriented information systems used for query an-
swering and data publishing. The research line on refusals and controlled interaction
execution has been devoted to non-probabilistic methods, so far [27,3,7,6,5]. Several of
these papers additionally and [9,8] explicitly deal with lies in non-probabilistic settings.
    Similarly, the non-interference property for inference control in non-logical, opera-
tional settings has originally not been probabilistic [16], but some later and recent work
also considers probabilities, e.g. [17,26,10,19,29], in particular to measure the (aver-
age, entropy-based) information flow from security high (secret) inputs to security low
(open) outputs. Moreover, formalized in a more general framework of value transmis-
sion over a randomized channel, based on [10] the authors of [19] emphasize that the
actual probabilities might differ from those believed by the attacker, and they then study
both the (decrease of) belief-uncertainty – defined as min-entropy of belief vulnerability
– and the impact of the posterior inaccuracy of the attacker’s belief. These studies, how-
ever, leave constructive methods to minimize secrecy-violating effects to future work.
Also inspired by [10] – and rediscovering fundamental insight about refusals [27] –,
now within a framework of procedural programs whose execution semantics are seen
as a transformation of probabilities of states over numeric variables, the authors of [23]
propose to employ efficiently updatable probabilistic polyhedra to approximate the cur-
rently assumed probabilistic knowledge of the attacker.
    Other popular methods, such as k-anonymity with l-diversity [25,28,22], are partly
covered by the deterministic abstract framework [4]. In some sense, though, k-
anonymity lies in between possibilistic and probabilistic methods, since its first con-
fidentiality criterion is aimed at ensuring a sufficient number of alternative states of the
world (governed by the k parameter) to prevent re-identification. The approach of weak-
ened relational views [5] pursues a closely related goal within a more general setting.
                                                                                                15

Moreover, probabilistic refinements of l-diversity of values associated with a k-block
of worlds attempt to let each of the associated values appear to be equally plausible.
     Halpern and O’Neill studied probabilistic secrecy in the context of dynamic sys-
tems [18]. Their approach is based on a modal logic of beliefs whose models encode a
streamlined account of system runs. Our technical analysis differs in several respects.
Their focus is on the definition and logical characterization of security properties; they
do not show how to develop secure systems, nor do they deal with cooperativeness.
Moreover, they do not estimate the complexity of achieving security.
     The methods based on lies (which have been extensively studied in the above non-
probabilistic contexts) can be regarded as the mainstream method in probabilistic meth-
ods, in the form of random answer perturbations [15]. Concerning differential privacy
[11,12,13,14], the main difference from our framework are two: (i) The differential
privacy model does not embody prior knowledge such as record correlation, that may
affect confidentiality [20]; in our model the a priori knowledge modeled by µ is used to
ensure confidentiality even in the presence of record correlations and the like. (ii) The
filterings for differential privacy are probabilistic while our filterings are not. Extending
our framework to probabilistic filterings is an interesting topic for further research.
     Recent more thorough comparisons and discussion of various probabilistic ap-
proaches to enforce confidentiality of sensitive information in provider-consumer in-
teractions are provided by [21] and [24].
     The challenges for future work include dealing with the inherent complexity of
some decision problems of interest (see, e.g., [4,29]), working safely with approximate
estimates of µ (see, e.g., [10,19,14]), and designing efficient implementations of the
secure methods (see, e.g., [23]).

References
 1. P. Billingsley. Probability and Measure, 3rd edition. John Wiley & Sons, New York, NY,
    USA, 1995.
 2. J. Biskup and P. A. Bonatti. Lying versus refusal for known potential secrets. Data Knowl.
    Eng., 38(2):199–222, 2001.
 3. J. Biskup and P. A. Bonatti. Controlled query evaluation for enforcing confidentiality in
    complete information systems. Int. J. Inf. Sec., 3(1):14–27, 2004.
 4. J. Biskup, P. A. Bonatti, C. Galdi, and L. Sauro. Optimality and complexity of inference-
    proof data filtering and CQE. In M. Kutylowski and J. Vaidya, editors, European Symposium
    on Research in Computer Security, ESORICS 2014, Part II, volume 8713 of Lecture Notes
    in Computer Science, pages 165–181. Springer, 2014.
 5. J. Biskup and M. Preuß. Information control by policy-based relational weakening templates.
    In I. G. Askoxylakis, S. Ioannidis, S. K. Katsikas, and C. A. Meadows, editors, Computer
    Security - ESORICS 2016 - 21st European Symposium on Research in Computer Security,
    Part II, volume 9879 of Lecture Notes in Computer Science, pages 361–381. Springer, 2016.
 6. J. Biskup and C. Tadros. Preserving confidentiality while reacting on iterated queries and
    belief revisions. Ann. Math. Artif. Intell., 73(1-2):75–123, 2015.
 7. J. Biskup and T. Weibert. Keeping secrets in incomplete databases. Int. J. Inf. Sec., 7(3):199–
    217, 2008.
 8. J. Biskup and L. Wiese. A sound and complete model-generation procedure for consistent
    and confidentiality-preserving databases. Theoretical Computer Science, 412:4044–4072,
    2011.
16

 9. P. A. Bonatti, S. Kraus, and V. S. Subrahmanian. Foundations of secure deductive databases.
    IEEE Trans. Knowl. Data Eng., 7(3):406–422, 1995.
10. M. R. Clarkson, A. C. Myers, and F. B. Schneider. Quantifying information flow with beliefs.
    Journal of Computer Security, 17(5):655–701, 2009.
11. I. Dinur and K. Nissim. Revealing information while preserving privacy. In F. Neven,
    C. Beeri, and T. Milo, editors, 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
    ciples of Database Systems, PODS 2003, pages 202–210. ACM, 2003.
12. C. Dwork. Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener,
    editors, Automata, Languages and Programming, 33rd International Colloquium, ICALP
    2006, Part II, volume 4052 of Lecture Notes in Computer Science, pages 1–12. Springer,
    2006.
13. C. Dwork. Differential privacy: A survey of results. In M. Agrawal, D. Du, Z. Duan,
    and A. Li, editors, Theory and Applications of Models of Computation, 5th International
    Conference, TAMC 2008, volume 4978 of Lecture Notes in Computer Science, pages 1–19.
    Springer, 2008.
14. C. Dwork and G. N. Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887,
    2016.
15. B. C. M. Fung, K. Wang, A. W.-C. Fu, and P. S. Yu. Introduction to Privacy-Preserving Data
    Publishing – Concepts and Techniques. Chapman & Hall/CRC, Boca Raton, FL, 2010.
16. J. A. Goguen and J. Meseguer. Unwinding and inference control. In IEEE Symposium on
    Security and Privacy, pages 75–87, 1984.
17. J. W. Gray III. Toward a mathematical foundation for information. Journal of Computer
    Security, 1(3-4):255–294, 1992.
18. J. Y. Halpern and K. R. O’Neill. Secrecy in multiagent systems. ACM Trans. Inf. Syst. Secur.,
    12(1):5.1–5.47, 2008.
19. S. Hamadou, C. Palamidessi, and V. Sassone. Quantifying leakage in the presence of unreli-
    able sources of information. J. Comput. Syst. Sci., 88:27–52, 2017.
20. D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In T. K. Sellis, R. J. Miller,
    A. Kementsietsidis, and Y. Velegrakis, editors, ACM SIGMOD International Conference on
    Management of Data, SIGMOD 2011, pages 193–204. ACM, 2011.
21. J. Liu, L. Xiong, and J. Luo. Semantic security: Privacy definitions revisited. Trans. Data
    Privacy, 6(3):185–198, 2013.
22. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: privacy
    beyond k-anonymity. TKDD, 1(1), 2007.
23. P. Mardziel, S. Magill, M. Hicks, and M. Srivatsa. Dynamic enforcement of knowledge-
    based security policies using probabilistic abstract interpretation. Journal of Computer Se-
    curity, 21(4):463–532, 2013.
24. C. Palamidessi. Quantitative approaches to the protection of private information: State of the
    art and some open challenges. In R. Focardi and A. C. Myers, editors, Principles of Security
    and Trust - 4th International Conference, POST 2015, volume 9036 of Lecture Notes in
    Computer Science, pages 3–7. Springer, 2015.
25. P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. Knowl.
    Data Eng., 13(6):1010–1027, 2001.
26. T. Santen. Preservation of probabilistic information flow under refinement. Inf. Comput.,
    206(2-4):213–249, 2008.
27. G. L. Sicherman, W. de Jonge, and R. P. van de Riet. Answering queries without revealing
    secrets. ACM Trans. Database Syst., 8(1):41–59, 1983.
28. L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncer-
    tainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
29. H. Yasuoka and T. Terauchi. Quantitative information flow as safety and liveness hyperprop-
    erties. Theor. Comput. Sci., 538:167–182, 2014.