<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Serenade: An Approach for Diferentially Private Greedy Redescription Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maiju Karjalainen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esther Galbrun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pauli Miettinen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Eastern Finland</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present an approach for mining diferentially private redescriptions. Applying diferential privacy to pattern mining approaches such as redescription mining is less straightforward than with numerical methods, in part because local patterns are more susceptible to the kind of noise inherent to diferential privacy, and in part because the typically combinatorial algorithms require many operations and become computationally unaofrdable when overloaded with privacy preserving procedures. Our solution, and the two algorithms we provide, address both the computational bottlenecks as well as the issues with noise. They might also provide inspiration for other diferentially private pattern mining algorithms. Extensive experiments on real-world data validate the usefulness of our algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>As awareness about the rights of data subjects rises, so does the demand for
privacypreserving data mining methods. This trend, combined with concerns about black-box
models being used in many domains, has lead to a need for methods that couple
interpretability and privacy. Here we present algorithms answering this need, bringing
diferential privacy to redescription mining.</p>
      <p>
        Diferential privacy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is the most promising approach for privacy preserving data
analysis. It allows to provide guarantees on the level of privacy ofered, even in the
presence of side information. Diferentially private results are also secure against
postprocessing: no amount of post-processing will reveal any further private information from
the results of diferentially-private algorithms.
      </p>
      <p>
        It is widely recognized that pattern and rule-based approaches, such as redescription
mining [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], are highly interpretable data analysis approaches. However, integrating
diferential privacy with such approaches is not simple, as they are based on local
properties of the data, which are more prone to privacy leakage than global properties.
Moreover, the combinatorial nature of these approaches makes them typically more
susceptible to noise than continuous optimization methods.
      </p>
      <p>We tackle these obstacles and present two algorithms based on existing approaches
for diferential privacy and redescription mining: a greedy algorithm for redescription
mining, and the exponential and Laplace mechanisms for diferential privacy. Their
successful combination requires developing novel methods. In particular, we present a
general technique to lower the sensitivity of the quality measure to magnitudes well below
1, allowing much more accurate results to be drawn from the exponential mechanism
without compromising the privacy. In addition, we use weighted reservoir sampling for
memoization, yielding an eficient implementation of the exponential mechanism.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>Redescription mining. Redescription mining is a data analysis task that aims at finding
two diferent ways to describe almost the same observations. In our setting, the data set
 consists of a pair of tables, respectively denoted D and D . Both tables are over the
same entities (rows) but disjoint collections of attributes (columns).</p>
      <p>Such a pair of data tables, along with user-defined parameters, constitute the input
of redescription mining. A redescription consists of a pair of queries denoted by   and
  , one over either data table. A query consists of literals over Boolean, categorical
or numerical attributes, combined with logical conjunction and disjunction operators,
possibly involving negations. For instance, the query</p>
      <p>= (¬ Hypertension ∧ [State = CA]) ∨ [BirthY ≤ 1950]
describes entities, here individuals, who either do not sufer from hypertension and reside
in California, or are born before 1950. The collection of entities described by the queries
is called the support of the query, denoted supp( ).</p>
      <p>The main quality of a redescription is that the queries have similar supports. Referred
to as its accuracy, it is measured using the Jaccard coeficient:
 (  ,   ) = |supp(  ) ∩ supp(  )| / |supp(  ) ∪ supp(  )| .
(1)
In addition to having a high accuracy, to be of interest a redescription should cover
enough entities. This can be ensured by setting a minimum support threshold.</p>
      <p>
        Since the introduction of redescription mining [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], various algorithms have been
proposed for the task, including algorithms based on decision trees [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ], on mining
and combining frequent itemsets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or considering regression and correlation models [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
We build on the ReReMi algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], that iteratively grows the queries in a greedy
fashion. Further details can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Diferential privacy. Diferential privacy ensures that it cannot be inferred, within
a fixed probability, whether or not a particular individual is included in a data set,
regardless of what other information the adversary might have. This is achieved by
withholding direct access to the data and only returning randomized outputs.</p>
      <p>Formally, if  and  ′ are two data sets that difer in exactly one row,  is a randomized
algorithm operating on these tables,  is any set of potential results  might return, and
ℛ ← ∅
 ← IntialPairs( init/2, ,   ,  )
for  0 ∈  do
ℓ ← 0
 0 ← ComputeQuality( ℓ,  qual/( || ))
while ℓ &lt;  and  ℓ &gt;  min do
ℓ ← ℓ + 1
 ℓ ← ExtendOne({ ℓ−1},  ext/(2 || ),   ,  )
 ℓ ← ComputeQuality( ℓ,  qual/( || ))</p>
      <p>
        Add to ℛ the last  ℓ with  ℓ &gt;  min, if any
return ℛ
 ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the privacy parameter (or budget), then  is  -diferentially private (later
also simply  -private) if
      </p>
      <p>
        Pr[ ( ) ∈  ] ≤   Pr[ ( ′) ∈  ] ,
(2)
where the probability is over the randomness of  (see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
      </p>
      <p>
        The core mechanism for ensuring diferential privacy with numerical outputs is the
Laplace mechanism, that perturbs numerical outputs with Laplace(0, Δ/ )-distributed
noise, where Δ is the sensitivity of  , i.e. the maximum change in the output of 
given two inputs that difer in exactly one row. For other than numerical outputs, the
exponential mechanism [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is used. The space of all possible outputs of  must be
endowed with a data-specific quality function  (︀  ( )︀) , and the answer is sampled from
this space with probability ∝ exp(︀  ·  (︀  ( )︀) . This ensures 2 Δ -privacy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Further
developments in diferential privacy include concentrated diferential privacy [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which
trades better behaviour under composition to weaker definition of privacy. Our method
is trivially ( · (  − 1)/2,  )-concentrated diferentially private [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], but further analysis
of concentrated diferential privacy is beyond the scope of this paper.
      </p>
      <p>
        After the initial formulation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the introduction of the exponential mechanism
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], diferential privacy has seen significant interest in academia, and recent years have
brought uptake by the industry as well (e.g. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). Real-world use of diferential privacy
has indicated that the typical assumption that  &lt; 1 is often unachievable (e.g. the US
Census Bureau used  = 19.61 for releasing the 2020 US Census results [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). Diferentially
private algorithms have been proposed for mining frequent itemsets [14, 15], association
rules [16, 17], as well as subgraphs [18] or sequences [19].
ℛ ← ∅
for all  ∈  do
      </p>
      <p>ℛ ← ℛ ∪ { (, ComputeQuality(,  qual/|| ))}
return ℛ</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Algorithms</title>
      <p>
        In this section we present two algorithms for diferentially private redescription mining:
SerenadeCS and SerenadeES. Both algorithms are based on the greedy redescription
mining approach of ReReMi and assume that one side of the data has only Boolean
and categorical attributes, a common assumption in redescription mining (see, e.g. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
      </p>
      <p>Both algorithms consist of a client with no direct access to the data and a diferentially
private engine, with full access to the data, that the client queries. The two algorithms
difer mainly on their client-side implementations, i.e. how they use the diferentially
private engine’s methods. With SerenadeCS, more of the decisions are made on the
client side (CS), resulting in the ability to prune low-quality results at the cost of an
increased budget expenditure. SerenadeES provides a thriftier alternative by letting
the diferentially private engine make the decisions (engine side, ES).</p>
      <p>We will next explain how the clients work before explaining how the diferentially
private part has been implemented.</p>
      <sec id="sec-3-1">
        <title>3.1. Client side</title>
        <p>The pseudocode for the algorithms is given in Algorithms 1 and 2. Both algorithms
mine redescriptions through three main steps. The IntialPairs method (Algorithm 4)
returns redescriptions with a single literal in both queries. These pairs are then iteratively
extended into longer redescriptions. The ExtendOne method (Algorithm 3) takes a set
of redescriptions as input and returns a new redescription, obtained by extending one of
the input redescriptions by one literal. The ComputeQuality method computes the
noisy Jaccard and support size of a redescription. The two algorithms difer on when the
quality of redescriptions is computed and how to select the next redescription to extend.</p>
        <p>SerenadeCS closely follows the approach of ReReMi: it considers one initial pair at
a time (line 4 in Algorithm 1) for at most a fixed number of extensions (line 7). In each
iteration, SerenadeCS calls ExtendOne with the current candidate as input (line 9),
queries ComputeQuality to get the noisy Jaccard of the extension (line 10), and
 [(  ,   )] ← ( ′ ,  ′ ) with age 1
( ′ ,  ′ ) ← 
, (  ,   ) ←  .pop()
increment the age of every (
return ( ′ ,  ′ )</p>
        <p>[(  ,   )]; delete  [(  ,   )]
Algorithm 3 ExtendOne, extending a redescription
Output: Extended redescription ( ′ ,  ′ )</p>
        <sec id="sec-3-1-1">
          <title>Static variables: Associative array  , priority queue</title>
          <p>Private information: Data 
= (D , D</p>
          <p>), sensitivity Δ given  ,   , and 
Input: Data  , collection of redescriptions  , privacy budget  , min. support   , steepness 
select the extensions with the highest key key( (
 ,   ,   ,  ), / Δ )
 ,   ) ∈ 
and remove expired from  and 
decides whether to continue with the extension process. This ensures we only attempt to
extend redescriptions that are good enough, but it has two drawbacks: i) it needs to call
ComputeQuality for every extension, using plenty of budget; ii) it might attempt to
extend a redescription which has no good extensions, efectively wasting the associated
budget. The number of redescriptions returned by SerenadeCS is not predetermined,
though it is never larger than the number of initial pairs  .</p>
          <p>SerenadeES addresses these drawbacks of SerenadeCS. At each of a fixed number of
extensions, it sends all current candidate redescriptions that can be extended to
ExtendOne and receives one extension back (line 4 in Algorithm 2). The decision of which
redescription to extend is made in the diferentially private engine. Furthermore, the
Jaccard is only computed for the redescriptions that are returned (line 9). This approach
saves on the privacy budget compared to SerenadeCS. It also allows some redescriptions
to be extended multiple times while not wasting budget on redescriptions that cannot be
extended. On the flip side,</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>ExtendOne might return a bad redescription, and this will only become apparent when it is too late to replace the candidate. SerenadeES always returns as many redescriptions as there were initial pairs.</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Diferentially Private Engine</title>
        <p>Next, we present the three methods of the diferentially private engine. We start with a
general overview and description of the main ingredients of our algorithms.
Reservoir sampling.</p>
        <p>
          Greedy extension is a standard strategy for building
redescriptions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Extending redescription (
        </p>
        <p>
          ,   ) consists in appending a literal to either of
the queries using a disjunction or conjunction. The extension with the highest accuracy
that satisfies constraints (e.g. minimum support threshold) is reported. For numerical
attributes, all relevant intervals are evaluated in turn. A pruning trick [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] allows to
identify relevant intervals, i.e. those that impact the accuracy, and avoid testing those
5:
6:
7:
8:
for all reservoirs   do
        </p>
        <p>Add redescription from   to ℛ
return ℛ
Algorithm 4 IntialPairs, generating initial candidates
Input: Privacy budget  , number of initial pairs  , minimum support threshold   , steepness 
Output: A set of initial redescriptions ℛ = {(  ,   )}
Private information: Data  = (D , D ), sensitivity Δ given  ,   and 
1: function IntialPairs(, ,   ,  )
2:
3:
4:
ℛ ← ∅
Set   as size-1 reservoir sampler for  = 1, . . . , 
for all literals  over the non-numerical attributes of D , numerical attributes  in D ,
and meaningful intervals [,  ] of  do</p>
        <p>Add initial pair (, [ ≤  ≤  ]) to all reservoir samplers   where key(︀  (, [ ≤  ≤
 ],   ,  ), / ( Δ ))︀ is greater than the current key
that do not (see Section 3.3).</p>
        <p>A diferentially private variant of such an extension strategy is best implemented with
the exponential mechanism. Instead of reporting the best extension, the algorithm will
randomly pick one of the candidates with probabilities proportional to their accuracies.
The set of candidates will also include the current, unextended, redescription associated
to a sampling probability reflecting its quality. That way, the structure of the queries
will not reveal anything sensitive about the data (see Section 3.3 about the case of
continuous-valued attributes).</p>
        <p>
          A naïve approach would be to store all candidates, but this would use too much
memory. Instead, we use weighted reservoir sampling [20, 21]. The idea is to keep a
reservoir of size  and evaluate candidate extensions in turn. For each one, we sample a
value  ∼ Unif [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]. If the probability of the candidate being selected by the exponential
mechanism is  , we compute its key as  1/ and store the candidate in the reservoir if
this key is larger than the current smallest key in the reservoir (see [20, 21] for proofs of
correctness). That is, for each candidate (  ,   ) we compute
key( (  ,   , Θ),  ) = log(Unif [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]) · exp(︀  ·  (  ,   , Θ))︀ ,
(3)
where  (  ,   , Θ) is a function measuring the quality of (  ,   ) given parameters in Θ,
and retain the candidates with highest keys.
        </p>
        <p>Extending a redescription. When called from SerenadeES, ExtendOne (Algorithm 3)
receives a collection of redescriptions  . In each call, the method samples from the
distribution of all extensions of all redescriptions in  , and one selected candidate replaces
the extended redescription. This proceeds for a fixed number of iterations, progressively
extending and replacing redescriptions in  . Between two successive calls to ExtendOne,
only one redescription changes.</p>
        <p>Our sampling approach also allows us to first draw a large sample and then take a
smaller sample from it, by simply retaining the entries associated with the largest keys.
As the collection  can be large, we use this strategy to save computational time.</p>
        <p>Specifically, ExtendOne stores for each redescription encountered so far the sampled
extension for that redescription, together with its key.</p>
        <sec id="sec-3-2-1">
          <title>When a new redescription is</title>
          <p>added to  , ExtendOne samples an extension for it and stores it. The redescription
whose sampled extension has the largest key is selected and the corresponding extended
redescription is returned.</p>
          <p>This memoization can lead to suboptimal choices if an extension of a redescription
gets a very low key value even though it is relatively good, and vice versa. To counteract
that, ExtendOne prunes stored extensions that are too old, i.e. extensions that have
not been selected in the past extension rounds. The maximum age of an extension is a
user-supplied parameter. When called from SerenadeCS, ExtendOne operates on a
single redescription and returns its candidate extension having the highest key, without
any memoization.</p>
          <p>Computing initial pairs and redescription quality. IntialPairs (Algorithm 4) builds
initial pairs by enumerating all pairs of attributes (, 
), where  is an attribute from D

and  is an attribute from D . The selection of the initial pairs from all pairs is based on
a reservoir sampling approach. Assuming w.l.o.g. that the non-numerical attributes are
in D , we consider as initial query   every literal consisting of a Boolean attribute or an
attribute–category pair. For   , we consider every literal (including all relevant intervals)
from D</p>
          <p>to match   , in the same way as when extending an existing redescription. The
sampling is done over the queries   . In order to generate  initial pairs, we populate
 diferent size-1 reservoirs, corresponding to  independent samples with replacement
from the exponential mechanism, ensuring privacy.</p>
          <p>ComputeQuality computes the size of the intersection and the size of the union of
the input queries’ supports and returns them after applying Laplace noise. The noisy
Jaccard of the redescription is computed as the noisy intersection size divided by the
noisy union size.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Handling Continuous Attributes</title>
        <p>Continuous attributes require extra care in both computational eficiency and privacy.
When appearing in a query, a continuous attribute  ∈ R will be bounded within an
interval, as [</p>
        <p>
          ≤  ≤  ] (where either  or  can be infinite). Finding these intervals
eficiently is the first problem. We follow the cut-point-based approach of
ReReMi [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>This approach is based on the fact that only some values of  will have an efect on the
support of the extended query when used as threshold. If an entity is not in supp(  ),
whether or not it belongs to supp([</p>
        <p>≤  ≤  ]) is indiferent w.r.t. the support of the
extended query. Hence we only need to consider as thresholds those values that determine
whether entities in supp(
−∞ and ∞, are called cut-points.</p>
        <p>This approach extends to disjunctions and negations. It also helps us with privacy.
Returning the exact value of cut-points can leak information about the data, as it reveals
 ) are included in supp([ ≤  ≤  ]). These values, along with
that  takes this exact value at least once.1 Let us assume that the lower cut-points
and upper cut-points are −∞ =  1 &lt;  2 &lt; · · · &lt;   and  1 &lt;  2 &lt; · · · &lt;   = ∞,
respectively, and that we have chosen to use interval [  ≤  ≤   ]. Instead of reporting
  , we sample the value  ′ we report as lower threshold from Unif (  −1,  ] and instead
of   , we sample a value  ′ from Unif [  ,  +1). Note that if  = 1 or  =  , we actually
have a half-interval, which we write [ ≤  ′ ] or [ ′ ≤  ], and no sampling is necessary for
that threshold. This sampling ensures that the accuracy of the redescription remains the
same while no information is leaked about the exact values in the data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Lower-Sensitivity Quality Function</title>
        <p>A key element of the algorithm is the distribution of the budget between the steps.
Both IntialPairs and ExtendOne use the exponential mechanism, which provides
2 Δ -privacy for quality  of sensitivity Δ . For our application, the standard quality
function is the Jaccard coeficient, which has sensitivity 1, since it takes values between 0
and 1. However, for interesting redescriptions, the support size is well above 0 and, as
a result, adding or removing a single row will not change the Jaccard coeficient much.
This motivates us to modify the quality function to take into account the support size.
Specifically, we multiply the Jaccard coeficient with a logistic function centered at the
user-defined minimum support threshold   . The quality function  becomes
 (  ,   ,   ,  ) =  (  ,   )/(︀ 1 + exp(︀ − · (|supp(  ) ∩ supp(  )| −   ))︀ ,
(4)
where  is a parameter adjusting the steepness of the logistic curve.</p>
        <p>For large values of  the sensitivity of  will still be 1. For smaller values of  , however,
it is possible to significantly reduce the sensitivity of  . Given   ,  , and data set size  ,
the sensitivity of  can be calculated in  ( 2) time by considering all possible values for
numerator and denominator in (1), plugging them into (4), and identifying the largest
change that can result from the addition of a single row. Furthermore, the steepness can
be optimized with a simple line search over diferent sensitivity parameters. With the
data sets and parameters used in our experiments, the sensitivity was always less than
0.001, leaving more budget to be spent on initial pairs and extensions.</p>
        <p>Notice that knowing the sensitivity releases information about the data as it depends
on the data set size. This leaves a few options. The first is that the diferentially private
engine computes the optimal steepness and uses the corresponding sensitivity when
distributing the budget. This approach is completely private, but means that the user
has no knowledge of the actual noise levels used. The other extreme is to use a weaker
form of diferential privacy, where the user knows the size  of the data set. The user
can then find the optimal  minimizing the sensitivity of (4). As an intermediate option,
the user might query for  , use the noisy estimate to compute near-optimal steepness,
and give that as a parameter to the algorithm. This way, at the cost of some privacy
budget, the user retains some knowledge of the quality function and knows its steepness,
but no privacy is lost.
1The fact that categorical attributes take each category at least once is not considered private information.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Other Parameters</title>
        <p>The two key parameters of the algorithms are the number of initial pairs,  , and the
number of extensions,  or  . The higher  is set, the more redescriptions will be
returned, but each of them will receive a smaller portion of the privacy budget, both
when generating the initial pairs and when computing the privatized supports and
accuracies. On the other hand, the number of extensions ( or  ) dictates how complex
the redescriptions can become. SerenadeCS will return no more than  redescriptions,
and none of them will have more than  extensions (i.e. they have at most  + 2 literals
in total). SerenadeES will return exactly  redescriptions, but some of them might have
just two literals, or some of them might have  + 2 literals. Hence, if SerenadeCS is set
to perform  extensions, a comparable setting for SerenadeES is  =  ·  extensions.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Privacy</title>
        <p>The privacy of the proposed approach is mainly based on the exponential mechanism.
For this, we need to sample from every possible extension with the correct probability.
Next we show that this is the case. We then continue by analysing the use of the budget,
and conclude that the methods are indeed private.</p>
        <p>Lemma 1. Given a redescription  ℓ−1 = (  ,   ), the probability that ExtendOne in
line 9 of Algorithm 1 returns an extension  ℓ = ( ′ ,  ′ ), where either i)  ′ is an arbitrary
single-literal extension of   and   =  ′ , ii)  ′ is an arbitrary single-literal extension
of   and   =  ′ , or iii)   =  ′ and   =  ′ , is ∝ exp( ·  ( ′ ,  ′ ,   ,  )).
Proof. We can assume w.l.o.g. that   =  ′ . Assume first that D does not contain
continuous attributes. To select the extension of   , ExtendOne can be considered
to go through every extension, computing their quality according to  , and placing
the extension into a stream together with a weight parameter that is set to exp( ·  ).
Finally, the algorithm will also place the unextended   into the stream with weight
corresponding to the quality of (  ,   ). The weighted reservoir sampler will select one
item from this stream, so that the probability of an item being selected is proportional
to its weight [21, 20], thus sampling the extension with correct probability.</p>
        <p>For continuous attributes, the algorithm further perturbs the boundaries as explained
in Section 3.3. As this does not afect the quality, all extensions with the same quality
have equal probability to be selected, concluding the proof.</p>
        <p>Lemma 2. The extended redescription  = ( ′ ,  ′ ) returned by ExtendOne when called
in line 4 of Algorithm 2 is selected with probability proportional to exp( ·  ( ′ ,  ′ ,   ,  ))
from all possible at-most-single literal extension to all redescriptions in  .
Lemma 3. Given a pair of single-literal queries (  ,   ) over data sets D and D , the
probability that IntialPairs returns (  ,   ) is ∝  · exp( ·  (  ,   ,   ,  )).
Lemma 4. The total budget used by SerenadeCS and SerenadeES never exceeds  tot.</p>
        <p>
          The proof for Lemma 3 follows the same lines as for Lemma 1, except that the sampling
is repeated  times, and is omitted. The proofs for Lemma 2 and Lemma 4 are postponed
to the full version of the paper. 2
Theorem 5. SerenadeCS and SerenadeES are  tot-diferentially private.
Proof. Per Lemma 3, both algorithms generate initial pairs privately. Each initial pair
is an independent query to the data, spending altogether budget  init. Every extension
in SerenadeCS is an independent, diferentially private query (Lemma 1). Due to
the composability of diferential privacy [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], they do not reveal more information than
permitted by their allocated budget. The same holds for extensions in SerenadeES.
Finally, quality calculations in ComputeQuality use the standard Laplace mechanism
and are private, in line with their allocated budget. The total budgets do not exceed the
allocated  tot (Lemma 4).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>To test our proposed methods, we conducted both quantitative and qualitative
experiments. The purpose of the quantitative experiments is to investigate the behavior of the
algorithms with diferent parameters and carry out an ablation study. The qualitative
experiments are meant to assess the usefulness of our approach for real-world data
analysis. For this reason, we did not do any parameter tuning as that would consume
some of the privacy budget.</p>
      <p>All algorithms are implemented in Python and the experiments are run with Python
3.6.8 on a machine with 2 AMD EPYC 7702 processors with 64 cores each and 1 TB of
main memory.</p>
      <sec id="sec-4-1">
        <title>4.1. Data and Other Methods</title>
        <p>
          The Mammals data set contains information about which mammal species inhabit which
areas of Europe on one side [22], and climate information on the other side [23]. This data
set has often been used in redescription mining literature (e.g. [
          <xref ref-type="bibr" rid="ref4 ref5 ref8">8, 24, 4, 25, 5, 26</xref>
          ]). The
data contains 194 species and 48 climate attributes as its columns, with 2575 localities
as its rows.
        </p>
        <p>The MIMIC data set [27]3 is an example of health data where patient privacy is critical.
It contains de-identified health information about hospital patients. Our dataset contains
two views: the diagnoses view and the lab events view. The aim when mining this data
set is to find out what kind of lab events are associated with various diagnoses of the
examined patients. The data set contains 107 Boolean attributes in the diagnoses view
(1 denotes a positive diagnosis) and 104 ternary attributes in the lab events view, with
46 065 patients as its rows.
2The source code and the appendix containing all details that are postponed to the full version of the
paper are presented in https://github.com/maijuka/Serenade
3Data set available at https://physionet.org/content/mimiciii/1.4/.</p>
        <p>The VAA data set [28] contains the background information and answers to questions
regarding political opinions of candidates to the Finnish parliamentary elections of 2011.
The data was collected by an online voting advice application (VAA). We removed
candidates who did not provide answers to all questions, leaving 1656 candidates, 9
background attributes and 107 opinion attributes.</p>
        <p>The Open Sex-Role Inventory [29] (SR data set) is used to study gender roles by
measuring masculinity and femininity. The Right-Wing Authoritarianism Scale [30]
(RightW data set) was developed to measure the psychology of the followers of facist
regimes.4 Columns recording to the time spent answering the questions were removed
from the RightW and the SR data sets. These data sets have the same variables on both
sides; IntialPairs and ExtendOne are implemented to ensure the same attribute
cannot be used in both queries of the same redescription.</p>
        <p>We also conducted experiments with other datasets. The results were not significant
and are postponed to the full version of the paper.</p>
        <p>
          The main studied algorithms are SerenadeCS and SerenadeES. In addition, for
ablation studies, we used a variation of SerenadeES called NoSens. It works like
SerenadeES, but uses the standard Jaccard (1) instead of the lower-sensitivity version (4)
for computing keys (3). We also did a comparison to ReReMi, the baseline non-private
redescription mining algorithm [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>For all experiments, we set  min = 0.3 for SerenadeCS and   to 10 % of data set
size  .</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Quantitative Experiments</title>
        <p>In this empirical evaluation, we investigate the efects of parameters and carry out an
ablation study using NoSens.</p>
        <p>
          Privacy budget. A typical value for the privacy budget is  tot = 1, though much higher
total budgets are sometimes used (e.g. [
          <xref ref-type="bibr" rid="ref13">31, 32, 13</xref>
          ]). Hence we first studied the efect
of varying the total budget  tot along with diferent ways to distribute it between the
diferent tasks. Total budgets of 0.5, 1, 10, and 100 were used. Of these,  tot = 0.5 and
 tot = 1 are typical values found in the literature,  tot = 10 is closer to values sometimes
used in practice [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and  tot = 100 can be seen as a baseline of how the algorithms
perform when privacy requirements are lifted.
        </p>
        <p>Another parameter to consider is the distribution of the total budget between  init,  ext,
and  qual. We used five diferent ways of distributing  tot between  init,  ext, and  qual,
respectively in the following proportions:  1 = (0.33, 0.33, 0.33),  2 = (0.45, 0.45, 0.1),
 3 = (0.6, 0.3, 0.1),  4 = (0.3, 0.6, 0.1), and  5 = (0.25, 0.25, 0.5). The maximum age of
an extension was set to 40,  = 20,  = 4, and  = 80. Each experiment was repeated
ifve times.</p>
        <p>Results for these experiments on the Mammals data set are presented in Fig. 1(a–b)
where we see that both SerenadeCS and SerenadeES are capable of returning very
4Data sets available at https://openpsychometrics.org/_rawdata/.
0.8
0.6
0.4
0.2
d
1.0
0.8
0.6
0.4
0.2
good results when allocated suficiently high budgets. On the other hand, NoSens
struggles even when allocated very high budgets. Given its ineficient use of the budget,
this is not surprising. Results for similar experiments on three other data sets are
presented in Fig. 1(c–e) where we see that SerenadeCS returns clearly better results
with almost all parameters for these datasets. We can also see that VAA (Fig 1(e)) yielded
poor results regardless of budget and algorithm. This is explained by the fact that the
non diferentially private redescription mining algorithm ReReMi5 also fails to mine good
quality redescriptions from this data set. These results are postponed to the full version
of the paper. From Fig. 1(f) we see that we get slightly better initial pairs with a higher
budget, which could be the reason why the distributions  3 and  2, having the highest
budgets for initial pairs, perform slightly better than the other budget distributions.</p>
        <p>SerenadeES typically has higher variance on the accuracy of the results, giving
accuracies both higher and lower than SerenadeCS. This is in line with the expected
behavior. Of the diferent budget distributions,  3 seems to be consistently giving the
best results, although the diferences are not significant.</p>
        <p>
          An analyst is typically not interested in results with very low reported accuracy (noisy
Jaccard). Figure 2 repeats Fig. 1(a–b) for SerenadeES and SerenadeCS after removing
redescriptions with noisy Jaccard below 0.3. The average and median for SerenadeES
5Source code available at https://pypi.org/project/python-siren/ [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
0.6
d
r
a
c
c0.4
a
J
e
u
r0.2
        </p>
        <p>T
b
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
improve, and especially the lower end of the tail rises, as expected. We can also see
that this pruning leaves some redescriptions with true accuracy below 0.3, but does
not actually discard many of those reported to be good. The discrepancy between the
reported and true accuracy is an important feature: if the system reports a redescription
as having a high accuracy, the user should be able to trust that it truly is so.
Number of initial pairs and reliability of noisy Jaccards. Next we investigate the efects
of the number of initial pairs. For these tests, the maximum age was always set to be
twice the number of initial pairs, and the number of extension rounds was four times
the number of initial pairs. Budget distribution  1 was used throughout. Recall that
more initial pairs means the budget is divided between more redescriptions. This shows
especially strongly in the diferences between the true and reported accuracies (Figure 3).</p>
        <p>Figure 3 also shows that SerenadeCS returns fewer redescriptions than SerenadeES
and that it occasionally reports overly optimistic Jaccard values even with  = 20.
This is due to its mediocre use of the budget, leaving just slivers for relevant quality
computations. Note that reported accuracies for SerenadeCS are cut above  min = 0.3
through filtering. With 80 initial pairs, reported Jaccards become less correlated with
true Jaccards for both methods, as budget per redescription dwindles.</p>
        <p>Maximum age. The final quantitative study probes the efects of the maximum age
memoization in ExtendOne in SerenadeES. We investigated this by setting  = 80
and trying maximum ages 0 (computing extensions anew in every iteration), 5, 10, 20,
40, 60, and 81 (stored extensions never expire). We found that the memoization has a
clear beneficial efect on the running time. Using maximum age 20 brings the running
time down to about 25 min, from over 70 min without memoization (maximum age 0).
Most redescriptions are selected for extension within about 20 steps, so the running time
does not improve noticeably beyond that. On the other hand, the maximum age did not
seem to have a significant efect on the quality.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Qualitative Experiments</title>
        <p>Next, we present a few results mined from diferent datasets, as anecdotal examples of
results one can expect by our algorithms with these kinds of data sets. We did five runs
of the algorithms for the MIMIC, VAA, RightW, SR, SSC, Psycho, and Pref data sets with
total budget 1 and budget distribution  3. The results were chosen over five runs for both
algorithms, so the total budget per data set is 10. The results with VAA, SSC, Psycho and
Pref datasets, and the raw redescriptions are postponed to the full version of the paper.</p>
        <p>An example result (  ,   ) from MIMIC with SerenadeCS is</p>
        <p>( Atrial Fibrillation , Uric Acid = 1 ∨ Creatine Kinase = 0 ) ,
meaning that the set of patients who had atrial fibrillation (a type of arrhythmia, abnormal
heart rhythm) corresponds to the set of patients whose uric acid (waste product in blood)
test results were abnormal or whose creatine kinase (enzyme related to energy production)
test results returned normal values. The reported accuracy of this redescription is 0.424
and its true accuracy is 0.227. Such results can be used to understand the procedures
at this hospital, and as they are privatized, the results can also be communicated more
widely without putting the privacy of patients in jeopardy.</p>
        <p>A result with SerenadeES from the RightW data set tells that the set of participants
who moderately or strongly agreed that “You have to admire those who challenged the
law and the majority’s view by protesting for women’s abortion rights, for animal rights,
or to abolish school prayer” corresponds to the set of participants who did not vote. The
reported accuracy of this redescription is 0.428 and its true accuracy is 0.538.</p>
        <p>A result from the SR data set with SerenadeCS the set of participants who agreed or
slightly agreed that they are the happiest when they are in their bed corresponds to the
set of participants who agreed or slightly agreed that they give people handmade gifts
and are between 14 and 32 years of age. The reported accuracy of this redescription is
0.651 and its true accuracy is 0.725.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>It is the nature of diferential privacy and local pattern mining that the user can never be
entirely certain whether a pattern reported by the algorithm actually represents a strong
pattern in the data. But careful selection of the parameters allows to avoid most of the
pitfalls, and when a privacy budget higher than 1 is available, the problem essentially
vanishes. Even with a privacy budget of at most 1, the results can be very useful. Of the
two approaches presented, SerenadeCS gives results with higher (true and reported)
Jaccard, but with lower correlation between the two, meaning that a good-looking result
from SerenadeCS might turn out not to be so good in reality. SerenadeES, on the
other hand, returns more truthful Jaccards, thanks to its better use of budget, but its
extension method might lead to generally somewhat inferior results.</p>
      <p>Overall, the reported strong patterns are reliably present in the data. This means that
a user who obtains interesting strong patterns via these diferentially private approaches
can seek further means to confirm them, for example by applying for a research permit for
the data set or by conducting a separate confirmatory study. The techniques presented
in this paper might also be used to implement other diferentially private pattern mining
algorithms in the future.
[14] C. Zeng, J. F. Naughton, J.-Y. Cai, On diferentially private frequent itemset mining,</p>
      <p>Proc. VLDB Endowment 6 (2012) 25–36.
[15] J. Lee, C. W. Clifton, Top-k frequent itemsets via diferentially private FP-trees, in:</p>
      <p>KDD ’14, 2014, pp. 931–940.
[16] Y.-T. Tsou, H. Zhen, X. Jiang, Y. Huang, S.-Y. Kuo, DPARM: Diferentially private
association rules mining, IEEE Access 8 (2020) 142131–142147.
[17] M. Maruseac, G. Ghinita, Precision-enhanced diferentially-private mining of
highconfidence association rules, IEEE Trans. Dependable Secure Comput. 17 (2018)
1297–1309.
[18] S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Diferentially private frequent subgraph
mining, in: ICDE’16, 2016, pp. 229–240.
[19] L. Bonomi, L. Xiong, A two-phase algorithm for mining sequential patterns with
diferential privacy, in: CIKM’13, 2013, pp. 269–278.
[20] P. S. Efraimidis, P. G. Spirakis, Weighted random sampling with reservoir, Inf. Proc.</p>
      <p>Lett. 97 (2006) 181–185.
[21] M. Kolonko, D. Wäsch, Sequential reservoir sampling with a nonuniform distribution,</p>
      <p>ACM Trans. Math. Soft. 32 (2006) 257–273.
[22] A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders,
F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, J. Zima, The Atlas of
European Mammals, Academic Press, 1999.
[23] R. J. Hijmans, S. E. Cameron, L. J. Parra, P. G. Jones, A. Jarvis, Very High
Resolution Interpolated Climate Surfaces for Global Land Areas, Int. J. Climatol.
25 (2005) 1965–1978. URL: www.worldclim.org.
[24] E. Galbrun, P. Miettinen, Siren: An interactive tool for mining and visualizing
geospatial redescriptions [demo], in: KDD’12, 2012, pp. 1544–1547.
[25] J. Kalofolias, E. Galbrun, P. Miettinen, From sets of good redescriptions to good
sets of redescriptions, in: ICDM’16, 2016, pp. 211–220.
[26] E. Galbrun, P. Miettinen, Mining Redescriptions with Siren, ACM Trans. Knowl.</p>
      <p>Discov. Data 12 (2018).
[27] A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi,
B. Moody, P. Szolovits, L. A. Celi, R. G. Mark, MIMIC-III, a freely accessible
critical care database, Scientific data 3 (2016) 160035.
[28] Helsingin Sanomat, Parliamentary elections 2011: Candidate responses to helsingin
sanomat candidate selector, 2016. URL: http://urn.fi/urn:nbn:fi:fsd:T-FSD2701,
version 2.1.
[29] T. E. Malloy, Bem sex role inventory, Wiley Online Library, 2010.
[30] B. Altemeyer, Right-wing authoritarianism, Univ. of Manitoba Press, 1983.
[31] R. Balu, T. Furon, Diferentially private matrix factorization using sketching
techniques, in: IH&amp;MMSec’16, 2016, pp. 57–62.
[32] S. Fletcher, M. Z. Islam, Decision tree classification with diferential privacy: A
survey, ACM Comput. Surv. 52 (2019) 83:1–83:33.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Calibrating noise to sensitivity in private data analysis</article-title>
          ,
          <source>in: TCC'06</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Galbrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miettinen</surname>
          </string-name>
          , Redescription Mining, Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Helm</surname>
          </string-name>
          , Turning CARTwheels:
          <article-title>An alternating algorithm for mining redescriptions</article-title>
          ,
          <source>in: KDD'04</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>266</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zinchenko</surname>
          </string-name>
          , E. Galbrun,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miettinen</surname>
          </string-name>
          ,
          <article-title>Mining Predictive Redescriptions with Trees</article-title>
          ,
          <source>in: ICDM'15 Workshops</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1672</fpage>
          -
          <lpage>1675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mihelčić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Džeroski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavrač</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Šmuc</surname>
          </string-name>
          ,
          <article-title>A framework for redescription set construction</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>68</volume>
          (
          <year>2017</year>
          )
          <fpage>196</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miettinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <article-title>Finding subgroups having several descriptions: Algorithms for redescription mining</article-title>
          ,
          <source>in: SDM'08</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F. I.</given-names>
            <surname>Stamm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strohmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lemmerich</surname>
          </string-name>
          ,
          <article-title>Redescription model mining</article-title>
          ,
          <source>in: KDD'21</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1521</fpage>
          -
          <lpage>1529</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Galbrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miettinen</surname>
          </string-name>
          ,
          <article-title>From black and white to full color: Extending redescription mining outside the Boolean world</article-title>
          ,
          <source>Stat. Anal. Data Min</source>
          .
          <volume>5</volume>
          (
          <year>2012</year>
          )
          <fpage>284</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>The algorithmic foundations of diferential privacy</article-title>
          ,
          <source>Found. Trends Theor. Comput. Sci. 9</source>
          (
          <year>2014</year>
          )
          <fpage>211</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talwar</surname>
          </string-name>
          ,
          <article-title>Mechanism design via diferential privacy</article-title>
          ,
          <source>in: FOCS'07</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Rothblum</surname>
          </string-name>
          , Concentrated diferential privacy,
          <source>arXiv preprint arXiv:1603</source>
          .
          <year>01887</year>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yekhanin</surname>
          </string-name>
          ,
          <article-title>Collecting Telemetry Data Privately</article-title>
          , in: NIPS'
          <fpage>17</fpage>
          ,
          <year>2017</year>
          , pp.
          <fpage>3571</fpage>
          -
          <lpage>3580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>US</given-names>
            <surname>Census Bureau</surname>
          </string-name>
          ,
          <article-title>Census bureau sets key parameters to protect privacy in 2020 census results</article-title>
          ,
          <year>2021</year>
          . URL: https://www.census.gov/newsroom/press-releases/
          <year>2021</year>
          / 2020-census
          <article-title>-key-parameters</article-title>
          .html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>