1. Introduction

Serenade: An Approach for Diferentially Private Greedy Redescription Mining

Maiju Karjalainen

Esther Galbrun

Pauli Miettinen

0 0 University of Eastern Finland

We present an approach for mining diferentially private redescriptions. Applying diferential privacy to pattern mining approaches such as redescription mining is less straightforward than with numerical methods, in part because local patterns are more susceptible to the kind of noise inherent to diferential privacy, and in part because the typically combinatorial algorithms require many operations and become computationally unaofrdable when overloaded with privacy preserving procedures. Our solution, and the two algorithms we provide, address both the computational bottlenecks as well as the issues with noise. They might also provide inspiration for other diferentially private pattern mining algorithms. Extensive experiments on real-world data validate the usefulness of our algorithms.

1. Introduction

As awareness about the rights of data subjects rises, so does the demand for privacypreserving data mining methods. This trend, combined with concerns about black-box models being used in many domains, has lead to a need for methods that couple interpretability and privacy. Here we present algorithms answering this need, bringing diferential privacy to redescription mining.

Diferential privacy [ 1 ] is the most promising approach for privacy preserving data analysis. It allows to provide guarantees on the level of privacy ofered, even in the presence of side information. Diferentially private results are also secure against postprocessing: no amount of post-processing will reveal any further private information from the results of diferentially-private algorithms.

It is widely recognized that pattern and rule-based approaches, such as redescription mining [ 2 ], are highly interpretable data analysis approaches. However, integrating diferential privacy with such approaches is not simple, as they are based on local properties of the data, which are more prone to privacy leakage than global properties. Moreover, the combinatorial nature of these approaches makes them typically more susceptible to noise than continuous optimization methods.

We tackle these obstacles and present two algorithms based on existing approaches for diferential privacy and redescription mining: a greedy algorithm for redescription mining, and the exponential and Laplace mechanisms for diferential privacy. Their successful combination requires developing novel methods. In particular, we present a general technique to lower the sensitivity of the quality measure to magnitudes well below 1, allowing much more accurate results to be drawn from the exponential mechanism without compromising the privacy. In addition, we use weighted reservoir sampling for memoization, yielding an eficient implementation of the exponential mechanism.

2. Background and Related Work

Redescription mining. Redescription mining is a data analysis task that aims at finding two diferent ways to describe almost the same observations. In our setting, the data set consists of a pair of tables, respectively denoted D and D . Both tables are over the same entities (rows) but disjoint collections of attributes (columns).

Such a pair of data tables, along with user-defined parameters, constitute the input of redescription mining. A redescription consists of a pair of queries denoted by and , one over either data table. A query consists of literals over Boolean, categorical or numerical attributes, combined with logical conjunction and disjunction operators, possibly involving negations. For instance, the query

= (¬ Hypertension ∧ [State = CA]) ∨ [BirthY ≤ 1950] describes entities, here individuals, who either do not sufer from hypertension and reside in California, or are born before 1950. The collection of entities described by the queries is called the support of the query, denoted supp( ).

The main quality of a redescription is that the queries have similar supports. Referred to as its accuracy, it is measured using the Jaccard coeficient: ( , ) = |supp( ) ∩ supp( )| / |supp( ) ∪ supp( )| . (1) In addition to having a high accuracy, to be of interest a redescription should cover enough entities. This can be ensured by setting a minimum support threshold.

Since the introduction of redescription mining [ 3 ], various algorithms have been proposed for the task, including algorithms based on decision trees [ 3, 4, 5 ], on mining and combining frequent itemsets [ 6 ], or considering regression and correlation models [ 7 ]. We build on the ReReMi algorithm [ 8 ], that iteratively grows the queries in a greedy fashion. Further details can be found in [ 2 ].

Diferential privacy. Diferential privacy ensures that it cannot be inferred, within a fixed probability, whether or not a particular individual is included in a data set, regardless of what other information the adversary might have. This is achieved by withholding direct access to the data and only returning randomized outputs.

Formally, if and ′ are two data sets that difer in exactly one row, is a randomized algorithm operating on these tables, is any set of potential results might return, and ℛ ← ∅ ← IntialPairs( init/2, , , ) for 0 ∈ do ℓ ← 0 0 ← ComputeQuality( ℓ, qual/( || )) while ℓ < and ℓ > min do ℓ ← ℓ + 1 ℓ ← ExtendOne({ ℓ−1}, ext/(2 || ), , ) ℓ ← ComputeQuality( ℓ, qual/( || ))

Add to ℛ the last ℓ with ℓ > min, if any return ℛ ∈ [ 0, 1 ] is the privacy parameter (or budget), then is -diferentially private (later also simply -private) if

Pr[ ( ) ∈ ] ≤ Pr[ ( ′) ∈ ] , (2) where the probability is over the randomness of (see [ 9 ]).

The core mechanism for ensuring diferential privacy with numerical outputs is the Laplace mechanism, that perturbs numerical outputs with Laplace(0, Δ/ )-distributed noise, where Δ is the sensitivity of , i.e. the maximum change in the output of given two inputs that difer in exactly one row. For other than numerical outputs, the exponential mechanism [ 10 ] is used. The space of all possible outputs of must be endowed with a data-specific quality function (︀ ( )︀) , and the answer is sampled from this space with probability ∝ exp(︀ · (︀ ( )︀) . This ensures 2 Δ -privacy [ 10 ]. Further developments in diferential privacy include concentrated diferential privacy [ 11 ], which trades better behaviour under composition to weaker definition of privacy. Our method is trivially ( · ( − 1)/2, )-concentrated diferentially private [ 11 ], but further analysis of concentrated diferential privacy is beyond the scope of this paper.

After the initial formulation [ 1 ] and the introduction of the exponential mechanism [ 10 ], diferential privacy has seen significant interest in academia, and recent years have brought uptake by the industry as well (e.g. [ 12 ]). Real-world use of diferential privacy has indicated that the typical assumption that < 1 is often unachievable (e.g. the US Census Bureau used = 19.61 for releasing the 2020 US Census results [ 13 ]). Diferentially private algorithms have been proposed for mining frequent itemsets [14, 15], association rules [16, 17], as well as subgraphs [18] or sequences [19]. ℛ ← ∅ for all ∈ do

ℛ ← ℛ ∪ { (, ComputeQuality(, qual/|| ))} return ℛ

3. The Algorithms

In this section we present two algorithms for diferentially private redescription mining: SerenadeCS and SerenadeES. Both algorithms are based on the greedy redescription mining approach of ReReMi and assume that one side of the data has only Boolean and categorical attributes, a common assumption in redescription mining (see, e.g. [ 8 ]).

Both algorithms consist of a client with no direct access to the data and a diferentially private engine, with full access to the data, that the client queries. The two algorithms difer mainly on their client-side implementations, i.e. how they use the diferentially private engine’s methods. With SerenadeCS, more of the decisions are made on the client side (CS), resulting in the ability to prune low-quality results at the cost of an increased budget expenditure. SerenadeES provides a thriftier alternative by letting the diferentially private engine make the decisions (engine side, ES).

We will next explain how the clients work before explaining how the diferentially private part has been implemented.

3.1. Client side

The pseudocode for the algorithms is given in Algorithms 1 and 2. Both algorithms mine redescriptions through three main steps. The IntialPairs method (Algorithm 4) returns redescriptions with a single literal in both queries. These pairs are then iteratively extended into longer redescriptions. The ExtendOne method (Algorithm 3) takes a set of redescriptions as input and returns a new redescription, obtained by extending one of the input redescriptions by one literal. The ComputeQuality method computes the noisy Jaccard and support size of a redescription. The two algorithms difer on when the quality of redescriptions is computed and how to select the next redescription to extend.

SerenadeCS closely follows the approach of ReReMi: it considers one initial pair at a time (line 4 in Algorithm 1) for at most a fixed number of extensions (line 7). In each iteration, SerenadeCS calls ExtendOne with the current candidate as input (line 9), queries ComputeQuality to get the noisy Jaccard of the extension (line 10), and [( , )] ← ( ′ , ′ ) with age 1 ( ′ , ′ ) ← , ( , ) ← .pop() increment the age of every ( return ( ′ , ′ )

[( , )]; delete [( , )] Algorithm 3 ExtendOne, extending a redescription Output: Extended redescription ( ′ , ′ )

Static variables: Associative array , priority queue

Private information: Data = (D , D

), sensitivity Δ given , , and Input: Data , collection of redescriptions , privacy budget , min. support , steepness select the extensions with the highest key key( ( , , , ), / Δ ) , ) ∈ and remove expired from and decides whether to continue with the extension process. This ensures we only attempt to extend redescriptions that are good enough, but it has two drawbacks: i) it needs to call ComputeQuality for every extension, using plenty of budget; ii) it might attempt to extend a redescription which has no good extensions, efectively wasting the associated budget. The number of redescriptions returned by SerenadeCS is not predetermined, though it is never larger than the number of initial pairs .

SerenadeES addresses these drawbacks of SerenadeCS. At each of a fixed number of extensions, it sends all current candidate redescriptions that can be extended to ExtendOne and receives one extension back (line 4 in Algorithm 2). The decision of which redescription to extend is made in the diferentially private engine. Furthermore, the Jaccard is only computed for the redescriptions that are returned (line 9). This approach saves on the privacy budget compared to SerenadeCS. It also allows some redescriptions to be extended multiple times while not wasting budget on redescriptions that cannot be extended. On the flip side,

ExtendOne might return a bad redescription, and this will only become apparent when it is too late to replace the candidate. SerenadeES always returns as many redescriptions as there were initial pairs. 3.2. Diferentially Private Engine

Next, we present the three methods of the diferentially private engine. We start with a general overview and description of the main ingredients of our algorithms. Reservoir sampling.

Greedy extension is a standard strategy for building redescriptions [ 2 ]. Extending redescription (

, ) consists in appending a literal to either of the queries using a disjunction or conjunction. The extension with the highest accuracy that satisfies constraints (e.g. minimum support threshold) is reported. For numerical attributes, all relevant intervals are evaluated in turn. A pruning trick [ 8 ] allows to identify relevant intervals, i.e. those that impact the accuracy, and avoid testing those 5: 6: 7: 8: for all reservoirs do

Add redescription from to ℛ return ℛ Algorithm 4 IntialPairs, generating initial candidates Input: Privacy budget , number of initial pairs , minimum support threshold , steepness Output: A set of initial redescriptions ℛ = {( , )} Private information: Data = (D , D ), sensitivity Δ given , and 1: function IntialPairs(, , , ) 2: 3: 4: ℛ ← ∅ Set as size-1 reservoir sampler for = 1, . . . , for all literals over the non-numerical attributes of D , numerical attributes in D , and meaningful intervals [, ] of do

Add initial pair (, [ ≤ ≤ ]) to all reservoir samplers where key(︀ (, [ ≤ ≤ ], , ), / ( Δ ))︀ is greater than the current key that do not (see Section 3.3).

A diferentially private variant of such an extension strategy is best implemented with the exponential mechanism. Instead of reporting the best extension, the algorithm will randomly pick one of the candidates with probabilities proportional to their accuracies. The set of candidates will also include the current, unextended, redescription associated to a sampling probability reflecting its quality. That way, the structure of the queries will not reveal anything sensitive about the data (see Section 3.3 about the case of continuous-valued attributes).

A naïve approach would be to store all candidates, but this would use too much memory. Instead, we use weighted reservoir sampling [20, 21]. The idea is to keep a reservoir of size and evaluate candidate extensions in turn. For each one, we sample a value ∼ Unif [ 0,1 ]. If the probability of the candidate being selected by the exponential mechanism is , we compute its key as 1/ and store the candidate in the reservoir if this key is larger than the current smallest key in the reservoir (see [20, 21] for proofs of correctness). That is, for each candidate ( , ) we compute key( ( , , Θ), ) = log(Unif [ 0,1 ]) · exp(︀ · ( , , Θ))︀ , (3) where ( , , Θ) is a function measuring the quality of ( , ) given parameters in Θ, and retain the candidates with highest keys.

Extending a redescription. When called from SerenadeES, ExtendOne (Algorithm 3) receives a collection of redescriptions . In each call, the method samples from the distribution of all extensions of all redescriptions in , and one selected candidate replaces the extended redescription. This proceeds for a fixed number of iterations, progressively extending and replacing redescriptions in . Between two successive calls to ExtendOne, only one redescription changes.

Our sampling approach also allows us to first draw a large sample and then take a smaller sample from it, by simply retaining the entries associated with the largest keys. As the collection can be large, we use this strategy to save computational time.

Specifically, ExtendOne stores for each redescription encountered so far the sampled extension for that redescription, together with its key.

When a new redescription is

added to , ExtendOne samples an extension for it and stores it. The redescription whose sampled extension has the largest key is selected and the corresponding extended redescription is returned.

This memoization can lead to suboptimal choices if an extension of a redescription gets a very low key value even though it is relatively good, and vice versa. To counteract that, ExtendOne prunes stored extensions that are too old, i.e. extensions that have not been selected in the past extension rounds. The maximum age of an extension is a user-supplied parameter. When called from SerenadeCS, ExtendOne operates on a single redescription and returns its candidate extension having the highest key, without any memoization.

Computing initial pairs and redescription quality. IntialPairs (Algorithm 4) builds initial pairs by enumerating all pairs of attributes (, ), where is an attribute from D and is an attribute from D . The selection of the initial pairs from all pairs is based on a reservoir sampling approach. Assuming w.l.o.g. that the non-numerical attributes are in D , we consider as initial query every literal consisting of a Boolean attribute or an attribute–category pair. For , we consider every literal (including all relevant intervals) from D

to match , in the same way as when extending an existing redescription. The sampling is done over the queries . In order to generate initial pairs, we populate diferent size-1 reservoirs, corresponding to independent samples with replacement from the exponential mechanism, ensuring privacy.

ComputeQuality computes the size of the intersection and the size of the union of the input queries’ supports and returns them after applying Laplace noise. The noisy Jaccard of the redescription is computed as the noisy intersection size divided by the noisy union size.

3.3. Handling Continuous Attributes

Continuous attributes require extra care in both computational eficiency and privacy. When appearing in a query, a continuous attribute ∈ R will be bounded within an interval, as [

≤ ≤ ] (where either or can be infinite). Finding these intervals eficiently is the first problem. We follow the cut-point-based approach of ReReMi [ 8 ].

This approach is based on the fact that only some values of will have an efect on the support of the extended query when used as threshold. If an entity is not in supp( ), whether or not it belongs to supp([

≤ ≤ ]) is indiferent w.r.t. the support of the extended query. Hence we only need to consider as thresholds those values that determine whether entities in supp( −∞ and ∞, are called cut-points.

This approach extends to disjunctions and negations. It also helps us with privacy. Returning the exact value of cut-points can leak information about the data, as it reveals ) are included in supp([ ≤ ≤ ]). These values, along with that takes this exact value at least once.1 Let us assume that the lower cut-points and upper cut-points are −∞ = 1 < 2 < · · · < and 1 < 2 < · · · < = ∞, respectively, and that we have chosen to use interval [ ≤ ≤ ]. Instead of reporting , we sample the value ′ we report as lower threshold from Unif ( −1, ] and instead of , we sample a value ′ from Unif [ , +1). Note that if = 1 or = , we actually have a half-interval, which we write [ ≤ ′ ] or [ ′ ≤ ], and no sampling is necessary for that threshold. This sampling ensures that the accuracy of the redescription remains the same while no information is leaked about the exact values in the data.

3.4. Lower-Sensitivity Quality Function

A key element of the algorithm is the distribution of the budget between the steps. Both IntialPairs and ExtendOne use the exponential mechanism, which provides 2 Δ -privacy for quality of sensitivity Δ . For our application, the standard quality function is the Jaccard coeficient, which has sensitivity 1, since it takes values between 0 and 1. However, for interesting redescriptions, the support size is well above 0 and, as a result, adding or removing a single row will not change the Jaccard coeficient much. This motivates us to modify the quality function to take into account the support size. Specifically, we multiply the Jaccard coeficient with a logistic function centered at the user-defined minimum support threshold . The quality function becomes ( , , , ) = ( , )/(︀ 1 + exp(︀ − · (|supp( ) ∩ supp( )| − ))︀ , (4) where is a parameter adjusting the steepness of the logistic curve.

For large values of the sensitivity of will still be 1. For smaller values of , however, it is possible to significantly reduce the sensitivity of . Given , , and data set size , the sensitivity of can be calculated in ( 2) time by considering all possible values for numerator and denominator in (1), plugging them into (4), and identifying the largest change that can result from the addition of a single row. Furthermore, the steepness can be optimized with a simple line search over diferent sensitivity parameters. With the data sets and parameters used in our experiments, the sensitivity was always less than 0.001, leaving more budget to be spent on initial pairs and extensions.

Notice that knowing the sensitivity releases information about the data as it depends on the data set size. This leaves a few options. The first is that the diferentially private engine computes the optimal steepness and uses the corresponding sensitivity when distributing the budget. This approach is completely private, but means that the user has no knowledge of the actual noise levels used. The other extreme is to use a weaker form of diferential privacy, where the user knows the size of the data set. The user can then find the optimal minimizing the sensitivity of (4). As an intermediate option, the user might query for , use the noisy estimate to compute near-optimal steepness, and give that as a parameter to the algorithm. This way, at the cost of some privacy budget, the user retains some knowledge of the quality function and knows its steepness, but no privacy is lost. 1The fact that categorical attributes take each category at least once is not considered private information.

3.5. Other Parameters

The two key parameters of the algorithms are the number of initial pairs, , and the number of extensions, or . The higher is set, the more redescriptions will be returned, but each of them will receive a smaller portion of the privacy budget, both when generating the initial pairs and when computing the privatized supports and accuracies. On the other hand, the number of extensions ( or ) dictates how complex the redescriptions can become. SerenadeCS will return no more than redescriptions, and none of them will have more than extensions (i.e. they have at most + 2 literals in total). SerenadeES will return exactly redescriptions, but some of them might have just two literals, or some of them might have + 2 literals. Hence, if SerenadeCS is set to perform extensions, a comparable setting for SerenadeES is = · extensions.

3.6. Privacy

The privacy of the proposed approach is mainly based on the exponential mechanism. For this, we need to sample from every possible extension with the correct probability. Next we show that this is the case. We then continue by analysing the use of the budget, and conclude that the methods are indeed private.

Lemma 1. Given a redescription ℓ−1 = ( , ), the probability that ExtendOne in line 9 of Algorithm 1 returns an extension ℓ = ( ′ , ′ ), where either i) ′ is an arbitrary single-literal extension of and = ′ , ii) ′ is an arbitrary single-literal extension of and = ′ , or iii) = ′ and = ′ , is ∝ exp( · ( ′ , ′ , , )). Proof. We can assume w.l.o.g. that = ′ . Assume first that D does not contain continuous attributes. To select the extension of , ExtendOne can be considered to go through every extension, computing their quality according to , and placing the extension into a stream together with a weight parameter that is set to exp( · ). Finally, the algorithm will also place the unextended into the stream with weight corresponding to the quality of ( , ). The weighted reservoir sampler will select one item from this stream, so that the probability of an item being selected is proportional to its weight [21, 20], thus sampling the extension with correct probability.

For continuous attributes, the algorithm further perturbs the boundaries as explained in Section 3.3. As this does not afect the quality, all extensions with the same quality have equal probability to be selected, concluding the proof.

Lemma 2. The extended redescription = ( ′ , ′ ) returned by ExtendOne when called in line 4 of Algorithm 2 is selected with probability proportional to exp( · ( ′ , ′ , , )) from all possible at-most-single literal extension to all redescriptions in . Lemma 3. Given a pair of single-literal queries ( , ) over data sets D and D , the probability that IntialPairs returns ( , ) is ∝ · exp( · ( , , , )). Lemma 4. The total budget used by SerenadeCS and SerenadeES never exceeds tot.

The proof for Lemma 3 follows the same lines as for Lemma 1, except that the sampling is repeated times, and is omitted. The proofs for Lemma 2 and Lemma 4 are postponed to the full version of the paper. 2 Theorem 5. SerenadeCS and SerenadeES are tot-diferentially private. Proof. Per Lemma 3, both algorithms generate initial pairs privately. Each initial pair is an independent query to the data, spending altogether budget init. Every extension in SerenadeCS is an independent, diferentially private query (Lemma 1). Due to the composability of diferential privacy [ 9 ], they do not reveal more information than permitted by their allocated budget. The same holds for extensions in SerenadeES. Finally, quality calculations in ComputeQuality use the standard Laplace mechanism and are private, in line with their allocated budget. The total budgets do not exceed the allocated tot (Lemma 4).

4. Experimental Evaluation

To test our proposed methods, we conducted both quantitative and qualitative experiments. The purpose of the quantitative experiments is to investigate the behavior of the algorithms with diferent parameters and carry out an ablation study. The qualitative experiments are meant to assess the usefulness of our approach for real-world data analysis. For this reason, we did not do any parameter tuning as that would consume some of the privacy budget.

All algorithms are implemented in Python and the experiments are run with Python 3.6.8 on a machine with 2 AMD EPYC 7702 processors with 64 cores each and 1 TB of main memory.

4.1. Data and Other Methods

The Mammals data set contains information about which mammal species inhabit which areas of Europe on one side [22], and climate information on the other side [23]. This data set has often been used in redescription mining literature (e.g. [ 8, 24, 4, 25, 5, 26 ]). The data contains 194 species and 48 climate attributes as its columns, with 2575 localities as its rows.

The MIMIC data set [27]3 is an example of health data where patient privacy is critical. It contains de-identified health information about hospital patients. Our dataset contains two views: the diagnoses view and the lab events view. The aim when mining this data set is to find out what kind of lab events are associated with various diagnoses of the examined patients. The data set contains 107 Boolean attributes in the diagnoses view (1 denotes a positive diagnosis) and 104 ternary attributes in the lab events view, with 46 065 patients as its rows. 2The source code and the appendix containing all details that are postponed to the full version of the paper are presented in https://github.com/maijuka/Serenade 3Data set available at https://physionet.org/content/mimiciii/1.4/.

The VAA data set [28] contains the background information and answers to questions regarding political opinions of candidates to the Finnish parliamentary elections of 2011. The data was collected by an online voting advice application (VAA). We removed candidates who did not provide answers to all questions, leaving 1656 candidates, 9 background attributes and 107 opinion attributes.

The Open Sex-Role Inventory [29] (SR data set) is used to study gender roles by measuring masculinity and femininity. The Right-Wing Authoritarianism Scale [30] (RightW data set) was developed to measure the psychology of the followers of facist regimes.4 Columns recording to the time spent answering the questions were removed from the RightW and the SR data sets. These data sets have the same variables on both sides; IntialPairs and ExtendOne are implemented to ensure the same attribute cannot be used in both queries of the same redescription.

We also conducted experiments with other datasets. The results were not significant and are postponed to the full version of the paper.

The main studied algorithms are SerenadeCS and SerenadeES. In addition, for ablation studies, we used a variation of SerenadeES called NoSens. It works like SerenadeES, but uses the standard Jaccard (1) instead of the lower-sensitivity version (4) for computing keys (3). We also did a comparison to ReReMi, the baseline non-private redescription mining algorithm [ 8 ].

For all experiments, we set min = 0.3 for SerenadeCS and to 10 % of data set size .

4.2. Quantitative Experiments

In this empirical evaluation, we investigate the efects of parameters and carry out an ablation study using NoSens.

Privacy budget. A typical value for the privacy budget is tot = 1, though much higher total budgets are sometimes used (e.g. [ 31, 32, 13 ]). Hence we first studied the efect of varying the total budget tot along with diferent ways to distribute it between the diferent tasks. Total budgets of 0.5, 1, 10, and 100 were used. Of these, tot = 0.5 and tot = 1 are typical values found in the literature, tot = 10 is closer to values sometimes used in practice [ 13 ], and tot = 100 can be seen as a baseline of how the algorithms perform when privacy requirements are lifted.

Another parameter to consider is the distribution of the total budget between init, ext, and qual. We used five diferent ways of distributing tot between init, ext, and qual, respectively in the following proportions: 1 = (0.33, 0.33, 0.33), 2 = (0.45, 0.45, 0.1), 3 = (0.6, 0.3, 0.1), 4 = (0.3, 0.6, 0.1), and 5 = (0.25, 0.25, 0.5). The maximum age of an extension was set to 40, = 20, = 4, and = 80. Each experiment was repeated ifve times.

Results for these experiments on the Mammals data set are presented in Fig. 1(a–b) where we see that both SerenadeCS and SerenadeES are capable of returning very 4Data sets available at https://openpsychometrics.org/_rawdata/. 0.8 0.6 0.4 0.2 d 1.0 0.8 0.6 0.4 0.2 good results when allocated suficiently high budgets. On the other hand, NoSens struggles even when allocated very high budgets. Given its ineficient use of the budget, this is not surprising. Results for similar experiments on three other data sets are presented in Fig. 1(c–e) where we see that SerenadeCS returns clearly better results with almost all parameters for these datasets. We can also see that VAA (Fig 1(e)) yielded poor results regardless of budget and algorithm. This is explained by the fact that the non diferentially private redescription mining algorithm ReReMi5 also fails to mine good quality redescriptions from this data set. These results are postponed to the full version of the paper. From Fig. 1(f) we see that we get slightly better initial pairs with a higher budget, which could be the reason why the distributions 3 and 2, having the highest budgets for initial pairs, perform slightly better than the other budget distributions.

SerenadeES typically has higher variance on the accuracy of the results, giving accuracies both higher and lower than SerenadeCS. This is in line with the expected behavior. Of the diferent budget distributions, 3 seems to be consistently giving the best results, although the diferences are not significant.

An analyst is typically not interested in results with very low reported accuracy (noisy Jaccard). Figure 2 repeats Fig. 1(a–b) for SerenadeES and SerenadeCS after removing redescriptions with noisy Jaccard below 0.3. The average and median for SerenadeES 5Source code available at https://pypi.org/project/python-siren/ [ 8 ] 0.6 d r a c c0.4 a J e u r0.2

T b 1.0 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 improve, and especially the lower end of the tail rises, as expected. We can also see that this pruning leaves some redescriptions with true accuracy below 0.3, but does not actually discard many of those reported to be good. The discrepancy between the reported and true accuracy is an important feature: if the system reports a redescription as having a high accuracy, the user should be able to trust that it truly is so. Number of initial pairs and reliability of noisy Jaccards. Next we investigate the efects of the number of initial pairs. For these tests, the maximum age was always set to be twice the number of initial pairs, and the number of extension rounds was four times the number of initial pairs. Budget distribution 1 was used throughout. Recall that more initial pairs means the budget is divided between more redescriptions. This shows especially strongly in the diferences between the true and reported accuracies (Figure 3).

Figure 3 also shows that SerenadeCS returns fewer redescriptions than SerenadeES and that it occasionally reports overly optimistic Jaccard values even with = 20. This is due to its mediocre use of the budget, leaving just slivers for relevant quality computations. Note that reported accuracies for SerenadeCS are cut above min = 0.3 through filtering. With 80 initial pairs, reported Jaccards become less correlated with true Jaccards for both methods, as budget per redescription dwindles.

Maximum age. The final quantitative study probes the efects of the maximum age memoization in ExtendOne in SerenadeES. We investigated this by setting = 80 and trying maximum ages 0 (computing extensions anew in every iteration), 5, 10, 20, 40, 60, and 81 (stored extensions never expire). We found that the memoization has a clear beneficial efect on the running time. Using maximum age 20 brings the running time down to about 25 min, from over 70 min without memoization (maximum age 0). Most redescriptions are selected for extension within about 20 steps, so the running time does not improve noticeably beyond that. On the other hand, the maximum age did not seem to have a significant efect on the quality.

4.3. Qualitative Experiments

Next, we present a few results mined from diferent datasets, as anecdotal examples of results one can expect by our algorithms with these kinds of data sets. We did five runs of the algorithms for the MIMIC, VAA, RightW, SR, SSC, Psycho, and Pref data sets with total budget 1 and budget distribution 3. The results were chosen over five runs for both algorithms, so the total budget per data set is 10. The results with VAA, SSC, Psycho and Pref datasets, and the raw redescriptions are postponed to the full version of the paper.

An example result ( , ) from MIMIC with SerenadeCS is

( Atrial Fibrillation , Uric Acid = 1 ∨ Creatine Kinase = 0 ) , meaning that the set of patients who had atrial fibrillation (a type of arrhythmia, abnormal heart rhythm) corresponds to the set of patients whose uric acid (waste product in blood) test results were abnormal or whose creatine kinase (enzyme related to energy production) test results returned normal values. The reported accuracy of this redescription is 0.424 and its true accuracy is 0.227. Such results can be used to understand the procedures at this hospital, and as they are privatized, the results can also be communicated more widely without putting the privacy of patients in jeopardy.

A result with SerenadeES from the RightW data set tells that the set of participants who moderately or strongly agreed that “You have to admire those who challenged the law and the majority’s view by protesting for women’s abortion rights, for animal rights, or to abolish school prayer” corresponds to the set of participants who did not vote. The reported accuracy of this redescription is 0.428 and its true accuracy is 0.538.

A result from the SR data set with SerenadeCS the set of participants who agreed or slightly agreed that they are the happiest when they are in their bed corresponds to the set of participants who agreed or slightly agreed that they give people handmade gifts and are between 14 and 32 years of age. The reported accuracy of this redescription is 0.651 and its true accuracy is 0.725.

5. Conclusions

It is the nature of diferential privacy and local pattern mining that the user can never be entirely certain whether a pattern reported by the algorithm actually represents a strong pattern in the data. But careful selection of the parameters allows to avoid most of the pitfalls, and when a privacy budget higher than 1 is available, the problem essentially vanishes. Even with a privacy budget of at most 1, the results can be very useful. Of the two approaches presented, SerenadeCS gives results with higher (true and reported) Jaccard, but with lower correlation between the two, meaning that a good-looking result from SerenadeCS might turn out not to be so good in reality. SerenadeES, on the other hand, returns more truthful Jaccards, thanks to its better use of budget, but its extension method might lead to generally somewhat inferior results.

Overall, the reported strong patterns are reliably present in the data. This means that a user who obtains interesting strong patterns via these diferentially private approaches can seek further means to confirm them, for example by applying for a research permit for the data set or by conducting a separate confirmatory study. The techniques presented in this paper might also be used to implement other diferentially private pattern mining algorithms in the future. [14] C. Zeng, J. F. Naughton, J.-Y. Cai, On diferentially private frequent itemset mining,

Proc. VLDB Endowment 6 (2012) 25–36. [15] J. Lee, C. W. Clifton, Top-k frequent itemsets via diferentially private FP-trees, in:

KDD ’14, 2014, pp. 931–940. [16] Y.-T. Tsou, H. Zhen, X. Jiang, Y. Huang, S.-Y. Kuo, DPARM: Diferentially private association rules mining, IEEE Access 8 (2020) 142131–142147. [17] M. Maruseac, G. Ghinita, Precision-enhanced diferentially-private mining of highconfidence association rules, IEEE Trans. Dependable Secure Comput. 17 (2018) 1297–1309. [18] S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Diferentially private frequent subgraph mining, in: ICDE’16, 2016, pp. 229–240. [19] L. Bonomi, L. Xiong, A two-phase algorithm for mining sequential patterns with diferential privacy, in: CIKM’13, 2013, pp. 269–278. [20] P. S. Efraimidis, P. G. Spirakis, Weighted random sampling with reservoir, Inf. Proc.

Lett. 97 (2006) 181–185. [21] M. Kolonko, D. Wäsch, Sequential reservoir sampling with a nonuniform distribution,

ACM Trans. Math. Soft. 32 (2006) 257–273. [22] A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, J. Zima, The Atlas of European Mammals, Academic Press, 1999. [23] R. J. Hijmans, S. E. Cameron, L. J. Parra, P. G. Jones, A. Jarvis, Very High Resolution Interpolated Climate Surfaces for Global Land Areas, Int. J. Climatol. 25 (2005) 1965–1978. URL: www.worldclim.org. [24] E. Galbrun, P. Miettinen, Siren: An interactive tool for mining and visualizing geospatial redescriptions [demo], in: KDD’12, 2012, pp. 1544–1547. [25] J. Kalofolias, E. Galbrun, P. Miettinen, From sets of good redescriptions to good sets of redescriptions, in: ICDM’16, 2016, pp. 211–220. [26] E. Galbrun, P. Miettinen, Mining Redescriptions with Siren, ACM Trans. Knowl.

Discov. Data 12 (2018). [27] A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, R. G. Mark, MIMIC-III, a freely accessible critical care database, Scientific data 3 (2016) 160035. [28] Helsingin Sanomat, Parliamentary elections 2011: Candidate responses to helsingin sanomat candidate selector, 2016. URL: http://urn.fi/urn:nbn:fi:fsd:T-FSD2701, version 2.1. [29] T. E. Malloy, Bem sex role inventory, Wiley Online Library, 2010. [30] B. Altemeyer, Right-wing authoritarianism, Univ. of Manitoba Press, 1983. [31] R. Balu, T. Furon, Diferentially private matrix factorization using sketching techniques, in: IH&MMSec’16, 2016, pp. 57–62. [32] S. Fletcher, M. Z. Islam, Decision tree classification with diferential privacy: A survey, ACM Comput. Surv. 52 (2019) 83:1–83:33.

[1]

Dwork ,

McSherry ,

Nissim ,

Smith , Calibrating noise to sensitivity in private data analysis , in: TCC'06 , 2006 , pp. 265 - 284 .

[2]

Galbrun ,

Miettinen , Redescription Mining, Springer, 2018 .

[3]

Ramakrishnan ,

Kumar ,

Mishra ,

Potts ,

R. F.

Helm , Turning CARTwheels: An alternating algorithm for mining redescriptions , in: KDD'04 , 2004 , pp. 266 - 275 .

[4]

Zinchenko , E. Galbrun,

Miettinen , Mining Predictive Redescriptions with Trees , in: ICDM'15 Workshops , 2015 , pp. 1672 - 1675 .

[5]

Mihelčić ,

Džeroski ,

Lavrač ,

Šmuc , A framework for redescription set construction , Expert Syst. Appl . 68 ( 2017 ) 196 - 215 .

[6]

Gallo ,

Miettinen ,

Mannila , Finding subgroups having several descriptions: Algorithms for redescription mining , in: SDM'08 , 2008 , pp. 334 - 345 .

[7]

F. I.

Stamm ,

Becker ,

Strohmaier ,

Lemmerich , Redescription model mining , in: KDD'21 , 2021 , pp. 1521 - 1529 .

[8]

Galbrun ,

Miettinen , From black and white to full color: Extending redescription mining outside the Boolean world , Stat. Anal. Data Min . 5 ( 2012 ) 284 - 303 .

[9]

Dwork ,

Roth , The algorithmic foundations of diferential privacy , Found. Trends Theor. Comput. Sci. 9 ( 2014 ) 211 - 407 .

[10]

McSherry ,

Talwar , Mechanism design via diferential privacy , in: FOCS'07 , 2007 , pp. 94 - 103 .

[11]

Dwork ,

G. N.

Rothblum , Concentrated diferential privacy, arXiv preprint arXiv:1603 . 01887 ( 2016 ).

[12]

Ding ,

Kulkarni ,

Yekhanin , Collecting Telemetry Data Privately , in: NIPS' 17 , 2017 , pp. 3571 - 3580 .

[13]

Census Bureau , Census bureau sets key parameters to protect privacy in 2020 census results , 2021 . URL: https://www.census.gov/newsroom/press-releases/ 2021 / 2020-census -key-parameters .html.