=Paper=
{{Paper
|id=Vol-2029/paper6
|storemode=property
|title=Data-driven Anonymization Process Applied to Time Series
|pdfUrl=https://ceur-ws.org/Vol-2029/paper6.pdf
|volume=Vol-2029
|authors=Vincent Thouvenot,Damien Nogues,Catherine Gouttas
|dblpUrl=https://dblp.org/rec/conf/simbig/ThouvenotNG17
}}
==Data-driven Anonymization Process Applied to Time Series==
Data-driven anonymization process applied to time series Vincent Thouvenot Damien Nogues Catherine Gouttas Thales Communications & Security Thales Digital Factory Thales Digital Factory firstname.lastname@thalesgroup.com Abstract EU citizen may be subject to regulation. Besides this legal context, Big Data technologies enables Digital transformation and Big Data allow the treatment of massive, dynamic and unstruc- the use of highly valuable data. However, tured data, and facilitates data crossing, weaken- these data can be individual or sensitive, ing privacy protection. The data concerned can and represent an obvious threat for pri- be personal (name, address, etc), and allow to (al- vacy. Anonymization, which achieves a most) directly identify an individual. Sensitive trade-off between data protection and data data, like religious or political beliefs, pose a risk utility, can be used in this context. There is for individual privacy too. Smartphone, smart ob- not global anonymization technique which ject, loyalty card, online purchase, social media fits at all applications. Here, we describe : there are large sources of individual and sensi- a data-driven anonymization process and tive data, which lead to an obvious risk for pri- apply it on simulated electrical load data. vacy. Protect privacy means avoiding the isola- tion of an individual, the correlation of some in- 1 Introduction formation from different datasets for one individ- The following paper is mainly written for a task ual and the possiblity to obtain information on an of dissemination about anonymization and good individual through exogeneous variables. Despite pratice about it. Indeed, if anonymization is quiet appearances, the trade-off behind anonymization well known from academic point of view, it is is not an easy task. In datasets, it can have some not still the case from France/Europe’s industrials’ identifiers. Just delete or encrypt them is generally one. However, privacy protection is a fundamen- not efficient (see e.g. Hansell (2006); Narayanan tal growing task for them. Digital transformation and Shmatikov (2008)). Others information can brings creation of global datalakes and allows de- be quasi-identifiers which allow re-identification velopment of new valuable business. Moreover, when they are crossing (e.g. in a health dataset, some Governments force an increasing putting in sex and age). There are sensitive data (e.g. disease Open Data, which should promote the opening in health context). Finally there are some remain- digital knowledge and ensure an open valuable nu- ing parameters. In addition, the trade-off depends meric ecosystem. At the same time, European of three parameters. First it depends of data typol- Union sets up rules to protect citizens, which es- ogy. Bank transaction data (e.g. see Ezhilarasi and tablish that citizens have protection right for their Hariharan (2015)) will not require the same treat- individuals data. The data have to be fairly pro- ment that unstructured data like social media data, cessed for specific purpose, and with individu- which ask to hide metadata, the identifying con- als’ agreement. It is an important point because tent and the relational graph data (e.g. see Zhou keeping a maximum of personnal data for future et al. (2008); Chester et al. (2013)). Second, the non specified mining task which should appear trade-off depends of the future use of data. Third, through future methods is inconsistent with this the anonymization strength is time dependent. In- previous rule. People have right to access and deed, new datasets and new re-identification meth- rectified their individual data. With 2016 reg- ods can be used to attack privacy. So, an ad- ulation, applied from 2018, any company offer- missible anonymization methodology can be un- ing goods or services (including cloud services) to admissible few months/years after. 80 We illustrate an anonymization workflow on dataset publication. Section 4 presents a simula- (simulated) electrical smart meter data furniture, tor which are respectful of French electrical smart which are a symbolic example of sensitive and meters anonymization task. We use simulate data individual data whose exloitation possibility is because we have not access to real data. We ap- new and illustrates new needs of anonymization. plied the Section 3 process in Section 5 on Section In France, smart meters, which are currently de- 4 simulation. ployed and whose deployment ended in 2020 will allow to gather infra-daily household electrical 2 A short survey about anonymization load. These data are by nature personal and sen- sitive. Infra-daily individual loads allow to detect For a dissemination task, it seems important to if and when someone is at home and can increase have a brief discussion about the differences be- risk of burglary for instance. The individual habits tween anonymization and encryption because con- (see e.g. (Blazakis et al., 2016)) can be detected fuse the two is a common mistake. Data encryp- and so sensitive data like religion (e.g. during Ra- tion consists in using some mathematical algo- madan) or some illegal activities (e.g. very par- rithms to transform data. The process can be re- ticular load pattern with cannabis plant) derived. versed with the good algorithm and the encryption Provide these data is a complicated challenge. In- key. It could be used to transfer data between two deed, household electrical load will be available at entities. The encryted data are still individuals and different level of time granularity (e.g. see Tudor so still personal data if original data are personal. et al. (2013); Buchmann et al. (2013)). To sim- Although encrytion can be useful to be one of the plify the context, in France, infra-daily load will components of de-identification, it is neither nec- be available to the individuals, which can choose essary nor sufficient for doing anonymization. to temporarily give access to these data to a third Pseudonymisation, which consists of hiding party. Electric distributor will gather daily data identifying metadata can be not efficient (see (even infra-daily in particular context). Provider Danezis (2013)). De-identification falls into two will access to monthly data. Althougt infra-daily categories of techniques: transform data to have data are noisly, having access to identified daily unreal individual, and aggregate and generalize in- (even monthly) data allow easily to re-identify dividual, where data provided symbolizing an in- infra-daily consomption (e.g. see (Tudor et al., dividuals set. Techniques can be used and com- 2013)). In this context, just hide direct identi- bined. fyer will be inadequate. On the other side, these Permutation techniques and puzzling ap- data have a strong valuable potential and many ac- proaches deconstruct, transform and/or change tors are interested by them, for instance, distribu- the data design (see e.g. Agrawal and Srikant tors to manage their network and achieve mainte- (2000), Zhang et al. (2007)). Noise addition nance task; local communities to improve their ur- techniques are popular. For instance, Dufaux ban policy; providers to propose new more adap- and Ebrahimi (2006) randomly transforms video tive pricing; or start-up to popose individuals some representation by inverting some signs in de- services to optimize the consomption. Based on composition coefficients and applies it to privacy the future use of data, it is not necessary to keep protection in video surveillance. Aggarwal and the same information. For instance new adap- Yu (2008) proposes a survey about randomization tive pricing and dimension the networks through methods. Classical randomization has some household electrical load need different informa- advantages. Noise is independent of data and tion. Finally, the currently acceptable methods can does not require entire dataset. It can be applied be questionned when data from gas smart meters, during data collection and on distributed system. which are currently deploying too, will be avail- Liu et al. (2008) proposes an survey of attacks able too. Indeed, gas meter data depends of similar techniques on privacy obtained by perturbations phenomenon that electrical meter data. methods. When the anonymizer adds an addi- tive noise, the attackers can use methods like The article is divided in four sections. In Sec- (spectral, singular value decomposition, principal tion 2, we give a short survey about anonymiza- component analysis, etc.) filtering, maximum tion. In Section 3, we describe our global a posteriori estimation, or distribution analysis. anonymization process, from data gathering to Under good conditions, multiplicative pertuba- 81 tions have good properties, for instance preserve lation function, time slicing profile, density, etc. Euclidian distance. An attacker who knows a Domingo-Ferrer and Torra (2005) partitiones data sample of input and output or has some indepen- throught Maximum Distance to Average Vector dent samples from the original distribution can (MDAV) algorithm. Aggarwal et al. (2006) pro- reverse the anonymization. Differential privacy poses microaggregation where some atypical indi- (see Dwork (2006), Dwork and Roth (2014)) is viduals can be not clustered and so not published. a very popular form of noise addition. Here, we Byun et al. (2007), Lin and Wei (2008), Li et al. add a random noise in such a way it makes a (2002), Xu and Numao (2015) and Loukides and mechanism which produces the same output with Shao (2008) proposes greedy heuristic to achieve almost similar probabilities when we consider k-anonymity through clustering with not NP-hard two adjacent inputs. The basic process to achieve complexity. Bergeat et al. (2014) compares two differential privacy is to sampling without replace- software allowing k-anonymization on a French ment the dataset and adding fictive individuals. health dataset of more than 20 million records. Differential privacy allows to work on privacy loss Gedik and Liu (2008) uses k-anonymity to protect and bound the risk. Chatzikokolakis et al. (2013) mobile location privacy. applies differential privacy on mobility trace When working on time series, previous tech- and tries to develop a mechanism more efficient niques can be used. For instance, Shou than just adding independent noise. McSherry et al. (2013) proposes what they named (k, P)- and Mironov (2009) proposes a framework of anonymity to preserve pattern in time series. In differential privacy to produce recommendations electrical household load protection, Chin et al. from collective user behavior in Netflix Prize (2016) proposes to solve an optimization problem dataset. with two components : first is about information leakage rate of consomer load given grid load and Privacy can be protected by creating individuals second is about the cost of errors. Shi et al. (2011) sets. K-anonymization (see Sweeney (2002)) con- proposes a differential privacy form to protect time sists of generalizing quasi-identifying information series. Zhang et al. (2015) proposes noise gener- to force having at least k individuals with the same ation to protection cloud data. Hong et al. (2013) values. K-anonymity can be broken when all the proposes a survey about time series privacy pro- individuals of (at least) one class have the same tection. sensitive data. L-diversity (see Machanavajjhala et al. (2006)) forces each class to have at least l dif- After de-identification, it is important to mea- ferent values of the sensitive data. In t-closeness sure de-identification degree. Venkatasubrama- (see Li et al. (2007)) the sensitive data in each nian (2008) surveys the metric proposed to mea- class has to respect its distribution in the total pop- sure privacy and privacy loss. Authors divide ulation. Generalization and suppression is NP- measuring privacy into three categories: statistical hard. Moreover, as expressed in Domingo-Ferrer methods, taking account variance of perturbated and Torra (2005), generalization and suppression variable, probabilistic methods, considering infor- can be not adapted for ordinal categorical and for mation theory and Bayesian analysis, and com- continuous attributes. Domingo-Ferrer and Torra putational methods, coming from the idea of a (2005) proposes to use microaggregation for this resource-bounded attacker and measuring privacy task. In microaggregation data are partitioned into loss in function of the information available for several clusters of length at least k with similar such attacker. Tóth et al. (2004) works on mes- records. Then, we apply an aggregator operator sage transmission and analyzes two entropy based (e.g. mean or median for continuous variable) anonymity measures. Authors measure separate to compute the centroid of each cluster. Besides global anonymity, which quantify the necessary clustering method, microaggregation has two im- effort to fully compromise dataset, that we name portant parameters: the minimum dimension of latter journalist scenario, and local anonymity, each cluster, adjusts the level of privacy protection which quantify the probability that transmission of and the function allowing computation of aggre- one user are compromised, that we name prosecu- gate value, which is linked with future data utility tor scenario. Gambs et al. (2014) works on attacks and protection level. The aggregation function can on geolocated data. Nin and Torra (2009) proposes be mean, sum, median, quantile, partial autocorre- a framework of protection and re-identification for 82 els, a neighbor which has access to contextual data, etc.). Many reasons can motivated the at- tacker (retrieve information about individual to do aggressive commercial supply or burglary, harm the organization which manages the anonymiza- tion to recover data governance for instance, show its capacity in re-identification, etc.). The re- identification framework will depend of the attack Figure 1: Global anonymization process category. Crossing anonymized data with dataset which is not anonymized is a classical way to try re-identification. Information in anonymized data time series. Ma and Yau (2015) proposes some can be used too (e.g. anonymized Internet requests information measures for quantifying privacy pro- can be identified by crossing location and interest tection of tme-series Data. of individual). It will depend of the chosen re- Interested reader can find in the first-rate book identification meaning. It can mean identify an Torra (2017) the stakes of data privacy and tech- individual or identify sensitive information about niques associated. an individual. For instance, if household load 3 Anonymization process have been anonymized by pseudonymization then adding a noise, the noisy curve can be identify to Each anonymization task has a specific and a a customer. If we do microaggregation, it could generic part, and so is unique. In this section be possible to find the customer cluster and pos- we describe our global data-driven anonymization sibly deduce probable sensitive information and process (see Figure 1) which allows separating the behavior for the customer. The re-identification two parts of the process. framework will depend of the technique to mea- After data gathering we have to tag data in sure re-identification risk. To be valuable, attack- function of theirs categories : identifier, quasi- ers have to be confident in their re-identification. indentifier, sensitive data, and remainded parame- When we compute the risk of individual identifi- ters. Next step consists in data pseudonymisation. cation, true positive are not the key performance Identifiers have to be hidden (deletion, encryption, indicator. Here, true positive means the good in- etc.). Obviously, that is generally not enough to in- dividual from anonymized data have been identi- sure privacy protection. We have to establish one fied at the individual from not anonymized data. metric to measure data protection and another one However, this individual from not anonymized for data utility. As anonymization could not be data can be identified at many other individuals in total and perfect, we have to choose the thresh- anonymized data. That decreases re-identification old of re-identification acceptance. Despite de- risk. Of course, identification errors decrease the identification process there are still residuals risk, confidence too. The risk have to combine all which has to be compared to the benefits. these information. Lastly, re-identification frame- It is necessary to build a re-identification frame- work will depend of the re-identification scenario. work, which is driven by the context and has Many are possible : we can target all or almost all to be realistic. That means the worst case sit- individuals when we know they are in the dataset uation, where an attacker is almost all-knowing, (journalist scenario), we can target one or some in- is probably not realistic and decreases the effi- dividuals (prosecutor scenario), we can try to dis- ciency of the trade-off utility data / privacy pro- tinguish studies with and without one individual, tection. The re-identification framework depends etc. of many parameters. The attacker type must be determined. Its resources will depend of who Then, as anonymization is a trade-off between he is (e.g. a member of the organization which utility and protection, we have to choose the mini- anonymizes data and so has access at plenty data mum utility of data. We have two case. Data could to attack anonymized dataset, a member of a be provided to a third party to answer to a spe- near organization which has access to similar data cific need. Only necessary information, and not which can be crossed, a Machine Learning expert more, have to be provided. After anonymization, which can deploy efficient re-identification mod- the study has to be possible. For instance, if a elec- 83 tric provider want to do new daily pricing, they (see e.g. Nedellec et al. (2014)). will only need precise daily profile. Data could be We simulate three curves types : we name the given without specific need, for instance to push first “second house load”, which are almost con- data in Open Data. We need to compute a metric stant with a random noise added, the second “lit- of privacy cost. For instance, when we add a noise tle professional”, which are represented by seg- in time series, we can compute a signal to noise ment curves with a load almost null the week-end ratio. and the night and almost constant during the day, Lastly, as anonymization can not be perfect, where the jump intensity and activities period are we have to choose the limit of acceptance for randomly chosen, and the third “household load”, re-identification risk. It determines the trade-off with calendar and thermic components. To sim- achievement point. It depends of the level of in- ulate the last, we apply similar idea that Pompey dividuality and sensitivity. The choice is driven et al. (2015). We train simulation models on GEF- by the exportation model used at the end. We can Com 2012 dataset (see Hong et al. (2014)). The choose a Publish and forget model, where typi- dataset comes from a Kaggle challenge and con- cally data are provided in Open Data. Then, it tains the hourly load demand of 20 local areas in is (almost) impossible to stop data sharing. An- USA and the temperature of 11 weather stations. other model is Data Use Agreement model where We train many different Additive Models on this agreement decides what the third party can do. Fi- dataset whose features sets contain calendar pa- nally we can use a closed model where data are in rameters (type of day, number of day since the a closed environment and the third party has only beginning of the year, etc.) and a random num- access to the results of its requests. ber of raw and smoothing temperatures. They are To avoid scalability problem during non indus- trained on different period. Then, we compute two trial step we can do an optional sub-sampling. De- months of their forecasting with random transla- identification methods, which depends of data ty- tion of features (e.g. one additive model design pology and future data use, are applied. Then, we is translated of one hour and its temperatures of measure the re-identification risk. When the risk is two Fahrenheit), to introduce variety in simulated lower than the authorized maximum, we transform load. We train some quantile additive models too the entire dataset. Then, we measure the utility of (see Gaillard et al. (2016)) to simulate some ex- anonymized data. If the minimum utility is not re- treme behaviors. Our simulated data are smoother spected, we re-start all the steps of this paragraph, than real individual load but it is not a real prob- else we export data in the chosen model. lem here. Indeed, for our appliance, it is important that simulation respects some assumptions. We as- 4 Appliance context : simulated sume there are different levels of load and three electrical load curves families and the consomptions are almost uniqueness even at low frequencie as shown in Electrical household load simulation has a rich Tudor et al. (2013). To respect an uniqueness as- literature. Many authors use variant of Markov sumption, we arbitrarily impose that daily aggre- Chain (e.g. Labeeuw and Deconinck (2013), Mu- gation during one week of two curves of “house- ratori et al. (2013), McLoughlin et al. (2010)). hold load" rounded to the thousands can not be McQueen et al. (2004) uses Monte Carlo sim- equal. We choose the daily time scale of aggre- ulation model of load demand taking into ac- gation because we are faced with an attacker using count the statistical spread of demand in each half daily data and by considering one week, we inte- hour using data sampled from a gamma distribu- grate the weekly cycle which are important when tion. Paatero and Lund (2005) uses bottom-up considering electrical information. load model for generating household load profiles. Pompey et al. (2015) trains Additive Model (see We assume that a provider needs to refine a pric- Hastie and Tibshirani (1990)) to achieve massive- ing for an area of its customers. It needs some scale simulation of electrical load in Smart Grids. infra-daily information (peak of demand, profile, Additive Models have yet proven their efficiency etc.). In our scenario, a potential attacker has ac- to model and forecast electricity load at aggregate cess to identified daily data. We assume the ex- level (see Pierrot and Goude (2011) in France, Fan portation model is a “publish and forget model”, and Hyndman (2012) in Australia) as at local level which explains the possibility for an attacker to 84 Method Parameter Utility Random noise Signal to noise ratio, noise trend, anomaly detec- be completed by post-treatment to improve pro- familly tion, total scope Permutation Type and quantity of permu- total scope tection. The re-identification risk of the first five Transformation tation type of transformation (stan- total scope, trend, techniques of Table 1 are about customer iden- Time slicing dardization, etc.) time scale anomaly detection total scope, anomaly tification and can be studied with classical time Differential privacy sampling probability and fic- detection total scope, trend, series identification and classification techniques tive curve anomaly detection (Deep Learning, Ensemble Method, K Nearest Microaggregation clustering, clusters dimen- total scope, clusters sion, aggregation function trend, profile Neighbors, etc.). Moreover, some methods are Scope aggregation scope, aggregation function total scope, scope trend, profile reversible. For instance, denoising methods can be applied for the first one. For deterministic Table 1: De-identification for time series transformation, the perturbation can be inverted with the knowledge of transformation parameters. have daily data access. Attempting to re-build chronological can reverse time slicing. Sensitive information can be re- 5 Appliance of anonymization process fund from the two last methods when attackers can identify the cluster of a customer. Here we After data gathering, we have to tag the data. Here work to provide data to a provider who wants to we assume data have two attributes: an individ- make a new pricing. It needs precise profile and ual identifier and simulated time series. First, we we choose to use microaggregation. As a provider pseudomyze the individual identifier by substitut- will not propose one tariff by individual, but many ing it with random numbers without replacement tariffs depending of large group, we do not need to ensure uniqueness. We have only one quasi- precise individual data. identifier, which is the sensitive data too, the simu- lated time series. In our simulation framework, the 5.2 Chosen methodology highest level of attackers can be a competing orga- nization which has identified data at a fine granu- We use time series clustering (see Liao (2005), larity level (daily data), with good level of opti- Rani and Sikka (2012)). As explain in these sur- mization and computing whose objective is to find veys, there are many ways to cluster time series. information about customers’ behaviors to make First, clustering algorithm can directly be applied aggressive supply. Data are provided to answer on raw data. However, it can be inefficient be- to a specific need of a third party: having data cause of noisy data. Second consists to extract to etablish new pricing. For this topic, microag- features from time series and applies clustering al- gregation is relevent. Then, we protect privacy gorithm on these features. Third is model-based throught the minimum dimension of clusters. The approaches where time series are modelled before components of the trade-off are the choice of clus- being clustered. Another important choice is the tering methodology, the choice of the dimension distance measure like Euclidean, Kullback-Leibler of clusters and the choice of the aggregator oper- divergence, Dynamic Time Warping (see Berndt ator. With microaggregation, an attacker could, and Clifford (1994)), etc. at worst, identify probable customers behaviors. Wavelets Decomposition (see Beylkin et al. More the minimum dimension of clusters is weak (1991)) have yet being used in time series stud- or more one load participates at the building of a ies because it allows to work on the different lev- cluster, more the attacker can be certain of its de- els of frequencies of the signal and to denoise the ductions. signal. We apply the pre-processing of Cugliari et al. (2016) which successfully applies disaggre- 5.1 Statistical and Machine Learning Setup gated load clustering to forecast load demand. Af- Table 1 presents some techniques which can be ter time series projection, authors compute rela- used to protect time series. Permutation, which tive contribution of each energy level. Assume consists to exchange data from one curve to an- that ( , j,k ) is a Haar basis. A continuous sig- other, is a form of noise introduction and can nal can be approximated in a truncated Haar ba- P Pj create unrealistic curves. Transformation can be sis: fˆ(t) = c0 (t) + Jj=01 2k=01 dj,k j,k (t), smoothing, standardization, etc. whose objec- where c0 , dj,k are the decomposition coefficients tive is to erase some individualities. Time slicing obtained with Fast Wavelet Transform algorithm. breaks individual trends. Differential privacy can Then, we define relative contribution of level j by 85 ✓ ◆ relj = logit ||dj ||2 PJ 1 2 2 , where logit(p) = which can be solved by many algorithms (e.g. pro- ⇣ ⌘ k=0 ||dk ||2 gramming dynamic or simulated annealing). Even ln 1 p p . In these features, we do not consider it is a NP-hard problem and the attacker needs an c0 , which corresponds to the mean level of each exact solution, many works show it is possible to load. We focus our effort on the profiles form consider the problem in a multi-parallel way and which is important to establish new pricing. use GPU programming (see Boyer et al. (2012), As in Cugliari et al. (2016), we use the relative Suri et al. (2012)). Adding a noise, even small, contribution after a Haar Decomposition. In place allows to get out of the knapsack problem. of K-Means we use Maximum Distance to Av- Algorithm 1 MDAV-generic erage Vector generic (MDAV-generic) presented by Domingo-Ferrer and Torra (2005), because we Assume D the relative contribution dataset and k want to have clusters of minimum size k regardless an integer. the number of clusters, and not k clusters regard- 1. While card(D) 3k less their size. Instead of MDAV-generic, we could have use less rigid algorithms like V-MDAV (see (a) Compute the average attribute-wise of Solanas and Ballesté (2006)), which does not force all records in D each cluster (except some last) to be of a fixed size. (b) Compute the most distant record d1 of Through MDAV-generic, we know in advance the previous average in term of Euclidian number of clusters, which can be interested when norm there are a data furniture requirements specitica- (c) Find the most distant record d2 of d1 tion with the third party. We benchmark the tech- (d) Use d2 and d1 as the center of two clus- nique with a mean based aggregation and a vari- ters of length k ance based aggregation. By mean (resp. variance) (e) Delete the records of the two clusters based aggregation, we mean achieving K Near- and come back at the beginning est Neighbor on the mean (resp. variance) load of each individual with initializing by the small- 2. If 2k card(D) 3k 1 est mean (resp. variance). The benchmarks have (a) Compute the average attribute-wise of some advantages : they are easy to implement and remaining records in D timely computated. Cluster algorithm allows to divide the individu- (b) Compute the most distant record d1 of als in subsets of pre-determined minimum length. previous average (Euclidian norm) Then, we have to choose the aggregation tech- (c) Use d1 as the center of a cluster of length niques. We can compute the median load of the in- k dividuals of each cluster at each time. This choice (d) Form another cluster with the others re- allows to minimize the absolute loss for each clus- maining records in D ter and to hide some extremes values. However, there is a non-zero probability that one individual 3. If card(D) < 2k, form a cluster with remain- of one cluster is (almost) always the median. In ing records this case, giving the median is equivalent to give the load of one individual. The mean can be com- puted at each instant for each cluster. However, 5.3 Results the attacker has access for each individual i to Ai We compare the performance when clustering al- the vector of daily load, and (lj )j the infra-daily gorithm is applied on 1 400 centralized simulated loads of each final aggregat (and so to (Lj ) the time series. Remind the objectives consist in fur- daily loads of each cluster). The attacker can try niture of profil as homogeneous as possible to a to solve for each cluster j, third party which wants to propose new pricing. 1 X That means the third party has to collect clus- arg min ||Lj P pi Ai ||2 . ter homogeneous, and data utility measure has to pi 2{0,1} pi concern this point. If data are giving for a task This adversaries model is then equivalent to a of forecasting, we should measure differently the Knapsack problem (see Kellerer et al. (2004)), utility, for instance by computing MAPE (Mean 86 Figure 2: Process relative time Figure 3: Silhouette Index estimated density Absolute Percentage Error) on a test subset (e.g. see Pierrot and Goude (2011)). Here, in a task of pricing, there is no forecasting need. It il- lustrates the dependency between anonymization and future use of data. We measure data util- ity by computing indicators like silhouette index and the Davies-Bouldin index. These two indica- tors represent measures of homogeneity of clus- ters. We do not use indicators like Root Mean Square Error, because there are different load lev- els. We work from 4 to 28 anonymization by step of 4. In Figure 2, we plot the relative compu- Figure 4: Bouldin-Davies Index ratio tation time when the reference level is the com- putation time of 28-anonymization. Computation plied on relative contribution of wavelet decompo- time decreases when k grows. The Figure 3 gives sition with two aggregations : one done by mean an estimation of Silhouette Index density for each level (before centralization) and the other by stan- k-anonymization. This index, computed for each dard deviation level. In relative Bouldin-Davies individual, is between -1 and 1. Stronger is this we give the ratio of Bouldin-Davies Index for index, stronger the individual is connected to its each k-anomyzation by mean and standard devi- cluster and far away the others clusters. It has to be ation and the Bouldin-Davies Index when MDAV- upper than 0 if an individual is well clustered. Ob- generic is applied with the same k. All the indica- viously, when k grows (data protection increases), tors show MDAV-generic applied on relative con- the index decreases (data utility decreases). We tribution of wavelet decomposition outperforms see three modes in the density : the upper cor- the two forms of trivial aggregation, and second responds to second home, the medium to profes- order (based on variance) aggregation is more ef- sional and the last to household. ficient than first order (based on mean). Bouldin-Davies index represents the average similarity between each cluster and its most simi- 6 Conclusion lar one, averaged over all the clusters. Lower this index is, the better the clustering is. In Figure The process allows to formalize a global data- 4, we give the ratio between Bouldin-Davies In- driven anonymization process facilizing the sep- dex of each k-anonymization and the one of 28- aration between specific and generic part of anonymization. Figure 4 illustrates data utility anonymization and integrating business knowl- loss caused by increasing data protection. There is edge and Data Science algorithms. Through this a big gap of data utility between 4-anonymization process formalization, we optimize the trade-off and 8-anonymisation. However, the data protec- privacy protection/data utility. tion is insufficient. To illustrate the process, we simulate a con- In Figure 5 we compare the MDAV-generic ap- text near (but different) the situation of electrical 87 NY, USA, PODS ’06, pages 153 – 162. https://doi.org/10.1145/1142351.1142374. R. Agrawal and R. Srikant. 2000. Privacy preserving data mining. In ACM SIGMOD Conference. M. Bergeat, N. Cuppens-Boulahia, F. Cuppens, N. Jess, F. Dupont, S. Oulmakhzoune, and G. De Peretti. 2014. A French Anonymization Experiment with Health Data. In PSD 2014 : Privacy in Statisti- cal Databases. Eivissa, Spain. https://hal.archives- ouvertes.fr/hal-01214624. D.J. Berndt and J. Clifford. 1994. Using dynamic time Figure 5: Relative Bouldin-Davies Index between warping to find patterns in time series. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, KDD benchmarks and MDAV-generic Workshop. AAAI Press, pages 359–370. G. Beylkin, R. Coifman, and V. Rokhlin. 1991. smart meters data provision. We assume a third Fast wavelet transforms and numerical algorithms party tries making new pricing and propose a mi- I. Comm. Pure Appl. Math. 44(2):141–183. https://doi.org/10.1002/cpa.3160440202. croaggregation process of time series using pre- processing through the methodology of Cugliari K. Blazakis, S. Davarzani, G. Stavrakakis, and I. Pisica. et al. (2016) and clustering algorithm of Domingo- 2016. Lessons learnt from mining meter data of res- idential consumers. Periodica Polytechnica Electri- Ferrer and Torra (2005). Instead punctual infor- cal Engineering and Computer Science 60(4):266– mation giving at each instant, it could be interest- 272. https://doi.org/10.3311/PPee.9993. ing to give a probabilistic view load. V. Boyer, D. El Baz, and M. Elkihel. 2012. Solv- Our example is based on static data. In many ing knapsack problems on {GPU}. Computers applications, it is interesting to receive data in al- and Operations Research 39(1):42 – 47. Spe- most streaming way. One of the next step is to cial Issue on Knapsack Problems and Applications. develop incremental microaggregation and mea- https://doi.org/http://dx.doi.org/10.1016/j.cor.2011.03.014. sure the privacy loss and the data utility in this E. Buchmann, K. Böhm, T. Burghardt, and S. Kessler. context. Another future work is the development 2013. Re-identification of smart meter data. Per- of a big data framework allowing anonymization sonal and Ubiquitous Computing 17(4):653–662. with noise addition, differential privacy and scal- https://doi.org/10.1007/s00779-012-0513-6. able microaggregation dividing the specific part J.W. Byun, A. Kamra, E. Bertino, and N. Li. 2007. inherent at each type of data, business constraint Efficient k-Anonymization Using Clustering Tech- and future data utility with generic part. Lastly, niques, Springer Berlin Heidelberg, Berlin, Heidel- berg, pages 188–200. here, we work of global datalake anonymization, which assumes there is a level where raw data are K. Chatzikokolakis, C. Palamidessi, and M. Stronati. stored before transformation. Local anonymiza- 2013. A predictive differentially-private mecha- tion, where data are anonymized at individual level nism for mobility traces. CoRR abs/1311.4008. http://arxiv.org/abs/1311.4008. have to be studied. S. Chester, B. M. Kapron, G. Srivastava, and S. Venkatesh. 2013. Complexity of social net- work anonymization. Social Netw. Analys. Mining References 3(2):151–166. C.C. Aggarwal and P.S. Yu. 2008. A survey of random- J-X. Chin, T. Tinoco De Rubira, and G. Hug. 2016. ization methods for privacy-preserving data mining. Privacy-protecting energy management unit through In Privacy-Preserving Data Mining - Models and model-distribution predictive control. CoRR Algorithms, pages 137–156. abs/1612.05120. http://arxiv.org/abs/1612.05120. G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, J. Cugliari, Y. Goude, and J. M. Poggi. 2016. R. Panigrahy, D. Thomas, and A. Zhu. 2006. Disaggregated electricity forecasting using Achieving anonymity via clustering. In Pro- wavelet-based clustering of individual con- ceedings of the twenty-fifth ACM SIGMOD- sumers. In 2016 IEEE International En- SIGACT-SIGART symposium on Principles of ergy Conference (ENERGYCON). pages 1–6. database systems. ACM, ACM, New York, https://doi.org/10.1109/ENERGYCON.2016.7514087. 88 G. Danezis. 2013. Privacy technology options for pro- T. Hong, P. Pinson, and S. Fan. 2014. Global energy tecting and processing utility reading. forecasting competition 2012. International Journal of Forecasting 30(2):357 – 363. J. Domingo-Ferrer and V. Torra. 2005. Ordinal, continuous and heterogeneous k-anonymity H. Kellerer, U. Pferschy, and D. Pisinger. 2004. Intro- through microaggregation. Data Mining duction to NP-Completeness of knapsack problems. and Knowledge Discovery 11(2):195–212. Springer. https://doi.org/10.1007/s10618-005-0007-5. W. Labeeuw and G. Deconinck. 2013. Residen- F. Dufaux and T. Ebrahimi. 2006. Scrambling for tial electrical load model based on mixture model video surveillance with privacy. In 2006 Con- clustering and markov models. IEEE Transac- ference on Computer Vision and Pattern Recog- tions on Industrial Informatics 9(3):1561–1569. nition Workshop (CVPRW’06). pages 160–160. https://doi.org/10.1109/TII.2013.2240309. https://doi.org/10.1109/CVPRW.2006.184. N. Li, T. Li, and S. Venkatasubramanian. 2007. t- closeness: Privacy beyond k-anonymity and l- C. Dwork. 2006. Differential Privacy, Springer Berlin diversity. In 2007 IEEE 23rd International Con- Heidelberg, Berlin, Heidelberg, pages 1–12. ference on Data Engineering. pages 106–115. https://doi.org/10.1109/ICDE.2007.367856. C. Dwork and A. Roth. 2014. The algo- rithmic foundations of differential pri- Y. Li, S. Zhu, L. Wang, and S. Jajodia. 2002. vacy. Foundations and Trends in Theo- A Privacy-Enhanced Microaggregation Method, retical Computer Science 9(3Ű4):211–407. Springer Berlin Heidelberg, Berlin, Heidelberg, https://doi.org/10.1561/0400000042. pages 148–159. K. Ezhilarasi and Mr. M. Hariharan. 2015. Protecting T. Warren Liao. 2005. Clustering of time series data, a sensitive information in bank transaction with data survey. Pattern Recognition 38:1857–1874. anonymization. International Journal For Techno- logical Research In Engineering (2 (11)). J-L Lin and M-C Wei. 2008. An efficient clus- tering method for k-anonymization. In Pro- S. Fan and R.J. Hyndman. 2012. Short- ceedings of the 2008 International Work- term load forecasting based on a semi- shop on Privacy and Anonymity in Infor- parametric additive model. Power Sys- mation Society, PAIS 2008. pages 499–502. tems, IEEE Transactions on 27(1):134–141. https://doi.org/10.1109/CANDAR.2015.61. https://doi.org/10.1109/TPWRS.2011.2162082. K. Liu, C. Giannella, and K. Kargupta. 2008. A sur- P. Gaillard, Y. Goude, and R. Nedellec. 2016. Addi- vey of attack techniques on privacy-preserving data tive models and robust aggregation for gefcom2014 perturbation methods. probabilistic electric load and electricity price fore- casting. International Journal of Forecasting G. Loukides and J-H. Shao. 2008. An efficient clus- 32(3):1038–1050. tering algorithm for k-anonymisation. Journal of Computer Science and Technology 23(2):188–202. S. Gambs, M-O. Killijian, and M. Nunez del https://doi.org/10.1007/s11390-008-9121-3. Prado Cortez. 2014. De-anonymization attack on C. Y.T. Ma and D. K.Y. Yau. 2015. On information- geolocated data. Journal of Computer and Sys- theoretic measures for quantifying privacy pro- tem Sciences 80(8):1597 – 1614. Special Issue on tection of time-series data. In Proceedings of Theory and Applications in Parallel and Distributed the 10th ACM Symposium on Information, Com- Computing Systems. puter and Communications Security. ACM, New York, NY, USA, ASIA CCS ’15, pages 427–438. B. Gedik and L. Liu. 2008. Protecting loca- https://doi.org/10.1145/2714576.2714577. tion privacy with personalized k-anonymity: Architecture and algorithms. IEEE Trans- A. Machanavajjhala, D. Kifer, J. Gehrke, and actions on Mobile Computing 7(1):1–18. M. Venkitasubramaniam. 2006. l-diversity: Privacy https://doi.org/10.1109/TMC.2007.1062. beyond k-anonymity. In IN ICDE. S. Hansell. 2006. Aol removes search data on vast F. McLoughlin, A. Duffy, and M. Conlon. 2010. group of web users. New York Times . The generation of domestic electricity load profiles through markov chain modelling. Euro-Asian Jour- T. J. Hastie and R. J. Tibshirani. 1990. Generalized nal of Sustainable Energy Development Policy 3. additive models. London: Chapman & Hall. D.HO. McQueen, P. R. Hyland, and S. J. Watson. 2004. S-K. Hong, K. Gurjar, H-S. Kim, and Y.S. Moon. 2013. Monte carlo simulation of residential electricity de- A survey on privacy preserving time-series data min- mand for forecasting maximum demand on distribu- ing. 3rd International Conference on Intelligent tion networks. IEEE Transactions on Power Systems Computational Systems (ICICS’2013) . 19(3):1685–1689. 89 F. McSherry and I. Mironov. 2009. Differen- Design, Automation Test in Europe Confer- tially private recommender systems: building pri- ence Exhibition (DATE). pages 1126–1129. vacy into the net. In Proceedings of the 15th https://doi.org/10.1109/DATE.2012.6176665. ACM SIGKDD international conference on Knowl- edge discovery and data mining. ACM, New L. Sweeney. 2002. K-anonimity: A model York, NY, USA, KDD ’09, pages 627–636. for protecting privacy. International https://doi.org/10.1145/1557019.1557090. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(05):557–570. M. Muratori, Ma. C. Roberts, R. Sioshansi, V. Marano, https://doi.org/10.1142/S0218488502001648. and G. Rizzoni. 2013. A highly resolved model- ing technique to simulate residential power demand. V. Torra. 2017. Data Privacy: Foundations, New De- Applied Energy 107:465 – 473. velopments and the Big Data Challenge (Studies in Big Data). Springer. A. Narayanan and V. Shmatikov. 2008. Robust de- anonymization of large sparse datasets. In 2008 G. Tóth, Z. Hornák, and et al. 2004. Measuring IEEE Symposium on Security and Privacy (sp 2008). anonymity revisited. pages 111–125. https://doi.org/10.1109/SP.2008.33. V. Tudor, M. Almgren, and M. Papatriantafilou. 2013. R. Nedellec, J. Cugliari, and Y. Goude. 2014. Gef- Analysis of the impact of data granularity on privacy com2012: Electric load forecasting and backcasting for the smart grid. WPES ’13 Proceedings of the with semi-parametric models. International Journal 12th ACM . of Forecasting 30(2):375 – 381. S. Venkatasubramanian. 2008. Measures of Anonymity, Springer US, Boston, MA, pages J. Nin and V. Torra. 2009. Towards the evaluation of 81–103. time series protection methods. Information Sci- ences 179(11):1663 – 1677. Including Special Issue X. Xu and M. Numao. 2015. An efficient generalized on Chance Discovery. clustering method for achieving k-anonymization. In 2015 Third International Symposium on Com- J.V. Paatero and P.D. Lund. 2005. A model for generat- puting and Networking (CANDAR). pages 499–502. ing household electricity load profiles. International https://doi.org/10.1109/CANDAR.2015.61. journal of energy research 30. G. Zhang, X. Liu, and Y. Yang. 2015. Time- A. Pierrot and Y. Goude. 2011. Short-term electricity series pattern based effective noise genera- load forecasting with generalized additive models. tion for privacy protection on cloud. IEEE Proceedings of ISAP power pages 593–600. Transactions on Computers 64(5):1456–1469. https://doi.org/10.1109/TC.2014.2298013. P. Pompey, A. Bondu, Y. Goude, and M. Sinn. 2015. Massive-Scale Simulation of Electrical Load Q. Zhang, N. Koudas, D. Srivastava, and T. Yu. in Smart Grids Using Generalized Additive Mod- 2007. Aggregate query answering on anonymized els, Springer International Publishing, Cham, pages tables. In 2007 IEEE 23rd International Con- 193–212. ference on Data Engineering. pages 116–125. https://doi.org/10.1109/ICDE.2007.367857. S. Rani and G. Sikka. 2012. Recent techniques of clus- tering of time series data: A survey. International B. Zhou, J. Pei, and W. Luk. 2008. A brief Journal of Computer Applications . survey on anonymization techniques for pri- vacy preserving publishing of social network E. Shi, T.-H. H. Chan, E. G. Rieffel, R. Chow, and data. SIGKDD Explor. Newsl. 10:12–22. D.Song. 2011. Privacy-preserving aggregation of https://doi.org/10.1145/1540276.1540279. time-series data. In Proceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011. L. Shou, X. Shang, K. Chen, G. Chen, and C. Zhang. 2013. Supporting pattern-preserving anonymiza- tion for time-series data. IEEE Transactions on Knowledge and Data Engineering 25(4):877–892. https://doi.org/10.1109/TKDE.2011.249. A. Solanas and A. Ballesté. 2006. V-mdav: Variable group size multivariate microaggregation . B. Suri, U. D. Bordoloi, and P. Eles. 2012. A scalable gpu-based approach to accelerate the multiple-choice knapsack problem. In 2012 90