=Paper= {{Paper |id=Vol-2029/paper6 |storemode=property |title=Data-driven Anonymization Process Applied to Time Series |pdfUrl=https://ceur-ws.org/Vol-2029/paper6.pdf |volume=Vol-2029 |authors=Vincent Thouvenot,Damien Nogues,Catherine Gouttas |dblpUrl=https://dblp.org/rec/conf/simbig/ThouvenotNG17 }} ==Data-driven Anonymization Process Applied to Time Series== https://ceur-ws.org/Vol-2029/paper6.pdf

Data-driven anonymization process applied to time series

Vincent Thouvenot Damien Nogues Catherine Gouttas
Thales Communications & Security Thales Digital Factory Thales Digital Factory
firstname.lastname@thalesgroup.com

Abstract EU citizen may be subject to regulation. Besides
this legal context, Big Data technologies enables
Digital transformation and Big Data allow the treatment of massive, dynamic and unstruc-
the use of highly valuable data. However, tured data, and facilitates data crossing, weaken-
these data can be individual or sensitive, ing privacy protection. The data concerned can
and represent an obvious threat for pri- be personal (name, address, etc), and allow to (al-
vacy. Anonymization, which achieves a most) directly identify an individual. Sensitive
trade-off between data protection and data data, like religious or political beliefs, pose a risk
utility, can be used in this context. There is for individual privacy too. Smartphone, smart ob-
not global anonymization technique which ject, loyalty card, online purchase, social media
fits at all applications. Here, we describe : there are large sources of individual and sensi-
a data-driven anonymization process and tive data, which lead to an obvious risk for pri-
apply it on simulated electrical load data. vacy. Protect privacy means avoiding the isola-
tion of an individual, the correlation of some in-
1 Introduction
formation from different datasets for one individ-
The following paper is mainly written for a task ual and the possiblity to obtain information on an
of dissemination about anonymization and good individual through exogeneous variables. Despite
pratice about it. Indeed, if anonymization is quiet appearances, the trade-off behind anonymization
well known from academic point of view, it is is not an easy task. In datasets, it can have some
not still the case from France/Europe’s industrials’ identifiers. Just delete or encrypt them is generally
one. However, privacy protection is a fundamen- not efficient (see e.g. Hansell (2006); Narayanan
tal growing task for them. Digital transformation and Shmatikov (2008)). Others information can
brings creation of global datalakes and allows de- be quasi-identifiers which allow re-identification
velopment of new valuable business. Moreover, when they are crossing (e.g. in a health dataset,
some Governments force an increasing putting in sex and age). There are sensitive data (e.g. disease
Open Data, which should promote the opening in health context). Finally there are some remain-
digital knowledge and ensure an open valuable nu- ing parameters. In addition, the trade-off depends
meric ecosystem. At the same time, European of three parameters. First it depends of data typol-
Union sets up rules to protect citizens, which es- ogy. Bank transaction data (e.g. see Ezhilarasi and
tablish that citizens have protection right for their Hariharan (2015)) will not require the same treat-
individuals data. The data have to be fairly pro- ment that unstructured data like social media data,
cessed for specific purpose, and with individu- which ask to hide metadata, the identifying con-
als’ agreement. It is an important point because tent and the relational graph data (e.g. see Zhou
keeping a maximum of personnal data for future et al. (2008); Chester et al. (2013)). Second, the
non specified mining task which should appear trade-off depends of the future use of data. Third,
through future methods is inconsistent with this the anonymization strength is time dependent. In-
previous rule. People have right to access and deed, new datasets and new re-identification meth-
rectified their individual data. With 2016 reg- ods can be used to attack privacy. So, an ad-
ulation, applied from 2018, any company offer- missible anonymization methodology can be un-
ing goods or services (including cloud services) to admissible few months/years after.

80
We illustrate an anonymization workflow on dataset publication. Section 4 presents a simula-
(simulated) electrical smart meter data furniture, tor which are respectful of French electrical smart
which are a symbolic example of sensitive and meters anonymization task. We use simulate data
individual data whose exloitation possibility is because we have not access to real data. We ap-
new and illustrates new needs of anonymization. plied the Section 3 process in Section 5 on Section
In France, smart meters, which are currently de- 4 simulation.
ployed and whose deployment ended in 2020 will
allow to gather infra-daily household electrical 2 A short survey about anonymization
load. These data are by nature personal and sen-
sitive. Infra-daily individual loads allow to detect For a dissemination task, it seems important to
if and when someone is at home and can increase have a brief discussion about the differences be-
risk of burglary for instance. The individual habits tween anonymization and encryption because con-
(see e.g. (Blazakis et al., 2016)) can be detected fuse the two is a common mistake. Data encryp-
and so sensitive data like religion (e.g. during Ra- tion consists in using some mathematical algo-
madan) or some illegal activities (e.g. very par- rithms to transform data. The process can be re-
ticular load pattern with cannabis plant) derived. versed with the good algorithm and the encryption
Provide these data is a complicated challenge. In- key. It could be used to transfer data between two
deed, household electrical load will be available at entities. The encryted data are still individuals and
different level of time granularity (e.g. see Tudor so still personal data if original data are personal.
et al. (2013); Buchmann et al. (2013)). To sim- Although encrytion can be useful to be one of the
plify the context, in France, infra-daily load will components of de-identification, it is neither nec-
be available to the individuals, which can choose essary nor sufficient for doing anonymization.
to temporarily give access to these data to a third Pseudonymisation, which consists of hiding
party. Electric distributor will gather daily data identifying metadata can be not efficient (see
(even infra-daily in particular context). Provider Danezis (2013)). De-identification falls into two
will access to monthly data. Althougt infra-daily categories of techniques: transform data to have
data are noisly, having access to identified daily unreal individual, and aggregate and generalize in-
(even monthly) data allow easily to re-identify dividual, where data provided symbolizing an in-
infra-daily consomption (e.g. see (Tudor et al., dividuals set. Techniques can be used and com-
2013)). In this context, just hide direct identi- bined.
fyer will be inadequate. On the other side, these Permutation techniques and puzzling ap-
data have a strong valuable potential and many ac- proaches deconstruct, transform and/or change
tors are interested by them, for instance, distribu- the data design (see e.g. Agrawal and Srikant
tors to manage their network and achieve mainte- (2000), Zhang et al. (2007)). Noise addition
nance task; local communities to improve their ur- techniques are popular. For instance, Dufaux
ban policy; providers to propose new more adap- and Ebrahimi (2006) randomly transforms video
tive pricing; or start-up to popose individuals some representation by inverting some signs in de-
services to optimize the consomption. Based on composition coefficients and applies it to privacy
the future use of data, it is not necessary to keep protection in video surveillance. Aggarwal and
the same information. For instance new adap- Yu (2008) proposes a survey about randomization
tive pricing and dimension the networks through methods. Classical randomization has some
household electrical load need different informa- advantages. Noise is independent of data and
tion. Finally, the currently acceptable methods can does not require entire dataset. It can be applied
be questionned when data from gas smart meters, during data collection and on distributed system.
which are currently deploying too, will be avail- Liu et al. (2008) proposes an survey of attacks
able too. Indeed, gas meter data depends of similar techniques on privacy obtained by perturbations
phenomenon that electrical meter data. methods. When the anonymizer adds an addi-
tive noise, the attackers can use methods like
The article is divided in four sections. In Sec- (spectral, singular value decomposition, principal
tion 2, we give a short survey about anonymiza- component analysis, etc.) filtering, maximum
tion. In Section 3, we describe our global a posteriori estimation, or distribution analysis.
anonymization process, from data gathering to Under good conditions, multiplicative pertuba-

81
tions have good properties, for instance preserve lation function, time slicing profile, density, etc.
Euclidian distance. An attacker who knows a Domingo-Ferrer and Torra (2005) partitiones data
sample of input and output or has some indepen- throught Maximum Distance to Average Vector
dent samples from the original distribution can (MDAV) algorithm. Aggarwal et al. (2006) pro-
reverse the anonymization. Differential privacy poses microaggregation where some atypical indi-
(see Dwork (2006), Dwork and Roth (2014)) is viduals can be not clustered and so not published.
a very popular form of noise addition. Here, we Byun et al. (2007), Lin and Wei (2008), Li et al.
add a random noise in such a way it makes a (2002), Xu and Numao (2015) and Loukides and
mechanism which produces the same output with Shao (2008) proposes greedy heuristic to achieve
almost similar probabilities when we consider k-anonymity through clustering with not NP-hard
two adjacent inputs. The basic process to achieve complexity. Bergeat et al. (2014) compares two
differential privacy is to sampling without replace- software allowing k-anonymization on a French
ment the dataset and adding fictive individuals. health dataset of more than 20 million records.
Differential privacy allows to work on privacy loss Gedik and Liu (2008) uses k-anonymity to protect
and bound the risk. Chatzikokolakis et al. (2013) mobile location privacy.
applies differential privacy on mobility trace
When working on time series, previous tech-
and tries to develop a mechanism more efficient
niques can be used. For instance, Shou
than just adding independent noise. McSherry
et al. (2013) proposes what they named (k, P)-
and Mironov (2009) proposes a framework of
anonymity to preserve pattern in time series. In
differential privacy to produce recommendations
electrical household load protection, Chin et al.
from collective user behavior in Netflix Prize
(2016) proposes to solve an optimization problem
dataset.
with two components : first is about information
leakage rate of consomer load given grid load and
Privacy can be protected by creating individuals
second is about the cost of errors. Shi et al. (2011)
sets. K-anonymization (see Sweeney (2002)) con-
proposes a differential privacy form to protect time
sists of generalizing quasi-identifying information
series. Zhang et al. (2015) proposes noise gener-
to force having at least k individuals with the same
ation to protection cloud data. Hong et al. (2013)
values. K-anonymity can be broken when all the
proposes a survey about time series privacy pro-
individuals of (at least) one class have the same
tection.
sensitive data. L-diversity (see Machanavajjhala
et al. (2006)) forces each class to have at least l dif- After de-identification, it is important to mea-
ferent values of the sensitive data. In t-closeness sure de-identification degree. Venkatasubrama-
(see Li et al. (2007)) the sensitive data in each nian (2008) surveys the metric proposed to mea-
class has to respect its distribution in the total pop- sure privacy and privacy loss. Authors divide
ulation. Generalization and suppression is NP- measuring privacy into three categories: statistical
hard. Moreover, as expressed in Domingo-Ferrer methods, taking account variance of perturbated
and Torra (2005), generalization and suppression variable, probabilistic methods, considering infor-
can be not adapted for ordinal categorical and for mation theory and Bayesian analysis, and com-
continuous attributes. Domingo-Ferrer and Torra putational methods, coming from the idea of a
(2005) proposes to use microaggregation for this resource-bounded attacker and measuring privacy
task. In microaggregation data are partitioned into loss in function of the information available for
several clusters of length at least k with similar such attacker. Tóth et al. (2004) works on mes-
records. Then, we apply an aggregator operator sage transmission and analyzes two entropy based
(e.g. mean or median for continuous variable) anonymity measures. Authors measure separate
to compute the centroid of each cluster. Besides global anonymity, which quantify the necessary
clustering method, microaggregation has two im- effort to fully compromise dataset, that we name
portant parameters: the minimum dimension of latter journalist scenario, and local anonymity,
each cluster, adjusts the level of privacy protection which quantify the probability that transmission of
and the function allowing computation of aggre- one user are compromised, that we name prosecu-
gate value, which is linked with future data utility tor scenario. Gambs et al. (2014) works on attacks
and protection level. The aggregation function can on geolocated data. Nin and Torra (2009) proposes
be mean, sum, median, quantile, partial autocorre- a framework of protection and re-identification for

82
els, a neighbor which has access to contextual
data, etc.). Many reasons can motivated the at-
tacker (retrieve information about individual to do
aggressive commercial supply or burglary, harm
the organization which manages the anonymiza-
tion to recover data governance for instance, show
its capacity in re-identification, etc.). The re-
identification framework will depend of the attack
Figure 1: Global anonymization process category. Crossing anonymized data with dataset
which is not anonymized is a classical way to try
re-identification. Information in anonymized data
time series. Ma and Yau (2015) proposes some
can be used too (e.g. anonymized Internet requests
information measures for quantifying privacy pro-
can be identified by crossing location and interest
tection of tme-series Data.
of individual). It will depend of the chosen re-
Interested reader can find in the first-rate book
identification meaning. It can mean identify an
Torra (2017) the stakes of data privacy and tech-
individual or identify sensitive information about
niques associated.
an individual. For instance, if household load
3 Anonymization process have been anonymized by pseudonymization then
adding a noise, the noisy curve can be identify to
Each anonymization task has a specific and a a customer. If we do microaggregation, it could
generic part, and so is unique. In this section be possible to find the customer cluster and pos-
we describe our global data-driven anonymization sibly deduce probable sensitive information and
process (see Figure 1) which allows separating the behavior for the customer. The re-identification
two parts of the process. framework will depend of the technique to mea-
After data gathering we have to tag data in sure re-identification risk. To be valuable, attack-
function of theirs categories : identifier, quasi- ers have to be confident in their re-identification.
indentifier, sensitive data, and remainded parame- When we compute the risk of individual identifi-
ters. Next step consists in data pseudonymisation. cation, true positive are not the key performance
Identifiers have to be hidden (deletion, encryption, indicator. Here, true positive means the good in-
etc.). Obviously, that is generally not enough to in- dividual from anonymized data have been identi-
sure privacy protection. We have to establish one fied at the individual from not anonymized data.
metric to measure data protection and another one However, this individual from not anonymized
for data utility. As anonymization could not be data can be identified at many other individuals in
total and perfect, we have to choose the thresh- anonymized data. That decreases re-identification
old of re-identification acceptance. Despite de- risk. Of course, identification errors decrease the
identification process there are still residuals risk, confidence too. The risk have to combine all
which has to be compared to the benefits. these information. Lastly, re-identification frame-
It is necessary to build a re-identification frame- work will depend of the re-identification scenario.
work, which is driven by the context and has Many are possible : we can target all or almost all
to be realistic. That means the worst case sit- individuals when we know they are in the dataset
uation, where an attacker is almost all-knowing, (journalist scenario), we can target one or some in-
is probably not realistic and decreases the effi- dividuals (prosecutor scenario), we can try to dis-
ciency of the trade-off utility data / privacy pro- tinguish studies with and without one individual,
tection. The re-identification framework depends etc.
of many parameters. The attacker type must be
determined. Its resources will depend of who Then, as anonymization is a trade-off between
he is (e.g. a member of the organization which utility and protection, we have to choose the mini-
anonymizes data and so has access at plenty data mum utility of data. We have two case. Data could
to attack anonymized dataset, a member of a be provided to a third party to answer to a spe-
near organization which has access to similar data cific need. Only necessary information, and not
which can be crossed, a Machine Learning expert more, have to be provided. After anonymization,
which can deploy efficient re-identification mod- the study has to be possible. For instance, if a elec-

83
tric provider want to do new daily pricing, they (see e.g. Nedellec et al. (2014)).
will only need precise daily profile. Data could be
We simulate three curves types : we name the
given without specific need, for instance to push
first “second house load”, which are almost con-
data in Open Data. We need to compute a metric
stant with a random noise added, the second “lit-
of privacy cost. For instance, when we add a noise
tle professional”, which are represented by seg-
in time series, we can compute a signal to noise
ment curves with a load almost null the week-end
ratio.
and the night and almost constant during the day,
Lastly, as anonymization can not be perfect, where the jump intensity and activities period are
we have to choose the limit of acceptance for randomly chosen, and the third “household load”,
re-identification risk. It determines the trade-off with calendar and thermic components. To sim-
achievement point. It depends of the level of in- ulate the last, we apply similar idea that Pompey
dividuality and sensitivity. The choice is driven et al. (2015). We train simulation models on GEF-
by the exportation model used at the end. We can Com 2012 dataset (see Hong et al. (2014)). The
choose a Publish and forget model, where typi- dataset comes from a Kaggle challenge and con-
cally data are provided in Open Data. Then, it tains the hourly load demand of 20 local areas in
is (almost) impossible to stop data sharing. An- USA and the temperature of 11 weather stations.
other model is Data Use Agreement model where We train many different Additive Models on this
agreement decides what the third party can do. Fi- dataset whose features sets contain calendar pa-
nally we can use a closed model where data are in rameters (type of day, number of day since the
a closed environment and the third party has only beginning of the year, etc.) and a random num-
access to the results of its requests. ber of raw and smoothing temperatures. They are
To avoid scalability problem during non indus- trained on different period. Then, we compute two
trial step we can do an optional sub-sampling. De- months of their forecasting with random transla-
identification methods, which depends of data ty- tion of features (e.g. one additive model design
pology and future data use, are applied. Then, we is translated of one hour and its temperatures of
measure the re-identification risk. When the risk is two Fahrenheit), to introduce variety in simulated
lower than the authorized maximum, we transform load. We train some quantile additive models too
the entire dataset. Then, we measure the utility of (see Gaillard et al. (2016)) to simulate some ex-
anonymized data. If the minimum utility is not re- treme behaviors. Our simulated data are smoother
spected, we re-start all the steps of this paragraph, than real individual load but it is not a real prob-
else we export data in the chosen model. lem here. Indeed, for our appliance, it is important
that simulation respects some assumptions. We as-
4 Appliance context : simulated sume there are different levels of load and three
electrical load curves families and the consomptions are almost
uniqueness even at low frequencie as shown in
Electrical household load simulation has a rich
Tudor et al. (2013). To respect an uniqueness as-
literature. Many authors use variant of Markov
sumption, we arbitrarily impose that daily aggre-
Chain (e.g. Labeeuw and Deconinck (2013), Mu-
gation during one week of two curves of “house-
ratori et al. (2013), McLoughlin et al. (2010)).
hold load" rounded to the thousands can not be
McQueen et al. (2004) uses Monte Carlo sim-
equal. We choose the daily time scale of aggre-
ulation model of load demand taking into ac-
gation because we are faced with an attacker using
count the statistical spread of demand in each half
daily data and by considering one week, we inte-
hour using data sampled from a gamma distribu-
grate the weekly cycle which are important when
tion. Paatero and Lund (2005) uses bottom-up
considering electrical information.
load model for generating household load profiles.
Pompey et al. (2015) trains Additive Model (see We assume that a provider needs to refine a pric-
Hastie and Tibshirani (1990)) to achieve massive- ing for an area of its customers. It needs some
scale simulation of electrical load in Smart Grids. infra-daily information (peak of demand, profile,
Additive Models have yet proven their efficiency etc.). In our scenario, a potential attacker has ac-
to model and forecast electricity load at aggregate cess to identified daily data. We assume the ex-
level (see Pierrot and Goude (2011) in France, Fan portation model is a “publish and forget model”,
and Hyndman (2012) in Australia) as at local level which explains the possibility for an attacker to

84
Method Parameter Utility
Random noise Signal to noise ratio, noise trend, anomaly detec- be completed by post-treatment to improve pro-
familly tion, total scope
Permutation Type and quantity of permu- total scope tection. The re-identification risk of the first five
Transformation
tation
type of transformation (stan- total scope, trend,
techniques of Table 1 are about customer iden-
Time slicing
dardization, etc.)
time scale
anomaly detection
total scope, anomaly
tification and can be studied with classical time
Differential privacy sampling probability and fic-
detection
total scope, trend,
series identification and classification techniques
tive curve anomaly detection (Deep Learning, Ensemble Method, K Nearest
Microaggregation clustering, clusters dimen- total scope, clusters
sion, aggregation function trend, profile Neighbors, etc.). Moreover, some methods are
Scope aggregation scope, aggregation function total scope, scope trend,
profile reversible. For instance, denoising methods can
be applied for the first one. For deterministic
Table 1: De-identification for time series transformation, the perturbation can be inverted
with the knowledge of transformation parameters.
have daily data access. Attempting to re-build chronological can reverse
time slicing. Sensitive information can be re-
5 Appliance of anonymization process fund from the two last methods when attackers
can identify the cluster of a customer. Here we
After data gathering, we have to tag the data. Here work to provide data to a provider who wants to
we assume data have two attributes: an individ- make a new pricing. It needs precise profile and
ual identifier and simulated time series. First, we we choose to use microaggregation. As a provider
pseudomyze the individual identifier by substitut- will not propose one tariff by individual, but many
ing it with random numbers without replacement tariffs depending of large group, we do not need
to ensure uniqueness. We have only one quasi- precise individual data.
identifier, which is the sensitive data too, the simu-
lated time series. In our simulation framework, the 5.2 Chosen methodology
highest level of attackers can be a competing orga-
nization which has identified data at a fine granu- We use time series clustering (see Liao (2005),
larity level (daily data), with good level of opti- Rani and Sikka (2012)). As explain in these sur-
mization and computing whose objective is to find veys, there are many ways to cluster time series.
information about customers’ behaviors to make First, clustering algorithm can directly be applied
aggressive supply. Data are provided to answer on raw data. However, it can be inefficient be-
to a specific need of a third party: having data cause of noisy data. Second consists to extract
to etablish new pricing. For this topic, microag- features from time series and applies clustering al-
gregation is relevent. Then, we protect privacy gorithm on these features. Third is model-based
throught the minimum dimension of clusters. The approaches where time series are modelled before
components of the trade-off are the choice of clus- being clustered. Another important choice is the
tering methodology, the choice of the dimension distance measure like Euclidean, Kullback-Leibler
of clusters and the choice of the aggregator oper- divergence, Dynamic Time Warping (see Berndt
ator. With microaggregation, an attacker could, and Clifford (1994)), etc.
at worst, identify probable customers behaviors. Wavelets Decomposition (see Beylkin et al.
More the minimum dimension of clusters is weak (1991)) have yet being used in time series stud-
or more one load participates at the building of a ies because it allows to work on the different lev-
cluster, more the attacker can be certain of its de- els of frequencies of the signal and to denoise the
ductions. signal. We apply the pre-processing of Cugliari
et al. (2016) which successfully applies disaggre-
5.1 Statistical and Machine Learning Setup gated load clustering to forecast load demand. Af-
Table 1 presents some techniques which can be ter time series projection, authors compute rela-
used to protect time series. Permutation, which tive contribution of each energy level. Assume
consists to exchange data from one curve to an- that ( , j,k ) is a Haar basis. A continuous sig-
other, is a form of noise introduction and can nal can be approximated in a truncated Haar ba-
P Pj
create unrealistic curves. Transformation can be sis: fˆ(t) = c0 (t) + Jj=01 2k=01 dj,k j,k (t),
smoothing, standardization, etc. whose objec- where c0 , dj,k are the decomposition coefficients
tive is to erase some individualities. Time slicing obtained with Fast Wavelet Transform algorithm.
breaks individual trends. Differential privacy can Then, we define relative contribution of level j by

85
✓ ◆
relj = logit
||dj ||2
PJ 1 2 2 , where logit(p) = which can be solved by many algorithms (e.g. pro-
⇣ ⌘ k=0 ||dk ||2 gramming dynamic or simulated annealing). Even
ln 1 p p . In these features, we do not consider it is a NP-hard problem and the attacker needs an
c0 , which corresponds to the mean level of each exact solution, many works show it is possible to
load. We focus our effort on the profiles form consider the problem in a multi-parallel way and
which is important to establish new pricing. use GPU programming (see Boyer et al. (2012),
As in Cugliari et al. (2016), we use the relative Suri et al. (2012)). Adding a noise, even small,
contribution after a Haar Decomposition. In place allows to get out of the knapsack problem.
of K-Means we use Maximum Distance to Av-
Algorithm 1 MDAV-generic
erage Vector generic (MDAV-generic) presented
by Domingo-Ferrer and Torra (2005), because we Assume D the relative contribution dataset and k
want to have clusters of minimum size k regardless an integer.
the number of clusters, and not k clusters regard- 1. While card(D) 3k
less their size. Instead of MDAV-generic, we could
have use less rigid algorithms like V-MDAV (see (a) Compute the average attribute-wise of
Solanas and Ballesté (2006)), which does not force all records in D
each cluster (except some last) to be of a fixed size. (b) Compute the most distant record d1 of
Through MDAV-generic, we know in advance the previous average in term of Euclidian
number of clusters, which can be interested when norm
there are a data furniture requirements specitica- (c) Find the most distant record d2 of d1
tion with the third party. We benchmark the tech- (d) Use d2 and d1 as the center of two clus-
nique with a mean based aggregation and a vari- ters of length k
ance based aggregation. By mean (resp. variance)
(e) Delete the records of the two clusters
based aggregation, we mean achieving K Near-
and come back at the beginning
est Neighbor on the mean (resp. variance) load
of each individual with initializing by the small- 2. If 2k  card(D)  3k 1
est mean (resp. variance). The benchmarks have
(a) Compute the average attribute-wise of
some advantages : they are easy to implement and
remaining records in D
timely computated.
Cluster algorithm allows to divide the individu- (b) Compute the most distant record d1 of
als in subsets of pre-determined minimum length. previous average (Euclidian norm)
Then, we have to choose the aggregation tech- (c) Use d1 as the center of a cluster of length
niques. We can compute the median load of the in- k
dividuals of each cluster at each time. This choice (d) Form another cluster with the others re-
allows to minimize the absolute loss for each clus- maining records in D
ter and to hide some extremes values. However,
there is a non-zero probability that one individual 3. If card(D) < 2k, form a cluster with remain-
of one cluster is (almost) always the median. In ing records
this case, giving the median is equivalent to give
the load of one individual. The mean can be com-
puted at each instant for each cluster. However, 5.3 Results
the attacker has access for each individual i to Ai
We compare the performance when clustering al-
the vector of daily load, and (lj )j the infra-daily
gorithm is applied on 1 400 centralized simulated
loads of each final aggregat (and so to (Lj ) the
time series. Remind the objectives consist in fur-
daily loads of each cluster). The attacker can try
niture of profil as homogeneous as possible to a
to solve for each cluster j,
third party which wants to propose new pricing.
1 X That means the third party has to collect clus-
arg min ||Lj P pi Ai ||2 . ter homogeneous, and data utility measure has to
pi 2{0,1} pi
concern this point. If data are giving for a task
This adversaries model is then equivalent to a of forecasting, we should measure differently the
Knapsack problem (see Kellerer et al. (2004)), utility, for instance by computing MAPE (Mean

86
Figure 2: Process relative time Figure 3: Silhouette Index estimated density

Absolute Percentage Error) on a test subset (e.g.
see Pierrot and Goude (2011)). Here, in a task
of pricing, there is no forecasting need. It il-
lustrates the dependency between anonymization
and future use of data. We measure data util-
ity by computing indicators like silhouette index
and the Davies-Bouldin index. These two indica-
tors represent measures of homogeneity of clus-
ters. We do not use indicators like Root Mean
Square Error, because there are different load lev-
els. We work from 4 to 28 anonymization by step
of 4. In Figure 2, we plot the relative compu- Figure 4: Bouldin-Davies Index ratio
tation time when the reference level is the com-
putation time of 28-anonymization. Computation plied on relative contribution of wavelet decompo-
time decreases when k grows. The Figure 3 gives sition with two aggregations : one done by mean
an estimation of Silhouette Index density for each level (before centralization) and the other by stan-
k-anonymization. This index, computed for each dard deviation level. In relative Bouldin-Davies
individual, is between -1 and 1. Stronger is this we give the ratio of Bouldin-Davies Index for
index, stronger the individual is connected to its each k-anomyzation by mean and standard devi-
cluster and far away the others clusters. It has to be ation and the Bouldin-Davies Index when MDAV-
upper than 0 if an individual is well clustered. Ob- generic is applied with the same k. All the indica-
viously, when k grows (data protection increases), tors show MDAV-generic applied on relative con-
the index decreases (data utility decreases). We tribution of wavelet decomposition outperforms
see three modes in the density : the upper cor- the two forms of trivial aggregation, and second
responds to second home, the medium to profes- order (based on variance) aggregation is more ef-
sional and the last to household. ficient than first order (based on mean).
Bouldin-Davies index represents the average
similarity between each cluster and its most simi- 6 Conclusion
lar one, averaged over all the clusters. Lower this
index is, the better the clustering is. In Figure The process allows to formalize a global data-
4, we give the ratio between Bouldin-Davies In- driven anonymization process facilizing the sep-
dex of each k-anonymization and the one of 28- aration between specific and generic part of
anonymization. Figure 4 illustrates data utility anonymization and integrating business knowl-
loss caused by increasing data protection. There is edge and Data Science algorithms. Through this
a big gap of data utility between 4-anonymization process formalization, we optimize the trade-off
and 8-anonymisation. However, the data protec- privacy protection/data utility.
tion is insufficient. To illustrate the process, we simulate a con-
In Figure 5 we compare the MDAV-generic ap- text near (but different) the situation of electrical

87
NY, USA, PODS ’06, pages 153 – 162.
https://doi.org/10.1145/1142351.1142374.

R. Agrawal and R. Srikant. 2000. Privacy preserving
data mining. In ACM SIGMOD Conference.

M. Bergeat, N. Cuppens-Boulahia, F. Cuppens, N. Jess,
F. Dupont, S. Oulmakhzoune, and G. De Peretti.
2014. A French Anonymization Experiment with
Health Data. In PSD 2014 : Privacy in Statisti-
cal Databases. Eivissa, Spain. https://hal.archives-
ouvertes.fr/hal-01214624.

D.J. Berndt and J. Clifford. 1994. Using dynamic time
Figure 5: Relative Bouldin-Davies Index between warping to find patterns in time series. In Usama M.
Fayyad and Ramasamy Uthurusamy, editors, KDD
benchmarks and MDAV-generic Workshop. AAAI Press, pages 359–370.

G. Beylkin, R. Coifman, and V. Rokhlin. 1991.
smart meters data provision. We assume a third Fast wavelet transforms and numerical algorithms
party tries making new pricing and propose a mi- I. Comm. Pure Appl. Math. 44(2):141–183.
https://doi.org/10.1002/cpa.3160440202.
croaggregation process of time series using pre-
processing through the methodology of Cugliari K. Blazakis, S. Davarzani, G. Stavrakakis, and I. Pisica.
et al. (2016) and clustering algorithm of Domingo- 2016. Lessons learnt from mining meter data of res-
idential consumers. Periodica Polytechnica Electri-
Ferrer and Torra (2005). Instead punctual infor- cal Engineering and Computer Science 60(4):266–
mation giving at each instant, it could be interest- 272. https://doi.org/10.3311/PPee.9993.
ing to give a probabilistic view load.
V. Boyer, D. El Baz, and M. Elkihel. 2012. Solv-
Our example is based on static data. In many ing knapsack problems on {GPU}. Computers
applications, it is interesting to receive data in al- and Operations Research 39(1):42 – 47. Spe-
most streaming way. One of the next step is to cial Issue on Knapsack Problems and Applications.
develop incremental microaggregation and mea- https://doi.org/http://dx.doi.org/10.1016/j.cor.2011.03.014.
sure the privacy loss and the data utility in this E. Buchmann, K. Böhm, T. Burghardt, and S. Kessler.
context. Another future work is the development 2013. Re-identification of smart meter data. Per-
of a big data framework allowing anonymization sonal and Ubiquitous Computing 17(4):653–662.
with noise addition, differential privacy and scal- https://doi.org/10.1007/s00779-012-0513-6.
able microaggregation dividing the specific part J.W. Byun, A. Kamra, E. Bertino, and N. Li. 2007.
inherent at each type of data, business constraint Efficient k-Anonymization Using Clustering Tech-
and future data utility with generic part. Lastly, niques, Springer Berlin Heidelberg, Berlin, Heidel-
berg, pages 188–200.
here, we work of global datalake anonymization,
which assumes there is a level where raw data are K. Chatzikokolakis, C. Palamidessi, and M. Stronati.
stored before transformation. Local anonymiza- 2013. A predictive differentially-private mecha-
tion, where data are anonymized at individual level nism for mobility traces. CoRR abs/1311.4008.
http://arxiv.org/abs/1311.4008.
have to be studied.
S. Chester, B. M. Kapron, G. Srivastava, and
S. Venkatesh. 2013. Complexity of social net-
work anonymization. Social Netw. Analys. Mining
References 3(2):151–166.
C.C. Aggarwal and P.S. Yu. 2008. A survey of random-
J-X. Chin, T. Tinoco De Rubira, and G. Hug. 2016.
ization methods for privacy-preserving data mining.
Privacy-protecting energy management unit through
In Privacy-Preserving Data Mining - Models and
model-distribution predictive control. CoRR
Algorithms, pages 137–156.
abs/1612.05120. http://arxiv.org/abs/1612.05120.

G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, J. Cugliari, Y. Goude, and J. M. Poggi. 2016.
R. Panigrahy, D. Thomas, and A. Zhu. 2006. Disaggregated electricity forecasting using
Achieving anonymity via clustering. In Pro- wavelet-based clustering of individual con-
ceedings of the twenty-fifth ACM SIGMOD- sumers. In 2016 IEEE International En-
SIGACT-SIGART symposium on Principles of ergy Conference (ENERGYCON). pages 1–6.
database systems. ACM, ACM, New York, https://doi.org/10.1109/ENERGYCON.2016.7514087.

88
G. Danezis. 2013. Privacy technology options for pro- T. Hong, P. Pinson, and S. Fan. 2014. Global energy
tecting and processing utility reading. forecasting competition 2012. International Journal
of Forecasting 30(2):357 – 363.
J. Domingo-Ferrer and V. Torra. 2005. Ordinal,
continuous and heterogeneous k-anonymity H. Kellerer, U. Pferschy, and D. Pisinger. 2004. Intro-
through microaggregation. Data Mining duction to NP-Completeness of knapsack problems.
and Knowledge Discovery 11(2):195–212. Springer.
https://doi.org/10.1007/s10618-005-0007-5.
W. Labeeuw and G. Deconinck. 2013. Residen-
F. Dufaux and T. Ebrahimi. 2006. Scrambling for tial electrical load model based on mixture model
video surveillance with privacy. In 2006 Con- clustering and markov models. IEEE Transac-
ference on Computer Vision and Pattern Recog- tions on Industrial Informatics 9(3):1561–1569.
nition Workshop (CVPRW’06). pages 160–160. https://doi.org/10.1109/TII.2013.2240309.
https://doi.org/10.1109/CVPRW.2006.184.
N. Li, T. Li, and S. Venkatasubramanian. 2007. t-
closeness: Privacy beyond k-anonymity and l-
C. Dwork. 2006. Differential Privacy, Springer Berlin
diversity. In 2007 IEEE 23rd International Con-
Heidelberg, Berlin, Heidelberg, pages 1–12.
ference on Data Engineering. pages 106–115.
https://doi.org/10.1109/ICDE.2007.367856.
C. Dwork and A. Roth. 2014. The algo-
rithmic foundations of differential pri- Y. Li, S. Zhu, L. Wang, and S. Jajodia. 2002.
vacy. Foundations and Trends in Theo- A Privacy-Enhanced Microaggregation Method,
retical Computer Science 9(3Ű4):211–407. Springer Berlin Heidelberg, Berlin, Heidelberg,
https://doi.org/10.1561/0400000042. pages 148–159.
K. Ezhilarasi and Mr. M. Hariharan. 2015. Protecting T. Warren Liao. 2005. Clustering of time series data, a
sensitive information in bank transaction with data survey. Pattern Recognition 38:1857–1874.
anonymization. International Journal For Techno-
logical Research In Engineering (2 (11)). J-L Lin and M-C Wei. 2008. An efficient clus-
tering method for k-anonymization. In Pro-
S. Fan and R.J. Hyndman. 2012. Short- ceedings of the 2008 International Work-
term load forecasting based on a semi- shop on Privacy and Anonymity in Infor-
parametric additive model. Power Sys- mation Society, PAIS 2008. pages 499–502.
tems, IEEE Transactions on 27(1):134–141. https://doi.org/10.1109/CANDAR.2015.61.
https://doi.org/10.1109/TPWRS.2011.2162082.
K. Liu, C. Giannella, and K. Kargupta. 2008. A sur-
P. Gaillard, Y. Goude, and R. Nedellec. 2016. Addi- vey of attack techniques on privacy-preserving data
tive models and robust aggregation for gefcom2014 perturbation methods.
probabilistic electric load and electricity price fore-
casting. International Journal of Forecasting G. Loukides and J-H. Shao. 2008. An efficient clus-
32(3):1038–1050. tering algorithm for k-anonymisation. Journal of
Computer Science and Technology 23(2):188–202.
S. Gambs, M-O. Killijian, and M. Nunez del https://doi.org/10.1007/s11390-008-9121-3.
Prado Cortez. 2014. De-anonymization attack on
C. Y.T. Ma and D. K.Y. Yau. 2015. On information-
geolocated data. Journal of Computer and Sys-
theoretic measures for quantifying privacy pro-
tem Sciences 80(8):1597 – 1614. Special Issue on
tection of time-series data. In Proceedings of
Theory and Applications in Parallel and Distributed
the 10th ACM Symposium on Information, Com-
Computing Systems.
puter and Communications Security. ACM, New
York, NY, USA, ASIA CCS ’15, pages 427–438.
B. Gedik and L. Liu. 2008. Protecting loca-
https://doi.org/10.1145/2714576.2714577.
tion privacy with personalized k-anonymity:
Architecture and algorithms. IEEE Trans- A. Machanavajjhala, D. Kifer, J. Gehrke, and
actions on Mobile Computing 7(1):1–18. M. Venkitasubramaniam. 2006. l-diversity: Privacy
https://doi.org/10.1109/TMC.2007.1062. beyond k-anonymity. In IN ICDE.
S. Hansell. 2006. Aol removes search data on vast F. McLoughlin, A. Duffy, and M. Conlon. 2010.
group of web users. New York Times . The generation of domestic electricity load profiles
through markov chain modelling. Euro-Asian Jour-
T. J. Hastie and R. J. Tibshirani. 1990. Generalized nal of Sustainable Energy Development Policy 3.
additive models. London: Chapman & Hall.
D.HO. McQueen, P. R. Hyland, and S. J. Watson. 2004.
S-K. Hong, K. Gurjar, H-S. Kim, and Y.S. Moon. 2013. Monte carlo simulation of residential electricity de-
A survey on privacy preserving time-series data min- mand for forecasting maximum demand on distribu-
ing. 3rd International Conference on Intelligent tion networks. IEEE Transactions on Power Systems
Computational Systems (ICICS’2013) . 19(3):1685–1689.

89
F. McSherry and I. Mironov. 2009. Differen- Design, Automation Test in Europe Confer-
tially private recommender systems: building pri- ence Exhibition (DATE). pages 1126–1129.
vacy into the net. In Proceedings of the 15th https://doi.org/10.1109/DATE.2012.6176665.
ACM SIGKDD international conference on Knowl-
edge discovery and data mining. ACM, New L. Sweeney. 2002. K-anonimity: A model
York, NY, USA, KDD ’09, pages 627–636. for protecting privacy. International
https://doi.org/10.1145/1557019.1557090. Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 10(05):557–570.
M. Muratori, Ma. C. Roberts, R. Sioshansi, V. Marano, https://doi.org/10.1142/S0218488502001648.
and G. Rizzoni. 2013. A highly resolved model-
ing technique to simulate residential power demand. V. Torra. 2017. Data Privacy: Foundations, New De-
Applied Energy 107:465 – 473. velopments and the Big Data Challenge (Studies in
Big Data). Springer.
A. Narayanan and V. Shmatikov. 2008. Robust de-
anonymization of large sparse datasets. In 2008 G. Tóth, Z. Hornák, and et al. 2004. Measuring
IEEE Symposium on Security and Privacy (sp 2008). anonymity revisited.
pages 111–125. https://doi.org/10.1109/SP.2008.33.
V. Tudor, M. Almgren, and M. Papatriantafilou. 2013.
R. Nedellec, J. Cugliari, and Y. Goude. 2014. Gef- Analysis of the impact of data granularity on privacy
com2012: Electric load forecasting and backcasting for the smart grid. WPES ’13 Proceedings of the
with semi-parametric models. International Journal 12th ACM .
of Forecasting 30(2):375 – 381. S. Venkatasubramanian. 2008. Measures of
Anonymity, Springer US, Boston, MA, pages
J. Nin and V. Torra. 2009. Towards the evaluation of
81–103.
time series protection methods. Information Sci-
ences 179(11):1663 – 1677. Including Special Issue X. Xu and M. Numao. 2015. An efficient generalized
on Chance Discovery. clustering method for achieving k-anonymization.
In 2015 Third International Symposium on Com-
J.V. Paatero and P.D. Lund. 2005. A model for generat- puting and Networking (CANDAR). pages 499–502.
ing household electricity load profiles. International https://doi.org/10.1109/CANDAR.2015.61.
journal of energy research 30.
G. Zhang, X. Liu, and Y. Yang. 2015. Time-
A. Pierrot and Y. Goude. 2011. Short-term electricity series pattern based effective noise genera-
load forecasting with generalized additive models. tion for privacy protection on cloud. IEEE
Proceedings of ISAP power pages 593–600. Transactions on Computers 64(5):1456–1469.
https://doi.org/10.1109/TC.2014.2298013.
P. Pompey, A. Bondu, Y. Goude, and M. Sinn.
2015. Massive-Scale Simulation of Electrical Load Q. Zhang, N. Koudas, D. Srivastava, and T. Yu.
in Smart Grids Using Generalized Additive Mod- 2007. Aggregate query answering on anonymized
els, Springer International Publishing, Cham, pages tables. In 2007 IEEE 23rd International Con-
193–212. ference on Data Engineering. pages 116–125.
https://doi.org/10.1109/ICDE.2007.367857.
S. Rani and G. Sikka. 2012. Recent techniques of clus-
tering of time series data: A survey. International B. Zhou, J. Pei, and W. Luk. 2008. A brief
Journal of Computer Applications . survey on anonymization techniques for pri-
vacy preserving publishing of social network
E. Shi, T.-H. H. Chan, E. G. Rieffel, R. Chow, and data. SIGKDD Explor. Newsl. 10:12–22.
D.Song. 2011. Privacy-preserving aggregation of https://doi.org/10.1145/1540276.1540279.
time-series data. In Proceedings of the Network
and Distributed System Security Symposium, NDSS
2011, San Diego, California, USA, 6th February -
9th February 2011.

L. Shou, X. Shang, K. Chen, G. Chen, and C. Zhang.
2013. Supporting pattern-preserving anonymiza-
tion for time-series data. IEEE Transactions on
Knowledge and Data Engineering 25(4):877–892.
https://doi.org/10.1109/TKDE.2011.249.

A. Solanas and A. Ballesté. 2006. V-mdav: Variable
group size multivariate microaggregation .

B. Suri, U. D. Bordoloi, and P. Eles. 2012. A
scalable gpu-based approach to accelerate the
multiple-choice knapsack problem. In 2012