=Paper= {{Paper |id=Vol-2029/paper6 |storemode=property |title=Data-driven Anonymization Process Applied to Time Series |pdfUrl=https://ceur-ws.org/Vol-2029/paper6.pdf |volume=Vol-2029 |authors=Vincent Thouvenot,Damien Nogues,Catherine Gouttas |dblpUrl=https://dblp.org/rec/conf/simbig/ThouvenotNG17 }} ==Data-driven Anonymization Process Applied to Time Series== https://ceur-ws.org/Vol-2029/paper6.pdf
             Data-driven anonymization process applied to time series


       Vincent Thouvenot           Damien Nogues         Catherine Gouttas
Thales Communications & Security Thales Digital Factory Thales Digital Factory
                    firstname.lastname@thalesgroup.com




                      Abstract                                EU citizen may be subject to regulation. Besides
                                                              this legal context, Big Data technologies enables
     Digital transformation and Big Data allow                the treatment of massive, dynamic and unstruc-
     the use of highly valuable data. However,                tured data, and facilitates data crossing, weaken-
     these data can be individual or sensitive,               ing privacy protection. The data concerned can
     and represent an obvious threat for pri-                 be personal (name, address, etc), and allow to (al-
     vacy. Anonymization, which achieves a                    most) directly identify an individual. Sensitive
     trade-off between data protection and data               data, like religious or political beliefs, pose a risk
     utility, can be used in this context. There is           for individual privacy too. Smartphone, smart ob-
     not global anonymization technique which                 ject, loyalty card, online purchase, social media
     fits at all applications. Here, we describe              : there are large sources of individual and sensi-
     a data-driven anonymization process and                  tive data, which lead to an obvious risk for pri-
     apply it on simulated electrical load data.              vacy. Protect privacy means avoiding the isola-
                                                              tion of an individual, the correlation of some in-
 1   Introduction
                                                              formation from different datasets for one individ-
 The following paper is mainly written for a task             ual and the possiblity to obtain information on an
 of dissemination about anonymization and good                individual through exogeneous variables. Despite
 pratice about it. Indeed, if anonymization is quiet          appearances, the trade-off behind anonymization
 well known from academic point of view, it is                is not an easy task. In datasets, it can have some
 not still the case from France/Europe’s industrials’         identifiers. Just delete or encrypt them is generally
 one. However, privacy protection is a fundamen-              not efficient (see e.g. Hansell (2006); Narayanan
 tal growing task for them. Digital transformation            and Shmatikov (2008)). Others information can
 brings creation of global datalakes and allows de-           be quasi-identifiers which allow re-identification
 velopment of new valuable business. Moreover,                when they are crossing (e.g. in a health dataset,
 some Governments force an increasing putting in              sex and age). There are sensitive data (e.g. disease
 Open Data, which should promote the opening                  in health context). Finally there are some remain-
 digital knowledge and ensure an open valuable nu-            ing parameters. In addition, the trade-off depends
 meric ecosystem. At the same time, European                  of three parameters. First it depends of data typol-
 Union sets up rules to protect citizens, which es-           ogy. Bank transaction data (e.g. see Ezhilarasi and
 tablish that citizens have protection right for their        Hariharan (2015)) will not require the same treat-
 individuals data. The data have to be fairly pro-            ment that unstructured data like social media data,
 cessed for specific purpose, and with individu-              which ask to hide metadata, the identifying con-
 als’ agreement. It is an important point because             tent and the relational graph data (e.g. see Zhou
 keeping a maximum of personnal data for future               et al. (2008); Chester et al. (2013)). Second, the
 non specified mining task which should appear                trade-off depends of the future use of data. Third,
 through future methods is inconsistent with this             the anonymization strength is time dependent. In-
 previous rule. People have right to access and               deed, new datasets and new re-identification meth-
 rectified their individual data. With 2016 reg-              ods can be used to attack privacy. So, an ad-
 ulation, applied from 2018, any company offer-               missible anonymization methodology can be un-
 ing goods or services (including cloud services) to          admissible few months/years after.



                                                         80
   We illustrate an anonymization workflow on               dataset publication. Section 4 presents a simula-
(simulated) electrical smart meter data furniture,          tor which are respectful of French electrical smart
which are a symbolic example of sensitive and               meters anonymization task. We use simulate data
individual data whose exloitation possibility is            because we have not access to real data. We ap-
new and illustrates new needs of anonymization.             plied the Section 3 process in Section 5 on Section
In France, smart meters, which are currently de-            4 simulation.
ployed and whose deployment ended in 2020 will
allow to gather infra-daily household electrical            2   A short survey about anonymization
load. These data are by nature personal and sen-
sitive. Infra-daily individual loads allow to detect        For a dissemination task, it seems important to
if and when someone is at home and can increase             have a brief discussion about the differences be-
risk of burglary for instance. The individual habits        tween anonymization and encryption because con-
(see e.g. (Blazakis et al., 2016)) can be detected          fuse the two is a common mistake. Data encryp-
and so sensitive data like religion (e.g. during Ra-        tion consists in using some mathematical algo-
madan) or some illegal activities (e.g. very par-           rithms to transform data. The process can be re-
ticular load pattern with cannabis plant) derived.          versed with the good algorithm and the encryption
Provide these data is a complicated challenge. In-          key. It could be used to transfer data between two
deed, household electrical load will be available at        entities. The encryted data are still individuals and
different level of time granularity (e.g. see Tudor         so still personal data if original data are personal.
et al. (2013); Buchmann et al. (2013)). To sim-             Although encrytion can be useful to be one of the
plify the context, in France, infra-daily load will         components of de-identification, it is neither nec-
be available to the individuals, which can choose           essary nor sufficient for doing anonymization.
to temporarily give access to these data to a third            Pseudonymisation, which consists of hiding
party. Electric distributor will gather daily data          identifying metadata can be not efficient (see
(even infra-daily in particular context). Provider          Danezis (2013)). De-identification falls into two
will access to monthly data. Althougt infra-daily           categories of techniques: transform data to have
data are noisly, having access to identified daily          unreal individual, and aggregate and generalize in-
(even monthly) data allow easily to re-identify             dividual, where data provided symbolizing an in-
infra-daily consomption (e.g. see (Tudor et al.,            dividuals set. Techniques can be used and com-
2013)). In this context, just hide direct identi-           bined.
fyer will be inadequate. On the other side, these              Permutation techniques and puzzling ap-
data have a strong valuable potential and many ac-          proaches deconstruct, transform and/or change
tors are interested by them, for instance, distribu-        the data design (see e.g. Agrawal and Srikant
tors to manage their network and achieve mainte-            (2000), Zhang et al. (2007)). Noise addition
nance task; local communities to improve their ur-          techniques are popular. For instance, Dufaux
ban policy; providers to propose new more adap-             and Ebrahimi (2006) randomly transforms video
tive pricing; or start-up to popose individuals some        representation by inverting some signs in de-
services to optimize the consomption. Based on              composition coefficients and applies it to privacy
the future use of data, it is not necessary to keep         protection in video surveillance. Aggarwal and
the same information. For instance new adap-                Yu (2008) proposes a survey about randomization
tive pricing and dimension the networks through             methods.      Classical randomization has some
household electrical load need different informa-           advantages. Noise is independent of data and
tion. Finally, the currently acceptable methods can         does not require entire dataset. It can be applied
be questionned when data from gas smart meters,             during data collection and on distributed system.
which are currently deploying too, will be avail-           Liu et al. (2008) proposes an survey of attacks
able too. Indeed, gas meter data depends of similar         techniques on privacy obtained by perturbations
phenomenon that electrical meter data.                      methods. When the anonymizer adds an addi-
                                                            tive noise, the attackers can use methods like
   The article is divided in four sections. In Sec-         (spectral, singular value decomposition, principal
tion 2, we give a short survey about anonymiza-             component analysis, etc.) filtering, maximum
tion. In Section 3, we describe our global                  a posteriori estimation, or distribution analysis.
anonymization process, from data gathering to               Under good conditions, multiplicative pertuba-



                                                       81
tions have good properties, for instance preserve               lation function, time slicing profile, density, etc.
Euclidian distance. An attacker who knows a                     Domingo-Ferrer and Torra (2005) partitiones data
sample of input and output or has some indepen-                 throught Maximum Distance to Average Vector
dent samples from the original distribution can                 (MDAV) algorithm. Aggarwal et al. (2006) pro-
reverse the anonymization. Differential privacy                 poses microaggregation where some atypical indi-
(see Dwork (2006), Dwork and Roth (2014)) is                    viduals can be not clustered and so not published.
a very popular form of noise addition. Here, we                 Byun et al. (2007), Lin and Wei (2008), Li et al.
add a random noise in such a way it makes a                     (2002), Xu and Numao (2015) and Loukides and
mechanism which produces the same output with                   Shao (2008) proposes greedy heuristic to achieve
almost similar probabilities when we consider                   k-anonymity through clustering with not NP-hard
two adjacent inputs. The basic process to achieve               complexity. Bergeat et al. (2014) compares two
differential privacy is to sampling without replace-            software allowing k-anonymization on a French
ment the dataset and adding fictive individuals.                health dataset of more than 20 million records.
Differential privacy allows to work on privacy loss             Gedik and Liu (2008) uses k-anonymity to protect
and bound the risk. Chatzikokolakis et al. (2013)               mobile location privacy.
applies differential privacy on mobility trace
                                                                   When working on time series, previous tech-
and tries to develop a mechanism more efficient
                                                                niques can be used.           For instance, Shou
than just adding independent noise. McSherry
                                                                et al. (2013) proposes what they named (k, P)-
and Mironov (2009) proposes a framework of
                                                                anonymity to preserve pattern in time series. In
differential privacy to produce recommendations
                                                                electrical household load protection, Chin et al.
from collective user behavior in Netflix Prize
                                                                (2016) proposes to solve an optimization problem
dataset.
                                                                with two components : first is about information
                                                                leakage rate of consomer load given grid load and
   Privacy can be protected by creating individuals
                                                                second is about the cost of errors. Shi et al. (2011)
sets. K-anonymization (see Sweeney (2002)) con-
                                                                proposes a differential privacy form to protect time
sists of generalizing quasi-identifying information
                                                                series. Zhang et al. (2015) proposes noise gener-
to force having at least k individuals with the same
                                                                ation to protection cloud data. Hong et al. (2013)
values. K-anonymity can be broken when all the
                                                                proposes a survey about time series privacy pro-
individuals of (at least) one class have the same
                                                                tection.
sensitive data. L-diversity (see Machanavajjhala
et al. (2006)) forces each class to have at least l dif-           After de-identification, it is important to mea-
ferent values of the sensitive data. In t-closeness             sure de-identification degree. Venkatasubrama-
(see Li et al. (2007)) the sensitive data in each               nian (2008) surveys the metric proposed to mea-
class has to respect its distribution in the total pop-         sure privacy and privacy loss. Authors divide
ulation. Generalization and suppression is NP-                  measuring privacy into three categories: statistical
hard. Moreover, as expressed in Domingo-Ferrer                  methods, taking account variance of perturbated
and Torra (2005), generalization and suppression                variable, probabilistic methods, considering infor-
can be not adapted for ordinal categorical and for              mation theory and Bayesian analysis, and com-
continuous attributes. Domingo-Ferrer and Torra                 putational methods, coming from the idea of a
(2005) proposes to use microaggregation for this                resource-bounded attacker and measuring privacy
task. In microaggregation data are partitioned into             loss in function of the information available for
several clusters of length at least k with similar              such attacker. Tóth et al. (2004) works on mes-
records. Then, we apply an aggregator operator                  sage transmission and analyzes two entropy based
(e.g. mean or median for continuous variable)                   anonymity measures. Authors measure separate
to compute the centroid of each cluster. Besides                global anonymity, which quantify the necessary
clustering method, microaggregation has two im-                 effort to fully compromise dataset, that we name
portant parameters: the minimum dimension of                    latter journalist scenario, and local anonymity,
each cluster, adjusts the level of privacy protection           which quantify the probability that transmission of
and the function allowing computation of aggre-                 one user are compromised, that we name prosecu-
gate value, which is linked with future data utility            tor scenario. Gambs et al. (2014) works on attacks
and protection level. The aggregation function can              on geolocated data. Nin and Torra (2009) proposes
be mean, sum, median, quantile, partial autocorre-              a framework of protection and re-identification for



                                                           82
                                                              els, a neighbor which has access to contextual
                                                              data, etc.). Many reasons can motivated the at-
                                                              tacker (retrieve information about individual to do
                                                              aggressive commercial supply or burglary, harm
                                                              the organization which manages the anonymiza-
                                                              tion to recover data governance for instance, show
                                                              its capacity in re-identification, etc.). The re-
                                                              identification framework will depend of the attack
     Figure 1: Global anonymization process                   category. Crossing anonymized data with dataset
                                                              which is not anonymized is a classical way to try
                                                              re-identification. Information in anonymized data
time series. Ma and Yau (2015) proposes some
                                                              can be used too (e.g. anonymized Internet requests
information measures for quantifying privacy pro-
                                                              can be identified by crossing location and interest
tection of tme-series Data.
                                                              of individual). It will depend of the chosen re-
   Interested reader can find in the first-rate book
                                                              identification meaning. It can mean identify an
Torra (2017) the stakes of data privacy and tech-
                                                              individual or identify sensitive information about
niques associated.
                                                              an individual. For instance, if household load
3   Anonymization process                                     have been anonymized by pseudonymization then
                                                              adding a noise, the noisy curve can be identify to
Each anonymization task has a specific and a                  a customer. If we do microaggregation, it could
generic part, and so is unique. In this section               be possible to find the customer cluster and pos-
we describe our global data-driven anonymization              sibly deduce probable sensitive information and
process (see Figure 1) which allows separating the            behavior for the customer. The re-identification
two parts of the process.                                     framework will depend of the technique to mea-
   After data gathering we have to tag data in                sure re-identification risk. To be valuable, attack-
function of theirs categories : identifier, quasi-            ers have to be confident in their re-identification.
indentifier, sensitive data, and remainded parame-            When we compute the risk of individual identifi-
ters. Next step consists in data pseudonymisation.            cation, true positive are not the key performance
Identifiers have to be hidden (deletion, encryption,          indicator. Here, true positive means the good in-
etc.). Obviously, that is generally not enough to in-         dividual from anonymized data have been identi-
sure privacy protection. We have to establish one             fied at the individual from not anonymized data.
metric to measure data protection and another one             However, this individual from not anonymized
for data utility. As anonymization could not be               data can be identified at many other individuals in
total and perfect, we have to choose the thresh-              anonymized data. That decreases re-identification
old of re-identification acceptance. Despite de-              risk. Of course, identification errors decrease the
identification process there are still residuals risk,        confidence too. The risk have to combine all
which has to be compared to the benefits.                     these information. Lastly, re-identification frame-
   It is necessary to build a re-identification frame-        work will depend of the re-identification scenario.
work, which is driven by the context and has                  Many are possible : we can target all or almost all
to be realistic. That means the worst case sit-               individuals when we know they are in the dataset
uation, where an attacker is almost all-knowing,              (journalist scenario), we can target one or some in-
is probably not realistic and decreases the effi-             dividuals (prosecutor scenario), we can try to dis-
ciency of the trade-off utility data / privacy pro-           tinguish studies with and without one individual,
tection. The re-identification framework depends              etc.
of many parameters. The attacker type must be
determined. Its resources will depend of who                     Then, as anonymization is a trade-off between
he is (e.g. a member of the organization which                utility and protection, we have to choose the mini-
anonymizes data and so has access at plenty data              mum utility of data. We have two case. Data could
to attack anonymized dataset, a member of a                   be provided to a third party to answer to a spe-
near organization which has access to similar data            cific need. Only necessary information, and not
which can be crossed, a Machine Learning expert               more, have to be provided. After anonymization,
which can deploy efficient re-identification mod-             the study has to be possible. For instance, if a elec-



                                                         83
tric provider want to do new daily pricing, they             (see e.g. Nedellec et al. (2014)).
will only need precise daily profile. Data could be
                                                                We simulate three curves types : we name the
given without specific need, for instance to push
                                                             first “second house load”, which are almost con-
data in Open Data. We need to compute a metric
                                                             stant with a random noise added, the second “lit-
of privacy cost. For instance, when we add a noise
                                                             tle professional”, which are represented by seg-
in time series, we can compute a signal to noise
                                                             ment curves with a load almost null the week-end
ratio.
                                                             and the night and almost constant during the day,
   Lastly, as anonymization can not be perfect,              where the jump intensity and activities period are
we have to choose the limit of acceptance for                randomly chosen, and the third “household load”,
re-identification risk. It determines the trade-off          with calendar and thermic components. To sim-
achievement point. It depends of the level of in-            ulate the last, we apply similar idea that Pompey
dividuality and sensitivity. The choice is driven            et al. (2015). We train simulation models on GEF-
by the exportation model used at the end. We can             Com 2012 dataset (see Hong et al. (2014)). The
choose a Publish and forget model, where typi-               dataset comes from a Kaggle challenge and con-
cally data are provided in Open Data. Then, it               tains the hourly load demand of 20 local areas in
is (almost) impossible to stop data sharing. An-             USA and the temperature of 11 weather stations.
other model is Data Use Agreement model where                We train many different Additive Models on this
agreement decides what the third party can do. Fi-           dataset whose features sets contain calendar pa-
nally we can use a closed model where data are in            rameters (type of day, number of day since the
a closed environment and the third party has only            beginning of the year, etc.) and a random num-
access to the results of its requests.                       ber of raw and smoothing temperatures. They are
   To avoid scalability problem during non indus-            trained on different period. Then, we compute two
trial step we can do an optional sub-sampling. De-           months of their forecasting with random transla-
identification methods, which depends of data ty-            tion of features (e.g. one additive model design
pology and future data use, are applied. Then, we            is translated of one hour and its temperatures of
measure the re-identification risk. When the risk is         two Fahrenheit), to introduce variety in simulated
lower than the authorized maximum, we transform              load. We train some quantile additive models too
the entire dataset. Then, we measure the utility of          (see Gaillard et al. (2016)) to simulate some ex-
anonymized data. If the minimum utility is not re-           treme behaviors. Our simulated data are smoother
spected, we re-start all the steps of this paragraph,        than real individual load but it is not a real prob-
else we export data in the chosen model.                     lem here. Indeed, for our appliance, it is important
                                                             that simulation respects some assumptions. We as-
4   Appliance context : simulated                            sume there are different levels of load and three
    electrical load                                          curves families and the consomptions are almost
                                                             uniqueness even at low frequencie as shown in
Electrical household load simulation has a rich
                                                             Tudor et al. (2013). To respect an uniqueness as-
literature. Many authors use variant of Markov
                                                             sumption, we arbitrarily impose that daily aggre-
Chain (e.g. Labeeuw and Deconinck (2013), Mu-
                                                             gation during one week of two curves of “house-
ratori et al. (2013), McLoughlin et al. (2010)).
                                                             hold load" rounded to the thousands can not be
McQueen et al. (2004) uses Monte Carlo sim-
                                                             equal. We choose the daily time scale of aggre-
ulation model of load demand taking into ac-
                                                             gation because we are faced with an attacker using
count the statistical spread of demand in each half
                                                             daily data and by considering one week, we inte-
hour using data sampled from a gamma distribu-
                                                             grate the weekly cycle which are important when
tion. Paatero and Lund (2005) uses bottom-up
                                                             considering electrical information.
load model for generating household load profiles.
Pompey et al. (2015) trains Additive Model (see                 We assume that a provider needs to refine a pric-
Hastie and Tibshirani (1990)) to achieve massive-            ing for an area of its customers. It needs some
scale simulation of electrical load in Smart Grids.          infra-daily information (peak of demand, profile,
Additive Models have yet proven their efficiency             etc.). In our scenario, a potential attacker has ac-
to model and forecast electricity load at aggregate          cess to identified daily data. We assume the ex-
level (see Pierrot and Goude (2011) in France, Fan           portation model is a “publish and forget model”,
and Hyndman (2012) in Australia) as at local level           which explains the possibility for an attacker to



                                                        84
    Method                 Parameter                       Utility
    Random noise           Signal to noise ratio, noise    trend, anomaly detec-            be completed by post-treatment to improve pro-
                           familly                         tion, total scope
    Permutation            Type and quantity of permu-     total scope                      tection. The re-identification risk of the first five
    Transformation
                           tation
                           type of transformation (stan-   total scope,       trend,
                                                                                            techniques of Table 1 are about customer iden-
    Time slicing
                           dardization, etc.)
                           time scale
                                                           anomaly detection
                                                           total scope, anomaly
                                                                                            tification and can be studied with classical time
    Differential privacy   sampling probability and fic-
                                                           detection
                                                           total scope,       trend,
                                                                                            series identification and classification techniques
                           tive curve                      anomaly detection                (Deep Learning, Ensemble Method, K Nearest
    Microaggregation       clustering, clusters dimen-     total scope, clusters
                           sion, aggregation function      trend, profile                   Neighbors, etc.). Moreover, some methods are
    Scope aggregation      scope, aggregation function     total scope, scope trend,
                                                           profile                          reversible. For instance, denoising methods can
                                                                                            be applied for the first one. For deterministic
         Table 1: De-identification for time series                                         transformation, the perturbation can be inverted
                                                                                            with the knowledge of transformation parameters.
have daily data access.                                                                     Attempting to re-build chronological can reverse
                                                                                            time slicing. Sensitive information can be re-
5       Appliance of anonymization process                                                  fund from the two last methods when attackers
                                                                                            can identify the cluster of a customer. Here we
After data gathering, we have to tag the data. Here                                         work to provide data to a provider who wants to
we assume data have two attributes: an individ-                                             make a new pricing. It needs precise profile and
ual identifier and simulated time series. First, we                                         we choose to use microaggregation. As a provider
pseudomyze the individual identifier by substitut-                                          will not propose one tariff by individual, but many
ing it with random numbers without replacement                                              tariffs depending of large group, we do not need
to ensure uniqueness. We have only one quasi-                                               precise individual data.
identifier, which is the sensitive data too, the simu-
lated time series. In our simulation framework, the                                         5.2   Chosen methodology
highest level of attackers can be a competing orga-
nization which has identified data at a fine granu-                                         We use time series clustering (see Liao (2005),
larity level (daily data), with good level of opti-                                         Rani and Sikka (2012)). As explain in these sur-
mization and computing whose objective is to find                                           veys, there are many ways to cluster time series.
information about customers’ behaviors to make                                              First, clustering algorithm can directly be applied
aggressive supply. Data are provided to answer                                              on raw data. However, it can be inefficient be-
to a specific need of a third party: having data                                            cause of noisy data. Second consists to extract
to etablish new pricing. For this topic, microag-                                           features from time series and applies clustering al-
gregation is relevent. Then, we protect privacy                                             gorithm on these features. Third is model-based
throught the minimum dimension of clusters. The                                             approaches where time series are modelled before
components of the trade-off are the choice of clus-                                         being clustered. Another important choice is the
tering methodology, the choice of the dimension                                             distance measure like Euclidean, Kullback-Leibler
of clusters and the choice of the aggregator oper-                                          divergence, Dynamic Time Warping (see Berndt
ator. With microaggregation, an attacker could,                                             and Clifford (1994)), etc.
at worst, identify probable customers behaviors.                                               Wavelets Decomposition (see Beylkin et al.
More the minimum dimension of clusters is weak                                              (1991)) have yet being used in time series stud-
or more one load participates at the building of a                                          ies because it allows to work on the different lev-
cluster, more the attacker can be certain of its de-                                        els of frequencies of the signal and to denoise the
ductions.                                                                                   signal. We apply the pre-processing of Cugliari
                                                                                            et al. (2016) which successfully applies disaggre-
5.1 Statistical and Machine Learning Setup                                                  gated load clustering to forecast load demand. Af-
Table 1 presents some techniques which can be                                               ter time series projection, authors compute rela-
used to protect time series. Permutation, which                                             tive contribution of each energy level. Assume
consists to exchange data from one curve to an-                                             that ( , j,k ) is a Haar basis. A continuous sig-
other, is a form of noise introduction and can                                              nal can be approximated in a truncated Haar ba-
                                                                                                                      P     Pj
create unrealistic curves. Transformation can be                                            sis: fˆ(t) = c0 (t) + Jj=01 2k=01 dj,k j,k (t),
smoothing, standardization, etc. whose objec-                                               where c0 , dj,k are the decomposition coefficients
tive is to erase some individualities. Time slicing                                         obtained with Fast Wavelet Transform algorithm.
breaks individual trends. Differential privacy can                                          Then, we define relative contribution of level j by



                                                                                       85
               ✓                   ◆
relj = logit
                     ||dj ||2
                   PJ 1 2 2     , where logit(p) =            which can be solved by many algorithms (e.g. pro-
    ⇣    ⌘          k=0 ||dk ||2                              gramming dynamic or simulated annealing). Even
ln 1 p p . In these features, we do not consider              it is a NP-hard problem and the attacker needs an
c0 , which corresponds to the mean level of each              exact solution, many works show it is possible to
load. We focus our effort on the profiles form                consider the problem in a multi-parallel way and
which is important to establish new pricing.                  use GPU programming (see Boyer et al. (2012),
    As in Cugliari et al. (2016), we use the relative         Suri et al. (2012)). Adding a noise, even small,
contribution after a Haar Decomposition. In place             allows to get out of the knapsack problem.
of K-Means we use Maximum Distance to Av-
                                                              Algorithm 1 MDAV-generic
erage Vector generic (MDAV-generic) presented
by Domingo-Ferrer and Torra (2005), because we                Assume D the relative contribution dataset and k
want to have clusters of minimum size k regardless            an integer.
the number of clusters, and not k clusters regard-             1. While card(D)        3k
less their size. Instead of MDAV-generic, we could
have use less rigid algorithms like V-MDAV (see                     (a) Compute the average attribute-wise of
Solanas and Ballesté (2006)), which does not force                      all records in D
each cluster (except some last) to be of a fixed size.              (b) Compute the most distant record d1 of
Through MDAV-generic, we know in advance the                            previous average in term of Euclidian
number of clusters, which can be interested when                        norm
there are a data furniture requirements specitica-                  (c) Find the most distant record d2 of d1
tion with the third party. We benchmark the tech-                   (d) Use d2 and d1 as the center of two clus-
nique with a mean based aggregation and a vari-                         ters of length k
ance based aggregation. By mean (resp. variance)
                                                                    (e) Delete the records of the two clusters
based aggregation, we mean achieving K Near-
                                                                        and come back at the beginning
est Neighbor on the mean (resp. variance) load
of each individual with initializing by the small-             2. If 2k  card(D)  3k        1
est mean (resp. variance). The benchmarks have
                                                                    (a) Compute the average attribute-wise of
some advantages : they are easy to implement and
                                                                        remaining records in D
timely computated.
    Cluster algorithm allows to divide the individu-                (b) Compute the most distant record d1 of
als in subsets of pre-determined minimum length.                        previous average (Euclidian norm)
Then, we have to choose the aggregation tech-                       (c) Use d1 as the center of a cluster of length
niques. We can compute the median load of the in-                       k
dividuals of each cluster at each time. This choice                 (d) Form another cluster with the others re-
allows to minimize the absolute loss for each clus-                     maining records in D
ter and to hide some extremes values. However,
there is a non-zero probability that one individual            3. If card(D) < 2k, form a cluster with remain-
of one cluster is (almost) always the median. In                  ing records
this case, giving the median is equivalent to give
the load of one individual. The mean can be com-
puted at each instant for each cluster. However,              5.3   Results
the attacker has access for each individual i to Ai
                                                              We compare the performance when clustering al-
the vector of daily load, and (lj )j the infra-daily
                                                              gorithm is applied on 1 400 centralized simulated
loads of each final aggregat (and so to (Lj ) the
                                                              time series. Remind the objectives consist in fur-
daily loads of each cluster). The attacker can try
                                                              niture of profil as homogeneous as possible to a
to solve for each cluster j,
                                                              third party which wants to propose new pricing.
                               1 X                            That means the third party has to collect clus-
       arg min ||Lj           P    pi Ai ||2 .                ter homogeneous, and data utility measure has to
           pi 2{0,1}            pi
                                                              concern this point. If data are giving for a task
This adversaries model is then equivalent to a                of forecasting, we should measure differently the
Knapsack problem (see Kellerer et al. (2004)),                utility, for instance by computing MAPE (Mean



                                                         86
          Figure 2: Process relative time                         Figure 3: Silhouette Index estimated density


Absolute Percentage Error) on a test subset (e.g.
see Pierrot and Goude (2011)). Here, in a task
of pricing, there is no forecasting need. It il-
lustrates the dependency between anonymization
and future use of data. We measure data util-
ity by computing indicators like silhouette index
and the Davies-Bouldin index. These two indica-
tors represent measures of homogeneity of clus-
ters. We do not use indicators like Root Mean
Square Error, because there are different load lev-
els. We work from 4 to 28 anonymization by step
of 4. In Figure 2, we plot the relative compu-                       Figure 4: Bouldin-Davies Index ratio
tation time when the reference level is the com-
putation time of 28-anonymization. Computation                plied on relative contribution of wavelet decompo-
time decreases when k grows. The Figure 3 gives               sition with two aggregations : one done by mean
an estimation of Silhouette Index density for each            level (before centralization) and the other by stan-
k-anonymization. This index, computed for each                dard deviation level. In relative Bouldin-Davies
individual, is between -1 and 1. Stronger is this             we give the ratio of Bouldin-Davies Index for
index, stronger the individual is connected to its            each k-anomyzation by mean and standard devi-
cluster and far away the others clusters. It has to be        ation and the Bouldin-Davies Index when MDAV-
upper than 0 if an individual is well clustered. Ob-          generic is applied with the same k. All the indica-
viously, when k grows (data protection increases),            tors show MDAV-generic applied on relative con-
the index decreases (data utility decreases). We              tribution of wavelet decomposition outperforms
see three modes in the density : the upper cor-               the two forms of trivial aggregation, and second
responds to second home, the medium to profes-                order (based on variance) aggregation is more ef-
sional and the last to household.                             ficient than first order (based on mean).
   Bouldin-Davies index represents the average
similarity between each cluster and its most simi-            6    Conclusion
lar one, averaged over all the clusters. Lower this
index is, the better the clustering is. In Figure             The process allows to formalize a global data-
4, we give the ratio between Bouldin-Davies In-               driven anonymization process facilizing the sep-
dex of each k-anonymization and the one of 28-                aration between specific and generic part of
anonymization. Figure 4 illustrates data utility              anonymization and integrating business knowl-
loss caused by increasing data protection. There is           edge and Data Science algorithms. Through this
a big gap of data utility between 4-anonymization             process formalization, we optimize the trade-off
and 8-anonymisation. However, the data protec-                privacy protection/data utility.
tion is insufficient.                                            To illustrate the process, we simulate a con-
   In Figure 5 we compare the MDAV-generic ap-                text near (but different) the situation of electrical



                                                         87
                                                                NY, USA, PODS ’06, pages 153 – 162.
                                                                https://doi.org/10.1145/1142351.1142374.

                                                              R. Agrawal and R. Srikant. 2000. Privacy preserving
                                                                 data mining. In ACM SIGMOD Conference.

                                                              M. Bergeat, N. Cuppens-Boulahia, F. Cuppens, N. Jess,
                                                                F. Dupont, S. Oulmakhzoune, and G. De Peretti.
                                                                2014. A French Anonymization Experiment with
                                                                Health Data. In PSD 2014 : Privacy in Statisti-
                                                                cal Databases. Eivissa, Spain. https://hal.archives-
                                                                ouvertes.fr/hal-01214624.

                                                              D.J. Berndt and J. Clifford. 1994. Using dynamic time
Figure 5: Relative Bouldin-Davies Index between                 warping to find patterns in time series. In Usama M.
                                                                Fayyad and Ramasamy Uthurusamy, editors, KDD
benchmarks and MDAV-generic                                     Workshop. AAAI Press, pages 359–370.

                                                              G. Beylkin, R. Coifman, and V. Rokhlin. 1991.
smart meters data provision. We assume a third                  Fast wavelet transforms and numerical algorithms
party tries making new pricing and propose a mi-                I.     Comm. Pure Appl. Math. 44(2):141–183.
                                                                https://doi.org/10.1002/cpa.3160440202.
croaggregation process of time series using pre-
processing through the methodology of Cugliari                K. Blazakis, S. Davarzani, G. Stavrakakis, and I. Pisica.
et al. (2016) and clustering algorithm of Domingo-               2016. Lessons learnt from mining meter data of res-
                                                                 idential consumers. Periodica Polytechnica Electri-
Ferrer and Torra (2005). Instead punctual infor-                 cal Engineering and Computer Science 60(4):266–
mation giving at each instant, it could be interest-             272. https://doi.org/10.3311/PPee.9993.
ing to give a probabilistic view load.
                                                              V. Boyer, D. El Baz, and M. Elkihel. 2012. Solv-
   Our example is based on static data. In many                  ing knapsack problems on {GPU}. Computers
applications, it is interesting to receive data in al-           and Operations Research 39(1):42 – 47. Spe-
most streaming way. One of the next step is to                   cial Issue on Knapsack Problems and Applications.
develop incremental microaggregation and mea-                    https://doi.org/http://dx.doi.org/10.1016/j.cor.2011.03.014.
sure the privacy loss and the data utility in this            E. Buchmann, K. Böhm, T. Burghardt, and S. Kessler.
context. Another future work is the development                  2013. Re-identification of smart meter data. Per-
of a big data framework allowing anonymization                   sonal and Ubiquitous Computing 17(4):653–662.
with noise addition, differential privacy and scal-              https://doi.org/10.1007/s00779-012-0513-6.
able microaggregation dividing the specific part              J.W. Byun, A. Kamra, E. Bertino, and N. Li. 2007.
inherent at each type of data, business constraint               Efficient k-Anonymization Using Clustering Tech-
and future data utility with generic part. Lastly,               niques, Springer Berlin Heidelberg, Berlin, Heidel-
                                                                 berg, pages 188–200.
here, we work of global datalake anonymization,
which assumes there is a level where raw data are             K. Chatzikokolakis, C. Palamidessi, and M. Stronati.
stored before transformation. Local anonymiza-                  2013. A predictive differentially-private mecha-
tion, where data are anonymized at individual level             nism for mobility traces. CoRR abs/1311.4008.
                                                                http://arxiv.org/abs/1311.4008.
have to be studied.
                                                              S. Chester, B. M. Kapron, G. Srivastava, and
                                                                S. Venkatesh. 2013. Complexity of social net-
                                                                work anonymization. Social Netw. Analys. Mining
References                                                      3(2):151–166.
C.C. Aggarwal and P.S. Yu. 2008. A survey of random-
                                                              J-X. Chin, T. Tinoco De Rubira, and G. Hug. 2016.
  ization methods for privacy-preserving data mining.
                                                                 Privacy-protecting energy management unit through
  In Privacy-Preserving Data Mining - Models and
                                                                 model-distribution predictive control.       CoRR
  Algorithms, pages 137–156.
                                                                 abs/1612.05120. http://arxiv.org/abs/1612.05120.

G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller,             J. Cugliari, Y. Goude, and J. M. Poggi. 2016.
  R. Panigrahy, D. Thomas, and A. Zhu. 2006.                     Disaggregated electricity forecasting using
  Achieving anonymity via clustering.    In Pro-                 wavelet-based clustering of individual con-
  ceedings of the twenty-fifth ACM SIGMOD-                       sumers.        In 2016 IEEE International En-
  SIGACT-SIGART symposium on Principles of                       ergy Conference (ENERGYCON). pages 1–6.
  database systems. ACM, ACM, New York,                          https://doi.org/10.1109/ENERGYCON.2016.7514087.




                                                         88
G. Danezis. 2013. Privacy technology options for pro-             T. Hong, P. Pinson, and S. Fan. 2014. Global energy
  tecting and processing utility reading.                            forecasting competition 2012. International Journal
                                                                     of Forecasting 30(2):357 – 363.
J. Domingo-Ferrer and V. Torra. 2005. Ordinal,
   continuous and heterogeneous k-anonymity                       H. Kellerer, U. Pferschy, and D. Pisinger. 2004. Intro-
   through microaggregation.           Data Mining                  duction to NP-Completeness of knapsack problems.
   and     Knowledge      Discovery   11(2):195–212.                Springer.
   https://doi.org/10.1007/s10618-005-0007-5.
                                                                  W. Labeeuw and G. Deconinck. 2013. Residen-
F. Dufaux and T. Ebrahimi. 2006. Scrambling for                     tial electrical load model based on mixture model
   video surveillance with privacy. In 2006 Con-                    clustering and markov models. IEEE Transac-
   ference on Computer Vision and Pattern Recog-                    tions on Industrial Informatics 9(3):1561–1569.
   nition Workshop (CVPRW’06). pages 160–160.                       https://doi.org/10.1109/TII.2013.2240309.
   https://doi.org/10.1109/CVPRW.2006.184.
                                                                  N. Li, T. Li, and S. Venkatasubramanian. 2007. t-
                                                                    closeness: Privacy beyond k-anonymity and l-
C. Dwork. 2006. Differential Privacy, Springer Berlin
                                                                    diversity. In 2007 IEEE 23rd International Con-
   Heidelberg, Berlin, Heidelberg, pages 1–12.
                                                                    ference on Data Engineering. pages 106–115.
                                                                    https://doi.org/10.1109/ICDE.2007.367856.
C. Dwork and A. Roth. 2014.               The algo-
  rithmic     foundations    of   differential  pri-              Y. Li, S. Zhu, L. Wang, and S. Jajodia. 2002.
  vacy.        Foundations and Trends in Theo-                      A Privacy-Enhanced Microaggregation Method,
  retical Computer Science 9(3Ű4):211–407.                         Springer Berlin Heidelberg, Berlin, Heidelberg,
  https://doi.org/10.1561/0400000042.                               pages 148–159.
K. Ezhilarasi and Mr. M. Hariharan. 2015. Protecting              T. Warren Liao. 2005. Clustering of time series data, a
  sensitive information in bank transaction with data                survey. Pattern Recognition 38:1857–1874.
  anonymization. International Journal For Techno-
  logical Research In Engineering (2 (11)).                       J-L Lin and M-C Wei. 2008. An efficient clus-
                                                                     tering method for k-anonymization.      In Pro-
S. Fan and R.J. Hyndman. 2012.             Short-                    ceedings of the 2008 International Work-
  term load forecasting based on a semi-                             shop on Privacy and Anonymity in Infor-
  parametric additive model.          Power Sys-                     mation Society, PAIS 2008. pages 499–502.
  tems, IEEE Transactions on 27(1):134–141.                          https://doi.org/10.1109/CANDAR.2015.61.
  https://doi.org/10.1109/TPWRS.2011.2162082.
                                                                  K. Liu, C. Giannella, and K. Kargupta. 2008. A sur-
P. Gaillard, Y. Goude, and R. Nedellec. 2016. Addi-                 vey of attack techniques on privacy-preserving data
   tive models and robust aggregation for gefcom2014                perturbation methods.
   probabilistic electric load and electricity price fore-
   casting.     International Journal of Forecasting              G. Loukides and J-H. Shao. 2008. An efficient clus-
   32(3):1038–1050.                                                 tering algorithm for k-anonymisation. Journal of
                                                                    Computer Science and Technology 23(2):188–202.
S. Gambs, M-O. Killijian, and M. Nunez del                          https://doi.org/10.1007/s11390-008-9121-3.
  Prado Cortez. 2014. De-anonymization attack on
                                                                  C. Y.T. Ma and D. K.Y. Yau. 2015. On information-
  geolocated data. Journal of Computer and Sys-
                                                                     theoretic measures for quantifying privacy pro-
  tem Sciences 80(8):1597 – 1614. Special Issue on
                                                                     tection of time-series data. In Proceedings of
  Theory and Applications in Parallel and Distributed
                                                                     the 10th ACM Symposium on Information, Com-
  Computing Systems.
                                                                     puter and Communications Security. ACM, New
                                                                     York, NY, USA, ASIA CCS ’15, pages 427–438.
B. Gedik and L. Liu. 2008.         Protecting loca-
                                                                     https://doi.org/10.1145/2714576.2714577.
  tion privacy with personalized k-anonymity:
  Architecture and algorithms.        IEEE Trans-                 A. Machanavajjhala, D. Kifer, J. Gehrke, and
  actions on Mobile Computing 7(1):1–18.                            M. Venkitasubramaniam. 2006. l-diversity: Privacy
  https://doi.org/10.1109/TMC.2007.1062.                            beyond k-anonymity. In IN ICDE.
S. Hansell. 2006. Aol removes search data on vast                 F. McLoughlin, A. Duffy, and M. Conlon. 2010.
   group of web users. New York Times .                              The generation of domestic electricity load profiles
                                                                     through markov chain modelling. Euro-Asian Jour-
T. J. Hastie and R. J. Tibshirani. 1990. Generalized                 nal of Sustainable Energy Development Policy 3.
   additive models. London: Chapman & Hall.
                                                                  D.HO. McQueen, P. R. Hyland, and S. J. Watson. 2004.
S-K. Hong, K. Gurjar, H-S. Kim, and Y.S. Moon. 2013.                Monte carlo simulation of residential electricity de-
  A survey on privacy preserving time-series data min-              mand for forecasting maximum demand on distribu-
  ing. 3rd International Conference on Intelligent                  tion networks. IEEE Transactions on Power Systems
  Computational Systems (ICICS’2013) .                              19(3):1685–1689.




                                                             89
F. McSherry and I. Mironov. 2009.           Differen-              Design, Automation Test in Europe Confer-
   tially private recommender systems: building pri-               ence Exhibition (DATE). pages 1126–1129.
   vacy into the net. In Proceedings of the 15th                   https://doi.org/10.1109/DATE.2012.6176665.
   ACM SIGKDD international conference on Knowl-
   edge discovery and data mining. ACM, New                      L. Sweeney. 2002.          K-anonimity:    A model
   York, NY, USA, KDD ’09, pages 627–636.                          for protecting privacy.               International
   https://doi.org/10.1145/1557019.1557090.                        Journal      of   Uncertainty,   Fuzziness      and
                                                                   Knowledge-Based        Systems    10(05):557–570.
M. Muratori, Ma. C. Roberts, R. Sioshansi, V. Marano,              https://doi.org/10.1142/S0218488502001648.
  and G. Rizzoni. 2013. A highly resolved model-
  ing technique to simulate residential power demand.            V. Torra. 2017. Data Privacy: Foundations, New De-
  Applied Energy 107:465 – 473.                                     velopments and the Big Data Challenge (Studies in
                                                                    Big Data). Springer.
A. Narayanan and V. Shmatikov. 2008. Robust de-
  anonymization of large sparse datasets. In 2008                G. Tóth, Z. Hornák, and et al. 2004.         Measuring
  IEEE Symposium on Security and Privacy (sp 2008).                anonymity revisited.
  pages 111–125. https://doi.org/10.1109/SP.2008.33.
                                                                 V. Tudor, M. Almgren, and M. Papatriantafilou. 2013.
R. Nedellec, J. Cugliari, and Y. Goude. 2014. Gef-                  Analysis of the impact of data granularity on privacy
  com2012: Electric load forecasting and backcasting                for the smart grid. WPES ’13 Proceedings of the
  with semi-parametric models. International Journal                12th ACM .
  of Forecasting 30(2):375 – 381.                                S. Venkatasubramanian. 2008.     Measures of
                                                                   Anonymity, Springer US, Boston, MA, pages
J. Nin and V. Torra. 2009. Towards the evaluation of
                                                                   81–103.
   time series protection methods. Information Sci-
   ences 179(11):1663 – 1677. Including Special Issue            X. Xu and M. Numao. 2015. An efficient generalized
   on Chance Discovery.                                            clustering method for achieving k-anonymization.
                                                                   In 2015 Third International Symposium on Com-
J.V. Paatero and P.D. Lund. 2005. A model for generat-             puting and Networking (CANDAR). pages 499–502.
   ing household electricity load profiles. International          https://doi.org/10.1109/CANDAR.2015.61.
   journal of energy research 30.
                                                                 G. Zhang, X. Liu, and Y. Yang. 2015.       Time-
A. Pierrot and Y. Goude. 2011. Short-term electricity              series pattern based effective noise genera-
  load forecasting with generalized additive models.               tion for privacy protection on cloud.    IEEE
  Proceedings of ISAP power pages 593–600.                         Transactions on Computers 64(5):1456–1469.
                                                                   https://doi.org/10.1109/TC.2014.2298013.
P. Pompey, A. Bondu, Y. Goude, and M. Sinn.
   2015. Massive-Scale Simulation of Electrical Load             Q. Zhang, N. Koudas, D. Srivastava, and T. Yu.
   in Smart Grids Using Generalized Additive Mod-                  2007. Aggregate query answering on anonymized
   els, Springer International Publishing, Cham, pages             tables. In 2007 IEEE 23rd International Con-
   193–212.                                                        ference on Data Engineering. pages 116–125.
                                                                   https://doi.org/10.1109/ICDE.2007.367857.
S. Rani and G. Sikka. 2012. Recent techniques of clus-
   tering of time series data: A survey. International           B. Zhou, J. Pei, and W. Luk. 2008.         A brief
   Journal of Computer Applications .                              survey on anonymization techniques for pri-
                                                                   vacy preserving publishing of social network
E. Shi, T.-H. H. Chan, E. G. Rieffel, R. Chow, and                 data.        SIGKDD Explor. Newsl. 10:12–22.
   D.Song. 2011. Privacy-preserving aggregation of                 https://doi.org/10.1145/1540276.1540279.
   time-series data. In Proceedings of the Network
   and Distributed System Security Symposium, NDSS
   2011, San Diego, California, USA, 6th February -
   9th February 2011.

L. Shou, X. Shang, K. Chen, G. Chen, and C. Zhang.
   2013. Supporting pattern-preserving anonymiza-
   tion for time-series data. IEEE Transactions on
   Knowledge and Data Engineering 25(4):877–892.
   https://doi.org/10.1109/TKDE.2011.249.

A. Solanas and A. Ballesté. 2006. V-mdav: Variable
  group size multivariate microaggregation .

B. Suri, U. D. Bordoloi, and P. Eles. 2012. A
  scalable gpu-based approach to accelerate the
  multiple-choice knapsack problem.     In 2012




                                                            90