=Paper= {{Paper |id=Vol-2699/paper12 |storemode=property |title=An Anonymiser Tool for Sensitive Graph Data |pdfUrl=https://ceur-ws.org/Vol-2699/paper12.pdf |volume=Vol-2699 |authors=Charini Nanayakkara,Peter Christen,Thilina Ranbaduge |dblpUrl=https://dblp.org/rec/conf/cikm/NanayakkaraCR20 }} ==An Anonymiser Tool for Sensitive Graph Data== https://ceur-ws.org/Vol-2699/paper12.pdf
An Anonymiser Tool for Sensitive Graph Data
Charini Nanayakkara, Peter Christen and Thilina Ranbaduge
Research School of Computer Science, The Australian National University, Canberra, ACT 2600 Australia


                                          Abstract
                                          Analysis of graph data is extensively conducted in numerous domains to learn the relationships between and behaviour of
                                          connected entities. Many graphs contain sensitive data, for example social network users and their posts, or genealogical
                                          records such as birth and death certificates. This has limited the use and publication of such sensitive graph data sets. While
                                          there are various techniques available to anonymise tabular data, anonymising graph data while maintaining the node and
                                          edge structure of the original graph, such as node attributes and the similarities between nodes, is a challenging task. In this
                                          paper, we present a web tool which can anonymise sensitive graph data while maintaining the similarity structure of the
                                          original graph by employing a cluster-based mapping of sensitive to public attribute values, as well as randomly shifting date
                                          values. Our demonstration will illustrate the tool on two example data sets of historical birth records.

                                          Keywords
                                          Graph anonymisation, sensitive data, data privacy, data generation, cluster mapping, string similarity


1. Introduction                                                                                                    anonymise a graph such that identifying the real-world
                                                                                                                   entities represented by nodes in the graph is made dif-
Representing databases as graphs is often necessary                                                                ficult, while it is still possible to conduct analysis on
in modern data analysis tasks due to many databases                                                                the anonymised graph with suitable machine learning
having inter-relationships between their records. For                                                              algorithms.
example, a social network data set would represent                                                                    Therefore, existing graph anonymisation tools are
individuals as nodes and their relationships as edges,                                                             not useful in generating an anonymised, human in-
whereas several census data sets may be connected                                                                  terpretable version of a given sensitive graphdata set.
with edges to show the records that potentially belong                                                             An anonymised human interpretable data set is how-
to members of the same family (such as siblings). The                                                              ever important to allow inspection of the data set in
types of data which require a graph representation are                                                             the context of transparency of how a machine learn-
often sensitive (such as data that represent real peo-                                                             ing algorithm performs on that data set, or to be able
ple) and therefore cannot be shared publicly. This re-                                                             to publish a data set for educational purposes.
quires the anonymisation of a graph such that sensi-                                                                  In this paper, we present our web tool DOYEN (Data
tive data cannot be reidentified, while the structure of                                                           anOnYmiser for sENsitive Graph Data) which can gen-
the graph, the relationships between nodes and their                                                               erate an anonymised version of a sensitive graph data
attributes, are being preserved.                                                                                   set while maintaining its graph structure. Our method
   Anonymisation of graph data is a topic that has been                                                            replaces the sensitive attribute values of nodes with
explored by several previous studies [1, 2]. However,                                                              values from public lookup tables using a cluster based
these studies focus on protecting a data set against pri-                                                          mapping technique. The initial implementation of the
vacy attacks [3, 4] and therefore often compromise the                                                             DOYEN tool anonymises family data where graph con-
structure of the original graph by removing or adding                                                              nectivity represents sibling relationships. Such fam-
edges and/or nodes that are vulnerable to reidentifica-                                                            ily data are required for numerous social science stud-
tion due to their unique characteristics. Furthermore,                                                             ies and for the reconstruction of (historical) popula-
the data sets resulting from existing anonymisation                                                                tions [5].
techniques do not necessarily have to be interpretable                                                                We anticipate that DOYEN would be instrumental
by humans. Rather, the aim of these techniques is to                                                               in mitigating hindrances to such research work due
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020,
                                                                                                                   to the inability of publishing sensitive graph data. We
Galway, Ireland                                                                                                    next provide a brief overview of the DOYEN system. In
email: charini.nanayakkara@anu.edu.au (C. Nanayakkara);                                                            Sect. 3 we describe the cluster-based mapping, and in
peter.christen@anu.edu.au (P. Christen);                                                                           Sect. 4 the process of generating anonymised data. We
thilina.ranbaduge@anu.edu.au (T. Ranbaduge)
orcid: 0000-0002-7603-1845 (C. Nanayakkara);
                                                                                                                   then discuss in Sect. 5 the demonstration of DOYEN
0000-0003-3435-2015 (P. Christen); 0000-0001-5405-3704 (T.                                                         and illustrate its front end.
Ranbaduge)
                                    ยฉ 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Public lookup tables                            Cluster based mapping          Anonymised records    Algorithm 1: Cluster based attribute value mapping
 Female FN       Male FN       Surname                                           Family: F1 R1
 Ann             Bob           Fernandes
                                                 Mathieson        Fernandes      Name:               Input:
                                                 Matthewson       Fernanders
 Anna            Bobby         Fernanders
                                                 Matheson         Fernando
                                                                                   Richie            - ๐ƒ๐‘  : Sensitive input data set       - ๐€: Set of sensitive attributes
 Elizabeth       Richard       Fernando                                            Fernandes
 ...             Richie
                 ...
                               Whitacre                                          DOB:                - ๐‹: Attribute value lookup tables
                               Whittacre                                           08โˆ’09โˆ’2005
                               ...
                                                 John             Bob
                                                                                  Family: F1 R2      Output:
Input parameters                                 Jonah            Bobby
                                                                                  Name:              - ๐Œ: Attribute value mapping table
 a) Date range: 01โˆ’01โˆ’2000 to 31โˆ’12โˆ’2010                                            Bobby
 b) Random date shift range: +/โˆ’ 15 days
                                                 Steven           Richie
                                                                                    Fernando
                                                                                  DOB:
                                                                                                      1: ๐Œ = { },       // Initialise empty attribute value mapping table
Records from sensitive input graph               Stevenson        Richard           15โˆ’04โˆ’2007        2: for ๐‘Ž๐‘– โˆˆ ๐€ do:              // Iterate over sensitive attributes
Family: F1 R1 Family: F1 R2 Family: F2 R3                                         Family: F2 R3       3:   ๐‚๐‘‘ = ๐†๐ž๐ญ๐‚๐ฅ๐ฎ๐ฌ๐ญ๐ž๐ซ(๐ƒ๐‘  .๐š๐‘– )           // Cluster input attribute values
Name:           Name:            Name:                                            Name:
  Steven           Jonah           John                                             Bob
                                                                                                      4:   ๐‚๐‘™ = ๐†๐ž๐ญ๐‚๐ฅ๐ฎ๐ฌ๐ญ๐ž๐ซ(๐‹.๐š๐‘– )           // Cluster lookup attribute values
                                                 Maclean          Whittacre
  Mathieson        Matheson        Mclean
                                                 Mclean           Whitacre
                                                                                    Whitacre          5:   for ๐œ๐‘— โˆˆ ๐‚๐‘‘ do:           // Iterate over attribute value clusters
DOB:            DOB:             DOB:                                             DOB:
  05โˆ’06โˆ’1895       06โˆ’01โˆ’1897      18โˆ’09โˆ’1900                                       02โˆ’12โˆ’2010
                                                                                                      6:      ๐œโ€ฒ๐‘— = ๐๐ž๐ฌ๐ญ๐Œ๐š๐ญ๐œ๐ก(๐œ๐‘— .๐ฌ, ๐œ๐‘— .๐ฅ, |๐œ๐‘— .๐ฏ|, ๐‚๐‘™ )  // Find best match
                                                                                                      7:      ๐‚๐‘™ .๐‘Ÿ๐‘’๐‘š๐‘œ๐‘ฃ๐‘’(๐œโ€ฒ๐‘— ) // Remove selected cluster from lookup clusters
Figure 1: Overview of the DOYEN system to generate                                                    8:      ๐Œ๐š๐ฉ๐•๐š๐ฅ๐ฎ๐ž๐ฌ(๐Œ, ๐œ๐‘— .๐ฏ, ๐œโ€ฒ๐‘— .๐ฏ) // Map attribute values in clusters
                                                                                                      9: return ๐Œ
anonymised graph data.



2. System Overview                                                                                  values. The data set ๐ƒ๐‘  can be represented as a graph
                                                                                                    ๐†๐‘  = (๐•๐‘  , ๐„๐‘  ), where a node (vertex) in ๐•๐‘  represents
The DOYEN tool anonymises a given sensitive input                                                   a record ๐‘Ÿ๐‘– โˆˆ ๐ƒ๐‘  , and an edge in ๐„๐‘  corresponds to the
graph data set by replacing sensitive attribute values                                              pairwise attribute similarity of the record pair (๐‘Ÿ๐‘– , ๐‘Ÿ๐‘— ).
with values from public lookup tables. At the same                                                  Such similarities, as calculated by comparing attribute
time DOYEN maintains the structure of the graph and                                                 values, are often used to show the strength or impor-
the similarities between its nodes by conducting at-                                                tance of relationships in graph data [7]. We refer to
tribute value replacement in a way that preserves the                                               the set of sensitive attributes in ๐ƒ๐‘  as ๐€ = {๐‘Ž1 , โ€ฆ , ๐‘Ž๐‘› },
similarities between nodes.                                                                         and the values from each sensitive attribute ๐‘Ž๐‘– in the
   Figure 1 illustrates a high level overview of DOYEN.                                             input data set and the lookup tables as ๐ƒ๐‘  .๐š๐‘– and ๐‹.๐š๐‘– ,
The sensitive graph data set and one or more lookup                                                 respectively.
tables with attribute values extracted from a public                                                   Assuming that the sensitive attributes ๐€ have been
data source are provided as input to DOYEN. As ex-                                                  used to calculate the pairwise similarities between rec-
ample we show a family graph data set where siblings                                                ords in ๐ƒ๐‘  , we need to ensure that these similarities
have the same family ID and colour, and highly simi-                                                are maintained in the anonymised data set we gener-
lar first name and last name pairs (where similarity is                                             ate. This means that we need to retain the similarity
calculated with an approximate string similarity func-                                              structure of ๐†๐‘  = (๐•๐‘  , ๐„๐‘  ) in the generated anonymised
tion [6]) are shown as edges (dashed lines). DOYEN                                                  graph ๐†๐‘Ž = (๐•๐‘Ž , ๐„๐‘Ž ) which represents the anonymised
first clusters the sensitive attribute values from the in-                                          data set ๐ƒ๐‘Ž . To achieve this goal, we use a one-to-one
put graph data set and each public lookup table sep-                                                cluster mapping approach where we map an attribute
arately, and then maps the generated clusters of sen-                                               value cluster from the sensitive input data set ๐ƒ๐‘  to an
sitive values to the clusters generated from a lookup                                               attribute value cluster from the public lookup tables ๐‹
table using a mapping approach as we describe below.                                                such that the intra cluster similarities are highly simi-
   For each node in the input graph data set an anony-                                              lar across the two clusters.
mised node is then generated by replacing each sen-                                                    Algorithm 1 outlines this approach to anonymise
sitive attribute value with the corresponding mapped                                                the sensitive attribute values in a given graph data set.
value from a public lookup table. If the graph data con-                                            The input to the algorithm are the set of attributes ๐€
tains date values, as shown in our example in Fig. 1,                                               (such as names and addresses of people), the sensitive
they are anonymised by shifting dates within a user                                                 graph data set ๐ƒ๐‘  , and the lookup tables ๐‹ which con-
specified range as we discuss in Sect 4.                                                            tain values that attributes ๐‘Ž๐‘– โˆˆ ๐€ could assume.
                                                                                                       In lines 2 to 4, the algorithm iterates over the sen-
                                                                                                    sitive attributes ๐‘Ž๐‘– โˆˆ ๐€, and clusters the correspond-
3. Cluster based Attribute Value                                                                    ing attribute values in the input data set ๐ƒ๐‘  .๐š๐‘– and the
   Mapping                                                                                          lookup table ๐‹.๐š๐‘– . Next, in lines 5 and 6, we iterate
                                                                                                    over the attribute value clusters ๐œ๐‘— โˆˆ ๐‚๐‘‘ from the in-
We now describe our anonymisation approach for sen-                                                 put data set ๐ƒ๐‘  and find the best matching attribute
sitive attribute values. Assume we have a sensitive in-                                             value cluster from the lookup attribute value clusters
put data set ๐ƒ๐‘  containing records ๐‘Ÿ โˆˆ ๐ƒ๐‘  that repre-                                               ๐‚๐‘™ . For a given attribute value cluster ๐œ๐‘— , we obtain its
sent entities, and external lookup tables ๐‹ of attribute                                            sorted values ๐œ๐‘— .๐ฏ = [๐‘ฃ1 , ๐‘ฃ2 , โ€ฆ ๐‘ฃ๐‘š ], the vector of pair-
wise similarities of attribute values in cluster ๐œ๐‘— .๐ฌ =                Table 1
[๐‘ ๐‘ฃ1 ,๐‘ฃ2 , โ€ฆ , ๐‘ ๐‘ฃ1 ,๐‘ฃ๐‘š , ๐‘ ๐‘ฃ2 ,๐‘ฃ3 , โ€ฆ , ๐‘ ๐‘ฃ๐‘šโˆ’1 ,๐‘ฃ๐‘š ], and the attribute   Number of unique attribute values
value length vector ๐œ๐‘— .๐ฅ = [|๐‘ฃ1 |, |๐‘ฃ2 |, โ€ฆ , |๐‘ฃ๐‘š |]. The best             Attribute       First name      Surname     Address
matching cluster for ๐œ๐‘— is identified with the function                                     F        M
๐๐ž๐ฌ๐ญ๐Œ๐š๐ญ๐œ๐ก(), which takes as input the pairwise simi-                      Lookup tables   33,334   19,151     51,350      2,255
larities vector ๐œ๐‘— .๐ฌ, the vector of attribute value lengths                Data set 1     274      231        152         124
                                                                            Data set 2     375      333        266         219
๐œ๐‘— .๐ฅ, the number of attribute values in cluster |๐œ๐‘— .๐ฏ|, and
the set of lookup attribute value clusters ๐‚๐‘™ . The func-
tion ๐๐ž๐ฌ๐ญ๐Œ๐š๐ญ๐œ๐ก() finds the most similar lookup clus-
ter ๐œโ€ฒ๐‘— to ๐œ๐‘— from the set of lookup clusters ๐‚๐‘™ (which                 create a corresponding date ๐‘‘1โ€ฒ where ๐‘‘๐‘š๐‘–๐‘› โ‰ค ๐‘‘1โ€ฒ โ‰ค
are of the same size as ๐œ๐‘— ) using Euclidean distances                  ๐‘‘๐‘š๐‘Ž๐‘ฅ . Then, the remaining date values in the record
calculated between their similarity vectors and their                   group are shifted by ๐‘‘1โ€ฒ โˆ’ ๐‘‘1 days and each newly gen-
attribute value length vectors. The cluster ๐œโ€ฒ๐‘— โˆˆ ๐‚๐‘™                    erated date value, excluding the earliest date ๐‘‘1โ€ฒ , is ran-
with the minimal weighted distances is selected to be                   domly shifted by ยฑฮ”๐‘‘ days such that any temporal
mapped to cluster ๐œ๐‘— . If ๐‚๐‘™ does not contain clusters                  constraints across the generated date values are main-
of size |๐œ๐‘— .๐ฏ| then subsets of clusters from ๐‚๐‘™ which are              tained. For example, if ๐ƒ๐‘  contains birth records of sib-
larger than |๐œ๐‘— .๐ฏ| are considered.                                     ling groups, then the temporal constraints of the birth
    In line 7, we remove the selected cluster ๐œโ€ฒ๐‘— from                  dates would reflect that it is not possible for two births
๐‚๐‘™ to obtain unique cluster mappings. In line 8, we                     by the same mother to be less than nine months apart
then map the sorted attribute values in ๐œ๐‘— .๐ฏ to the cor-               unless they are twins [8]. The anonymised date values
responding values in cluster ๐œโ€ฒ๐‘— .๐ฏ, such that each at-                 are then used in the synthetic records ๐‘Ÿ๐‘–โ€ฒ โˆˆ ๐ƒ๐‘Ž .
tribute value from data set ๐ƒ๐‘  has a unique mapping
to a value in the lookup table ๐‹.
                                                                        5. Demonstration
4. Generating an Anonymised                                             The initial implementation of the DOYEN tool demon-
                                                                        strates sensitive graph data anonymisation using two
   Data Set                                                             input data sets which we generated based on two real-
Once the attribute value mapping has been completed,                    world historical birth data sets. These data sets con-
DOYEN generates the anonymised graph data set ๐ƒ๐‘Ž                        tain name and address variations to help demonstrate
for the sensitive input graph data set ๐ƒ๐‘  in the follow-                the capability of DOYEN to anonymise a graph while
ing manner.                                                             maintaining its similarity structure. The example birth
   For each record from the input data set ๐‘Ÿ๐‘– โˆˆ ๐ƒ๐‘  , we                 data sets contain several twin births as well as missing
create a synthetic record ๐‘Ÿ๐‘–โ€ฒ โˆˆ ๐ƒ๐‘Ž , and for each sen-                  values for the last names of fathers and children, as
sitive attribute value in ๐‘Ÿ๐‘– , we replace the original at-              seen in the original birth data sets. Lookup tables con-
tribute value with the corresponding mapped attribute                   taining values for the sensitive attributes first name,
value from ๐Œ.                                                           surname, and address, were generated using a publicly
   Given many data sets have dates associated with                      available US voter database (see: https://dl.ncsbe.gov)
their records (such as dates of birth or dates of hospital              as well as Australian addresses. Table 1 summarises
admission), DOYEN also provides an anonymisation                        the number of unique values in each sensitive attribute
approach for date values while maintaining the tem-                     in the lookup tables and the two example data sets. If
poral distances across connected records. Prior to date                 attribute values contained more than one token (such
anonymisation, the records in ๐ƒ๐‘  are grouped such                       as having two names) they were separated into indi-
that related records are contained in a single group.                   vidual tokens as a preprocessing step.
These groupings reflect records that represent a re-                       As shown in Table 1, the number of values available
lated group of entities, such as the siblings of the same               for each attribute is significantly larger in the lookup
family. Subsequent to grouping records, DOYEN sorts                     tables compared to the input graph data set. Having
the date values associated with the records in a group.                 more attribute values in the lookup tables will help
The tool allows the user to define a minimum (๐‘‘๐‘š๐‘–๐‘› )                    anonymise the sensitive input data set in a manner
and a maximum (๐‘‘๐‘š๐‘Ž๐‘ฅ ) date within which they want                       that better preserves its graph similarity structure.
the earliest date value from a specific group to appear.                   Fig. 2 shows the input screen of the DOYEN web
Thus, for the earliest date ๐‘‘1 from a record group, we                  tool. The two buttons Example 1 and Example 2 will
                                                                        load one of the input graph data sets and suitable pa-
Figure 2: Input screen of the DOYEN tool.



rameter values. A sample of the input data set can be
viewed by clicking on the View Input Data Sample but-
ton, whereas the full data set can be downloaded as
well. The four parameters to be specified are:        Figure 3: Sample of attribute value mappings as generated
                                                             by DOYEN.
โ€ข Group size and their counts: This parameter allows
  control of the number of families of a given size that
  are to be generated. When a sensitive graph data         Jaro-Winkler for names and the Jaccard q-gram based
  set is loaded, this field is automatically populated     approach for addresses) [6]. Subsequently, the attribute
  with the family size distribution of the loaded input    value clusters are identified with the connected com-
  data set. The user can change the values as desired.     ponent clustering approach with a similarity threshold
  However, the current implementation restricts the        of 0.8 [9]. Next, all attribute value clusters from the in-
  user to specifying only up to a maximum number           put graph are mapped to clusters from the lookup ta-
  of families as appearing in the input data set, for a    ble using the Euclidean distance vector similarity mea-
  given family size. That is, if there are ten families    sure, as we described in Algorithm 1. Since the vector
  (clusters) of size five in the input data set, then cur- of similarities between attribute value pairs in clusters
  rently the user can only generate a maximum of ten       (๐œ๐‘— .๐ฌ) is more important than the vector of attribute
  families of size five.                                   value lengths (๐œ๐‘— .๐ฅ), we assign a relatively higher weight
 โ€ข Minimum/Maximum dates for the first birth: For ๐‘ค(๐‘ค > 0.5) to ๐œ๐‘— .๐ฌ and a weight of 1 โˆ’ ๐‘ค to ๐œ๐‘— .๐ฅ when
   each family in the anonymised data set, the first calculating the overall cluster similarity.
   birth record in the family is expected to have a birth      After the attribute value mapping is completed, the
   date within (including) the given minimum and max-      DOYEN      tool generates the anonymised graph data set
   imum date range (๐‘‘๐‘š๐‘–๐‘› and ๐‘‘๐‘š๐‘Ž๐‘ฅ ), where ๐‘‘๐‘š๐‘–๐‘› <          by   replacing  the attribute values of each record in the
   ๐‘‘๐‘š๐‘Ž๐‘ฅ .                                                  input   data set with the corresponding mapped attribute
                                                           value, and by shifting the dates of birth as described in
 โ€ข Random time offset: This is the ยฑฮ”๐‘‘ time range Section 4. Once the anonymised data is generated, the
   which is used to further shift (randomly perturb) user can view a sample of the attribute value mappings
   the dates of birth in each anonymised synthetic fam- as shown in Fig. 3. Furthermore, the user can view a
   ily (except the first date) as we described in Sec- sample of the anonymised, synthetic data set created
   tion 4.                                                 by DOYEN, or download the full data set. Fig. 4 shows
                                                           a sample of the sensitive input data set and the corre-
   The user has the flexibility of editing the values with sponding anonymised records. Notice how, for exam-
which the parameter fields are automatically populated ple, the address values โ€˜monkstadtโ€™ and โ€˜monkstodtโ€™
after an example data set has been loaded. When the (in R8 and R9) with a high pairwise string similarity
Generate Data button is clicked, the anonymisation and have been replaced with values โ€˜narembureโ€™ and โ€˜narem-
data generation process is executed in the back end. burnโ€™ which have a similar pairwise similarity.
The tool internally calculates the string similarity of        To illustrate the quality of the anonymised graphs
attribute values by first applying a blocking technique generated by DOYEN, Fig. 5 shows the distribution of
[9] (with phonetic encoding for names and Locality the pairwise similarity for record pairs from each ex-
Sensitive Hashing for addresses) and then applying a ample data set. For each record pair, the similarity is
pairwise string similarity calculation method (we used
                                                                                                                        ity of anonymising sensitive graph data sets while pre-
                                                                                                                        serving their similarity structure. As future work we
                                                                                                                        intend to further improve DOYEN such that it is capa-
                                                                                                                        ble of maintaining geographic structures in the graph
                                                                                                                        data (such as maintaining similar distances between
                                                                                                                        addresses in input records and the corresponding gen-
                                                                                                                        erated anonymised output records). Furthermore, we
                                                                                                                        plan to support generating synthetic data sets of dif-
                                                                                                                        ferent cluster size distributions without restricting the
                                                                                                                        user to a maximum distribution as limited by the input
                                                                                                                        data set. Such flexibility will allow users to generate
                                                                                                                        larger or smaller data sets as suited for their research
                                                                                                                        requirements.


                                                                                                                        Acknowledgments
                                                                                                                        This research was partially funded by the Australian
                                                                                                                        Research Council under grant DP160101934.

Figure 4: Sample of the sensitive input data set and the
anonymised data set as generated by DOYEN.
                                                                                                                        References
                                                                                                                        [1] T. Feder, S. U. Nabar, E. Terzi, Anonymizing
                             105       Example data set 1               105        Example data set 2                       graphs, arXiv Preprint (2008). URL: http://arxiv.
                                      Sensitive         Anonymised                 Sensitive         Anonymised
                                                                                                                            org/abs/0810.5578.
  Number of record pairs (log)




                             104                                        104
                                                                                                                        [2] L.-E. Wang, X. Li, A graph-based multifold model
                             103                                        103                                                 for anonymizing data with attributes of multiple
                             10   2
                                                                        102                                                 types, CaS 72 (2018) 122โ€“135.
                                                                                                                        [3] S. Das, O. Egecioglu, A. Abbadi, Anรณnimos: An
                             101                                        101
                                                                                                                            LP-based approach for anonymizing weighted so-
                             1000.0                                                                                         cial network graphs, IEEE TKDE 24 (2012) 590โ€“
                                                                           0
                                      0.2   0.4   0.6      0.8       1.0 10 0.0   0.2    0.4   0.6      0.8       1.0
                                                             Pairwise similarity
                                                                                                                            604.
                                                                                                                        [4] B. Zhou, J. Pei, W. Luk, A brief survey on
Figure 5: Comparison of pairwise record similarities of the
                                                                                                                            anonymization techniques for privacy preserving
sensitive input and the generated anonymised data sets.
                                                                                                                            publishing of social network data, SIGKDD Ex-
                                                                                                                            plor. Newsl. 10 (2008) 12โ€“22.
                                                                                                                        [5] G. Bloothooft, P. Christen, K. Mandemakers,
calculated using the Jaro-Winkler similarity measure                                                                        M. Schraagen, Population Reconstruction,
on names and the Jaccard q-gram similarity measure                                                                          Springer, Cham, 2015.
on addresses [6], followed by an averaging of these                                                                     [6] G. Navarro, A guided tour to approximate string
values. As can be seen from this figure, the similarity                                                                     matching, ACM Computing Surveys 33 (2001) 31โ€“
distribution of both the sensitive input data set and the                                                                   88.
generated anonymised data set are highly similar for                                                                    [7] R. Xiang, J. Neville, M. Rogati, Modeling rela-
each of the two example data sets. This shows the ca-                                                                       tionship strength in online social networks, in:
pability of DOYEN to anonymise sensitive graph data,                                                                        WWW, ACM, Raleigh, NC, 2010, pp. 981โ€“990.
while maintaining its structure, as reflected by these                                                                  [8] C. Nanayakkara, P. Christen, T. Ranbaduge, Ro-
pairwise similarities.                                                                                                      bust temporal graph clustering for group record
                                                                                                                            linkage, in: PAKDD, Springer, Macau, 2019.
                                                                                                                        [9] P. Christen, Data Matching โ€“ Concepts and tech-
6. Conclusion and Future Work                                                                                               niques for record linkage, entity resolution, and
In this paper, we have presented the initial implemen-                                                                      duplicate detection, Springer, Heidelberg, 2012.
tation of our web tool DOYEN which has the capabil-