=Paper=
{{Paper
|id=Vol-2699/paper12
|storemode=property
|title=An Anonymiser Tool for Sensitive Graph Data
|pdfUrl=https://ceur-ws.org/Vol-2699/paper12.pdf
|volume=Vol-2699
|authors=Charini Nanayakkara,Peter Christen,Thilina Ranbaduge
|dblpUrl=https://dblp.org/rec/conf/cikm/NanayakkaraCR20
}}
==An Anonymiser Tool for Sensitive Graph Data==
An Anonymiser Tool for Sensitive Graph Data Charini Nanayakkara, Peter Christen and Thilina Ranbaduge Research School of Computer Science, The Australian National University, Canberra, ACT 2600 Australia Abstract Analysis of graph data is extensively conducted in numerous domains to learn the relationships between and behaviour of connected entities. Many graphs contain sensitive data, for example social network users and their posts, or genealogical records such as birth and death certificates. This has limited the use and publication of such sensitive graph data sets. While there are various techniques available to anonymise tabular data, anonymising graph data while maintaining the node and edge structure of the original graph, such as node attributes and the similarities between nodes, is a challenging task. In this paper, we present a web tool which can anonymise sensitive graph data while maintaining the similarity structure of the original graph by employing a cluster-based mapping of sensitive to public attribute values, as well as randomly shifting date values. Our demonstration will illustrate the tool on two example data sets of historical birth records. Keywords Graph anonymisation, sensitive data, data privacy, data generation, cluster mapping, string similarity 1. Introduction anonymise a graph such that identifying the real-world entities represented by nodes in the graph is made dif- Representing databases as graphs is often necessary ficult, while it is still possible to conduct analysis on in modern data analysis tasks due to many databases the anonymised graph with suitable machine learning having inter-relationships between their records. For algorithms. example, a social network data set would represent Therefore, existing graph anonymisation tools are individuals as nodes and their relationships as edges, not useful in generating an anonymised, human in- whereas several census data sets may be connected terpretable version of a given sensitive graphdata set. with edges to show the records that potentially belong An anonymised human interpretable data set is how- to members of the same family (such as siblings). The ever important to allow inspection of the data set in types of data which require a graph representation are the context of transparency of how a machine learn- often sensitive (such as data that represent real peo- ing algorithm performs on that data set, or to be able ple) and therefore cannot be shared publicly. This re- to publish a data set for educational purposes. quires the anonymisation of a graph such that sensi- In this paper, we present our web tool DOYEN (Data tive data cannot be reidentified, while the structure of anOnYmiser for sENsitive Graph Data) which can gen- the graph, the relationships between nodes and their erate an anonymised version of a sensitive graph data attributes, are being preserved. set while maintaining its graph structure. Our method Anonymisation of graph data is a topic that has been replaces the sensitive attribute values of nodes with explored by several previous studies [1, 2]. However, values from public lookup tables using a cluster based these studies focus on protecting a data set against pri- mapping technique. The initial implementation of the vacy attacks [3, 4] and therefore often compromise the DOYEN tool anonymises family data where graph con- structure of the original graph by removing or adding nectivity represents sibling relationships. Such fam- edges and/or nodes that are vulnerable to reidentifica- ily data are required for numerous social science stud- tion due to their unique characteristics. Furthermore, ies and for the reconstruction of (historical) popula- the data sets resulting from existing anonymisation tions [5]. techniques do not necessarily have to be interpretable We anticipate that DOYEN would be instrumental by humans. Rather, the aim of these techniques is to in mitigating hindrances to such research work due Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, to the inability of publishing sensitive graph data. We Galway, Ireland next provide a brief overview of the DOYEN system. In email: charini.nanayakkara@anu.edu.au (C. Nanayakkara); Sect. 3 we describe the cluster-based mapping, and in peter.christen@anu.edu.au (P. Christen); Sect. 4 the process of generating anonymised data. We thilina.ranbaduge@anu.edu.au (T. Ranbaduge) orcid: 0000-0002-7603-1845 (C. Nanayakkara); then discuss in Sect. 5 the demonstration of DOYEN 0000-0003-3435-2015 (P. Christen); 0000-0001-5405-3704 (T. and illustrate its front end. Ranbaduge) ยฉ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Public lookup tables Cluster based mapping Anonymised records Algorithm 1: Cluster based attribute value mapping Female FN Male FN Surname Family: F1 R1 Ann Bob Fernandes Mathieson Fernandes Name: Input: Matthewson Fernanders Anna Bobby Fernanders Matheson Fernando Richie - ๐๐ : Sensitive input data set - ๐: Set of sensitive attributes Elizabeth Richard Fernando Fernandes ... Richie ... Whitacre DOB: - ๐: Attribute value lookup tables Whittacre 08โ09โ2005 ... John Bob Family: F1 R2 Output: Input parameters Jonah Bobby Name: - ๐: Attribute value mapping table a) Date range: 01โ01โ2000 to 31โ12โ2010 Bobby b) Random date shift range: +/โ 15 days Steven Richie Fernando DOB: 1: ๐ = { }, // Initialise empty attribute value mapping table Records from sensitive input graph Stevenson Richard 15โ04โ2007 2: for ๐๐ โ ๐ do: // Iterate over sensitive attributes Family: F1 R1 Family: F1 R2 Family: F2 R3 Family: F2 R3 3: ๐๐ = ๐๐๐ญ๐๐ฅ๐ฎ๐ฌ๐ญ๐๐ซ(๐๐ .๐๐ ) // Cluster input attribute values Name: Name: Name: Name: Steven Jonah John Bob 4: ๐๐ = ๐๐๐ญ๐๐ฅ๐ฎ๐ฌ๐ญ๐๐ซ(๐.๐๐ ) // Cluster lookup attribute values Maclean Whittacre Mathieson Matheson Mclean Mclean Whitacre Whitacre 5: for ๐๐ โ ๐๐ do: // Iterate over attribute value clusters DOB: DOB: DOB: DOB: 05โ06โ1895 06โ01โ1897 18โ09โ1900 02โ12โ2010 6: ๐โฒ๐ = ๐๐๐ฌ๐ญ๐๐๐ญ๐๐ก(๐๐ .๐ฌ, ๐๐ .๐ฅ, |๐๐ .๐ฏ|, ๐๐ ) // Find best match 7: ๐๐ .๐๐๐๐๐ฃ๐(๐โฒ๐ ) // Remove selected cluster from lookup clusters Figure 1: Overview of the DOYEN system to generate 8: ๐๐๐ฉ๐๐๐ฅ๐ฎ๐๐ฌ(๐, ๐๐ .๐ฏ, ๐โฒ๐ .๐ฏ) // Map attribute values in clusters 9: return ๐ anonymised graph data. 2. System Overview values. The data set ๐๐ can be represented as a graph ๐๐ = (๐๐ , ๐๐ ), where a node (vertex) in ๐๐ represents The DOYEN tool anonymises a given sensitive input a record ๐๐ โ ๐๐ , and an edge in ๐๐ corresponds to the graph data set by replacing sensitive attribute values pairwise attribute similarity of the record pair (๐๐ , ๐๐ ). with values from public lookup tables. At the same Such similarities, as calculated by comparing attribute time DOYEN maintains the structure of the graph and values, are often used to show the strength or impor- the similarities between its nodes by conducting at- tance of relationships in graph data [7]. We refer to tribute value replacement in a way that preserves the the set of sensitive attributes in ๐๐ as ๐ = {๐1 , โฆ , ๐๐ }, similarities between nodes. and the values from each sensitive attribute ๐๐ in the Figure 1 illustrates a high level overview of DOYEN. input data set and the lookup tables as ๐๐ .๐๐ and ๐.๐๐ , The sensitive graph data set and one or more lookup respectively. tables with attribute values extracted from a public Assuming that the sensitive attributes ๐ have been data source are provided as input to DOYEN. As ex- used to calculate the pairwise similarities between rec- ample we show a family graph data set where siblings ords in ๐๐ , we need to ensure that these similarities have the same family ID and colour, and highly simi- are maintained in the anonymised data set we gener- lar first name and last name pairs (where similarity is ate. This means that we need to retain the similarity calculated with an approximate string similarity func- structure of ๐๐ = (๐๐ , ๐๐ ) in the generated anonymised tion [6]) are shown as edges (dashed lines). DOYEN graph ๐๐ = (๐๐ , ๐๐ ) which represents the anonymised first clusters the sensitive attribute values from the in- data set ๐๐ . To achieve this goal, we use a one-to-one put graph data set and each public lookup table sep- cluster mapping approach where we map an attribute arately, and then maps the generated clusters of sen- value cluster from the sensitive input data set ๐๐ to an sitive values to the clusters generated from a lookup attribute value cluster from the public lookup tables ๐ table using a mapping approach as we describe below. such that the intra cluster similarities are highly simi- For each node in the input graph data set an anony- lar across the two clusters. mised node is then generated by replacing each sen- Algorithm 1 outlines this approach to anonymise sitive attribute value with the corresponding mapped the sensitive attribute values in a given graph data set. value from a public lookup table. If the graph data con- The input to the algorithm are the set of attributes ๐ tains date values, as shown in our example in Fig. 1, (such as names and addresses of people), the sensitive they are anonymised by shifting dates within a user graph data set ๐๐ , and the lookup tables ๐ which con- specified range as we discuss in Sect 4. tain values that attributes ๐๐ โ ๐ could assume. In lines 2 to 4, the algorithm iterates over the sen- sitive attributes ๐๐ โ ๐, and clusters the correspond- 3. Cluster based Attribute Value ing attribute values in the input data set ๐๐ .๐๐ and the Mapping lookup table ๐.๐๐ . Next, in lines 5 and 6, we iterate over the attribute value clusters ๐๐ โ ๐๐ from the in- We now describe our anonymisation approach for sen- put data set ๐๐ and find the best matching attribute sitive attribute values. Assume we have a sensitive in- value cluster from the lookup attribute value clusters put data set ๐๐ containing records ๐ โ ๐๐ that repre- ๐๐ . For a given attribute value cluster ๐๐ , we obtain its sent entities, and external lookup tables ๐ of attribute sorted values ๐๐ .๐ฏ = [๐ฃ1 , ๐ฃ2 , โฆ ๐ฃ๐ ], the vector of pair- wise similarities of attribute values in cluster ๐๐ .๐ฌ = Table 1 [๐ ๐ฃ1 ,๐ฃ2 , โฆ , ๐ ๐ฃ1 ,๐ฃ๐ , ๐ ๐ฃ2 ,๐ฃ3 , โฆ , ๐ ๐ฃ๐โ1 ,๐ฃ๐ ], and the attribute Number of unique attribute values value length vector ๐๐ .๐ฅ = [|๐ฃ1 |, |๐ฃ2 |, โฆ , |๐ฃ๐ |]. The best Attribute First name Surname Address matching cluster for ๐๐ is identified with the function F M ๐๐๐ฌ๐ญ๐๐๐ญ๐๐ก(), which takes as input the pairwise simi- Lookup tables 33,334 19,151 51,350 2,255 larities vector ๐๐ .๐ฌ, the vector of attribute value lengths Data set 1 274 231 152 124 Data set 2 375 333 266 219 ๐๐ .๐ฅ, the number of attribute values in cluster |๐๐ .๐ฏ|, and the set of lookup attribute value clusters ๐๐ . The func- tion ๐๐๐ฌ๐ญ๐๐๐ญ๐๐ก() finds the most similar lookup clus- ter ๐โฒ๐ to ๐๐ from the set of lookup clusters ๐๐ (which create a corresponding date ๐1โฒ where ๐๐๐๐ โค ๐1โฒ โค are of the same size as ๐๐ ) using Euclidean distances ๐๐๐๐ฅ . Then, the remaining date values in the record calculated between their similarity vectors and their group are shifted by ๐1โฒ โ ๐1 days and each newly gen- attribute value length vectors. The cluster ๐โฒ๐ โ ๐๐ erated date value, excluding the earliest date ๐1โฒ , is ran- with the minimal weighted distances is selected to be domly shifted by ยฑฮ๐ days such that any temporal mapped to cluster ๐๐ . If ๐๐ does not contain clusters constraints across the generated date values are main- of size |๐๐ .๐ฏ| then subsets of clusters from ๐๐ which are tained. For example, if ๐๐ contains birth records of sib- larger than |๐๐ .๐ฏ| are considered. ling groups, then the temporal constraints of the birth In line 7, we remove the selected cluster ๐โฒ๐ from dates would reflect that it is not possible for two births ๐๐ to obtain unique cluster mappings. In line 8, we by the same mother to be less than nine months apart then map the sorted attribute values in ๐๐ .๐ฏ to the cor- unless they are twins [8]. The anonymised date values responding values in cluster ๐โฒ๐ .๐ฏ, such that each at- are then used in the synthetic records ๐๐โฒ โ ๐๐ . tribute value from data set ๐๐ has a unique mapping to a value in the lookup table ๐. 5. Demonstration 4. Generating an Anonymised The initial implementation of the DOYEN tool demon- strates sensitive graph data anonymisation using two Data Set input data sets which we generated based on two real- Once the attribute value mapping has been completed, world historical birth data sets. These data sets con- DOYEN generates the anonymised graph data set ๐๐ tain name and address variations to help demonstrate for the sensitive input graph data set ๐๐ in the follow- the capability of DOYEN to anonymise a graph while ing manner. maintaining its similarity structure. The example birth For each record from the input data set ๐๐ โ ๐๐ , we data sets contain several twin births as well as missing create a synthetic record ๐๐โฒ โ ๐๐ , and for each sen- values for the last names of fathers and children, as sitive attribute value in ๐๐ , we replace the original at- seen in the original birth data sets. Lookup tables con- tribute value with the corresponding mapped attribute taining values for the sensitive attributes first name, value from ๐. surname, and address, were generated using a publicly Given many data sets have dates associated with available US voter database (see: https://dl.ncsbe.gov) their records (such as dates of birth or dates of hospital as well as Australian addresses. Table 1 summarises admission), DOYEN also provides an anonymisation the number of unique values in each sensitive attribute approach for date values while maintaining the tem- in the lookup tables and the two example data sets. If poral distances across connected records. Prior to date attribute values contained more than one token (such anonymisation, the records in ๐๐ are grouped such as having two names) they were separated into indi- that related records are contained in a single group. vidual tokens as a preprocessing step. These groupings reflect records that represent a re- As shown in Table 1, the number of values available lated group of entities, such as the siblings of the same for each attribute is significantly larger in the lookup family. Subsequent to grouping records, DOYEN sorts tables compared to the input graph data set. Having the date values associated with the records in a group. more attribute values in the lookup tables will help The tool allows the user to define a minimum (๐๐๐๐ ) anonymise the sensitive input data set in a manner and a maximum (๐๐๐๐ฅ ) date within which they want that better preserves its graph similarity structure. the earliest date value from a specific group to appear. Fig. 2 shows the input screen of the DOYEN web Thus, for the earliest date ๐1 from a record group, we tool. The two buttons Example 1 and Example 2 will load one of the input graph data sets and suitable pa- Figure 2: Input screen of the DOYEN tool. rameter values. A sample of the input data set can be viewed by clicking on the View Input Data Sample but- ton, whereas the full data set can be downloaded as well. The four parameters to be specified are: Figure 3: Sample of attribute value mappings as generated by DOYEN. โข Group size and their counts: This parameter allows control of the number of families of a given size that are to be generated. When a sensitive graph data Jaro-Winkler for names and the Jaccard q-gram based set is loaded, this field is automatically populated approach for addresses) [6]. Subsequently, the attribute with the family size distribution of the loaded input value clusters are identified with the connected com- data set. The user can change the values as desired. ponent clustering approach with a similarity threshold However, the current implementation restricts the of 0.8 [9]. Next, all attribute value clusters from the in- user to specifying only up to a maximum number put graph are mapped to clusters from the lookup ta- of families as appearing in the input data set, for a ble using the Euclidean distance vector similarity mea- given family size. That is, if there are ten families sure, as we described in Algorithm 1. Since the vector (clusters) of size five in the input data set, then cur- of similarities between attribute value pairs in clusters rently the user can only generate a maximum of ten (๐๐ .๐ฌ) is more important than the vector of attribute families of size five. value lengths (๐๐ .๐ฅ), we assign a relatively higher weight โข Minimum/Maximum dates for the first birth: For ๐ค(๐ค > 0.5) to ๐๐ .๐ฌ and a weight of 1 โ ๐ค to ๐๐ .๐ฅ when each family in the anonymised data set, the first calculating the overall cluster similarity. birth record in the family is expected to have a birth After the attribute value mapping is completed, the date within (including) the given minimum and max- DOYEN tool generates the anonymised graph data set imum date range (๐๐๐๐ and ๐๐๐๐ฅ ), where ๐๐๐๐ < by replacing the attribute values of each record in the ๐๐๐๐ฅ . input data set with the corresponding mapped attribute value, and by shifting the dates of birth as described in โข Random time offset: This is the ยฑฮ๐ time range Section 4. Once the anonymised data is generated, the which is used to further shift (randomly perturb) user can view a sample of the attribute value mappings the dates of birth in each anonymised synthetic fam- as shown in Fig. 3. Furthermore, the user can view a ily (except the first date) as we described in Sec- sample of the anonymised, synthetic data set created tion 4. by DOYEN, or download the full data set. Fig. 4 shows a sample of the sensitive input data set and the corre- The user has the flexibility of editing the values with sponding anonymised records. Notice how, for exam- which the parameter fields are automatically populated ple, the address values โmonkstadtโ and โmonkstodtโ after an example data set has been loaded. When the (in R8 and R9) with a high pairwise string similarity Generate Data button is clicked, the anonymisation and have been replaced with values โnarembureโ and โnarem- data generation process is executed in the back end. burnโ which have a similar pairwise similarity. The tool internally calculates the string similarity of To illustrate the quality of the anonymised graphs attribute values by first applying a blocking technique generated by DOYEN, Fig. 5 shows the distribution of [9] (with phonetic encoding for names and Locality the pairwise similarity for record pairs from each ex- Sensitive Hashing for addresses) and then applying a ample data set. For each record pair, the similarity is pairwise string similarity calculation method (we used ity of anonymising sensitive graph data sets while pre- serving their similarity structure. As future work we intend to further improve DOYEN such that it is capa- ble of maintaining geographic structures in the graph data (such as maintaining similar distances between addresses in input records and the corresponding gen- erated anonymised output records). Furthermore, we plan to support generating synthetic data sets of dif- ferent cluster size distributions without restricting the user to a maximum distribution as limited by the input data set. Such flexibility will allow users to generate larger or smaller data sets as suited for their research requirements. Acknowledgments This research was partially funded by the Australian Research Council under grant DP160101934. Figure 4: Sample of the sensitive input data set and the anonymised data set as generated by DOYEN. References [1] T. Feder, S. U. Nabar, E. Terzi, Anonymizing 105 Example data set 1 105 Example data set 2 graphs, arXiv Preprint (2008). URL: http://arxiv. Sensitive Anonymised Sensitive Anonymised org/abs/0810.5578. Number of record pairs (log) 104 104 [2] L.-E. Wang, X. Li, A graph-based multifold model 103 103 for anonymizing data with attributes of multiple 10 2 102 types, CaS 72 (2018) 122โ135. [3] S. Das, O. Egecioglu, A. Abbadi, Anรณnimos: An 101 101 LP-based approach for anonymizing weighted so- 1000.0 cial network graphs, IEEE TKDE 24 (2012) 590โ 0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0 Pairwise similarity 604. [4] B. Zhou, J. Pei, W. Luk, A brief survey on Figure 5: Comparison of pairwise record similarities of the anonymization techniques for privacy preserving sensitive input and the generated anonymised data sets. publishing of social network data, SIGKDD Ex- plor. Newsl. 10 (2008) 12โ22. [5] G. Bloothooft, P. Christen, K. Mandemakers, calculated using the Jaro-Winkler similarity measure M. Schraagen, Population Reconstruction, on names and the Jaccard q-gram similarity measure Springer, Cham, 2015. on addresses [6], followed by an averaging of these [6] G. Navarro, A guided tour to approximate string values. As can be seen from this figure, the similarity matching, ACM Computing Surveys 33 (2001) 31โ distribution of both the sensitive input data set and the 88. generated anonymised data set are highly similar for [7] R. Xiang, J. Neville, M. Rogati, Modeling rela- each of the two example data sets. This shows the ca- tionship strength in online social networks, in: pability of DOYEN to anonymise sensitive graph data, WWW, ACM, Raleigh, NC, 2010, pp. 981โ990. while maintaining its structure, as reflected by these [8] C. Nanayakkara, P. Christen, T. Ranbaduge, Ro- pairwise similarities. bust temporal graph clustering for group record linkage, in: PAKDD, Springer, Macau, 2019. [9] P. Christen, Data Matching โ Concepts and tech- 6. Conclusion and Future Work niques for record linkage, entity resolution, and In this paper, we have presented the initial implemen- duplicate detection, Springer, Heidelberg, 2012. tation of our web tool DOYEN which has the capabil-