=Paper=
{{Paper
|id=Vol-2438/paper4
|storemode=property
|title=Anonymization of the University Information System Log Data: a Case Study
|pdfUrl=https://ceur-ws.org/Vol-2438/paper4.pdf
|volume=Vol-2438
|authors=Jiri Zettel
|dblpUrl=https://dblp.org/rec/conf/ruleml/Zettel19
}}
==Anonymization of the University Information System Log Data: a Case Study==
<pdf width="1500px">https://ceur-ws.org/Vol-2438/paper4.pdf</pdf>
<pre>
Anonymization of the University Information System Log
                  Data: a Case Study

                                           Jiri Zettel

    Department of Information and Knowledge Engineering, University of Economics, Prague
                      W. Churchill Sq. 4, 130 67 Prague, Czech Republic
                                      xzetj01@vse.cz


        Abstract. This paper brings deep insight into data preparation when implement-
        ing group-based anonymization techniques. A real-world dataset contains access
        log data and is used for the consequent anomaly detection task. Unlike other re-
        search in the field of anonymization, we don’t focus on the design of new algo-
        rithms, but on the pre-processing steps and on exploring of applicability of exist-
        ing algorithms. Each algorithm has specific requirements for the data, so pre-
        processing must be comprehensive. In this paper we present how such data can
        be transformed into relational data, introduce a novel approach for anonymization
        of IPv4 address in our dataset using several anonymization algorithms and dis-
        cuss their principles, strengths, and weaknesses. Two ways of pre-processing of
        IPv4 for k-anonymity algorithms are presented: first, we split IPv4 into four parts
        and create generalization hierarchies and second we convert IPv4 to integer val-
        ues. We propose an improvement in Mondrian algorithm suitable for categorical
        attributes which gives better results than the original algorithm.

        Keywords: Anonymization, K-Anonymity, IPv4, Privacy-Preserving, Anomaly
        Detection, Data Preparation


1       Introduction

There is an ongoing university research project with the objectives (i) to detect com-
puter attacks in the university information system using anomaly-based detection meth-
ods and (ii) to create an artificial generator of such attacks. The dataset used for exper-
iments contains access log data and is represented in a relational database. The results
of the experiments on the datasets should be published, therefore the dataset has to be
anonymized. Anonymization is a technique of changing data in a way that prevents the
identification of a person. There is a tradeoff between data utility and a level of anony-
mization. Our goal is to experiment with anonymization techniques so that the risk of
re-identification of a person stays at an acceptable level and at the same time the infor-
mation required for successful anomaly detection remains in the dataset. In this paper,
we selected and evaluate group-based anonymizations. A review of such techniques is

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
presented in [1]. Consider that e.g. IP address is anonymized in a way that last two
octets are removed. If the dataset contains the only IP address starting 146.102.*.* then
this anonymization technique does not protect the person using this address. The same
logic applies when IP is replaced by a unique identifier, the pattern of a particular per-
son can be tracked and identity can be compromised if an attacker knows any additional
information which is the usual case. With the k-anonymous group, each anonymized IP
address belongs to a group with k-1 other IPs. The main contribution of this paper con-
tains (i) the data preparation steps for various anonymization algorithms, including
transformation to relational data (ii) their comparison, evaluation and possible optimi-
zation and (iii) a novel approach for anonymization of the dataset with IPv4 addresses.
Two ways of pre-processing convenient for the use by k-anonymity algorithms are pre-
sented further, we also discuss the ideas behind each decision. The experiments with
Mondrian led us to optimize it for the use of categorical attributes.


2       Background and Related Work

Most of the research focuses on discovering models guaranteeing privacy and designing
new algorithms on how to achieve it. Such benchmarks utilize the same public dataset
Adult1 to prove how the new algorithms outperform the others. Main established models
still in use are k-anonymity [2], ℓ-diversity [3], t-closeness [4], ε-differential privacy
[5] or ρ-uncertainty [6]. An overview of the data anonymization methods was described
by Prasser [7], author of ARX anonymization tool. Studies dealing with the application
of anonymization algorithm on real-world data are rare. One of such case studies de-
scribes the anonymization of medical surveys [8] using k-anonymity. Emam described
a framework for anonymization of clinical data [9]. Ayala-Rivera presented a system-
atic evaluation of k-anonymization algorithms on the Adult dataset. Differential privacy
technique is currently being adopted in the commercial sector [10]. For the anonymiza-
tion of IPv4 address, according to the survey of network traffic anonymization methods
[11], common methods are prefix-preserving, random permutation, truncation or
grouping. None of the reviewed IPv4 papers deal with k-anonymization algorithms.
    K-Anonymity ensures that each tuple in a table is indistinguishable from at least k
others, with respect to quasi-identifiers (QI). QI are attributes whose release must be
controlled. Achieving k-anonymity is through searching the minimal generalization of
the values of the attributes and optionally through tuples suppression. The relationship
of the domain levels forms the domain generalization hierarchy (DGH). A value gen-
eralization hierarchy (VGH) represents a relationship between the values in the do-
mains. For the evaluation of the anonymization, we use the Anonymization ToolBox2.
The algorithm Datafly [12] uses a greedy heuristic, Incognito [13] performs hierarchy-
based optimal search and Mondrian [14] performs partitioning. The first metric we will
use in the evaluation is the discernibility metric, it assigns a penalty to each tuple based
on how many tuples in the transformed dataset are indistinguishable from it [15]. The
second metric is the normalized average equivalence class size metric described in [14].

1 Data set can be found in UCI Repository at: https://archive.ics.uci.edu/ml/datasets
2 UTD Anonymization Toolbox available at: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/
  3       Data Understanding and Preparation

  The data used in this experiment are represented by HTTP requests stored in a MySQL
  table, one per user action. User-supplied POST parameters we removed because they
  are sensitive. UserID and SourceIP are quasi-identifiers. They can lead to the identifi-
  cation of the persons when linking to some additional data. Another quasi-identifier
  could be a timestamp. There are some indications that the identity could be revealed
  when discovering user behavior. We selected a subset containing 350k transactions,
  activities generated in one day. Some modifications in the data were necessary to be
  done first. For example, SourceIP was transformed into five new attributes, first having
  integer values (using INET_ATON function3) and remaining four having IP address
  split into four octets. Then we extracted the identities from the transactional data and
  thus created the relational data with all the users and their originating IPs. This new
  table consists only UserID and SourceIP attributes and the relationship between the
  attributes is many-to-many. Statistics are described in Table 1.

  Full-domain Generalizations. Datafly and Incognito work with categorical attributes.
  All the values within the domain have to be generalized to the same level in DGH. This
  will lead to over-generalization if there are values which vary greatly from all the other
  values in the domain because they need to be generalized more to fit in an equivalence
  class. And this is the case for the IP addresses. IP split into four parts will represent the
  IP as four separate quasi-identifiers and allow to apply different domain generalization
  level to each IP part. In fact, it’s easier to generalize the fourth octet than the first one
  because its values are distributed more evenly. The last octet represents the host in the
  subnet, the first octet is assigned by IANA4 and the next one by Regional Internet Reg-
  istries. VGH/DGH created are illustrated in Fig. 1.

  DGHIP                                                       VGHIP


  IP4                                             [0:255]


  IP3                        [0:127]                                   [128:255]


  IP2               [0:63]             [64:127]               [128:191]          [192:255]


  IP1         [0:31] [32:63] [64:95] [96:127]           [128:159] [160:191] [192:223] [224:255]

   IP0 All the leaf values 0, 1, … , 254, 255 direct to the appropriate group in upper
domain generalization level (IP1) – not displayed in this diagram for simplification.

                    Fig. 1: Domain and value generalization hierarchies for IP address


  3 https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functions.html#function_inet-aton
  4 IANA IPv4 Address Space Registry available at https://www.iana.org/assignments/ipv4-address-

      space/ipv4-address-space.xhtml
Top-down Algorithm. Mondrian partitions the values until k-anonymity is achieved
so it works well with numeric attributes. UserID attribute is numerical discrete and IP
address can be numerized also to discrete values. The advantage of numerizing IP over
using VGH for partitioning is obvious. It will allow much more fine-grained generali-
zation. We proposed an improvement in the Mondrian, which is suitable for the use of
categorical attributes mapped to discrete numerical values, especially when the attrib-
utes have low cardinality. We implemented it in UTD Toolbox and further, we prove it
gives better results also for the numerized IP address. The original algorithm creates a
frequency set in the selected dimension and searches for median (splitVal). It splits the
values to the left-hand side (lhs) and right-hand side (rhs) interval. Lhs interval is then
created using inclusive interval (including splitVal) and rhs exclusive. This cut is then
recursively repeated in rhs and lhs until at least k values are present in each interval.
Table 2 shows an example frequency set, were median is 3. Creating lhs = [1:3] and rhs
= (3:4] would not be allowable cut considering k = 10. In such cases creating intervals
where splitVal is rhs inclusive would still allow further cut, so we extended the algo-
rithm by one more step which tries to cut the partition to rhs inclusive when lhs inclu-
sive is not allowable.

    Table 1: Identities table statistics                     Table 2: Example frequency set

    Attribute    Distinct    Min     Max    Stdev    Avg       QID value      count
                 Count
                                                               1              21
 Octet 1              144       1     223   51.17   109.09     2              370
 Octet 2              253       0     255   64.50   124.21
                                                               3              4214
 Octet 3              256       0     255   75.68   125.24
                                                               4              5
 Octet 4              256       0     255   75.40   119.79
 UserID             9974        -       -       -        -
 Num. IP          12736         -       -       -        -


4        Experimental Evaluation

Following configuration was used for the experiment: Intel Xeon CPU 1.9 GHz, Java
1.8U211 32-bit runtime-environment, maximum heap size is about 1.6GB.

Datafly for IP Address. First evaluation is done for parameters k = 10 and suppression
threshold = 10. The time processed was 642s. Octet 1 is generalized to DGH level =
IP3, octets 2 to 4 are generalized to IP2 level. The result gives 9 suppressed tuples
(described as an equivalence group of size 9) and a total of 128 equivalence classes.
The smallest equivalence size is 11 and the largest is 432. The distribution of the group
sizes is described in Fig. 2. This shows how many distinct users are associated with
each anonymized IP address. Normalized average group size is calculated to be 11.27
and discernibility metric is 2,893,151.
   The second evaluation is done for k = 10 and suppression threshold 0. The time was
627s. To keep the suppression on zero level, generalizations are greater, octets 1 and 2
are generalized do DGH = IP3 level, while octets 3 and 4 to IP2 level. There are 64
groups, less than in the first experiment, but they are bigger (37 to 468). Normalized
average group size is also higher (22.55) and discernibility metric is 4,243,430. The
distribution of the group sizes is described in Fig. 3.

                      5                                                                                   2.5
   Count of classes


                                                                                       Count of classes
                      4                                                                                    2
                      3                                                                                   1.5
                      2                                                                                    1
                      1                                                                                   0.5
                      0                                                                                    0


                                                                                                                239
                             9
                            22
                            43
                            59
                            78
                           110
                           131
                           167
                           238
                           301


                                                                                                                 37
                                                                                                                 77
                                                                                                                122
                                                                                                                158
                                                                                                                212

                                                                                                                290
                                                                                                                373
                                                                                                                429
                                Equivalence class size                                                            Equivalence class size

                          Fig. 2: Datafly, k=10, supp=10                                                  Fig. 3: Datafly, k=10, supp=0


Incognito for IP Address. Because of time limitation, we removed the IP1 level of
DGH for IP address. The anonymization took 20942s which is nearly 6 hours. The
optimal level of anonymization found is 3-2-3-0 for the IP parts 1 to 4, meaning the last
octet remained with the original values. Intuitively the IP address should be generalized
in such way, that most generalized bits should be on the right side. But this would not
represent optimal anonymization by Incognito of the quasi attributes as we selected
them. The solution for this would be to choose only three or two last octets. Fig. 4
shows the distribution of equivalence classes (there are total 512 of them). Normalized
average group size is smaller compared to Datafly (2.82) and discernibility metric is
447,140.

                      40                                                              800
                                                                   Count of classes
   Count of classes


                      30                                                              600                                         Orig
                                                                                                                                  Modif
                      20                                                              400

                      10                                                              200

                      0                                                                     0
                                     22
                           10
                                16


                                          28
                                               34
                                                    40
                                                         46
                                                              55


                                                                                                          10 12 14 16 18 21 25 28 30
                                Equivalence class size                                                          Equivalence class size

                          Fig. 4: Incognito, k=10, supp=10                                            Fig. 5: Mondrian, k=10, alg. comparison
Mondrian for IP Address. First Mondrian evaluation for k = 10, using original Mon-
drian partitioning algorithm took 81s. There were 1016 equivalence classes created,
smallest having size of 10 and the largest 30. Normalized average group size is calcu-
lated to be 1.42 and discernibility metric is 207,524. Second evaluation for the same
parameter was done with the modified algorithm. Time processing was 80s, 1022 clas-
ses created in total (from size 10 to 29). Normalized average group size is calculated to
be 1.41 and discernibility metric is 205,304. The results are slightly better than with the
original algorithm. The graph comparison can be found in Fig. 5. To compare how
many equivalence classes were created by the modified algorithm we searched for par-
titions split into intervals (a:splitVal) and [splitVal:b), which are those where splitVal
is exclusive in lhs and inclusive in rhs. There are 6 partitions in total, cut in 12 such
intervals, meaning 12 equivalence classes out of 1022 were created in the case when
standard algorithm did not find further allowable cut. We examined the IP addresses
defining the intervals and all of them belong to Universities or Internet Service Provid-
ers. This confirms the assumption about low cardinality attributes.

Mondrian for User ID and IP Address. In this experiment, we included also anony-
mization of user ID within the equivalence class, along with the IP address. This will
ensure that not only the IP address is indistinguishable from k – 1 other addresses, but
so does the user ID. The processing time is 124s. There are 9 equivalence classes cre-
ated, smallest has 10 equivalent members and largest 18. There are only 2 classes of 18
members. More information is shown in Fig. 6. Normalized average group size is cal-
culated to be 1.37 which is much smaller than for the previous algorithms. The discern-
ibility metric is also much smaller (199,530). The distribution of user IDs can be found
in Fig. 7. Total intervals of IP addresses are 648, the distribution is shown in Fig. 8.
Fig. 9 illustrates the relationship between the anonymized IP address groups and users.
It shows that each IP address interval is associated with at least 10 users. Highly used
IP intervals belong to the subnets of the University or the largest Internet Service Pro-
viders.

                      800                                                  500
                                                         Count of groups
   Count of classes


                      600                                                  400
                                                                           300
                      400
                                                                           200
                      200                                                  100
                       0                                                     0
                            10 11 12 13 14 15 16 17 18                           10   14   18    29    33
                             Equivalence class size                                   User ID group size

   Fig. 6: Mondrian for UserID and IP, k=10                       Fig. 7: Mondrian for UserID and IP, k=10
                       400                                             500
     Count of groups


                       300


                                                      Count of users
                                                                       400
                       200                                             300
                       100                                             200
                        0                                              100
                             10 15 26 31 86 113
                                                                        0
                              IP address group size                          IP address group

    Fig. 8: Mondrian for UserID and IP, k=10               Fig. 9: Mondrian for UserID and IP, k=10


5                      Conclusions and Future Work

Our work brings additional and practical information to the papers evaluating the algo-
rithms on public datasets. We believe that it can be useful for our further experiments
and can give other researchers insight into the pre-processing of raw data. The experi-
ments proved that existing k-anonymization algorithms can be applied when data is
prepared in a convenient way. The example illustrated anonymization of the IPv4 ad-
dress which considers their occurrences in the dataset to be published, as opposed to
the IP anonymizations that don’t consider other instances in the dataset. Best results
were achieved by the Mondrian algorithm because the partitioning technique is very
good for the numeric continuous attributes or categorical mapped to discrete numerical
values when the cardinality is high. When the cardinality is low we would consider
using Datafly or Incognito with value generalization hierarchies. However, we saw the
limitation of the Incognito’s optimal algorithm which is computationally very intensive.
The intervals created by Mondrian as equivalence groups for IP address can be easily
converted back to one anonymized IP address in IPv4 format when replacing the digits
differentiating on the same index in lower and upper bound by an asterisk (*) while
keeping the digits with same values on the same index. We also verified that the idea
of the modified Mondrian algorithm is correct in the experiment with IP addresses. In
our future work, we would like to perform a similar experiment with the transaction
data, mainly to anonymize the timestamp attribute to hide the user behavior pattern.
Eventually, consequent experiments with the anomaly detection task on anonymized
dataset need to be done to evaluate when the detection is successful.

Acknowledgment

The work reported in this paper is carried out with the support of the IGA F4/12/2019
project of the University of Economics, Prague.
References

[1]    J. Zettel and P. Berka, “STUDY OF ANONYMIZATION TECHNIQUES FOR
       LOGGING DATA FROM UNIVERSITY INFORMATION SYSTEM,” presented at the
       26th Interdisciplinary Information Management Talks IDIMT 2019, 2019.
[2]    L. Sweeney, “k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY,” Interna-
       tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05,
       pp. 557–570, Oct. 2002.
[3]    A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “ℓ-Diversity: Pri-
       vacy Beyond k-Anonymity,” p. 12, 2007.
[4]    N. Li, T. Li, S. Venkatasubramanian, and T. Labs, “t-Closeness: Privacy Beyond k-Ano-
       nymity and -Diversity,” p. 10, 2007.
[5]    C. Dwork, “Differential Privacy,” in Automata, Languages and Programming, 2006, pp.
       1–12.
[6]    J. Cao, P. Karras, C. Raïssi, and K.-L. Tan, “ρ-uncertainty: inference-proof transaction
       anonymization,” Proceedings of the VLDB Endowment, vol. 3, no. 1–2, pp. 1033–1044,
       Sep. 2010.
[7]    arx-deidentifier and F. Prasser, “An overview of methods for data anonymization,”
       https://de.slideshare.net/arx-deidentifier/prasser-methods, 2015.
[8]    M. Gentili, S. Hajian, and C. Castillo, “A Case Study of Anonymization of Medical Sur-
       veys,” in Proceedings of the 2017 International Conference on Digital Health - DH ’17,
       London, United Kingdom, 2017, pp. 77–81.
[9]    K. El Emam and B. Malin, “Concepts And methods for de-identifying clinical trial data,”
       Paper commissioned by the Committee on Strategies for Responsible Sharing of Clinical
       Trial Data, 2014.
[10]   N. Johnson, J. P. Near, and D. Song, “Towards Practical Differential Privacy for SQL Que-
       ries,” p. 14, 2018.
[11]   N. V. Dijkhuizen and J. V. D. Ham, “A Survey of Network Traffic Anonymisation Tech-
       niques and Implementations,” ACM Computing Surveys, vol. 51, no. 3, pp. 1–27, May
       2018.
[12]   L. Sweeney, “Guaranteeing anonymity when sharing medical data, the Datafly System.,”
       Proc AMIA Annu Fall Symp, pp. 51–55, 1997.
[13]   K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: efficient full-domain K-ano-
       nymity,” in Proceedings of the 2005 ACM SIGMOD international conference on Manage-
       ment of data - SIGMOD ’05, Baltimore, Maryland, 2005, p. 49.
[14]   K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Mondrian Multidimensional K-Anonym-
       ity,” in 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA,
       USA, 2006, pp. 25–25.
[15]   R. J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,” in 21st
       International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 2005, pp. 217–
       228.

</pre>