=Paper= {{Paper |id=Vol-1924/ialatecml_paper1 |storemode=property |title=Interactive Anonymization for Privacy aware Machine Learning |pdfUrl=https://ceur-ws.org/Vol-1924/ialatecml_paper1.pdf |volume=Vol-1924 |authors=Bernd Malle,Peter Kieseberg,Andreas Holzinger |dblpUrl=https://dblp.org/rec/conf/pkdd/MalleKH17 }} ==Interactive Anonymization for Privacy aware Machine Learning== https://ceur-ws.org/Vol-1924/ialatecml_paper1.pdf
               Interactive Anonymization
          for Privacy aware Machine Learning

            Bernd Malle12 , Peter Kieseberg12 , Andreas Holzinger1
                           1
                             Holzinger Group HCI-KDD
           Institute for Medical Informatics, Statistics & Documentation
                          Medical University Graz, Austria
                               b.malle@hci-kdd.org
              2
                 SBA Research gGmbH, Favoritenstrae 16, 1040 Wien
                          PKieseberg@sba-research.org



      Abstract. Privacy aware Machine Learning is the discipline of applying
      Machine Learning techniques in such a way as to protect and retain per-
      sonal identities during the process. This is most easily achieved by first
      anonymizing a dataset before releasing it for the purpose of data min-
      ing or knowledge extraction. Starting in June 2018, this will also remain
      the sole legally permitted way within the EU to release data without
      granting people involved the right to be forgotten, i.e. the right to have
      their data deleted on request. To governments, organizations and cor-
      porations, this represents a serious impediment to research operations,
      since any anonymization results in a certain degree of reduced data util-
      ity. In this paper we propose applying human background knowledge via
      interactive Machine Learning to the process of anonymization; this is
      done by eliciting human preferences for preserving some attribute values
      over others in the light of specific tasks. Our experiments show that hu-
      man knowledge can yield measurably better classification results than a
      rigid automatic approach. However, the impact of interactive learning in
      the field of anonymization will largely depend on the experimental setup,
      such as an appropriate choice of application domain as well as suitable
      test subjects.
      Keywords: Machine Learning, Privacy aware ML, interactive ML, Knowl-
      edge Bases, Anonymization, k-Anonymity, SaNGreeA, Information Loss,
      Weight Vectors


1   Introduction and Motivation

In many sectors of today’s data-driven economies technical progress is dependent
on data mining, knowledge extraction from diverse sources, as well as the anal-
ysis of personal information. Especially the latter constitutes a vital building-
block for business intelligence and the provision of personalized services, which
are practically demanded by modern society. Often, the insights necessary for
enabling organizations to provide these goods require publication, linkage, and
systematic analysis of personal data sets from heterogeneous sources, exposing


                                         15
Interactive Anonymization for Privacy aware Machine Learning

those data to the risk of leakage, with repercussions ranging from mild incon-
venience (exposure of a social profile) to potentially catastrophic ramifications
(leakage of health information to an employer).
    Living up to those challenges, governments around the world are contem-
plating or already enacting new laws concerning the handling of personal data.
For instance, under the new European General Data Protection Regulations
(GDPR) taking effect on June 1st, 2018, customers are given a right to be forgot-
ten, meaning that an organization is obligated to remove a customer’s personal
data upon request. An exception to this rule is only granted to organizations
which anonymize data before analyzing them in any wholesale, automated fash-
ion. This brings us to the field of Privacy aware machine learning (PaML), e.g.
the application of ML algorithms only on previously anonymized data. Such
anonymization can be provided by perturbing data (e.g. introduction noise into
numerical values or differential privacy [4]) or k-anonymity [17] (clustering of
data into equivalence groups), which has since become the industry standard.
    The original requirement of k-anonymity has since been extended by the con-
cepts of l-diversity [11] (where every cluster must contain at least l diverse sensi-
tive values), t-closeness [9] (demanding that the local distribution over sensitive
values must not diverge from its global distribution by more than a threshold of
t) as well as delta-presence [15] (which incorporates the background knowledge
of a potential attacker). Although all of those concepts are interesting in their
own right, for the sake of comparing interactive ML algorithms to their fully
automatic counterpart, we only took k-anonymity into consideration.
    Based on our previous works on this topic [13] [12], in which we conducted
a comparison study of binary classification performance on perturbed (selective
deletion) vs. wholesale anonymized data, in this paper we introduce the notion
of interactive Machine Learning for (k-)anonymization.


2   k-Anonymity

Given the original tabular concept of anonymization, we will usually encounter
three different categories of attributes within a given dataset:

 – Personal identifiers are data items which directly identify a person with-
   out having to cross-reference or further analyze them. Examples are email
   address or social security number (SSN). As personal identifiers are imme-
   diately dangerous, this category of data is usually removed.
 – Sensitive data, also called ’payload’, represents information that is crucial
   for further data mining or research purposes. Examples for this category
   would be disease classification, drug intake or personal income. This data
   shall be preserved in the anonymized dataset and can therefore not be deleted
   or generalized.
 – Quasi identifiers (QI’s), are data which in themselves do not directly
   reveal the identity of a person, but might be used in aggregate to reconstruct
   it. For instance, [18] reported in 2002 that the identity of 87% of U.S. citizens


                                         16
Interactive Anonymization for Privacy aware Machine Learning

    could be uncovered via just the 3 attributes zip code, gender and date of
    birth. Despite this danger, QI’s may contain vital information to research
    applications (like ZIP code in a disease spread study); they are therefore
    generalized to an acceptable compromise between privacy (data loss) and
    information content (data utility).

    Based on this categorization k-anonymity [16] was introduced as a formal
concept of privacy, in which a record is released only if its quasi-identifiers are
indistinguishable from at least k − 1 other entities in the dataset. This can be
imagined like a clustering of data into so-called equivalence groups of at least
size k, with all internal QI’s being generalized to the exact same level.
    Generalization in this setting means an abstraction of attribute value: e.g.
given two ZIP codes of ’8010’ and ’8045’, we could first generalize to ’80**’, then
incorporate another data point showing ZIP ’8500’ by generalizing the cluster to
’8***’, and finally merging with any other ZIP code to the highest level of ’all’,
also denoted as ’*’.


3    interactive Machine Learning

Interactive ML algorithms adjust their inner workings by continuously inter-
acting with an outside oracle, drawing positive / negative reinforcement from
this interaction [7]. Such systems are especially useful for highly-personalized
predictions or decision support [8]; moreover many real-world problems exhibit
(super)exponential algorithmic runtime; in such cases human brains dwarf ma-
chines at approximating solutions and learning from very small samples, thus
enabling us to ’intuit’ solutions efficiently [6].
    By incorporating humans as oracles into this process, we can elicit back-
ground knowledge regarding specific use cases unknown to automatic algorithms
[19]. This however is highly dependent on the users’ experience in a certain field
as well as data / classification complexity; domain experts can of course be
expected to contribute more valuable decision points than laymen; likewise, a
low-dimensional dataset and simple classification tasks will result in higher qual-
ity human responses than convoluted problem sets.
    While the authors of [14] propose a system that interacts with a user in order
to set a certain k-factor and subsequently provides a report on information loss
and Kurtosis of QI distributions, the algorithm is not interactive by our definition
in that it does not influence the inner workings of the algorithm during the
learning phase. This is also true in case of the Cornell Anonymization Toolkit
(Cat) [20], which conducts a complete anonymization run and only afterwards
lets the user decide if they are satisfied with the results. In contrast, our approach
alters algorithmic parameters upon every (batch of) human decisions, letting the
algorithm adapt in real-time.
    [10] describe an approach incorporating humans into the anonymization pro-
cess by allowing them to set constraints on attribute generalization; moreover
they construct generalization hierarchies involving domain-specific ontologies.


                                         17
Interactive Anonymization for Privacy aware Machine Learning

Although this technique marks a departure from wholesale automatic anonymiza-
tion, it still lacks the dynamic human-computer interaction of our approach.
    Apart from the field of privacy, interactive ML is present in a wide spectrum
of applications, from bordering medical fields like protein interactions / cluster-
ings [1] via on-demand group-creation in social networks [2] to even teaching
algorithms suitable mappings from gestures to music-generating parameters [5].


4     Experiments

The following sections will describe our experiment in detail, encompassing the
general iML setting, chosen data set, anonymization algorithm used as well as a
description of the overall processing pipeline employed to obtain the final results
as presented.


4.1   General setting

The basic idea of our experiment was to compare different weight vectors repre-
senting attribute (quasi-identifier) importance during anonymization: Let’s say
that a doctor needs ro release a dataset for the purpose of studying disease-
spread; in this case ’ZIP code’ information is probably (but not necessarily) of
much greater importance then ’occupation’ or ’race’. However, if a skin cancer
study is to be performed, ’race’ information might be of utmost importance,
whereas ’ZIP code’ might be negligible.
    In our experiment, the task was to classify a people dataset on the target
attributes income, education level and marital status. Therefore, we tested an
equal weight vector setting against two others obtained from human experiments:
1) bias in which the user just specified which attributes they thought would be
important for a specific classification by moving sliders, and 2) iML in which
the user was tasked to decide a series of clustering possibilities by moving a
data row to one of two partly anonymized clusters presented, thereby conveying
which attributes were more important to preserve than others (Figure 1). Only
the last method constitutes an interactive learning approach by introducing an
oracle into the process.
4.2   Data

We chose the adults dataset from the UCI Machine Learning repository which
was generated from US census data from 1994 and contains approximately 50k
entries in it’s original; this data-set is used by many anonymization researchers
and therefore constitutes a quasi-standard. After initial preprocessing we chose
the first 500 complete data rows as our iML experimental data to be presented to
users. After obtaining bias / iml weights from the experiment, we chose the first
3k entries of the original data as the basis for producing 775 new, anonymized
data sets. Although 3k rows might seem overly frugal on our part, we have
asserted via random deletion of original data points that classifier performance
remains stable for as little as 1.5k rows. Of the original attributes (data columns)


                                        18
Interactive Anonymization for Privacy aware Machine Learning




         Fig. 1. Two different implementations of the iML interface design.




provided 4 were deleted: ’capital-gain’ & ’capital-loss’ (both were too skewed to
be useful for humans), ’fnlwgt’ (a mere weighting factor) as well as ’education’
which is also represented by ’education num’.


4.3   Anonymization Algorithm

In order to conduct our experiments, it was necessary to choose an algorithm
which would enable us to easily hook into its internal logic - we therefore chose a
greedy clustering algorithm called SaNGreeA (Social network greedy clustering)
which was introduced by [3] and implemented it in JavaScript. This enabled us
to execute it within a browser environment during our iML experiments as well
as server-side for batch-execution of all derived datasets afterwards. As a greedy
clustering algorithm SaNGreeA’s runtime lies in O(n2 ) - which we were willing
to accept in exchange for it’s white-box internals.


                                        19
Interactive Anonymization for Privacy aware Machine Learning

   Besides its capacity to anonymize graph structures (which we did not utilize
during this work), it is a relatively simple algorithm considering General infor-
mation loss - or GIL - during anonymization. This GIL can be interpreted by the
sum of information loss occurring during generalization of continuous (range) as
well as hierarchical attributes:

                              Xs
                                             size(gen(cl)[Nj ])
            GIL(cl) = |cl| · (
                              j=1
                                  size(min xN (X[Nj ]), maxxN (X[Nj ]))
                                                  t
                                                  X height(Λ(gen(cl)[Cj ]))
                                              +                               )
                                                   j=1
                                                           height(HCj )

    where:
- |cl| denotes the cluster cl’s cardinality;
- size([i1, i2]) is the size of the interval [i1, i2], i.e., (i2 − i1);
- Λ(w), wHCj is the sub-hierarchy of HCj rooted in w;
- height(HCj ) denotes the height of the tree hierarchy HCj ;

      The following formulas then give the total / normalized GIL, respectively:
                           v
                           X                                        GIL(G, S)
            GIL(G, S) =          GIL(clj )   and     NGIL(G, S) =
                           j=1
                                                                    n · (s + t)

    The algorithm starts by picking a (random or pre-defined) data row as its
first cluster, then iteratively picking best candidates for merging by minimizing
GIL until the cluster reaches size k, at which point a new data point is chosen as
the initiator for the next cluster; this process continues until all data points are
merged into clusters, satisfying the k-anonymity criterion for the given dataset.

4.4     Processing pipeline for obtaining results
Once our iML experiments had yielded enough weight vectors, we had to generate
a whole new set of anonymized datasets on which we subsequently applied 4
classifiers on each of the 3 target attributes (columns) described; therefore we
designed the following processing pipeline:

 1. Taking the first 5k rows of the original, preprocessed dataset as input and
    applying k-anonymization with a k-factor range of [5, 10, 20, 50, 100, 200]
    and 129 different weight vectors (equal, bias, iml) from our experiments on
    it, we produced 774 anonymized datasets (775 including the original).
 2. We executed 4 classifiers on all of the datasets and compared their F1 score;
    the reason for selecting multiple algorithms was to explore if anonymiza-
    tion would yield different behaviors on different mathematical approaches
    for classification. The four algorithms used were linear SVC (as a represen-
    tative of Support Vector Machines), logistic regression (gradient descent),


                                             20
Interactive Anonymization for Privacy aware Machine Learning

    gradient boosting (ensemble, boosting) as well as random forest (ensemble,
    bagging). While reading the datasets pertaining to the classification target of
    education, the 14 different education levels present within the adult dataset
    were grouped into 4 categories ’pre-high-school’, ’high school’, ’<=bachelors’
    and ’advanced studies’.
 3. For each combination of classification target (income, marital status, edu-
    cation) and weight category (equal, bias, iml ) we averaged the respective
    results. Results were plotted per target, as this allows better comparison
    between different classifiers. The leftmost point in all plots designates the
    original, un-anonymized dataset.


5   Results & Discussion

As per the results in our previous work on PaML [13] [12] we generally expected
1/x shaped curves for classifier performance as factors of k are increasing. These
expectations held to only a small degree; moreover for targets education as well
as income there was no clear winner amongst the weight categories, with some
achieving better or worse depending on a specific factor of k.
    We got the smoothest results for the marital status target, with human bias
winning consistently over equal weights as well as human interaction (Figure 2).
We interpret this as stemming from the fact that there is a significant correlation
between the attributes ’marital-status’ and ’relationship’ in the dataset, which
led users to consciously overvalue the latter when prompted for their bias. It is
not completely clear why the iML results were not able to keep up in this case,
but since this seems to be a general phenomenon throughout our results, we will
discuss this in a later paragraph.
    On classification target education, bias still mostly outperforms iML-obtained
attribute weights, with equal weights slightly winning out at very high factors of
k (Figure 3). Although we assume that apparently important clues towards ed-
ucation might be misleading (like income or working hours), this cannot explain
the difference between bias- and iML-based results. It has to be noted however,
that results on this target are distinctly inferior to those of the other scenarios
which might diminish the gap’s significance.
    Only on target income did we observe a partly reversed order between human
bias and iML - however at the cost of both being usually inferior to a simple
setting with equal attribute weights (Figure 4). This is especially surprising
because income was the only binary classification task in our experiments, which
should have given humans a slight advantage over the algorithm. On the other
hand, human bias seems most susceptible to falling prey to certain stereotypes
in the area of money (w.r.t. gender, race, marital status...), which would explain
the reversal of results.
    As for the failure of iML to significantly outperform both the equal weight
setting and especially human bias, we conjecture that our experimental setup has
produced those effects: Since we wanted our users to conduct their experiment in
real-time but needed a simple implementation of an anonymization algorithm to


                                        21
Interactive Anonymization for Privacy aware Machine Learning




Fig. 2. Results on target marital status - human bias wins consistently over both equal
weights and human interaction with the algorithm.



enable this interaction (which resulted in an O(n2 ) algorithmic runtime), we had
to limit ourselves to just a tiny subset of data (500 rows, merely 1% of the original
dataset). This choice apparently resulted in generalizations proceeding far too
quickly, reaching suppression (’all’) levels prematurely, thereby denying our users
sensible clustering choices. On the other hand, the effect could also stem from
users not really trying to contribute to the experiments in a meaningful way;
this effect could only be mitigated by selecting more serious users or choosing
some less serious (more social?) application domain.
     Overall, we were also surprised that a seemingly absurd k-factor of 200 would
still yield comparably good results (and in some cases even improve perfor-
mance..).


6    Open problems & future challenges

As iML for anonymization is still a fledgling sub-area in the larger fields of
privacy as well as Machine Learning, there are certainly innumerable possibilities
for even basic progress & development. The following list is only a tiny subset
of possible research venues we deem suitable for our own future work:


                                          22
Interactive Anonymization for Privacy aware Machine Learning




Fig. 3. Results on target education - we still see human bias performing slightly better
than equal weights / iML in most cases of k, but not as consequently as above.




 – Explain the unexpected behavior of linear SVC on the income target at
   high levels of k; probably by performing comparison studies on synthetically
   generated datasets.
 – Faster algorithm. Repeat the experiments with a faster algorithmic im-
   plementation so that we can use thousands of data points even in real time
   within a Browser: this would lead to more relaxed generalizations, allow-
   ing the user to make better interactive choices, thus presumably improving
   results by quite some margin.
 – Expert domain, domain experts. Choosing an expert domain like can-
   cer studies in combination with proper experts like medical professionals,
   we would expect both human bias as well as iML results to significantly
   outperform a pre-defined weight vector.
 – Different setting. On the other hand, a more ’gamified’ setting such as rec-
   ommendations within a social network could motivate amateur users to get
   more immersed into the experiment, yielding better results even for mundane
   application tasks.
 – Different data formats. As Artificial Intelligence is slowly reaching matu-
   rity, it is now also applied to non- and semi-structured data like audio/video


                                          23
Interactive Anonymization for Privacy aware Machine Learning




Fig. 4. Results on target income - only in this scenario do we see iML-based results gen-
erally outperforming bias (except linear SVC), nevertheless incapable of outperforming
the rigidly equal setting.



    or even *omics data. Since images are clearly relevant for medical research,
    and humans extremely efficient at processing them, studying interactive ML
    on visual data promises great scientific revenue.


7    Conclusion
Based on the emerging necessity of Privacy aware data processing, in this work
we presented a fundamental approach of bringing human knowledge to bear on
the task of anonymization via interactive Machine Learning. We devised an ex-
periment involving clustering of data points with respect to human preference
for attribute preservation and tested the resulting parameters on classification of
anonymized people data into classes of marital status, education and income. Our
preliminary results show that human bias can definitely contribute to even mun-
dane application areas, whereas more complex or convoluted tasks may require
trained professionals or better data preparation (dimensionality reduction etc.).
We also described our insights regarding technical details for iML experiments
and closed by outlining promising future research venues.


                                           24
Interactive Anonymization for Privacy aware Machine Learning

References
 1. Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza.
    Power to the People: The Role of Humans in Interactive Machine Learning. AI
    Magazine, 35(4):105–120, 2014.
 2. Saleema Amershi, James Fogarty, and Daniel Weld. ReGroup: interactive machine
    learning for on-demand group creation in social networks. Proceedings of the 2012
    ACM annual conference on Human Factors in Computing Systems - CHI ’12,
    page 21, 2012.
 3. Alina Campan and Traian Marius Truta. Data and structural k-anonymity in
    social networks. In Privacy, Security, and Trust in KDD, pages 33–54. Springer,
    2009.
 4. Cynthia Dwork. Differential privacy: A survey of results. In International Confer-
    ence on Theory and Applications of Models of Computation, pages 1–19. Springer,
    2008.
 5. R. Fiebrink, D. Trueman, and P.R. Cook. A metainstrument for interactive, on-
    the-fly machine learning. Proc. NIME, 2:3, 2009.
 6. A Holzinger, M Plass, K Holzinger, GC Crisan, CM Pintea, and V Palade. To-
    wards interactive machine learning (iml): Applying ant colony algorithms to solve
    the traveling salesman problem with the human-in-the-loop approach. In IFIP
    International Cross Domain Conference and Workshop (CD-ARES), pages 81–95.
    Springer, Heidelberg, Berlin, New York, 2016.
 7. Andreas Holzinger. Interactive machine learning for health informatics: When do
    we need the human-in-the-loop? Springer Brain Informatics (BRIN), 3(2):119–131,
    2016.
 8. Peter Kieseberg, Bernd Malle, Peter Frhwirt, Edgar Weippl, and Andreas
    Holzinger. A tamper-proof audit and control system for the doctor in the loop.
    Brain Informatics, pages 1–11, 2016.
 9. Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy
    beyond k-anonymity and l-diversity. In IEEE 23rd International Conference on
    Data Engineering, ICDE 2007, pages 106–115. IEEE, 2007.
10. Brian C.S. Loh and Patrick H.H. Then. Ontology-enhanced interactive anonymiza-
    tion in domain-driven data mining outsourcing. Proceedings - 2nd International
    Symposium on Data, Privacy, and E-Commerce, ISDPE 2010, (June):9–14, 2010.
11. Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan
    Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions
    on Knowledge Discovery from Data (TKDD), 1(1):1–52, 2007.
12. Bernd Malle, Peter Kieseberg, and Andreas Holzinger. Do not disturb? classifier
    behavior on perturbed datasets. In Machine Learning and Knowledge Extraction,
    IFIP CD-MAKE, Lecture Notes in Computer Science LNCS Volume 10410, pages
    155–173. Springer, Cham, 2017.
13. Bernd Malle, Peter Kieseberg, Edgar Weippl, and Andreas Holzinger. The right
    to be forgotten: towards machine learning on perturbed knowledge bases. In In-
    ternational Conference on Availability, Reliability, and Security, pages 251–266.
    Springer, 2016.
14. Carlos Moque, Alexandra Pomares, and Rafael Gonzalez. AnonymousData.co: A
    Proposal for Interactive Anonymization of Electronic Medical Records. Procedia
    Technology, 5:743–752, 2012.
15. M. E. Nergiz and C. Clifton. delta-presence without complete world knowledge.
    IEEE Transactions on Knowledge and Data Engineering, 22(6):868–883, 2010.


                                         25
Interactive Anonymization for Privacy aware Machine Learning

16. Pierangela Samarati. Protecting respondents identities in microdata release. IEEE
    Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001.
17. Latanya Sweeney. Achieving k-anonymity privacy protection using generalization
    and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-
    Based Systems, 10(5):571–588, 2002.
18. Latanya Sweeney. k-anonymity: A model for protecting privacy. International
    Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570,
    2002.
19. MALCOLM WARE, EIBE FRANK, GEOFFREY HOLMES, MARK HALL, and
    IAN H WITTEN. Interactive machine learning: letting users build classifiers.
    International Journal of Human-Computer Studies, 55(3):281–292, 2001.
20. Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Interactive anonymization
    of sensitive data. Proceedings of the 35th SIGMOD international conference on
    Management of data - SIGMOD ’09, page 1051, 2009.




                                         26