Privacy Preserving Association Rule Mining Using
                       Perturbation Technique
                Tobi Deborah Omoyele                                                        Solomon Olalekan Akinola
           Department of Computer Science                                                Department of Computer Science
          University of Ibadan, Ibadan, Nigeria                                         University of Ibadan, Ibadan, Nigeria
                 Omoyeletobi@yahoo.com                                                          solom202@yahoo.co.uk

                                                                            volumes of data but the potential danger is privacy concerns of
ABSTRACT                                                                    sensitive data [1]. The objective of privacy preserving data mining
The information age has enabled organizations to gather large
                                                                            is to hide sensitive information so that they cannot be discovered
volumes of data. However, the usefulness of this data is negligible
                                                                            through data mining techniques. Privacy preserving data mining
if “meaningful information” cannot be extracted from it. Data
                                                                            (PPDM) refers to the area of data mining that seeks to safeguard
mining answers this need. The problem of privacy-preserving data
                                                                            sensitive information from unsolicited disclosure [2]. Most
mining has become important recently because of the increasing
                                                                            traditional data mining techniques analyze and model the dataset
ability to store personal data and the sophistication of data mining
                                                                            statistically, while privacy preservation is primarily concerned
algorithms to leverage information. Many researches have been
                                                                            with protecting against disclosure of individual data records.
done in this field but few with quantitative data had drawbacks of
high number of rules generated and few number of item hidden.               The term “privacy preserving data mining” was introduced in
This study, proposes a perturbation association rules hiding                papers Rakesh and Ramakrishna [3] and Lindell and Pinkas [4].
algorithm for privacy of quantitative data to provide a better              These papers considered two fundamental problems of privacy
algorithm for preserving quantitative data. In hiding of rules, the         preservation in data mining, privacy preserving in data collection
noise associated with each item was calculated. The noise was               and mining a dataset partitioned across several private enterprises.
used to calculate the support and confidence of rules which were            Rakesh and Ramakrishna [3] devised a randomization algorithm
then compared with minimum support and confidence. Item whose               that allows a large number of users to contribute their private
support/confidence is less than or equal to minimum support or              records for efficient centralized data mining while limiting the
confidence would be hidden. Experimental result shows that the              disclosure of their values, Lindell and Pinkas [4] invented a
algorithm hides more rules than the existing works.                         cryptographic protocol for decision tree construction over a
                                                                            dataset horizontally partitioned between two parties.
CCS Concepts                                                                A number of techniques such as randomization, suppression,
• Information systems ➝ Information systems applications ➝                  summarization, association rule, perturbation, cryptography and k-
Data mining ➝ Association rules • Security and privacy ➝                    anonymity have been suggested in recent years according to
Human and societal aspects of security and privacy ➝ Privacy                survey carried out by Alexandre [5] in order to perform privacy-
protection                                                                  preserving data mining.
Keywords: Data Perturbation; Data Mining; Data Privacy;                     Ila [6] proposed an algorithm for privacy preservation of data
Randomization; Association rule mining; Summarization;                      mining in which he assumed that only sensitive data items can be
Suppression.                                                                found in the database. The proposed algorithm modified data in
                                                                            the database such that sensitive item can either be at the left hand
1. INTRODUCTION                                                             side or right hand side of the rule and cannot be inferred through
                                                                            association rule mining algorithms. For the algorithm to hide
The information age has enabled many organizations to gather
                                                                            sensitive item either the support or confidence is decreased to be
large volumes of data. However, the usefulness of this data is
                                                                            smaller than pre-specified minimum support and minimum
negligible if “meaningful information” or “knowledge” cannot be
                                                                            confidence.
extracted from it. Data mining attempts to answer this need. Data
mining is an iterative process within which progress is defined by          Most of the studies carried out in this research are concentrated on
discovery, prediction or classification of data through either              hiding binary database which are only concerned with the
automatic or manual methods. It is most useful in scenarios in              presence or absence of item but in reality most real applications
which there are no predetermined notions about                              consists of quantitative values. For instance, many people have
                                                                            the problem of sugar, but this does not mean that one is sick or
what will constitute an interesting outcome and it also involve the
                                                                            not, the only criterion that can be use to for determine the illness
search for new, valuable and nontrivial information in large
                                                                            is the surplus/deficiency in sugar’s quantity.
                                                                            The problem of mining quantitative association rule was first
                                                                            introduced by Evfimievski et al [7]. Two works have been done in
                                                                            the field of hiding fuzzy association rule in quantitative data
CoRI’16, Sept 7–9, 2016, Ibadan, Nigeria.                                   according to Manoj and Joshi [8] and Berberoglu [9]. However,
                                                                            little or no work has been done in the aspect of perturbation of
                                                                            association rules for quantitative data using the concept of noise.
                                                                            Manoj and Joshi [8] proposed a hiding algorithm that integrates
                                                                            the fuzzy set concepts and Apriori mining algorithm to find


                                                                       91
useful fuzzy association rules from a quantitative database and                applying privacy algorithm to only Boolean data by applying the
then hide them using privacy preserving technique. Unlike                      algorithm to quantitative data and not just the presence or absence
previous approaches which mainly deals with association rules                  of data.
in binary database, the approach deals with hiding the
association      rules     in quantitative database. Numerical                 3. PROPOSED ALGORITHM
experiments were performed to measure the performance of                       In order to hide an association rule, A→B we can either decrease
the algorithm according to three criteria: the number of rules                 its support to be smaller than minimum support value or its
hidden, side effects and database effects of the algorithm.                    confidence to be smaller than its minimum confidence value. To
However, the algorithm only deals with association rule in small               decrease the confidence of a rule, two strategies can be used. The
dataset in which the algorithm generated much rules but the                    first one is to increase the support count of A i.e. LHS of the rule,
algorithm is not efficient since a large part of the rules generated           but not support count of A→B. The second one is to decrease the
resulted into lost rules.                                                      support count of A→B. For the second case, if we only decrease
                                                                               the support of B, the right hand side of the rule, it would reduce
In this paper, we attempt to present a perturbation association rules          the confidence faster than simply reducing the support of A∪B.
hiding algorithm for privacy preservation of quantitative data with            Based on these two strategies, we propose a privacy preserving
few rules and hides higher percentage of the rules by decreasing               data mining algorithm for hiding sensitive quantitative data items
the support of the rule. The support of rule          is decreased by          called perturbation of association rule for quantitative data using
decreasing the support of A. This is achieved by decreasing the                the concept of noise. This algorithm first calculates the value of
support value of either A or B of the rule until either the support or         noise for each data items and the column with the highest noise
confidence is below the minimum support or minimum confidence                  value form the rule with the sensitive column. Secondly, the
value respectively.                                                            algorithm find useful association rule that consist of only one
The rest of this paper is organized as follows. Privacy preserving             item on both sides of the rule and then hide them using
quantitative association rule hiding was defined in Section II, our            privacy preserving technique. For hiding purpose, the
approach to hide useful association rules in quantitative data was             algorithm tries to decrease the support of rule A→B by
presented in Section III, the perturbation association rule hiding             decreasing the support count of itemset AB until either support or
process was presented in Section IV, and the proposed algorithm                confidence value of the rule goes below minimum support or
related example was presented in Section V. Section VI includes                minimum confidence value respectively.
the conclusion and future works.
                                                                               Input:
2. PROBLEM STATEMENT                                                           (1)A source database D,
The objective of privacy preserving data mining is to hide certain             (2) Minimum support threshold
information so that they cannot be discovered through data mining              (3) Minimum confidence threshold
techniques. There have been two broad approaches for privacy                   Output: A transformed database D where rules containing A on
preserving data mining. The first approach, called output privacy,             LHS (Left Hand Side) or B at the RHS (Right Hand Side) will be
is to alter the data before delivery to data miner so that real data is        hidden.
obscured and mining result will not disclose certain privacy. For
example, perturbation, blocking, merging, swapping and sampling                Algorithm
are some methods that have been proposed for this type of output                   1. Transform the rough dataset(Rd) into original
privacy. The second approach, called input privacy, is to                              database(D) as Rd→D
manipulate the data using data distribution methods in which                       2. For each itemset XɛD, for                             .
the privacy of the data is protected before releasing to the user. In                  Calculate the value of the noisy attribute V such that
this approach, mining result is not affected or minimally affected.
Almost all of studies that have been proposed in this research area                             √
concentrated on hiding Boolean association rules which are                          3.   Calculate the sumCount of each noisy attribute         for
concerned only with whether an item is present in a
transaction or not, without considering its quantity.                               4.   If sumCount( ) = Maximum Value such that k =
However, transactions with quantitative values are commonly                              Maximum Value
found in real world application. For example, in a patient’s blood                        THEN
test, many attributes could be found. In addition, attribute’s                           1-itemset = {                 }
quantity instead of just presence/absence of the attribute in blood
                                                                                    5.   EXIT sumCount
is more important for determination of the illness. Furthermore,
                                                                                    6.   Sensitive column = {              } such that mɛN
many people have the problem of sugar but this does not
                                                                                         1-itemset       {            }
mean that one is sick or not, the only criterion to used to
                                                                                    7.   Find 2-itemset from the union of      and
determine the illness is the surplus/deficiency in sugar ’s
                                                                                    8.   Find ={ rules from itemset X}
quantity.
                                                                                         //for X= {     } the possible rules are
 Some works have been done to discover association rules from
quantitative data which generate set of rules but hides less than
30% of the rules generated. This paper proposes a privacy
preserving data mining algorithm (that uses the output privacy
approach in which the data is preserved before it will be released
to the miner) that improves on the association rule hiding
algorithms for quantitative data by combining the perturbation                      9.   Compute the support of rule
                                                                                                                 (         )       (    )
method for privacy and association rule mining technique for                                ((          ))                     =
privacy. Furthermore, the paper addresses the challenges of

                                                                          92
                                                                               STEP 4: join the 1-itemset to the sensitive column        in a way
     10. Compute the confidence of the rule                                    similar to that of apriori algorithm to form a 2- itemset. The 2-
                                   (        )         (           )            itemset is use to find the useful association rule by placing    at
            ((           ))                                           =
                                         (   )            (   )
             (       )
                                                                               the LHS and at the RHS similar to that of apriori algorithm
                 (   )
                                                                               STEP 5 (a): in order to hide sensitive rule, calculate the support
     11. if Sup(LHS,RHS) ≤ MST or conf(LHS,RHS) ≤ MCT                          and confidence of each rule
              = X+                                                                                                     (           )
                                                                                                  ((          ))
                    Else
                                                                                                                             and
                    =
                                                                                                                   (            )          (            )
     12. go to line 8                                                                    ((            ))
     13. select the next rule                                                                                 (             ()                      )
     14. repeat // line 9,10 and 11 on each new rule                           STEP 5(b): A rule is hidden if
     15. if    is empty                                                             (         )
          THEN                                                                                    Or
     16. transform the updated database D and output the                                     (           )
         updated D                                                             if Sup(LHS,RHS) ≤ MST or conf(LHS,RHS) ≤ MCT
                                                                                           = X+
                                                                                        Else
3.1 Explanation to Abbreviations in the                                                 =
Algorithm
     LHS (left hand side): this is the left hand side of the rule              5. ILLUSTRATION OF THE ALGORITHM
     RHS (right hand side): this is the right hand side of the rule            In this section, we give an example to illustrate how the proposed
     generated                                                                 algorithm can be used.
     MST (minimum support threshold): this is the support                      Let the original data be depicted below:
     specified by the user of the algorithm
     MCT (minimum confidence threshold): this is the confidence                               A              B           C                     D
     threshold specified by the user                                           T1             10             5           8                     3
     min = minimum                                                             T2             3              11          6                     14
     Sup/Supp = support                                                        T3             6              3           9                     13
                                                                               T4             7              5           8                     12
4. STEPS TO THE PROPOSED                                                       T5             11             4           7                     10
ALGORITHM
n = total number of transaction data                                           T1, T2, T3, T4, T5 are transactions and A, B, C, D are the
m= total number of attributes (items)                                          attributes of the data in the relational table
D=       attribute 1 ≤ i ≤ n                                                   Given:
  =     attribute 1 ≤ j ≤ m                                                    Assuming the sensitive column to hide is column B
V= noise                                                                       Minimum support threshold (MST) = 44% = 0.44
X= original data                                                               Minimum confidence threshold (MCT) = 75% = 0.75
   = inverse of original data
MCT= minimum confidence threshold                                              STEP 1: for each transaction data D, i =1 to n, and for each
MST= minimum support threshold                                                 attribute (item) j= 1 to m, transform the quantitative value into a
  = each noisy attribute 1≤ k ≤ i                                              noisy quantitative attribute value using the randomly generated
                                                                               formula
                                                                                                   √
STEP 1: for each transaction data D, i =1 to n, and for each                   For X =
attribute (item) j= 1 to m, transform the quantitative value into a
                                                                                 =
noisy quantitative attribute value using the randomly generated
                                                                               For T1      = 10
formula                                                                        For T1      =1/10
                 √
For X =                                                                        N= 5 i.e number of transactions in the database.
   =
STEP 2: calculate the sumcount of each noisy attribute        on the                  A                 B                C                     D
transaction data as                                                            T1     10       2.26     5         1.16   8          1.82       3     0.75
         =∑                                                                    T2     3        0.75     11        2.48   6          1.38       14    3.15
                                                                               T3     6        1.38     3         0.75   9          2.04       13    2.93
STEP 3: for each noisy attribute 1≤j≤m and 1≤k≤m check for                     T4     7        1.59     5         1.16   8          1.82       12    2.70
                                                                               T5     11       2.48     4         0.95   7          1.59       10    2.26
the         that has the maximum value. If             satisfies the
above condition then the column whose            has the maximum
                                                                               STEP 2: calculate the count of each noisy attribute                  on the
value is put in the set of 1- itemset which form the left hand side
(LHS) of the rule                                                              transaction data as
i.e ={                                                 }
                                                                                                                         ∑


                                                                          93
         A                           B                         C                              D                      i.e
T1       10           2.26           5    1.16                 8               1.82           3         0.75                       =               = 11
                                                                                                                                                                                   (           )               (         )
T2       3            0.75           11   2.48                 6               1.38           14        3.15                              (                      )
T3       6            1.38           3    0.75                 9               2.04           13        2.93
T4       7            1.59           5    1.16                 8               1.82           12        2.70                                                                       (                   )             (            )
T5       11           2.48           4    0.95                 7               1.59           10        2.26                       (                      )
                                                                                                                                                                                           (       )                     (       )
count                 8.46                6.5                                  8.65                     11.79                                                                      (                       )

STEP 3: for each noisy attribute       1≤j≤m and 1≤k≤m check for
the          that has the maximum value. if              satisfies the                                               Since Sup(                      ) > MST but conf(                                             ) < MCT, hence
above condition then the column whose              has the maximum                                                   is hidden
count is put in the set of 1- itemset which form the left hand side                                                  i.e                   =         +
(LHS) of the rule                                                                                                                                                              to nearest whole number
                                                                                                                                                          (            )                   (               )
i.e = *                                                   }                                                                (                   )
In this case
                                                                                                                                                              (                )                   (           )             (        )
       *                 }                                                                                                     (           )
STEP 4: join the 1-itemset to the sensitive column in a way                                                                                                       (        )                           (       )
similar to that of apriori algorithm to form a 2- itemset. The 2-
itemmset is use to find the useful association rule by at the LHS
and at the RHS similar to that of apriori algorithm                                                                  Since Sup(                       ) < MST and conf(                                            ) > MCT, hence
                                                                                                                     is hidden
                                                                                                                     i.e
                                                                                                                         = +
                                                                                                                                                                               to nearest whole number
                                                                                                                                                                                   (           )               (         )
                                                                                                                                          (                      )

STEP 5 (a): in order to hide sensitive rule, calculate the support                                                                                                                 (                   )             (            )
and confidence of each rule                                                                                                            (                  )
                                                                                                                                                                                           (       )                     (       )
                                       (           )                                                                                                                                   (                       )
                  ((         ))
                                          and                                                                        Since Sup(                       ) < MST and conf(                                            ) < MCT, hence
                                            (                          )                  (                 )        is hidden i.e
         ((                  ))
                               (   )         (                                                          )                = +
STEP 5(b): A rule is hidden if                                                                                                                                                 to nearest whole number
     (         )
                    Or                                                                                               Transformed Database
              (           )                                                                                                     A                                     B                            C                         D
if Sup(LHS,RHS) ≤ MST or conf(LHS,RHS) ≤ MCT                                                                         T1         10                                    6                            8                         3
            =      +                                                                                                 T2         3                                     11                           6                         14
         Else                                                                                                        T3         6                                     4                            9                         13
         =                                                                                                           T4         7                                     6                            8                         12
                                           (                               (              )
     
                                                           )
                  (               )                                                                                  T5         11                                    5                            7                         10

                                               (                   )                  (             )                6. DISCUSSION OF RESULT
                  (              )                                                                                   From the result shown in the transformed database, the algorithm
                                                       (       )                              (    )
                                                   (                           )                                     was able to hide 4 out of the 5 items on the sensitive column
                                                                                                                     resulting into 80% privacy protection of the sensitive column B.
Since Sup(               ) < MST but conf(                                         ) > MCT ,hence                    Comparing with the previous works of Manoj and Joshi [8] and
is hidden                                                                                                            Berberoglu and Kaya [9] on the same dataset, which hide only 1
i.e                                                                                                                  out of the 5 items, resulting into 20% privacy protection of the
    = +                                                                                                              sensitive column B of the data while the proposed algorithm hides
                                                                                                                     80% of the sensitive column B.
                                           (               )               (              )
                 (               )
                                                                                                                     7. CONCLUSION AND FUTURE WORK
                  8/5=0.49                                                                                           In this work, we proposed an improved algorithm for hiding
                                           (                       )                  (            )                 sensitive quantitative data that combines the concept of noise with
              (              )
                                                   (           )                          (       )                  association rule mining to generate the useful association rules in
                                               (                       )                                             quantitative data and then hide them using privacy concept. The
                                                                                                                     algorithm works on association rules in quantitative data unlike
Since Sup(               ) > MST and conf(                                         ) > MCT, hence                    previous work done in this field which works on binary data. In
is not hidden                                                                                                        the future we plan to apply the algorithm to real life dataset; also

                                                                                                                94
we propose that in the proposed algorithm the association rule          [5]   Alexandre Evfimievski (2002)- randomization in a privacy
should contain more than one item on each side of the rule and                preserving data mining SIGKDD explorations. Volume 4-
that the algorithm should be used for text data after they might              issue-2, 2002- international journal of computer science
have been normalized into quantitative data.                                  and information technologies
                                                                        [6]   Ila Chandrakar (2016). Hybrid algorithm for privacy
                                                                              preserving association rule mining. Department of
8. REFERENCES                                                                 Information Technology VNR Vignana Jyothi Institute of
[1]   Doug Alexander (2016) - data mining                                     Engineering     and     Technology    Hyderabad,    India
      http://www.laits.utexas.edu/anorman/BUS.FOR/course.mat/                 http://www.ieee.org/documents/hybrid-algorithm. Accessed
      Alex/. Accessed on January 2016                                         May 2016.
[2]   Alexandre Evfimirvski, Ramakrishnan Srikant, Rakesh               [7]   Evfimievski A., Srikant R., Agrawal R., and Gehrk J.,
      Agrawal and Johannes Gerhke Privacy preserving mining of                (2002). Privacy Preserving Mining of Association Rules, in
      association rules. In proceedings of the 8th ACM SIGKDD                 Proceedings of the 8th Conference on Knowledge Discovery
      international Conference on Knowledge discovery in                      and Data Mining, New York, USA, pp. 1-12, 2002.
      databases and data mining, Edmonton, Akberta, Canada              [8]   Manoj Gupta and Joshi R. C. (2009). Privacy Preserving
      July 23-26, 2002, pages 217-228                                         Fuzzy Association Rules Hiding in Quantitative Data.
[3]   Rakesh Agrawal and Ramakrishna Srikant (2000) SIGMOD                    International Journal of Computer Theory and Engineering,
      ‘oo proceedings of the 2000 ACM SIGMOD international                    Vol. 1, No. 4, October, 2009, 1793-8201.
      conference on Management of data pages 439-450                    [9]   Berberoglu T. and Kaya, M. (2008). Hiding Fuzzy
[4]   Lindell Y and Pinkas B (2000) Advances in cryptology                    Association Rules in Quantitative Data, The 3rd
      ‘crypt ‘000 proceedings, LNCS 1880, Springer-                           International Conference on Grid and Pervasive
      Vertag,pp.20-24,August 2000.                                            Computing Workshops, May 2008, pp. 387-392.


                                                                   95