=Paper=
{{Paper
|id=Vol-3304/paper09
|storemode=property
|title=A Hierarchical-based Frequent Itemset Mining Method under Local Differential Privacy
|pdfUrl=https://ceur-ws.org/Vol-3304/paper09.pdf
|volume=Vol-3304
|authors=Deming Kong,Jian Wang
}}
==A Hierarchical-based Frequent Itemset Mining Method under Local Differential Privacy==
<pdf width="1500px">https://ceur-ws.org/Vol-3304/paper09.pdf</pdf>
<pre>
A Hierarchical-based Frequent Itemset Mining Method under
Local Differential Privacy
Deming Kong, Jian Wang
Nanjing University of Aeronautics and Astronautics, College of Computer Science and Technology, Nanjing,
210016, China

                 Abstract
                 Frequent itemset mining aims at finding the sets of items that occur together frequently in many
                 transactions, which can lay the foundation for further association rule mining. In this paper, to
                 deal with the problem of massive amount of data, we adopt a specific model which expresses
                 the attributes of item hierarchically. Based on the layering attributes, we propose a novel
                 frequent itemset mining method under local differential privacy which deals with the frequent
                 itemset mining problem layer by layer and it greatly improves the search efficiency.
                 Experimental analyses based on real and synthetic datasets show that our method outperforms
                 the state-of-the-art in terms of the computation cost.

                 Keywords 1
                 local differential privacy, frequent itemset mining, layering attribute

1. Introduction

    Frequent Itemset Mining (FIM) is a key problem in data mining. By mining frequent itemset,
merchants can discover the links between products, and thus predict customers' buying habits and
improve their service quality, which is beneficial both for their own interests and the economic
development of society. However, in the process of collecting users’ information, there is a risk that
some sensitive personal information such as users’ purchase history, browsing preferences may leak
and users do not want to expose their privacy [15]. In recent years, in order to protect privacy, local
differential privacy [1] has been widely used in data analysis tasks as a secure and advanced privacy-
preserving technique without the need for trusted third party and it can provide a better balance between
users’ privacy and data utility.
    There has been some research on frequent itemset mining of set-valued data under local differential
privacy [5,6,13]. However, there exits some problems in the process of frequent itemset mining under
local differential privacy, e.g., the excessive computation cost and low data utility. To deal with the
massive amount of data, we devise a novel framework based on dividing the category attributes of an
item into layers and step-by-step processing. The hierarchical expression of attributes makes full use of
the correlation between different layers of attributes. Specifically, the correlation here mainly refers to
the inclusion relationship of category attributes from different layers. Layer-division strategy of
attributes increases the available information dimension of the model and demonstrates the idea of
divide and rule [4]. In addition, step-by-step processing greatly reduces the complexity of problem and
improves the search efficiency.
    In real world, people may have complex personal requirements. Fortunately, based on the layering
structure of category attributes, our scheme can perfectly meet the complex requirements through
mining frequent itemset from high layer to low layer. Thus, users are able to obtain the frequent itemset
of different layers, rather than a single layer. Compared with other methods, our model is more flexible.
    The main contributions of this paper are as follows.

ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21-
23, 2022, Guangzhou, China
EMAIL: kongdeming@nuaa.edu.cn (Deming Kong); wangjian@nuaa.edu.cn (Jian Wang);
ORCID: 0000-0002-6202-2921 (Deming Kong); 0000-0002-8376-5898 (Jian Wang);
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                  67
   1) We adopt a hierarchical expression of category attributes to improve the search efficiency. In
addition, mining frequent itemset at different layers can meet the complex personal needs of users.
   2) Based on the layering attributes, we propose a novel frequent itemset mining method which
deals with the problem layer by layer.
   3) We introduce a specific model of local differential privacy to the method mentioned above, and
our experimental results validate the effectiveness of the proposed method.
   The rest of the paper is organized in the following order. Section 2 introduces background
knowledge and problem definition of frequent itemset mining, Section 3 presents some previous work,
Section 4 gives a complete implementation of our scheme, Section 5 shows experimental results of our
scheme and compares it with other methods, and Section 6 concludes our work.

2. Preliminaries and Problem definition
2.1. Local Differential Privacy

    Definition 1 ( 𝛜 -Local Differential Privacy ， 𝛜 -LDP) a perturbation mechanism 𝐴: 𝐷 → 𝑂
satisfies 𝜖-LDP if and only if for any two users' data 𝑑, 𝑑 ′ ∈ 𝐷 , and any possible output 𝑜 ∈ 𝑂 satisfies
the following inequality (1).
                                           𝑃 (𝐴 (𝑑 ) = 𝑜 )
                                                             ≤ 𝑒𝜖                     (1)
                                           𝑃 (𝐴 (𝑑 ′ ) = 𝑜 )
    Theorem 1 (Sequential Composition Theorem) Let A1 , A2 , A3 , … , A𝑛 be a series of LDP
mechanisms where A𝑖 implements 𝜖𝑖 -LDP, when the mechanism A on the data 𝐷 are run in an
independent random on A1 (𝐷), A2 (𝐷), A3 (𝐷), … , A𝑛 (𝐷) or run them in some sequential primary order,
at this point the mechanism 𝐴 is considered to satisfy the (∑𝑛𝑖=1 𝜖𝑖 )-LDP.

2.2.     Unary Encoding

   To cope with the problem of input domain size larger than 2, Generalized Random Response (GRR)
[15] and Unary Encoding (UE) [3] were proposed. It was later investigated that when the input domain
size 𝑑 > 3𝑒 𝜖 + 2, the variance of UE is smaller than that of GRR and is therefore more suitable as a
perturbation method [3].
   Definition 2 (Unary Encoding, UE) first maps the input 𝑣 as a 𝑑 bit length vector 𝐵 where only
the bit corresponding to the user input 𝑣 is set as 1. For the vector after the one-hot encoding 𝐵, UE is
perturbed separately for each bit as the following Equation (2).
                                                        𝑝, 𝑖𝑓𝐵[𝑖 ] = 1
                                   𝑃(𝐵′ [𝑖 ] = 1) = {                                 (2)
                                                        𝑞, 𝑖𝑓𝐵[𝑖 ] = 0
                                                                    𝑝.(1−𝑞)
     According to the 𝜖-LDP definition, UE can provide 𝜖 = ln                 for LDP. In order to minimize the
                                                                    𝑞.(1−𝑝)
                                                                                                        1
variance of UE, Wang [3] et al. proposed Optimal Unary Encoding (OUE), resulting in 𝑝 = , 𝑞 =
                                                                                                 2
 1                                                                                                4𝑒 𝜖
       . In the frequency estimation task, OUE has the smallest estimation variance of 𝑉𝑎𝑟 = 𝑛           .
𝑒 𝜖 +1                                                                                         (𝑒 𝜖 −1)2


2.3.     Padding-and-sampling


Fig.1. Padding-and-sampling

                                                    68
    For the frequency estimation of set-valued data, padding-and-sampling protocol [2,7,8] is proposed
to deal with the challenge of users’ difference in transaction length.
    Wang [8] proposed the Padding-and-Sampling-based Frequency Oracle (PSFO) protocol which is
based on the padding-and-sampling. PSFO protocol is specified by three parameters: a positive integer
𝐿, a frequency oracle FO, and the privacy budget 𝜖. It is composed of a pair of algorithms: ⟨𝜓, 𝜙⟩, where
𝜓 is used by each user to perturb his input value, and 𝜙 is used by the aggregator; 𝜙 takes as input the
reports from all users, and can be queried for the frequency of each value. Hence, PSFO can be defined
as the following Equation (3).

                              𝑃𝑆𝐹𝑂(𝐿, 𝐹𝑂, 𝜖 ) = ⟨𝜓𝐹𝑂(𝜖) (𝑃𝑆𝐿 ), 𝜙𝐹𝑂(𝜖) . 𝐿⟩        (3)

2.4.    Problem definition

    In the frequent itemset problem, the data collector knows the set of all items which is referred to as
the full set 𝐼 . There is a total of 𝑛 users, and the 𝑖 𝑡ℎ user has a set of items 𝑣 𝑖 ⊆ 𝐼 and we call this a
transaction. Similarly, an itemset 𝑋 is a collection of items, and the frequency of any 𝑋 ⊆ 𝐼 is defined
as the number of transactions containing that itemset 𝑋, which is also denoted as 𝑓𝑋 = {𝑣 𝑖 |𝑋 ⊆ 𝑣 𝑖 }.
    Our goal is to devise a secure and efficient scheme that enables data collector to implement frequent
itemset mining under local differential privacy. We denote the frequency of top-k frequent item as 𝑓(𝑦),
for the candidates of top-k frequent itemset 𝑌, our optimization goal is: 𝑓𝑌 = argmax ∏𝑦∈𝑌 𝑓 (𝑦). The
                                                                                      𝑌
top-k frequent itemset refers to an itemset whose frequency is among the k highest for all itemsets.

3. Related work

    There has been some researches devoted to mining frequent itemset under local differential privacy.
To address the problem of the varying number of items owned by users, Qin [7] proposes a two-phase
approach for better frequency estimation. She first pads the set of items to a certain length L by some
dummy items, and then samples a random item from it. They propose an algorithm for frequent item
mining based on this approach, called LDPMiner. However, this method can only find frequent items
while cannot be applied to the frequent itemset mining.
    Inspired by LDPMiner, Wang [8] groups users and each of them is involved in an independent query
task. This method improves the search accuracy and is known as the SVIM frequent item mining
algorithm. Based on SVIM, he then constructs the frequent itemset candidates and proposes the SVSM
frequent itemset mining algorithm, which is the first solution to the frequent itemset mining problem
under local differential privacy. However, SVSM focuses on items with only a single attribute, which
ignores the correlation between items in the real world.
    Zhang [16] devised a MRR algorithm to perform frequent itemset mining under local differential
privacy. It not only takes into account the multiple attributes of users’ data, but also can meet the needs
of users' personalized privacy protection. However, the process is inefficient in encoding because items
that do not have values in some attributes still consume large computation cost, and the computation of
frequent itemset supported by Bayes formula leads to low accuracy. Sharmin [15] proposed a novel
method to find frequent itemset on small datasets. The scheme framework is better at maintaining the
correlation between items, but the interactions between users and server are too complex to be applied
in reality.

4. An efficient FIM method with layering attribute under LDP

    In this section, we propose the solution to the problem of mining frequent itemset under local
differential privacy. We first introduce the layering structure of category attribute, and then we present
the complete implementation of our proposed method under LDP.


                                                     69
4.1.    Layering structure of category attribute


Fig. 2. Hierarchical expression of category attributes

    Different from most of the researches nowadays which focus on the items with only a single category
attribute, we divide the category attributes into many layers and express them hierarchically. The
layering structure is shown in Figure 2. By dividing the attributes into layers, we are able to expand the
original single attribute of an item into multi-dimensional category attribute. Hence, the problem of
massive amount of data can be effectively solved by taking advantage of the correlation between
different layers of attributes [4].
    Based on the layering structure of category attribute, in order to improve the search efficiency, we
devise a novel scheme of finding frequent itemset step by step, from high layer to low layer. We find
the frequent itemset of large category firstly, and then to search the specific frequent itemset of small
category within the given frequent itemset of large category, from coarse to fine, to greatly reduce the
problem complexity.
    If the number of a small category item is increasing, the number of corresponding large category
will also increase, it makes sense that the frequent itemset of large category can be used for the further
search of small category frequent itemset. The complexity of the frequent 2-itemset problem is O (𝑛2 ).
e.g., there are ten large categories, and each large category contains 100 small category items. If we
find the top three large categories of frequent 2-itemset, the problem scale will decline to O (3 ∗ 2002 ),
which is much smaller than O (10002 ). Although there is some loss in accuracy, it greatly improves
the search efficiency.

4.2.    Mining frequent itemset with layering attribute under LDP


Fig. 3. Top-level framework (𝛾𝑘𝑖 represents top-𝑘𝑖 frequent itemsets of the 𝑖 𝑡ℎ layer)


                                                    70
     We propose a multi-phase scheme framework in which users obtain frequent itemset of each layer
after interacting with the server several times. Based on the idea of user pruning, after the itemset
frequency estimation of each layer has been performed, only the top-ranked frequent itemsets of the
current layer are sent to the next layer for continue mining. The iterative process of the frequent itemset
mining for each layer continues and all estimated frequent itemsets of each layer are discovered
eventually.
     The total privacy budget 𝜖 is divided into 𝑚 parts for 𝑚 layers and each 𝜖𝑖 is used in the 𝑖 𝑡ℎ layer.
Specifically, we allocate privacy budget 𝜖 equally, which means each layer has the same average of
𝜖𝑖 = 𝜖⁄𝑚 and the whole process satisfies 𝜖-LDP according to the sequential composition property.
     In LDP, a common approach is to partition users into different groups, each answering one separate
question than for all users to answer multiple questions each with part of the total privacy budget 𝜖 [8].
It is proved that the over division of privacy budget will greatly reduce the data utility and the grouping
of users is beneficial to the accuracy of the final result. In our scheme, since there are 𝑘𝑖 frequent
itemsets of the 𝑖 𝑡ℎ layer, we randomly divide users into 𝑘𝑖 groups and each group of users participate
in an independent task for the (𝑖 + 1)𝑡ℎ layer frequent itemset mining separately [9]. The top-level
framework is shown in Figure 3.
     For each group of users with 𝐼𝑆 from the previous layer, each user first finds the items from his own
transaction which is in the range domain of the itemset 𝐼𝑆. If a user does not possess any item in the
range domain of 𝐼𝑆, he will be regarded as a useless participant and does not need to go through the
next series of operations. In other words, the frequent itemset is combined of the items only from the
range domain, and any item outside the domain is not considered, which demonstrates the idea of user
pruning. Algorithm 1 outlines the specific process of each layer.
     Step 1: Padding and Sampling. As the number of items in a user transaction varies, existing FO
protocols cannot handle this problem. Hence, it is necessary to adopt padding-and-sampling protocol to
select one item to report. If the user’s transaction length is less than 𝐿, the dummy item is padded; if the
user’s transaction length is larger than 𝐿, he randomly draws 𝐿 items out. Although the dummy item
could be sampled, its frequency will be ignored by the server during the frequency estimation process
and it definitely could not be considered as a frequent item.
     Step 2: Encoding and Perturbing. Assuming that there are 𝑁 users, each of whom 𝑢𝑗 possesses an
item obtained by padding and sampling. Then, the sampled item is indexed up to get the value of the
                                                                                              𝜖
𝑖 𝑡ℎ layer category attribute. Each user adopts OUE algorithm to perturb the value with 𝑖⁄2. Finally,
the perturbed value is submitted to the server for aggregation.
     Step 3: Aggregation and Estimation. After receiving the perturbed value from users, the server
calculates the frequencies of all category attributes by multiplying the estimated results with 𝐿 and
dropping the dummy item. Then, the server sorts attributes according to the estimated frequencies and
combines them two by two to construct the top-2k frequent itemset candidates. The 2k candidates are
sent to users.
     The following task is quite similar to the original problem, with the difference that the universe of
items becomes the 2k frequent itemset candidates, instead of the full set of 𝑑 items. Specifically, the
final top-k frequent itemsets are chosen from the 2k candidates.
     Step 4: Candidates Estimation. After receiving 2k frequent itemset candidates, the user selects the
set of items he owns in the 2k candidates. He then adopts padding-and-sampling algorithm to randomly
select one item from the padded set. Note that at this time the padding length 𝐿 becomes 2k rather than
                                                                                                       𝜖
90% of the number of users. Afterwards, the user adopts OUE algorithm to perturb the value with 𝑖⁄2
and sends it to the server.
     The server estimates the frequencies of all candidates from the perturbed value and thus top-k
frequent itemsets can be obtained.


                                                    71
5. Experimental Results

   In this section, we experimentally evaluate our method of mining frequent itemset for category data
under local differential privacy, and verify that our proposed scheme performs effectively on real and
synthetic datasets. Basically, we want to answer the following questions: Firstly, how many frequent
itemsets can be effectively identified. Secondly, how much do we improve the search efficiency.
   Environment: all algorithms are implemented in Python 3.7 and all the experiments are conducted
on an Intel Core i5-9300H 2.4GHz PC with 16GB memory.
   Datasets: we use the real dataset (Retail) [15] to show that our proposed method is suitable for real
datasets. We show the results of the Retail dataset in Figure 4. We also implement our scheme on a
synthetic dataset called norm-user [16] which obeys normal distribution and the result is shown in
Figure 5. The synthetic normal distribution has mean of 4 and standard deviation of 0.1.
   Our main goal of this experiment is to prove that by taking the correlations between items into
account, our scheme can usually obtain frequent itemset results more accurately and efficiently,
compared to SVSM [8] and Priv_OA [15] mechanisms.
   Evaluation metrics: to compare the accuracy of the results of the mining, we adopt the F − Score
[16] to measure our performance. Let 𝑋𝑡 = {𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑘 } denotes the ground truth for top-k
frequent itemset and 𝑋𝑟 represents the set of frequent itemset generated by a specific algorithm. Then


                                                  72
𝑋𝑡 ∩ 𝑋𝑟 is the set of real top-k itemsets that are identified by the algorithm. In this case, F − Score
represents the accuracy of the search and is defined as the following Equation (4). In addition, we take
the program running time as the expression the computation cost.
                                       𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 |𝑋𝑡 ∩ 𝑋𝑟 |
                      𝐹 − 𝑆𝑐𝑜𝑟𝑒 = 2 ∗                        =                  (4)
                                       𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙         |𝑋𝑡 |
   We conduct simulations on two kinds of datasets to compare the performance between three
mechanisms. The first algorithm performing is our proposed method, the second one is SVSM algorithm
and the last one is called Priv_OA which performs well on small datasets. The privacy budget 𝜖 we
employ ranges from 1.0 to 5.0 with intervals of 1. In addition, we also explore the relationship between
the accuracy of the results and the value of k through experiments. The value k ranges from 30 to 70
with intervals of 10.


Fig.4. the F − Score results on Retail dataset

   In Figure 4, we evaluate three methods on Retail dataset and plot the F − Score diagram based on
the variation of privacy budget 𝜖 and the value of k. As can be observed in Figure 4, the identification
accuracy which is denoted as F − Score increases with 𝜖, and basically maintain, even with a little
fluctuation when the value of k increases. It can be observed that our mechanism has an improvement
over the other two methods, and with the increase of privacy budget 𝜖 and the value of k, that advantage
is more obvious. That means the data utility has been improved under our proposed method. We also
can find that with the increase of k, F − Score does not go up that fast. That is probably because the
frequency difference of itemset is not that obvious and the error in frequency estimation also increases
with k.


Fig. 5. the F − Score results on synthetic dataset

   In Figure 5, we evaluate three methods on synthetic dataset and plot the F − Score diagram based
on the variation of privacy budget 𝜖 and the value of k. Similar to the Retail dataset, it can be observed
that the scheme we propose outperforms the other two methods. However, it can be noticed that when
the privacy budget 𝜖 = 1, our approach doesn’t perform well compared to Priv_OA, probably because
data utility is low when the privacy budget is small, and at this time, users’ privacy can be better
protected.

                                                     73
Fig. 6. the computation time on Retail dataset

   Furthermore, we also explore the relationship between computation time and variables (the privacy
budget 𝜖 and k) and the result is shown in Figure 6. This suggests that for the set of real top-k itemsets
that are identified by algorithms, our proposed method can greatly reduce the computation cost, thus
improve the search efficiency.

6. Conclusion

   In this paper, a novel method is proposed to mine frequent itemset under local differential privacy.
To improve the search accuracy, we divide the category attributes into layers and make full use of the
correlation between different layer of attributes. Based on the layering attributes, in order to reduce the
complexity of processing, frequent itemset mining is implemented step by step, from high layer to low
layer. In addition, we have experimentally verified that our approach performs effectively on both real
and synthetic datasets. Due to that we only consider the search of frequent 2-itemset in this paper, in
the future, we will continue to do research on frequent multi-itemset mining under LDP. Furthermore,
there can be more improvements in terms of different expressions of attribute structures.

7. References

[1] Dwork C. Differential privacy [C]//International Colloquium on Automata, Languages, and
    Programming. Springer, Berlin, Heidelberg, 2006: 1-12.
[2] Duchi J C, Jordan M I, Wainwright M J. Local privacy and statistical minimax rates[C]//2013
    IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 2013: 429-438.
[3] Wang T, Blocki J, Li N, et al. Locally differentially private protocols for frequency estimation
    [C]//26th {USENIX} Security Symposium ({USENIX} Security 17). 2017: 729-745.
[4] Wang N, Xiao X, Yang Y, et al. Collecting and analyzing multidimensional data with local
    differential privacy [C]//2019 IEEE 35th International Conference on Data Engineering (ICDE).
    IEEE, 2019: 638-649.
[5] Wang S, Huang L, Nie Y, et al. Privset: set-valued data analyses with locale differential
    privacy[C]//IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2018:
    1088-1096.
[6] Wang S, Qian Y, Du J, et al. Set-valued data publication with local privacy: tight error bounds and
    efficient mechanisms[J]. Proceedings of the VLDB Endowment, 2020, 13(8): 1234-1247.
[7] Qin Z, Yang Y, Yu T, et al. Heavy hitter estimation over set-valued data with local differential
    privacy [C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and
    Communications Security. 2016: 192-203.
[8] Wang T, Li N, Jha S. Locally differentially private frequent itemset mining[C]//2018 IEEE
    Symposium on Security and Privacy (SP). IEEE, 2018: 127-143.
[9] Kairouz P, Bonawitz K, Ramage D. Discrete distribution estimation under local privacy
    [C]//International Conference on Machine Learning. PMLR, 2016: 2436-2444.


                                                    74
[10] Wang T, Li N, Jha S. Locally differentially private heavy hitter identification [J]. IEEE
     Transactions on Dependable and Secure Computing, 2019.
[11] Fanti G, Pihur V, Erlingsson Ú. Building a RAPPOR with the Unknown: Privacy-Preserving
     Learning of Associations and Data Dictionaries[J]. Proceedings on Privacy Enhancing
     Technologies, 2016, 3: 41-61.
[12] Bassily R, Nissim K, Stemmer U, et al. Practical Locally Private Heavy Hitters [J]. Advances in
     Neural Information Processing Systems, 2017, 30: 2288-2296.
[13] Wang N, Xiao X, Yang Y, et al. PrivTrie: Effective frequent term discovery under local differential
     privacy [C]//2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018:
     821-832.
[14] Q. Xue, Y. Zhu and J. Wang, "Joint Distribution Estimation and Naï    ve Bayes Categoryification
     Under Local Differential Privacy," in IEEE Transactions on Emerging Topics in Computing, vol.
     9, no. 4, pp. 2053-2063, 1 Oct.-Dec. 2021, doi: 10.1109/TETC.2019.2959581.
[15] Sharmin Afrose, Tanzima Hashem, and Mohammed Eunus Ali. 2021. Frequent Itemsets Mining
     with a Guaranteed Local Differential Privacy in Small Datasets. 33rd International Conference on
     Scientific and Statistical Database Management Association for Computing Machinery, New York,
     NY, USA, DOI:https://doi.org/10.1145/3468791.3468807
[16] Xinyuan Zhang, Liusheng Huang, Peng Fang, Shaowei Wang, Zhenyu Zhu, and Hongli Xu. 2017.
     Differentially Private Frequent Itemset Mining From Smart Devices in Local Setting. in WASA.
     433-44


                                                  75

</pre>