Multi-Sorted Inverse Frequent Itemsets Mining for
Generating Realistic No-SQL Datasets
(Discussion Paper)

Domenico Saccà1 , Edoardo Serra2 and Antonino Rullo1
1
    DIMES Department, University of Calabria, 87036 Rende, Italy
2
    Department of Computer Science, Boise State University, Boise, ID 83725, USA


                                         Abstract
                                         The development of novel platforms and techniques for emerging “Big Data” applications requires the
                                         availability of real-life datasets for data-driven experiments, which are however not accessible in most
                                         cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting
                                         solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real
                                         ones. A promising approach is based on inverse mining techniques such as inverse frequent itemset mining
                                         (IFM), which consists of generating a transactional dataset satisfying given support constraints on the
                                         itemsets of an input set, that are typically the frequent and infrequent ones. This paper describes an
                                         extension of IFM that considers more structured schemes for the datasets to be generated, as required in
                                         emerging big data applications, e.g., social network analytics.

                                         Keywords
                                         IFM, No-SQL, Itemset Mining


1. Introduction
Emerging “Big Data” platforms and applications call for the invention of novel data analysis
techniques that are capable to effectively and efficiently handle large amount of data [1]. There is
therefore an increasing need to use real-life datasets for data-driven experiments that are often
not available for two main reasons: (𝑖) ownership (i.e., they are often proprietary and out of reach
for academic research), and (𝑖𝑖) privacy (they may include sensible data as it happens for many
research areas, e.g., epidemiology, public health, social science, study of the behavior of large
populations of individuals under natural scenarios, as well as under human interventions).
   The two above limitations can be overcome by the design of “realistic” synthetic datasets that
capture key attributes and activities of real ones without violating confidentiality [2].
   In this paper we use an inverse data mining approach to generate an artificial NoSQL transac-
tional dataset that reflects the patterns of a real one: the patterns can be thought of as a compressed
representation of the original dataset and are first discovered by data mining techniques, and they
are next used to generate a “realistic” pattern-preserving dataset.

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" domenico.sacca@unical.it (D. Saccà); edoardoserra@boisestate.edu (E. Serra); n.rullo@dimes.unical.it (A. Rullo)
 0000-0003-3584-5372 (D. Saccà); 0000-0003-0689-5063 (E. Serra); 0000-0002-6030-0027 (A. Rullo)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   A well-known inverse approach refers to the classical data mining problem of extracting the set
of the frequent/infrequent itemsets from a transaction database. The related literature is rather rich
and covers almost three decades: after the seminal papers in the nineties [3, 4], additional aspects
have been studied in the last two decades. The perspective of the frequent itemset mining problem
can be naturally inverted as follows: we are be given in advance a set of itemsets together with
their frequency as constraints and our goal is to compute a transactional database satisfying the
above constraints. This problem, called the inverse frequent itemset mining problem (IFM), has
been introduced in the context of defining generators for benchmarks of mining algorithms [5],
and has been subsequently reconsidered in privacy preserving contexts [6, 7]. A reformulation
of IFM in terms of frequencies instead of supports has been introduced in [8, 9] with the name
FREQSAT and many variants of it have been proposed as well.
   The original IFM formulation does not introduce any constraint on infrequency. A simple
solution to exclude unexpected frequent itemset from a feasible solution is the formulation
proposed in [10], which is called IFM𝑆 : only itemsets listed in the set 𝑆 of constraints can be
included as transactions in 𝒟. A more general version of IFM with infrequency support constraint
(IFMI for short), has been proposed in [11], where every infrequent itemset may occur a small
number of times.
   In order to enlarge the domain of IFM to NoSQL applications, an extension of IFM has been
proposed in [12] that considers more structured schemes for the datasets to be generated, as
required by emerging big data applications, e.g., social network analytics. A many-sorted
extension of IFM called 𝑚𝑠-IFM is defined by replacing the basic simple schema 𝑅(𝐾, 𝐴)
with a more general NoSQL schema 𝑅(𝐾, 𝐴1 , . . . , 𝐴𝑝 , 𝐴1 , . . . , 𝐴𝑞 ), where 𝐾 is the table key,
𝐴1 , . . . , 𝐴𝑝 are single-valued attributes, and 𝐴1 , . . . , 𝐴𝑞 are multi-valued attributes. This problem
reduces to classical IFM when 𝑝 = 0 and 𝑞 = 1, i.e., there is exactly one multi-valued attribute.
   In this paper we describe the formal definition of 𝑚𝑠-IFM of [12] and illustrate a method for
solving 𝑚𝑠-IFM that has been shown to be effective by means of a number of experiments. The
solver is based on large-scale linear programming and generalizes the approach used in [11] to
solve IFMI .


2. Many-Sorted IFM
Let a NoSQL relation 𝑅(𝐾, 𝐴1 , . . . , 𝐴𝑝 , 𝐴𝑝+1 , . . . , 𝐴𝑝+𝑞 ) be given, where 𝐾 is the table key,
𝐴1 , . . . , 𝐴𝑝 are classical single-valued (SV) attributes and 𝐴𝑝+1 , . . . , 𝐴𝑝+𝑞 are multi-valued
(MV) attributes. We assume that the attributes are ordered, i.e., 𝐾 is the first attribute, 𝐴1 the
second attribute, 𝐴𝑝+1 the (1 + 𝑝 + 1)-th attribute and so on. For each 𝑖, 1 ≤ 𝑖 ≤ 𝑝 + 𝑞, let 𝒜𝑖 be
the finite domain for the attributes 𝐴𝑖 and |𝒜𝑖 | = 𝑛𝑖 . We assume that the values of every domain
𝒜𝑖 (called items) are given in input – they are SV items or MV items depending on whether the
attribute 𝐴𝑖 is SV or MV. On the other hand, the domain 𝒦 of the key 𝐾 is countably infinite and,
then, its values∑︀𝑝are not listed.
                               ¨ = 𝑝+𝑞
                                        ∑︀
   Let 𝑛˙ = 𝑖=1 𝑛𝑖 , 𝑛                      𝑖=𝑝+1 𝑛𝑖 and 𝑛 = 𝑛 ˙ +𝑛¨ . A NoSQL tuple on 𝑅 is of the form
𝑡 = [𝑘, 𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ], where 𝑘 ∈ 𝒦, for each 𝑖, 1 ≤ 𝑖 ≤ 𝑝, 𝑎𝑖 ∈ 𝒜𝑖 (𝑎𝑖 is an item of 𝑡)
and for each 𝑖, 1 ≤ 𝑖 ≤ 𝑞, 𝑔𝑖 ⊆ 𝒜𝑝+𝑖 (𝑔𝑖 is an itemset of 𝑡). Items and itemsets are called values
of 𝑡. A NoSQL table on 𝑅 is a finite set of NoSQL tuples.
   A many-sorted transaction 𝑇 is a (𝑝 + 𝑞)-tuple [𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ] where the key attribute
value is dropped out from a NoSQL tuple. It follows that a many-sorted transaction can be
transformed into a tuple simply by inventing a key for it. Given any attribute 𝐴𝑖 in 𝑅 and a
many-sorted transaction 𝑇 , 𝑇𝐴𝑖 denotes the value of 𝑇 for the attribute 𝐴𝑖 (e.g., 𝑎𝑖 ∈ 𝒜𝑖 if 𝑖 ≤ 𝑝
or 𝑔𝑖−𝑝 ⊆ 𝒜𝑖 if 𝑖 > 𝑝).
   Let 𝒯 be the set of all many-sorted transactions. The cardinality of 𝒯 is 𝑛𝒯 = 2𝑛¨ · 𝑝𝑖=1 𝑛˙ 𝑖 .
                                                                                                    ∏︀
A many-sorted dataset 𝒟 is set of pairs (𝑇, 𝛿 𝒟 (𝑇 )), where 𝑇 is a many-sorted transaction and
𝛿 𝒟 (𝑇 ) is the number of occurrences of 𝑇 . 𝒟 can be transformed into a NoSQL table 𝑇𝒟 by
making 𝛿 𝒟 (𝑇 ) tuple duplicates of each many-sorted transaction 𝑇 . The ∑︀        cardinality of 𝒟, denoted
as |𝒟|, is the number of pairs in 𝒟 and the size of 𝒟 is 𝛿 = |𝑇𝒟 | = 𝑇 ∈𝒟 𝛿 𝒟 (𝑇 ). We stress
                                                               𝒟

that in general 𝛿 𝒟 ≫ |𝒟|. From now on, we shall omit the term many-sorted whenever it is clear
from the context.
   A sub-transaction 𝑆 is a (𝑝 + 𝑞)-tuple [𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ] on 𝑅 for which the domain
of each SV attribute is extended with ⊥, which stands for a null value. Let ⊥(𝑆) denote the
number of null values in 𝑆 – a transaction 𝑇 can be also seen as a sub-transaction type for which
⊥(𝑇 ) = 0. The length of 𝑆, denoted as 𝑙(𝑆), is the number of values different from ⊥ and ∅. A
sub-transaction 𝑆 subsumes a transaction 𝑇 (written 𝑆 ⊑ 𝑇 ) if for each SV attribute 𝐴𝑖 , either
𝑆.𝐴𝑖 = ⊥ or 𝑆.𝐴𝑖 = 𝑇.𝐴𝑖 , and for each MV attribute 𝐴𝑖 , 𝑆.𝐴𝑖 ⊆ 𝑇.𝐴𝑖 . Observe that if 𝑆
happens to be a transaction then every transaction 𝑇 for which 𝑆 ⊑ 𝑇 has the same SV items as
𝑆, whereas its MV itemsets are supersets of the corresponding itemsets in 𝑆. In the classical IFM
setting, every transaction type 𝑆 coincides with an itemset and the transactions 𝑇 subsumed by 𝑆
are all itemsets for which 𝑆 ⊆ 𝑇 . This analogy explains why we write 𝑆 ⊑ 𝑇 for transaction
subsumption.
   Given a sub-transaction 𝑆 for which 𝑙(𝑆) > 0 and two integers 𝜎1 and 𝜎2 for which 0 ≤ 𝜎1 ≤
𝜎2 , 𝛾 = ⟨𝑆, 𝜎1 , 𝜎2 ⟩ represents∑︀
                                  a frequency support constraint defined as: a database 𝒟 satisfies 𝛾
(written as 𝒟 |= 𝛾) if: 𝜎1 ≤                    𝒟
                                    𝑇 ∈𝒟∧𝑆⊑𝑇 𝛿 (𝑇 ) ≤ 𝜎2 . An infrequency support constraint 𝛾 is
a pair ⟨𝑆, 𝜎⟩ and is actually a shorthand for the frequency support constraint ⟨𝑆, 0, 𝜎⟩. The upper
bound 𝜎 will be simply referred to as 𝛾# . A (frequency or infrequency) support constraint 𝛾 for
which 𝑙(𝛾𝑆 ) = 1 is called a domain support constraint. Given a set Π of support constraints, a
database 𝒟 satisfies Π (written as 𝒟 |= Π), if for each 𝛾 ∈ Π, 𝒟 |= 𝛾.

Example. Individuals are characterized by the SV attributes Gender, Location and Age and by the
MV attributes Groups and Events: an individual may belong to various groups and may attend a
number of events. A transaction 𝐼 = [𝑀 𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, 25, {𝑔1 , 𝑔4 }, {𝑒1 , 𝑒3 }] represents a 25-year
old male individual located in Rome who belongs to the groups 𝑔1 and 𝑔4 and attends the events
𝑒1 and 𝑒3 . The transaction 𝐽 = [𝐹 𝑒𝑚𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, 20, {𝑔1 , 𝑔2 }] represents an individual who does
not attend any event. Note that, as the attributes do not define a key, there may exist several
occurrences of the same individual. Examples of constraints are:
   – ⟨[𝑀 𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, ⊥, {𝑔1 , 𝑔2 }, ∅], 10000, 20000⟩ fixes the range for the overall duplicate num-
ber of male individuals who are located in Rome and are participating to at least the groups 𝑔1
and 𝑔2 ;
– ⟨[𝐹 𝑒𝑚𝑎𝑙𝑒, ⊥, 25, {𝑔1 , 𝑔2 }, {𝑒1 , 𝑒3 }], 500, 1000⟩ fixes the range for the overall duplicate num-
ber of 25-year old female individuals who are participating to at least the groups 𝑔1 and 𝑔2 and
attending at least the events 𝑒1 and 𝑒3 ;
Infrequency constraints:
– ⟨[⊥, 𝐶𝑜𝑠𝑒𝑛𝑧𝑎, ⊥, {𝑔1 , 𝑔2 }, ∅], 100⟩ states that the number of individuals located at Cosenza in
a feasible dataset who are participating to at least the groups 𝑔1 and 𝑔2 is at most 100;
– ⟨[⊥, ⊥, ⊥, ∅, {𝑒1 , 𝑒2 }], 10000⟩ states that the number of individuals attending at least the events
𝑒1 and 𝑒2 is at most 10000;
– ⟨[⊥, ⊥, ⊥, {𝑔1 }, ∅], 100000⟩ is a domain support constraint stating that the number of individuals
participating to at least the group 𝑔1 is at most 100000.                                            □

   From now on, we assume that the following sets of constraints are given: (1) a set Σ of
frequency support constraints with cardinality 𝑚 = |Σ| > 0 and (2) a set Σ       ̂︀ of infrequency
support constraints with cardinality 𝑚 ̂︀ = |Σ| ≥ 0. Let Σ̆ = Σ ∪ Σ and 𝑚
                                              ̂︀                   ̂︀      ˘ = |Σ̆| = 𝑚 + 𝑚.  ̂︀
   Observe that, as the number of infrequency constraints could be very large, they are not induced
as for the IFMI case [11]. but explicitly defined. The drawback of this formulation is that it is not
anymore excluded to eventually obtain undesired frequent itemsets. But an accurate choice of
explicit infrequent itemsets may reduce this risk.

Definition. Given 𝑅, Σ̆ = Σ ∪ Σ,    ̂︀ and an integer size > 0, the multi-sorted inverse frequent
itemset mining problem, shortly denoted as 𝑚𝑠-IFM, consists of finding a many-sorted dataset
𝒟 on 𝑅 such that both 𝛿 𝒟 = size and 𝒟 |= Σ̆ (or of eventually stating that there is no such a
dataset). If the integer constraint for the number of duplicates for a transaction of a database 𝒟 is
relaxed (i.e., it is a rational number), the problem is called relaxed 𝑚𝑠-IFM.                     □

By assuming that items, constraint bounds and size are stored using a constant amount of space,
the input size of the problem is 𝒪( 𝑛 + 𝑚˘ (𝑝 + 𝑞 𝑛¨ + 1) + 𝑚 ). It is easy to see that 𝑚𝑠-IFM
reduces to the IFMI problem if 𝑝 = 0 and 𝑞 = 1, i.e., there exists exactly one non-key attribute in
𝑅 and this attribute is of MV type, say 𝐺. A transaction is then any itemset on the domain 𝒢 of
𝐺. The next result shows that the complexity of 𝑚𝑠-IFM and of relaxed 𝑚𝑠-IFM.

Proposition. The decision version of 𝑚𝑠-IFM is in PSPACE and NP-hard and the decision version
of relaxed 𝑚𝑠-IFM is NP-complete.                                                          □

Note that the higher complexity of the IFMI problem, which has been proved to be NEXP-complete
in [11], derives from the task of discovering “minimal” itemsets that must be enforced to be
infrequent: in IFMI minimal infrequent itemsets are computed by the resolution algorithm,
whereas they are assumed to be part of the input for 𝑚𝑠-IFM.
   To solve the relaxed 𝑚𝑠-IFM, in the next sections we define formulate 𝑚𝑠-IFM as a linear
program and present an extension of the classical simplex algorithm for its resolution.

2.1. Linear Program Formulation
We select an ordering of all many-sorted transactions in 𝒯 , say {𝑇1 , . . . , 𝑇𝑛𝒯 }. We use the vector
𝑣 = [1, . . . , 𝑛𝒯 ] to list all possible non-empty many-sorted transaction indices. Let the vector
𝑠 = [𝑖1 , . . . , 𝑖𝑚 , 𝑗1 , . . . , 𝑗𝑚
                                     ̂︀ ] represent the indices of the frequency support constraints in Σ and
the infrequency support constraint Σ.̂︀ Let 𝑙 and 𝑢 be two vectors of 𝑚 integers such that for
                             𝐽
each 𝑗, 1 ≤ 𝑗 ≤ 𝑚, 𝑙𝑗 = 𝜎𝑚𝑖𝑛 and 𝑢𝑗 = 𝜎𝑚𝑎𝑥     𝐽   , where 𝐽 = 𝛾˜𝑠𝑗 is the 𝑗-th frequency support
constraints in Σ according to the ordering fixed by 𝑠. Let 𝑢′ be the vector of 𝑚   ̂︀ integers such
that for each 𝑗, 𝑚 ≤ 𝑗 ≤ 𝑚 + 𝑚, ̂︀ 𝑢′𝑗 = 𝜎𝑚𝑎𝑥𝐽   , where 𝐽 = 𝛾˜𝑠𝑗 is the 𝑗-th infrequency support
constraints in Σ according to the ordering fixed by 𝑠. Let 𝑥 be a vector of 𝑛𝒯 non-negative
rational variables whose meaning is that its 𝑗-the coordinate, 𝑥𝑗 , denotes the number of duplicates
for the many-sorted transaction 𝑇𝑗 . We consider a (𝑚 + 𝑚)     ̂︀ × 𝑛𝒯 matrix 𝐴 where for each
𝑖, 1 ≤ 𝑖 ≤ 𝑚 + 𝑚, ̂︀ and for each 𝑗 ∈ 𝑣, 𝑎𝑖𝑗 = 1 if 𝑆 ⊑ 𝑇𝑗 where 𝛾𝑠𝑖 = ⟨𝑆, 𝜎1 , 𝜎2 ⟩ or 𝑎𝑖𝑗 = 0
otherwise. As usual, we introduce a vector 𝑤 of 2𝑚 + 1 non-negative rational number artificial
variables, whose values represent the costs of violating some support constraints.
   The relaxed linear formulation of the 𝑚𝑠-IFM is the following:

                                                 2𝑚+1
                                                  ∑︁
                                 LP : minimize          𝑤𝑖 - subject to
                                                 𝑖=1

                           ∑︁
                   𝑤𝑖 +          𝑎𝑖𝑗 𝑥𝑗 ≥ 𝑙𝑖                                  1≤𝑖≤𝑚
                           𝑗∈𝑣
                           ∑︁
                𝑤𝑚+𝑖 −           𝑎𝑖𝑗 𝑥𝑗 ≥ −𝑢𝑖                                 1≤𝑖≤𝑚
                           𝑗∈𝑣
                           ∑︁
                       −         𝑎𝑖𝑗 𝑥𝑗 ≥ −𝑢′𝑗                 𝑚 + 1 ≤ 𝑖 ≤ 𝑚 + 𝑚′
                           𝑗∈𝑣
                             ∑︁                                         ∑︁
                  𝑤2𝑚+1 +           𝑥𝑗 ≥ size                       −         𝑥𝑗 ≥ size
                             𝑗∈𝑣                                        𝑗∈𝑣


   For solving the linear program we use the column generation approach (see e.g. [13]), which
is an extension of the simplex method for dealing with linear programs with a large number of
variables. The column generation simplex solves a linear program without explicitly including
all columns (i.e., variables), in the coefficient matrix but only a subset of them with cardinality
equal to the number of rows (i.e., constraints). Columns are dynamically generated by solving an
auxiliary optimization problem called the pricing problem.
   The linear program to be solved is denoted as the master problem (MP). In our case the MP
problem consists of 𝑟 = 2𝑚 + 𝑚′ + 2 rows and 𝑐 = 2𝑛 + 2𝑚 columns.
   The linear program with only a subset of the MP columns with cardinality 𝑐′ equal to the
number 𝑟 of rows is called the restricted master problem (RMP). Then, the restricted master
problem is the master problem with a subset of columns w.r.t. the original master problem. From
the theory of the simplex algorithm we know that a basic solution for a linear program is an
assignment of the variables that satisfy the constraints and where only 𝑟 (number of rows are
activated) variables are assigned with a value greater or equal than zero (i.e., the variables in the
basic solution), the reaming ones are zero. Starting from an initial basic solution, the simplex
algorithm will iteratively find the optimal basic solution. Since each variable is assigned to a
column, it turns out that if the RMP contains the columns associated to the variables in the
optimal base solutions of the MP, then the RMP will return the same solution. Then the column
generation approach is an iterative algorithm that working on the restricted master problem will
keep the number of columns bounded and iterates till the RMP does not contain the columns
associated with optimal base solution of the MP.
   The column generation method looks for an optimal basis as within the simplex algorithm. It
starts from an initial basis and moves from a current basis to a new one by adding a new basic
column with a negative reduced cost (iteration step). Primal feasibility is maintained and the
objective function is non-increasing during this search. The reduced cost of a column can be
computed by using the current dual variables. The task of providing a column with a negative
reduced cost, or certifying that there is not such a column, is delegated to the pricing problem. If
there is no column with a negative reduced cost, then the algorithm terminates and the current
basis is optimal.
   The crucial task is the formulation and resolution of the pricing problem that is done by using
integer linear program. Due to page restrictions, we defer this issue to the full paper [12].

2.2. Empirical evaluation of Many-Sorted IFM
To evaluate the accuracy of the solutions computed by our 𝑚𝑠-IFM approximated approach, we
show that synthetic data can successfully replace original data in an analytics task, in particular
classification, which is the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs to, on the basis of a training set of data containing observations and
whose categories membership is known [14]. To this end, we set up a classification problem by
representing a transactional dataset with a standard dataset for classification, where each item
corresponds to a binary attribute (i.e., 1 if present and 0 otherwise). To create several different
classification tasks for each dataset used in the experiments, we selected the top-𝑘 most frequent
items. For each of such items, we consider this item as the dependent feature (i.e., the binary
attribute that we want to classify) and the remaining top-𝑘 items as the independent features
(i.e., the input of the classification model). For each synthetic dataset generated by 𝑚𝑠-IFM
techniques, we train a classification model for each of the top-k most frequent items and we test
the trained model on the original dataset.
   In the experiments, we use 3 different classification models [14]: Logistic Regression, Decision
Tree, and Random Forest. All of these models are trained with datasets and specific procedures
to balance the training sets are executed. For each model and for each dataset, we compute the
accuracy of the classification model, which was first trained and tested on the original dataset.
The accuracy computed while testing classification on the original dataset (denoted as original
accuracy) can be assumed to be the best classification accuracy value that any synthetic dataset
generation approach can achieve.
   To create instances for our 𝑚𝑠-IFM, we used the Yelp dataset [15] to extract a No-SQL table.
In particular, each transaction of the No-SQL table represents a business in Yelp (e.g., restaurant,
hospital, etc.). In total there are 101,824 transactions. The table has 4 attributes: STARS (the
average number of stars for each businesshis – a SV attribute with 9 values), STATE (the state
where the business is located – a SV attribute with 50 values), CATEGORIES (the categories to
which a particular business belongs – a MV value attribute with 397 distinct values, e.g., garage,
restaurant, space for kids, etc.). To create the frequency and infrequency support many-sorted
constraints, we converted this NOSQL table in a transactional dataset.
Table 1
Accuracy for the 30-top frequent items with 𝑠(%) ∈ {1.2, 1.6}
                        classifier         Original   𝑆(%) = 1.2     𝑆(%) = 1.6
                   Logistic Regression      0.99        0.93           0.93
                      Decision Tree         0.99        0.85           0.84
                     Random Forest          0.99        0.92           0.92


   In Table 1, we report the average classification results obtained for 30-top frequent items in the
two different datasets obtained by means of two selected supports.
   From Table 1 it is possible to see that the synthetic datasets generated are pretty close to the
original dataset in term of accuracy. In addition, we did not observe a strong variation when the
support changes. Further accuracy evaluations are reported in [12].


3. Conclusion
The Inverse Frequent itemset Mining (IFM) problem consists of generating a transactional database
satisfying given support constraints on some itemsets, which are typically the frequent ones. In
order to enlarge the application domain, an extension of IFM, called multi-sorted IFM (𝑚𝑠-IFM),
has been introduced in [12] that considers more structured schemes for the datasets to be generated,
as required in emerging big data applications, e.g., social network analytics.
   This paper has given an overview of 𝑚𝑠-IFM problem definition, linear program formulation,
resolution using a column generation algorithm (an extension of the simplex method for dealing
with linear programs with a large number of variables), and an experimental evaluation of the
capability of the synthetic datasets to perform the data mining task of classification with an
accuracy close to the one achieved with the original datasets.


References
 [1] K. Michael, K. W. Miller, Big Data: New Opportunities and New Challenges
     [Guest editors’ Introduction],        Computer 46 (2013) 22–24. doi:http://doi.
     ieeecomputersociety.org/10.1109/MC.2013.196.
 [2] H. Wu, Y. Ning, P. Chakraborty, J. Vreeken, N. Tatti, N. Ramakrishnan, Generating Realistic
     Synthetic Population Datasets, ACM Trans. Knowl. Discov. Data 12 (2018) 45:1–45:22.
     URL: http://doi.acm.org/10.1145/3182383. doi:10.1145/3182383.
 [3] R. Agrawal, T. Imieliński, A. Swami, Mining Association Rules Between Sets of Items in
     Large Databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on
     Management of Data, SIGMOD ’93, ACM, New York, NY, USA, 1993, pp. 207–216.
 [4] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, Data mining, Hypergraph Transver-
     sals, and Machine Learning, in: A. O. Mendelzon, Z. M. Özsoyoglu (Eds.), Proceedings
     of the 16-th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
     Systems, PODS ’97, ACM Press, 1997, pp. 209–216.
 [5] T. Mielikainen, On Inverse Frequent Set Mining, in: Proceedings of 2nd Workshop on
     Privacy Preserving Data Mining, PPDM ’03, IEEE Computer Society, Washington, DC,
     USA, 2003, pp. 18–23.
 [6] R. Agrawal, R. Srikant, Privacy-preserving Data Mining, in: Proceedings of the 2000 ACM
     SIGMOD International Conference on Management of Data, SIGMOD ’00, ACM, New
     York, NY, USA, 2000, pp. 439–450. doi:10.1145/342009.335438.
 [7] X. Wu, Y. Wu, Y. Wang, Y. Li, Privacy Aware Market Basket Data Set Generation: A
     Feasible Approach for Inverse Frequent Set Mining, in: Proceedings of SIAM International
     Conference on Data Mining, SDM’ 05, SIAM, Philadelphia, PA, USA, 2005, pp. 103–114.
 [8] T. Calders, Computational Complexity of Itemset Frequency Satisfiability, in: Pro-
     ceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Princi-
     ples of Database Systems, PODS ’04, ACM, New York, NY, USA, 2004, pp. 143–154.
     doi:10.1145/1055558.1055580.
 [9] T. Calders, The Complexity of Satisfying Constraints on Databases of Transactions, Acta
     Informatica 44 (2007) 591–624.
[10] A. Guzzo, D. Saccà, E. Serra, An Effective Approach to Inverse Frequent Set Mining,
     in: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM
     ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 806–811. URL: http:
     //dx.doi.org/10.1109/ICDM.2009.123. doi:10.1109/ICDM.2009.123.
[11] A. Guzzo, L. Moccia, D. Saccà, E. Serra, Solving Inverse Frequent Itemset Mining with
     Infrequency Constraints via Large-scale Linear Programs, ACM Trans. Knowl. Discov.
     Data 7 (2013) 18:1–18:39. URL: http://doi.acm.org/10.1145/2541268.2541271. doi:10.
     1145/2541268.2541271.
[12] D. Saccà, E. Serra, A. Rullo, Extending Inverse Frequent Itemsets Mining to Generate
     Realistic Datasets: Complexity, Accuracy and Emerging Applications, Data Min. Knowl.
     Discov. 33 (2019) 1736–1774. URL: https://doi.org/10.1007/s10618-019-00643-1. doi:10.
     1007/s10618-019-00643-1.
[13] P. C. Gilmore, R. E. Gomory, A Linear Programming Approach to the Cutting-Stock
     Problem, Operations Research 9 (1961) 849–859. URL: http://or.journal.informs.org/cgi/
     doi/10.1287/opre.11.6.863.
[14] J. Han, M. Kamber, Data Mining:                  Concepts and Techniques, Kauf-
     mann, San Francisco, CA, USA, 2005. URL: http://www.amazon.com/
     Data-Mining-Concepts-Techniques-Management/dp/1558604898.
[15] Yelp Challenge, 2017. URL: https://www.yelp.com/dataset.