Multi-Sorted Inverse Frequent Itemsets Mining for Generating Realistic No-SQL Datasets (Discussion Paper) Domenico Saccà1 , Edoardo Serra2 and Antonino Rullo1 1 DIMES Department, University of Calabria, 87036 Rende, Italy 2 Department of Computer Science, Boise State University, Boise, ID 83725, USA Abstract The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones. A promising approach is based on inverse mining techniques such as inverse frequent itemset mining (IFM), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent and infrequent ones. This paper describes an extension of IFM that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics. Keywords IFM, No-SQL, Itemset Mining 1. Introduction Emerging “Big Data” platforms and applications call for the invention of novel data analysis techniques that are capable to effectively and efficiently handle large amount of data [1]. There is therefore an increasing need to use real-life datasets for data-driven experiments that are often not available for two main reasons: (𝑖) ownership (i.e., they are often proprietary and out of reach for academic research), and (𝑖𝑖) privacy (they may include sensible data as it happens for many research areas, e.g., epidemiology, public health, social science, study of the behavior of large populations of individuals under natural scenarios, as well as under human interventions). The two above limitations can be overcome by the design of “realistic” synthetic datasets that capture key attributes and activities of real ones without violating confidentiality [2]. In this paper we use an inverse data mining approach to generate an artificial NoSQL transac- tional dataset that reflects the patterns of a real one: the patterns can be thought of as a compressed representation of the original dataset and are first discovered by data mining techniques, and they are next used to generate a “realistic” pattern-preserving dataset. SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " domenico.sacca@unical.it (D. Saccà); edoardoserra@boisestate.edu (E. Serra); n.rullo@dimes.unical.it (A. Rullo)  0000-0003-3584-5372 (D. Saccà); 0000-0003-0689-5063 (E. Serra); 0000-0002-6030-0027 (A. Rullo) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) A well-known inverse approach refers to the classical data mining problem of extracting the set of the frequent/infrequent itemsets from a transaction database. The related literature is rather rich and covers almost three decades: after the seminal papers in the nineties [3, 4], additional aspects have been studied in the last two decades. The perspective of the frequent itemset mining problem can be naturally inverted as follows: we are be given in advance a set of itemsets together with their frequency as constraints and our goal is to compute a transactional database satisfying the above constraints. This problem, called the inverse frequent itemset mining problem (IFM), has been introduced in the context of defining generators for benchmarks of mining algorithms [5], and has been subsequently reconsidered in privacy preserving contexts [6, 7]. A reformulation of IFM in terms of frequencies instead of supports has been introduced in [8, 9] with the name FREQSAT and many variants of it have been proposed as well. The original IFM formulation does not introduce any constraint on infrequency. A simple solution to exclude unexpected frequent itemset from a feasible solution is the formulation proposed in [10], which is called IFM𝑆 : only itemsets listed in the set 𝑆 of constraints can be included as transactions in 𝒟. A more general version of IFM with infrequency support constraint (IFMI for short), has been proposed in [11], where every infrequent itemset may occur a small number of times. In order to enlarge the domain of IFM to NoSQL applications, an extension of IFM has been proposed in [12] that considers more structured schemes for the datasets to be generated, as required by emerging big data applications, e.g., social network analytics. A many-sorted extension of IFM called 𝑚𝑠-IFM is defined by replacing the basic simple schema 𝑅(𝐾, 𝐴) with a more general NoSQL schema 𝑅(𝐾, 𝐴1 , . . . , 𝐴𝑝 , 𝐴1 , . . . , 𝐴𝑞 ), where 𝐾 is the table key, 𝐴1 , . . . , 𝐴𝑝 are single-valued attributes, and 𝐴1 , . . . , 𝐴𝑞 are multi-valued attributes. This problem reduces to classical IFM when 𝑝 = 0 and 𝑞 = 1, i.e., there is exactly one multi-valued attribute. In this paper we describe the formal definition of 𝑚𝑠-IFM of [12] and illustrate a method for solving 𝑚𝑠-IFM that has been shown to be effective by means of a number of experiments. The solver is based on large-scale linear programming and generalizes the approach used in [11] to solve IFMI . 2. Many-Sorted IFM Let a NoSQL relation 𝑅(𝐾, 𝐴1 , . . . , 𝐴𝑝 , 𝐴𝑝+1 , . . . , 𝐴𝑝+𝑞 ) be given, where 𝐾 is the table key, 𝐴1 , . . . , 𝐴𝑝 are classical single-valued (SV) attributes and 𝐴𝑝+1 , . . . , 𝐴𝑝+𝑞 are multi-valued (MV) attributes. We assume that the attributes are ordered, i.e., 𝐾 is the first attribute, 𝐴1 the second attribute, 𝐴𝑝+1 the (1 + 𝑝 + 1)-th attribute and so on. For each 𝑖, 1 ≤ 𝑖 ≤ 𝑝 + 𝑞, let 𝒜𝑖 be the finite domain for the attributes 𝐴𝑖 and |𝒜𝑖 | = 𝑛𝑖 . We assume that the values of every domain 𝒜𝑖 (called items) are given in input – they are SV items or MV items depending on whether the attribute 𝐴𝑖 is SV or MV. On the other hand, the domain 𝒦 of the key 𝐾 is countably infinite and, then, its values∑︀𝑝are not listed. ¨ = 𝑝+𝑞 ∑︀ Let 𝑛˙ = 𝑖=1 𝑛𝑖 , 𝑛 𝑖=𝑝+1 𝑛𝑖 and 𝑛 = 𝑛 ˙ +𝑛¨ . A NoSQL tuple on 𝑅 is of the form 𝑡 = [𝑘, 𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ], where 𝑘 ∈ 𝒦, for each 𝑖, 1 ≤ 𝑖 ≤ 𝑝, 𝑎𝑖 ∈ 𝒜𝑖 (𝑎𝑖 is an item of 𝑡) and for each 𝑖, 1 ≤ 𝑖 ≤ 𝑞, 𝑔𝑖 ⊆ 𝒜𝑝+𝑖 (𝑔𝑖 is an itemset of 𝑡). Items and itemsets are called values of 𝑡. A NoSQL table on 𝑅 is a finite set of NoSQL tuples. A many-sorted transaction 𝑇 is a (𝑝 + 𝑞)-tuple [𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ] where the key attribute value is dropped out from a NoSQL tuple. It follows that a many-sorted transaction can be transformed into a tuple simply by inventing a key for it. Given any attribute 𝐴𝑖 in 𝑅 and a many-sorted transaction 𝑇 , 𝑇𝐴𝑖 denotes the value of 𝑇 for the attribute 𝐴𝑖 (e.g., 𝑎𝑖 ∈ 𝒜𝑖 if 𝑖 ≤ 𝑝 or 𝑔𝑖−𝑝 ⊆ 𝒜𝑖 if 𝑖 > 𝑝). Let 𝒯 be the set of all many-sorted transactions. The cardinality of 𝒯 is 𝑛𝒯 = 2𝑛¨ · 𝑝𝑖=1 𝑛˙ 𝑖 . ∏︀ A many-sorted dataset 𝒟 is set of pairs (𝑇, 𝛿 𝒟 (𝑇 )), where 𝑇 is a many-sorted transaction and 𝛿 𝒟 (𝑇 ) is the number of occurrences of 𝑇 . 𝒟 can be transformed into a NoSQL table 𝑇𝒟 by making 𝛿 𝒟 (𝑇 ) tuple duplicates of each many-sorted transaction 𝑇 . The ∑︀ cardinality of 𝒟, denoted as |𝒟|, is the number of pairs in 𝒟 and the size of 𝒟 is 𝛿 = |𝑇𝒟 | = 𝑇 ∈𝒟 𝛿 𝒟 (𝑇 ). We stress 𝒟 that in general 𝛿 𝒟 ≫ |𝒟|. From now on, we shall omit the term many-sorted whenever it is clear from the context. A sub-transaction 𝑆 is a (𝑝 + 𝑞)-tuple [𝑎1 , . . . , 𝑎𝑝 , 𝑔1 , . . . , 𝑔𝑞 ] on 𝑅 for which the domain of each SV attribute is extended with ⊥, which stands for a null value. Let ⊥(𝑆) denote the number of null values in 𝑆 – a transaction 𝑇 can be also seen as a sub-transaction type for which ⊥(𝑇 ) = 0. The length of 𝑆, denoted as 𝑙(𝑆), is the number of values different from ⊥ and ∅. A sub-transaction 𝑆 subsumes a transaction 𝑇 (written 𝑆 ⊑ 𝑇 ) if for each SV attribute 𝐴𝑖 , either 𝑆.𝐴𝑖 = ⊥ or 𝑆.𝐴𝑖 = 𝑇.𝐴𝑖 , and for each MV attribute 𝐴𝑖 , 𝑆.𝐴𝑖 ⊆ 𝑇.𝐴𝑖 . Observe that if 𝑆 happens to be a transaction then every transaction 𝑇 for which 𝑆 ⊑ 𝑇 has the same SV items as 𝑆, whereas its MV itemsets are supersets of the corresponding itemsets in 𝑆. In the classical IFM setting, every transaction type 𝑆 coincides with an itemset and the transactions 𝑇 subsumed by 𝑆 are all itemsets for which 𝑆 ⊆ 𝑇 . This analogy explains why we write 𝑆 ⊑ 𝑇 for transaction subsumption. Given a sub-transaction 𝑆 for which 𝑙(𝑆) > 0 and two integers 𝜎1 and 𝜎2 for which 0 ≤ 𝜎1 ≤ 𝜎2 , 𝛾 = ⟨𝑆, 𝜎1 , 𝜎2 ⟩ represents∑︀ a frequency support constraint defined as: a database 𝒟 satisfies 𝛾 (written as 𝒟 |= 𝛾) if: 𝜎1 ≤ 𝒟 𝑇 ∈𝒟∧𝑆⊑𝑇 𝛿 (𝑇 ) ≤ 𝜎2 . An infrequency support constraint 𝛾 is a pair ⟨𝑆, 𝜎⟩ and is actually a shorthand for the frequency support constraint ⟨𝑆, 0, 𝜎⟩. The upper bound 𝜎 will be simply referred to as 𝛾# . A (frequency or infrequency) support constraint 𝛾 for which 𝑙(𝛾𝑆 ) = 1 is called a domain support constraint. Given a set Π of support constraints, a database 𝒟 satisfies Π (written as 𝒟 |= Π), if for each 𝛾 ∈ Π, 𝒟 |= 𝛾. Example. Individuals are characterized by the SV attributes Gender, Location and Age and by the MV attributes Groups and Events: an individual may belong to various groups and may attend a number of events. A transaction 𝐼 = [𝑀 𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, 25, {𝑔1 , 𝑔4 }, {𝑒1 , 𝑒3 }] represents a 25-year old male individual located in Rome who belongs to the groups 𝑔1 and 𝑔4 and attends the events 𝑒1 and 𝑒3 . The transaction 𝐽 = [𝐹 𝑒𝑚𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, 20, {𝑔1 , 𝑔2 }] represents an individual who does not attend any event. Note that, as the attributes do not define a key, there may exist several occurrences of the same individual. Examples of constraints are: – ⟨[𝑀 𝑎𝑙𝑒, 𝑅𝑜𝑚𝑒, ⊥, {𝑔1 , 𝑔2 }, ∅], 10000, 20000⟩ fixes the range for the overall duplicate num- ber of male individuals who are located in Rome and are participating to at least the groups 𝑔1 and 𝑔2 ; – ⟨[𝐹 𝑒𝑚𝑎𝑙𝑒, ⊥, 25, {𝑔1 , 𝑔2 }, {𝑒1 , 𝑒3 }], 500, 1000⟩ fixes the range for the overall duplicate num- ber of 25-year old female individuals who are participating to at least the groups 𝑔1 and 𝑔2 and attending at least the events 𝑒1 and 𝑒3 ; Infrequency constraints: – ⟨[⊥, 𝐶𝑜𝑠𝑒𝑛𝑧𝑎, ⊥, {𝑔1 , 𝑔2 }, ∅], 100⟩ states that the number of individuals located at Cosenza in a feasible dataset who are participating to at least the groups 𝑔1 and 𝑔2 is at most 100; – ⟨[⊥, ⊥, ⊥, ∅, {𝑒1 , 𝑒2 }], 10000⟩ states that the number of individuals attending at least the events 𝑒1 and 𝑒2 is at most 10000; – ⟨[⊥, ⊥, ⊥, {𝑔1 }, ∅], 100000⟩ is a domain support constraint stating that the number of individuals participating to at least the group 𝑔1 is at most 100000. □ From now on, we assume that the following sets of constraints are given: (1) a set Σ of frequency support constraints with cardinality 𝑚 = |Σ| > 0 and (2) a set Σ ̂︀ of infrequency support constraints with cardinality 𝑚 ̂︀ = |Σ| ≥ 0. Let Σ̆ = Σ ∪ Σ and 𝑚 ̂︀ ̂︀ ˘ = |Σ̆| = 𝑚 + 𝑚. ̂︀ Observe that, as the number of infrequency constraints could be very large, they are not induced as for the IFMI case [11]. but explicitly defined. The drawback of this formulation is that it is not anymore excluded to eventually obtain undesired frequent itemsets. But an accurate choice of explicit infrequent itemsets may reduce this risk. Definition. Given 𝑅, Σ̆ = Σ ∪ Σ, ̂︀ and an integer size > 0, the multi-sorted inverse frequent itemset mining problem, shortly denoted as 𝑚𝑠-IFM, consists of finding a many-sorted dataset 𝒟 on 𝑅 such that both 𝛿 𝒟 = size and 𝒟 |= Σ̆ (or of eventually stating that there is no such a dataset). If the integer constraint for the number of duplicates for a transaction of a database 𝒟 is relaxed (i.e., it is a rational number), the problem is called relaxed 𝑚𝑠-IFM. □ By assuming that items, constraint bounds and size are stored using a constant amount of space, the input size of the problem is 𝒪( 𝑛 + 𝑚˘ (𝑝 + 𝑞 𝑛¨ + 1) + 𝑚 ). It is easy to see that 𝑚𝑠-IFM reduces to the IFMI problem if 𝑝 = 0 and 𝑞 = 1, i.e., there exists exactly one non-key attribute in 𝑅 and this attribute is of MV type, say 𝐺. A transaction is then any itemset on the domain 𝒢 of 𝐺. The next result shows that the complexity of 𝑚𝑠-IFM and of relaxed 𝑚𝑠-IFM. Proposition. The decision version of 𝑚𝑠-IFM is in PSPACE and NP-hard and the decision version of relaxed 𝑚𝑠-IFM is NP-complete. □ Note that the higher complexity of the IFMI problem, which has been proved to be NEXP-complete in [11], derives from the task of discovering “minimal” itemsets that must be enforced to be infrequent: in IFMI minimal infrequent itemsets are computed by the resolution algorithm, whereas they are assumed to be part of the input for 𝑚𝑠-IFM. To solve the relaxed 𝑚𝑠-IFM, in the next sections we define formulate 𝑚𝑠-IFM as a linear program and present an extension of the classical simplex algorithm for its resolution. 2.1. Linear Program Formulation We select an ordering of all many-sorted transactions in 𝒯 , say {𝑇1 , . . . , 𝑇𝑛𝒯 }. We use the vector 𝑣 = [1, . . . , 𝑛𝒯 ] to list all possible non-empty many-sorted transaction indices. Let the vector 𝑠 = [𝑖1 , . . . , 𝑖𝑚 , 𝑗1 , . . . , 𝑗𝑚 ̂︀ ] represent the indices of the frequency support constraints in Σ and the infrequency support constraint Σ.̂︀ Let 𝑙 and 𝑢 be two vectors of 𝑚 integers such that for 𝐽 each 𝑗, 1 ≤ 𝑗 ≤ 𝑚, 𝑙𝑗 = 𝜎𝑚𝑖𝑛 and 𝑢𝑗 = 𝜎𝑚𝑎𝑥 𝐽 , where 𝐽 = 𝛾˜𝑠𝑗 is the 𝑗-th frequency support constraints in Σ according to the ordering fixed by 𝑠. Let 𝑢′ be the vector of 𝑚 ̂︀ integers such that for each 𝑗, 𝑚 ≤ 𝑗 ≤ 𝑚 + 𝑚, ̂︀ 𝑢′𝑗 = 𝜎𝑚𝑎𝑥𝐽 , where 𝐽 = 𝛾˜𝑠𝑗 is the 𝑗-th infrequency support constraints in Σ according to the ordering fixed by 𝑠. Let 𝑥 be a vector of 𝑛𝒯 non-negative rational variables whose meaning is that its 𝑗-the coordinate, 𝑥𝑗 , denotes the number of duplicates for the many-sorted transaction 𝑇𝑗 . We consider a (𝑚 + 𝑚) ̂︀ × 𝑛𝒯 matrix 𝐴 where for each 𝑖, 1 ≤ 𝑖 ≤ 𝑚 + 𝑚, ̂︀ and for each 𝑗 ∈ 𝑣, 𝑎𝑖𝑗 = 1 if 𝑆 ⊑ 𝑇𝑗 where 𝛾𝑠𝑖 = ⟨𝑆, 𝜎1 , 𝜎2 ⟩ or 𝑎𝑖𝑗 = 0 otherwise. As usual, we introduce a vector 𝑤 of 2𝑚 + 1 non-negative rational number artificial variables, whose values represent the costs of violating some support constraints. The relaxed linear formulation of the 𝑚𝑠-IFM is the following: 2𝑚+1 ∑︁ LP : minimize 𝑤𝑖 - subject to 𝑖=1 ∑︁ 𝑤𝑖 + 𝑎𝑖𝑗 𝑥𝑗 ≥ 𝑙𝑖 1≤𝑖≤𝑚 𝑗∈𝑣 ∑︁ 𝑤𝑚+𝑖 − 𝑎𝑖𝑗 𝑥𝑗 ≥ −𝑢𝑖 1≤𝑖≤𝑚 𝑗∈𝑣 ∑︁ − 𝑎𝑖𝑗 𝑥𝑗 ≥ −𝑢′𝑗 𝑚 + 1 ≤ 𝑖 ≤ 𝑚 + 𝑚′ 𝑗∈𝑣 ∑︁ ∑︁ 𝑤2𝑚+1 + 𝑥𝑗 ≥ size − 𝑥𝑗 ≥ size 𝑗∈𝑣 𝑗∈𝑣 For solving the linear program we use the column generation approach (see e.g. [13]), which is an extension of the simplex method for dealing with linear programs with a large number of variables. The column generation simplex solves a linear program without explicitly including all columns (i.e., variables), in the coefficient matrix but only a subset of them with cardinality equal to the number of rows (i.e., constraints). Columns are dynamically generated by solving an auxiliary optimization problem called the pricing problem. The linear program to be solved is denoted as the master problem (MP). In our case the MP problem consists of 𝑟 = 2𝑚 + 𝑚′ + 2 rows and 𝑐 = 2𝑛 + 2𝑚 columns. The linear program with only a subset of the MP columns with cardinality 𝑐′ equal to the number 𝑟 of rows is called the restricted master problem (RMP). Then, the restricted master problem is the master problem with a subset of columns w.r.t. the original master problem. From the theory of the simplex algorithm we know that a basic solution for a linear program is an assignment of the variables that satisfy the constraints and where only 𝑟 (number of rows are activated) variables are assigned with a value greater or equal than zero (i.e., the variables in the basic solution), the reaming ones are zero. Starting from an initial basic solution, the simplex algorithm will iteratively find the optimal basic solution. Since each variable is assigned to a column, it turns out that if the RMP contains the columns associated to the variables in the optimal base solutions of the MP, then the RMP will return the same solution. Then the column generation approach is an iterative algorithm that working on the restricted master problem will keep the number of columns bounded and iterates till the RMP does not contain the columns associated with optimal base solution of the MP. The column generation method looks for an optimal basis as within the simplex algorithm. It starts from an initial basis and moves from a current basis to a new one by adding a new basic column with a negative reduced cost (iteration step). Primal feasibility is maintained and the objective function is non-increasing during this search. The reduced cost of a column can be computed by using the current dual variables. The task of providing a column with a negative reduced cost, or certifying that there is not such a column, is delegated to the pricing problem. If there is no column with a negative reduced cost, then the algorithm terminates and the current basis is optimal. The crucial task is the formulation and resolution of the pricing problem that is done by using integer linear program. Due to page restrictions, we defer this issue to the full paper [12]. 2.2. Empirical evaluation of Many-Sorted IFM To evaluate the accuracy of the solutions computed by our 𝑚𝑠-IFM approximated approach, we show that synthetic data can successfully replace original data in an analytics task, in particular classification, which is the problem of identifying to which of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known [14]. To this end, we set up a classification problem by representing a transactional dataset with a standard dataset for classification, where each item corresponds to a binary attribute (i.e., 1 if present and 0 otherwise). To create several different classification tasks for each dataset used in the experiments, we selected the top-𝑘 most frequent items. For each of such items, we consider this item as the dependent feature (i.e., the binary attribute that we want to classify) and the remaining top-𝑘 items as the independent features (i.e., the input of the classification model). For each synthetic dataset generated by 𝑚𝑠-IFM techniques, we train a classification model for each of the top-k most frequent items and we test the trained model on the original dataset. In the experiments, we use 3 different classification models [14]: Logistic Regression, Decision Tree, and Random Forest. All of these models are trained with datasets and specific procedures to balance the training sets are executed. For each model and for each dataset, we compute the accuracy of the classification model, which was first trained and tested on the original dataset. The accuracy computed while testing classification on the original dataset (denoted as original accuracy) can be assumed to be the best classification accuracy value that any synthetic dataset generation approach can achieve. To create instances for our 𝑚𝑠-IFM, we used the Yelp dataset [15] to extract a No-SQL table. In particular, each transaction of the No-SQL table represents a business in Yelp (e.g., restaurant, hospital, etc.). In total there are 101,824 transactions. The table has 4 attributes: STARS (the average number of stars for each businesshis – a SV attribute with 9 values), STATE (the state where the business is located – a SV attribute with 50 values), CATEGORIES (the categories to which a particular business belongs – a MV value attribute with 397 distinct values, e.g., garage, restaurant, space for kids, etc.). To create the frequency and infrequency support many-sorted constraints, we converted this NOSQL table in a transactional dataset. Table 1 Accuracy for the 30-top frequent items with 𝑠(%) ∈ {1.2, 1.6} classifier Original 𝑆(%) = 1.2 𝑆(%) = 1.6 Logistic Regression 0.99 0.93 0.93 Decision Tree 0.99 0.85 0.84 Random Forest 0.99 0.92 0.92 In Table 1, we report the average classification results obtained for 30-top frequent items in the two different datasets obtained by means of two selected supports. From Table 1 it is possible to see that the synthetic datasets generated are pretty close to the original dataset in term of accuracy. In addition, we did not observe a strong variation when the support changes. Further accuracy evaluations are reported in [12]. 3. Conclusion The Inverse Frequent itemset Mining (IFM) problem consists of generating a transactional database satisfying given support constraints on some itemsets, which are typically the frequent ones. In order to enlarge the application domain, an extension of IFM, called multi-sorted IFM (𝑚𝑠-IFM), has been introduced in [12] that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics. This paper has given an overview of 𝑚𝑠-IFM problem definition, linear program formulation, resolution using a column generation algorithm (an extension of the simplex method for dealing with linear programs with a large number of variables), and an experimental evaluation of the capability of the synthetic datasets to perform the data mining task of classification with an accuracy close to the one achieved with the original datasets. References [1] K. Michael, K. W. Miller, Big Data: New Opportunities and New Challenges [Guest editors’ Introduction], Computer 46 (2013) 22–24. doi:http://doi. ieeecomputersociety.org/10.1109/MC.2013.196. [2] H. Wu, Y. Ning, P. Chakraborty, J. Vreeken, N. Tatti, N. Ramakrishnan, Generating Realistic Synthetic Population Datasets, ACM Trans. Knowl. Discov. Data 12 (2018) 45:1–45:22. URL: http://doi.acm.org/10.1145/3182383. doi:10.1145/3182383. [3] R. Agrawal, T. Imieliński, A. Swami, Mining Association Rules Between Sets of Items in Large Databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, ACM, New York, NY, USA, 1993, pp. 207–216. [4] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, Data mining, Hypergraph Transver- sals, and Machine Learning, in: A. O. Mendelzon, Z. M. Özsoyoglu (Eds.), Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’97, ACM Press, 1997, pp. 209–216. [5] T. Mielikainen, On Inverse Frequent Set Mining, in: Proceedings of 2nd Workshop on Privacy Preserving Data Mining, PPDM ’03, IEEE Computer Society, Washington, DC, USA, 2003, pp. 18–23. [6] R. Agrawal, R. Srikant, Privacy-preserving Data Mining, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, ACM, New York, NY, USA, 2000, pp. 439–450. doi:10.1145/342009.335438. [7] X. Wu, Y. Wu, Y. Wang, Y. Li, Privacy Aware Market Basket Data Set Generation: A Feasible Approach for Inverse Frequent Set Mining, in: Proceedings of SIAM International Conference on Data Mining, SDM’ 05, SIAM, Philadelphia, PA, USA, 2005, pp. 103–114. [8] T. Calders, Computational Complexity of Itemset Frequency Satisfiability, in: Pro- ceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Princi- ples of Database Systems, PODS ’04, ACM, New York, NY, USA, 2004, pp. 143–154. doi:10.1145/1055558.1055580. [9] T. Calders, The Complexity of Satisfying Constraints on Databases of Transactions, Acta Informatica 44 (2007) 591–624. [10] A. Guzzo, D. Saccà, E. Serra, An Effective Approach to Inverse Frequent Set Mining, in: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 806–811. URL: http: //dx.doi.org/10.1109/ICDM.2009.123. doi:10.1109/ICDM.2009.123. [11] A. Guzzo, L. Moccia, D. Saccà, E. Serra, Solving Inverse Frequent Itemset Mining with Infrequency Constraints via Large-scale Linear Programs, ACM Trans. Knowl. Discov. Data 7 (2013) 18:1–18:39. URL: http://doi.acm.org/10.1145/2541268.2541271. doi:10. 1145/2541268.2541271. [12] D. Saccà, E. Serra, A. Rullo, Extending Inverse Frequent Itemsets Mining to Generate Realistic Datasets: Complexity, Accuracy and Emerging Applications, Data Min. Knowl. Discov. 33 (2019) 1736–1774. URL: https://doi.org/10.1007/s10618-019-00643-1. doi:10. 1007/s10618-019-00643-1. [13] P. C. Gilmore, R. E. Gomory, A Linear Programming Approach to the Cutting-Stock Problem, Operations Research 9 (1961) 849–859. URL: http://or.journal.informs.org/cgi/ doi/10.1287/opre.11.6.863. [14] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Kauf- mann, San Francisco, CA, USA, 2005. URL: http://www.amazon.com/ Data-Mining-Concepts-Techniques-Management/dp/1558604898. [15] Yelp Challenge, 2017. URL: https://www.yelp.com/dataset.