Multi-Sorted Inverse Frequent Itemsets Mining: On-Going Research Domenico Saccà ? , Edoardo Serra, and Antonio Piccolo 1 DIMES Department, University of Calabria, and ICT-SUD, Italy sacca@unical.it 2 Computer Science Department, Boise State University, USA edoardoserra@boisestate.edu 3 DIMES Department, University of Calabria, Italy piccolo@dimes.unical.it Abstract. Inverse frequent itemset mining (IFM) consists of generating artificial transactional databases reflecting patterns of real ones, in particular, satisfying given frequency constraints on the itemsets. An extension of IFM called many- sorted IFM, is introduced where the schemes for the datasets to be generated are those typical of Big Tables, as required in emerging big data applications, e.g., social network analytics. Keywords: Big Data Synthesized Datasets, Inverse data mining, Frequent Itemsets 1 Introduction Emerging “Big Data” platforms and applications call for the invention of novel data analysis techniques that are capable to handle large amount of data [11]. There is there- fore an increasing need to use real-life datasets for data-driven experiments but, as pointed out in a recent ACM SIGMOD Blog post by Gerhard Weikum [13], datasets used into research papers are often poor. As real-life workloads are often proprietary and out of reach for academic research, inverse mining techniques can be applied to generate artificial datasets that reflect the patterns of real ones: the patterns are first dis- covered by data mining techniques (or even directly provided by domain experts) and then used to generate “realistic” privacy-preserving datasets. In this paper we propose to extend inverse frequent itemset mining (IFM) that con- sists of generating a transactional database satisfying given support constraints on the itemsets in an input set, that are typically the frequent ones. We recall that frequent item- set mining is a popular mining task over transaction databases is to single out the set of the frequent/infrequent itemsets [1, 7, 10, 6] – some basic notation on IFM is presented in Section 2. In order to enlarge the application domain of IFM, in Section 3 we introduce an extension of IFMI , called ms-IFM, which considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social net- work analytics. We present an overview of our on-going research work on how to solve ms-IFM and we draw the conclusion in Section 4. ? This work was carried out within the Cyber Security District, funded by MIUR. 2 Preliminaries and Related Work Let I be a finite domain of n elements, also called items. Any subset I ⊆ I is an itemset over I. A (transactional) database D over I (also called dataset) is a bag of itemsets, which may occur duplicated in D — the size |D| of D is the total number of its itemsets, called transactions. Given a database D over I, for each itemset I ∈ D, there exist two important measures: (i) the number of duplicates of I, denoted as δ D (I), that is the number of occurrences of I in D, and (ii) the support of I, denoted as σ D (I), that P is the sum of all number of duplicates of itemsets in D containing I, i.e., σ D (I) = J∈D∧I⊆J δ D (J) – an alternative measure is the frequency f D (I) = σ D (I)/|D|. A database D can be represented in a succinct format as a set of pairs (I, δ D (I)). We say that I is a frequent (resp., infrequent) itemset in D if its support is greater than or equal to (resp., less than) a given threshold. The perspective of the frequent itemset mining problem can be naturally inverted: we are be given in advance a set of itemsets together with their frequency constraints and our goal is then to decide whether there is a transaction database satisfying the above constraints. This problem, called the inverse frequent itemset mining problem (IFM), has been introduced in the context of defining generators for benchmarks of mining algorithms [12], and has been subsequently reconsidered in privacy preserving contexts [2]). IFM has been proved to be in PSPACE and NP-hard. Observe that the orig- inal IFM formulation does not introduce any constraint on infrequency. A reformulation of IFM in terms of frequencies instead of supports has been introduced in [3, 4]. A drawback of IFM is that a solution may contain itemsets that are not expected to be frequent. A simple solution to exclude such itemsets from a feasible solution has been first proposed in [9] – the decision complexity of this problem is NP-complete. An elaborated version of IFM with infrequency support constraint (IFMI for short), has been recently proposed in [8] and its decision complexity is NEXP-complete. 3 Many-Sorted IFM for Big Data Applications Let a NoSQL relation R(K, A1 , . . . , Ap , Ap+1 , . . . , Ap+q ) be given, where K is the table key, A1 , . . . , Ap are classical single-valued (SV) attributes and Ap+1 , . . . , Ap+q are multi-valued (MV) attributes. We assume that the attributes are ordered, i.e., K is the first attribute, A1 the second attribute, Ap+1 the (1 + p + 1)-th attribute and so on. For each i, 1 ≤ i ≤ p + q, let Ai be the finite domain for the attributes Ai and |Ai | = ni . We assume that the values of every domain Ai (called items) are given in input – they are SV items or MV items depending on whether the attribute Ai is SV or MV. On the other hand, the domain Pp K of theP key K is countably infinite and, then, its p+q values are not listed. Let ṅ = i=1 ni , n̈ = i=p+1 ni and n = ṅ + n̈. A NoSQL tuple on R is of the form t = [k, a1 , . . . , ap , g1 , . . . , gq ], where k ∈ K, for each i, 1 ≤ i ≤ p, ai ∈ Ai (ai is an item of t) and for each i, 1 ≤ i ≤ q, gi ⊆ Ap+i (gi is an itemset of t). Items and itemsets are called values of t. A NoSQL table on R is a finite set of NoSQL tuples. A many-sorted transaction T is a (p + q)-tuple [a1 , . . . , ap , g1 , . . . , gq ] where the key attribute value is dropped out from a NoSQL tuple. It follows that a many-sorted transaction can be transformed into a tuple simply by inventing a key for it. Given any attribute Ai in R and a many-sorted transaction T , TAi denotes the value of T for the attribute Ai (e.g., ai ∈ Ai if i ≤ p or gi−p ⊆ Ai if i > p). Qp Let T be the set of all many-sorted transactions. The cardinality of T is nT = 2n̈ · D i=1 ṅi . A many-sorted dataset D is set of pairs (T, δ (T )), where T is a many-sorted transaction and δ D (T ) is the number of occurrences of T . D can be transformed into a NoSQL table TD by making δ D (T ) tuple duplicates of each many-sorted transaction T . The cardinalityPof D, denoted as |D|, is the number of pairs in D and the size of D is δ D = |TD | = T ∈D δ D (T ). We stress that in general δ D  |D|. From now on, we shall omit the term many-sorted whenever it is clear from the context. A sub-transaction S is a (p + q)-tuple [a1 , . . . , ap , g1 , . . . , gq ] on R for which the domain of each SV attribute is extended with ⊥, which stands for a null value. Let ⊥(S) denote the number of null values in S – a transaction T can be also seen as a sub- transaction type for which ⊥(T ) = 0. The length of S, denoted as l(S), is the number of values different from ⊥ and ∅. A sub-transaction S subsumes a transaction T (written S v T ) if for each SV at- tribute Ai , either S.Ai = ⊥ or S.Ai = T.Ai , and for each MV attribute Ai , S.Ai ⊆ T.Ai . Observe that if S happens to be a transaction then every transaction T for which S v T has the same SV items as S, whereas its MV itemsets are supersets of the corre- sponding itemsets in S. In the classical IFM setting, every transaction type S coincides with an itemset and the transactions T subsumed by S are all itemsets for which S ⊆ T . This analogy explains why we write S v T for transaction subsumption. Given a sub-transaction S for which l(S) > 0 and two integers σ1 and σ2 for which 0 ≤ σ1 ≤ σ2 , γ = hS, σ1 , σ2 i represents a frequency P support constraint defined as: a D database D satisfies γ (written as D |= γ) if: σ1 ≤ T ∈D∧SvT δ (T ) ≤ σ2 . An infrequency support constraint γ is a pair hS, σi and is actually a shorthand for the frequency support constraint hS, 0, σi. The upper bound σ will be simply referred to as γ# . A (frequency or infrequency) support constraint γ for which l(γS ) = 1 is called a domain support constraint. Given a set Π of support constraints, a database D satisfies Π (written as D |= Π), if for each γ ∈ Π, D |= γ. Example 1. Individuals are characterized by the SV attributes Gender, Location and Age and by the MV attributes Groups and Events: an individual may belong to various groups and may attend a number of events. A transaction I = [M ale, Rome, 25, {g1 , g4 }, {e1 , e3 }] represents a 25-year old male individual located in Rome who belongs to the groups g1 and g4 and attends the events e1 and e3 . The transaction J = [F emale, Rome, 20, {g1 , g2 }] represents an individual who does not attend any event. Note that, as the attributes do not define a key, there may exist several occurrences of the same individual. Examples of constraints are shown next. Frequency constraints: – h[M ale, Rome, ⊥, {g1 , g2 }, ∅], 10000, 20000i fixes the range for the overall dupli- cate number of male individuals who are located in Rome and are participating to at least the groups g1 and g2 ; – h[F emale, ⊥, 25, {g1 , g2 }, {e1 , e3 }], 500, 1000i fixes the range for the overall du- plicate number of 25-year old female individuals who are participating to at least the groups g1 and g2 and attending at least the events e1 and e3 ; Infrequency constraints: – h[⊥, Cosenza, ⊥, {g1 , g2 }, ∅], 100i states that the number of individuals located at Cosenza in a feasible dataset who are participating to at least the groups g1 and g2 is at most 100; – h[⊥, ⊥, ⊥, ∅, {e1 , e2 }], 10000i states that the number of individuals attending at least the events e1 and e2 is at most 10000; – h[⊥, ⊥, ⊥, {g1 }, ∅], 100000i is a domain support constraint stating that the number of individuals participating to at least the group g1 is at most 100000. 2 From now on, we assume that the following sets of constraints are given: (1) set Σ of frequency support constraints with cardinality m = |Σ| > 0 and (2) a set Σ b of infrequency support constraints with cardinality m b = |Σ| ≥ 0. Let Σ̆ = Σ ∪ Σ and b b m̆ = |Σ̆| = m + m. b Given R, Σ̆ = Σ ∪ Σ, b and an integer size > 0, the multi-sorted inverse frequent itemset mining problem, shortly denoted as ms-IFM, consists of finding a many-sorted dataset D on R such that both δ D = size and D |= Σ̆ (or of eventually stating that there is no such a dataset). By assuming that items, constraint bounds and size are stored using a constant amount of space, the input size of the problem is O( n + m̆ (p + q n̈ + 1) + m ). It is easy to see that ms-IFM reduces to the classical IFM problem if p = 0 and q = 1, i.e., there exists exactly one non-key attribute in R and this attribute is of MV type, say G. A transaction is then any itemset on the domain G of G. The next result shows that the complexity of ms-IFM. The decision version of ms-IFM is in PSPACE and NP-hard. To provide a more efficient resolution algorithm for the problem, we relax the integer constraint for the number of duplicates for a transaction of a database D, i.e., it may be a rational number. We call relaxed ms-IFM this version of the problem. Then the decision version of relaxed ms-IFM is NP-complete. Note that, as shown in [8], the higher complexity of the IFMI problem derives from the task of discovering “minimal” itemsets that must be enforced to be infrequent. In- stead such infrequent itemsets are assumed to be part of the input for ms-IFM. 4 Notes on ms-IFM Resolution and Conclusions Following the approach of [8], we formulate the ms-IFM problem as a linear program (LP) with a linear number of constraints (corresponding to the support constraints of the itemsets) and an exponential number variables, one for each possible transaction, indicating the number of its occurrences in the target database. We represent the LP in a succinct format with size O(n m), where n is the total number of items and m is the total number of constraints. In [8] the simplex method approach is adopted to solve our LP problem for it con- tinues to be effectively and efficiently used to solve linear programs notwithstanding it may get exponential time. The main issue dealt with is to apply the simplex to a LP with an exponential number of variables. The literature gives an interesting solution: column generation, see e.g [5], that is a version of the simplex dealing with a large number of variables (large-scale linear programs). This method solves a linear program without explicitly including all columns (i.e., variables), in the coefficient matrix but only a sub- set of them with cardinality equal to the number of rows (i.e., constraints). Columns are dynamically generated by solving an auxiliary optimization problem called the pricing problem. As the decision version of the latter problem is proven to be NP complete for IFMI , a heuristic algorithm for the pricing problem is used. An exact exponential time pricing algorithm is presented as well, which is only executed at the last iterations of column generation A large number of experiments have confirmed that capability of the approach to solve large instances of both synthesized and real datasets. Our on-going research is focused in adapting the approach of [8] to solve ms-IFM by providing suitable new solvers for the pricing problem: an exact solver and a heuris- tic one. References 1. R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, SIGMOD ’93, pages 207–216, New York, NY, USA, 1993. ACM. 2. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, SIGMOD ’00, pages 439–450, New York, NY, USA, 2000. ACM. 3. T. Calders. Computational complexity of itemset frequency satisfiability. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’04, pages 143–154, New York, NY, USA, 2004. ACM. 4. T. Calders. The complexity of satisfying constraints on databases of transactions. Acta Informatica, 44(7–8):591–624, 2007. 5. G. B. Dantzig and M. N. Thapa. Linear Programming 2: Theory and Extensions. Springer- Verlag, 2006. 6. B. Goethals and M. J. Zaki. Advances in frequent itemset mining implementations: report on fimi’03. SIGKDD Explorations Newsletter, 6(1):109–117, 2004. 7. D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data mining, hypergraph transver- sals, and machine learning. In A. O. Mendelzon and Z. M. Özsoyoglu, editors, PODS’97, pages 209–216. ACM Press, 1997. 8. A. Guzzo, L. Moccia, D. Saccà, and E. Serra. Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. TKDD, 7(4):18, 2013. 9. A. Guzzo, D. Saccà, and E. Serra. An effective approach to inverse frequent set mining. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, pages 806–811, Washington, DC, USA, 2009. IEEE Computer Society. 10. J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007. 11. K. Michael and K. W. Miller. Big data: New opportunities and new challenges [guest editors’ introduction]. Computer, 46(6):22–24, 2013. 12. T. Mielikainen. On inverse frequent set mining. In Proceedings of 2nd Workshop on Pri- vacy Preserving Data Mining, PPDM ’03, pages 18–23, Washington, DC, USA, 2003. IEEE Computer Society. 13. G. Weikum. Wheres the Data in the Big Data Wave? 2013. ACM Sigmod BLOG: http://wp.sigmod.org/?p=786.