All the existing MDL-based approaches compute only one model, i.e., search for a set of mutually optimal (nonredundant) patterns. In [16], based on the assumption that a dataset may contain several models, the authors propose to compute a sequence of MDL-optimal pattern sets at different levels of specification. They introduce a structural function over datasets. The structural function maps a natural number k ∈ N into a set of k MDL-optimal patterns. All pattern sets are computed independently of each other, and the model (pattern set) of size k may differ by more than one pattern from the model of size k + 1. However, the patterns from different models (pattern sets) may be similar (or repetitive), thus, the whole set of models may be redundant. The idea of using several models to describe data is widely used in supervised methods. The closest to PM class of supervised learning methods is the one based on decision trees [15]. A label of a tree node can be interpreted as an element of a pattern, thus a pattern is a set of labels in the path from the root to a leaf. All modern tree-based approaches are based on ensembles of trees, e.g., random forests [3], extremely randomized trees [10], boosted trees [8], etc. The tree-based methods are known to have the following properties: (i) an ensemble of trees is better than a single decision tree, (ii) an ensemble of “shallow” trees outperforms an ensemble of deeper trees, (iii) an ensemble of boosted trees (an ensemble of trees where each new tree is focused on the examples misclassified by the previous trees) outperforms other models of tree ensembles. These rules of thumb work well in general, for a wide variety of datasets. PM approaches have never been considered from this perspective. Based on the assumption that the best practices of supervised learning may work well in unsupervised settings, we propose to adapt the aforementioned principles for building an ensemble of classifiers to PM approaches. As a basic method for computing a pattern set, we consider the state-of-the-art MDL-based PM approach called Krimp [19]. This is an analogue of a decision tree model, meaning that both of them compute pattern sets (in case of decision trees a pattern corresponds to a path from the root to a leaf), where a pattern is computed depending on other patterns. The proposed MDL-based multi-model approach (i) uses at most a quadratic number of “easily-derivable” candidates to optimal patterns, (ii) builds a sequence of models (pattern sets), where each successive model describes the data fragments that have been insufficiently well described by previous models, (iii) evaluates interestingness of patterns by their informativeness based on MDL. Non-redundancy of patterns within a single model (in a pattern set) is ensured by MDL, while non-redundancy of patterns among models is ensured by a proposed data projection strategy, i.e., using only poorly described data fragments for computing the next model. The paper has the following structure. In Section 2, we introduce the basic notations. In Section 3.1 we consider principles of computing simply-derivable itemsets (an analogue of shallow trees). In Section 3.2 we present an approach for computing a sequence of models (pattern sets), where each new model (pattern set) describes those regions of a dataset that are insufficiently well described by the previous models. In Section 4 we study the quality of the proposed approach in experiments. In Section 5 we conclude and give the direction of future work. 2 Basic Notions 2.1 Basic Notions of Pattern Mining We work with transactional datasets and represent them as binary tables. In binary datasets, attributes are also called items, and patterns are called itemsets. As patterns we consider itemsets consisting of at least 2 attributes and describing at least 2 objects. Let I = {i1 , . . . , iM } be a set of attributes. A dataset D is a bag of N transactions over attributes in I, where each transaction t is a subset of I, i.e., t ⊆ I. The pattern search space consists of all possible subsets of I, i.e., it is the powerset P(I) of size 2|I| . An itemset X occurs in a transaction t, iff X ⊆ t, we call the set of objects where it occurs the extent of X, i.e., ext(X) = {t ∈ D | X ⊆ t}. The frequency of X is the (relative) size of its image, i.e., f req(X) = |ext(X)| (or |ext(X)|/N ). An equivalence class Equiv(X) of a pattern X is a set of itemsets with the same extent, Equiv(X) = {Y ∈ P(I) | ext(Y ) = ext(X)}. In our study we use closed itemsets as patterns. A closed itemset C is the maximal set of items common to a set of objects [13, 14], it is unique in its equivalence class. We call the closed itemset C ∈ Equiv(X) the closure of X and denote it by cl(X). Closed itemsets are a very common choice for candidates to MDL-optimal patterns, since (i) a closed itemset is the maximal set that represents all the itemsets with the same image, (ii) a closed itemset provides a lossless representation of these itemsets [14]. An example of a transactional dataset and the corresponding set of closed itemsets (ordered by the inclusion of attributes) is given in Figure 1, (a, b). Figure 1: (a) A dataset, (b) partially ordered by inclusion of closed itemsets, patterns (with size and frequency of at least 2) are highlighted in bold, (c) the standard code table ST , (d) a code table containing patterns ABC and AB, (e) covering (encoding) with patterns ABC and AB. Figure 2: (a) adding AB (optimal), (b) adding ABC (optimal), (c) adding AD (not optimal), (d) adding BE (not optimal). The final set of optimal patterns is {ABC, AB}. For the sake of simplicity, we denote itemsets by sequences of letters, e.g., ABC will be used to denote the attribute set {A, B, C}. 2.2 Minimum Description Length Principle in Pattern Mining For a given dataset D and a set of models M, the best model M ∈ M is the one that minimizes the total description length L(D, M ) = L(M ) + L(D|M ), where L(M ) is the length, in bits, of the description of M , and L(D|M ) is the length, in bits, of the description of the data when encoded with M [11]. We consider how MDL is used in PM using the state-of-the-art method called Krimp [19]. In Krimp, the model M is represented by a two-column code table CT , where the left- and right-hand columns contain itemsets over attributes M and their associated prefix codes, respectively. The initial code table is called standard code table and denoted by ST . It contains only singletons (an example is given in Figure 1, c, and explained further). The code length of a singleton is inversely proportional to its frequency. By convention, the singletons are not patterns. A code table consists of two columns. The left-hand column contains patterns described in thePattribute-wise manner, using the optimal codes from ST , i.e., the description length of a pattern X is given by X∈CT L(X|ST ). The length of an optimal code of pattern X is computed based on Shannon optimal codes, i.e., L(code(X)) = − log2 (P (X)), where P (X) is the probability of pattern X. The probability of a pattern is based on its usage in a data covering (i.e., how many times a pattern is used to cover dataset). The probability of X ∈ CT is given by usage(X) P (X) = P ∗ . (1) X ∗ ∈CT usage(X ) An example of code table CT and covering with the patterns from CT is given in Figure 1, d and e, respectively. The computation of CT is guided by the MDL principle – a pattern is added to CT only if it allows for reducing the total description length. The total length is given by L(D, P CT ) = L(CT |D) + L(D|CT ), where the length of the dataset D en- coded P by this CT is L(D|CT ) = X∈CT usage(X)L(code(X)). The length of CT is given by L(CT |D) = X∈CT L(X|ST ) + L(code(X)). The code table is filled greedily. We explain the principle of covering by means of a small example given in Figure 2. The candidates to optimal patterns (e.g., patterns from Figure 1, b) are arranged in the standard candi- date order, i.e., by “frequency” ↓, “|X|” ↓ and lexicographically ↑, where ↓/↑ denotes the descending/ascending order, respectively. Then, one-by-one, patterns are added to the code table, where they are arranged in the standard cover order, i.e., by “|X|” ↓, “frequency” ↓, and lexicographically ↑. Only those patterns that reduce L(D, CT ) are retained in the code table. The steps of building the optimal CT using candidates AB, ABC, AD, BE (the candidates are considered in the standard candidate order) is shown in Figure 2. The first pattern AB is added to the standard code table ST (Figure 2, a), it reduces L(D, CT ), thus the optimal pattern set is {AB}. Then ABC is added to CT (Figure 2, b), it reduces L(D, CT ) and the updated code table contains {ABC, AB}. Then AD is added (Figure 2, c), it does not reduce L(D, CT ), the optimal pattern set remains unchanged. The same for BE (Figure 2, d), it does not reduce L(D, CT ). For the sake of simplicity, we show only covering by patterns, the corresponding code tables can be reconstructed based on the covering as in the example in Figure 1 (d, e). 3 Keep It Simple: an Algorithm for Discovering Useful Pattern Sets 3.1 Simply-derivable patterns One of the main difficulties of PM approaches is that they compute candidates to optimal patterns in advance (in contrast to the tree-based approaches, where a tree is built on the fly). Considering all frequent patterns as candidates is infeasible in practice and implies dealing with redundant (similar) patterns. Restricting the set of frequent patterns by increasing the frequency threshold does not solve this problem. Moreover, some interesting and infrequent patterns can be omitted. Using only closed itemsets partially solves the redundancy problem, however, the number of closed itemsets is still exponential. We propose a deterministic threshold-free approach for building a candidate set comprising of at most a quadratic number of patterns w.r.t. the number of attributes. These patterns are “easily-derivable,” since they are computed based on the closure of a union of two attributes, i.e., for a dataset D over attributes I, the set of candidates is given by F = {cl({ik ij }) | ik , ij ∈ I, k > j}. in what follows we call these patterns biclosed itemsets. For a dataset from Figure 1, a, the set of biclosed itemsets is {AB, AD, BE, ABC, ABDE, ABCDE}. For biclosed itemsets of size at least 3 there are several ways to derive them, e.g., cl(AC) = cl(BC) = ABD, cl(CD) = cl(CE) = ABCDE. On the one hand, it is easy to reproduce a biclosed itemset – it is enough to know only two attributes. On the other hand, a biclosed itemset is maximal (cannot be extended without reducing the support). The number of biclosed itemsets is bounded by |I|(|I| − 1)/2. 3.2 Multi-model Description Since the number of biclosed candidates is much lower than the number of frequent patterns, they do not cover a dataset as good as patterns from an exponentially large pattern set, like candidates in Krimp. Thus, larger fragments of data remain uncovered. Using the principle of “boosting”, i.e., building a new model based on the data fragments that have been poorly described by the previous models, we propose to explore gradually patterns by a sequence of models (pattern sets). At each iteration, we consider a data fragment (projection) U of the binary dataset D that has not been entirely described by the models from the previous iterations. We propose an algorithm, called KeepItSimple (KIS), the pseudocode is given in Algorithm 1. We consider two versions of the algorithm that differ in the way of computing projections (i.e., defining which fragments are insufficiently well described). To compute a single model we use Krimp algorithm. However, instead of frequent (closed) patterns, we use biclosed itemsets as candidates. We also change the probability estimates (in Formula 1), we give details in Section 3.3. At the beginning, KIS works as follows: the initial code table is the standard code table (line 1), the candidates F are biclosed itemsets (line 3). Then candidates F are used to cover greedily the dataset. We use the same cover principles as in Krimp (see Section 2.2 for details). When the first pattern set (model) is computed there are two possibilities to define projections, i.e., the fragments U of data that will be used instead of D at the next iteration to find a new set of optimal patterns. We consider a simple (S) and detailed (D) projections. To compute projection (line 12), we consider attributes I ∗ ⊆ I that are partially covered by patterns from the current code table. To compute S-projection, we keep the whole columns corresponding to partially covered attributes, while to compute D-projection we keep only uncovered fragments in columns (see line 12 S and D, respectively). The process of computing new tables stops when there are no biclosed patterns or the next optimal code table is the standard one. Example. Let us consider how the algorithms work using the dataset from Figure 1. The steps of Algorithm 1 are given to the right of the pseudocode. At the first step we use the original dataset D instead of projections. The optimal pattern set is C = {AB, ABC}. For covered data Cover(US,D , C) the projections US and UD are given in Step 2. For S-projection, the whole columns are used, while for D-projection we use only uncovered fragments (US and UD , respectively). These projections are used to compute new candidates. Biclosed itemsets computed on the projected data US and UD are F = {AD, BE} and F = {}, respectively. The optimal patterns are C = {AD, BE} and C = ∅, for S and D-projections, respectively. Step 3 is performed only for the S-projected data. Since the only possible biclosed itemset AB is already in the set of optimal patterns, the candidate set F is empty. The optimal pattern sets are C = {AB, ABC, AD, BE} and C = {AB, ABC}, for S- and D-projections, respectively. Algorithm 1 KeepItSimple(D) Require: binary dataset D over set of attributes I Ensure: itemset collection P 1: L ← StandardCodeT ableLength(D) 2: U ← D; I ∗ ← I 3S: F ← ComputeBiclosed(U ) 4: run ← T rue 5: while run do 6: Lold ← L 7S: F ← ComputeInducedBiclosed(F, I ∗ ) 7D: F ← ComputeBiclosed(U ) 8: C ← M DLOptimal(U, F) 9: L ← L(U | C) + L(C | U ) 10: U ← U \ Cover(U, C) 11: I ∗ ← U ncoveredAttributes(U ) 12S: U ← P rojection(D, I ∗ ) 12D: U ← P rojection(U, I ∗ ) 13: P ←P ∪C 14: run ← run and (Lold > L) and (I ∗ 6= ∅) 15: end while 16: return P 3.3 Alternative Estimates of Patterns We also propose alternative estimates of pattern probability, instead of the usage in covering (Formula 1) we propose to use frequency: f requency(X) P (X) = P ∗ . (2) X ∗ ∈CT f requency(X ) Replacing usage by frequency means that patterns may overlap. According to the compression principle, repetitive encoding does not provide the best compression (reduction of L(D, CT ) w.r.t. L(D, ST )), but imposes a heavier penalty on pattern similarity. We leave the study of the side effects of these changes out of the scope of this paper. In Figure 3 we show how covering changes w.r.t. Krimp (see Figure 2) for the same candidates arranged in the same order. The set of MDL-optimal patterns {ABC, AB} computed by Krimp (Figure 2, b) and the set of MDL-optimal patterns {AB, AD, BE} computed with the proposed estimates (Figure 3, d) are different. The latter set covers a larger fragment of data and the optimal patterns are less similar. In our experiments we show that the frequency-based estimate allows one to improve the quality of obtained pattern sets. Figure 3: (a) adding AB (optimal), (b) adding ABC (not optimal), (c) adding AD (optimal), (d) adding BE (optimal). The final set of optimal patterns is {AB, AD, BE}, the optimal encoding is given in (b). 4 Experiments In this section we report the results of an experimental study. We compare the following algorithms: Krimp [19] (with disjoint estimates, given in Formula 1) and KIS (with disjoint and overlapping estimates, given in For- mula 2). For KIS we distinguish 2 types of projections, i.e., simple S and detailed D. We evaluate our methods on freely available datasets (adult, auto, breast, car evaluation, chess, ecoli, glass, heart disease, hepatitis, iris, led 7, mushroom, nursery, pima indians, soybean large, tic-tac-toe, wine, zoo) from the LUCS/KDD discretized data set repository [4]. As quality measures, we consider the following ones. Size of pattern sets. The pattern sets of a smaller size can be more easily checked by experts. Thus, the smaller sizes are preferable. Descriptiveness (coverage ratio). We define the coverage ratio as the ratio of entries that are covered by patterns to all non-empty entries. A pattern set is more descriptive if it covers a larger portion of data. Redundancy (average number of patterns that describe an entry). For non-redundant pattern sets, this value is close to 1. The high values indicate redundancy, i.e., an entry is described by several patterns. Interestingness (F -measure, precision, recall). We understand interestingness as the ability of patterns to describe “hidden interesting groups of objects”. To evaluate interestingness we consider labeled datasets, where class labels were used only to evaluate patterns. As quality measure we choose F-measure (F1 score), as it allows us to take into account both precision and recall of a description: precision · recall F1 = 2 . precision + recall The results of the experiments are given in Figure 4. The bars/dashes correspond to the averaged val- ues/standard deviation, respectively. Figure 4 shows that KIS provides more stable results, i.e., the standard deviation is smaller for the first three characteristics, namely the number of optimal patterns, coverage ratio and the number of patterns per cell. The quality of patterns (evaluated with F 1 measure) is higher for KIS (any settings) than for Krimp. Further, we compare approaches in more detail. Models with disjoint estimates. Let us compare Krimp and KIS based on S- and D-projections. To estimate pattern probability, we use Formula 1 (“disjoint” estimates). Among KIS(S), KIS(D), and Krimp, KIS(S) has the largest average number of patterns (Figure 4, the “disjoint” bar), the highest coverage ratio and the largest number of patterns per cell. That allows us to conclude that KIS(S)-optimal pattern sets are as redundant as Krimp-optimal ones. However, having a larger number of patterns, KIS(S)-optimal pattern sets describe better datasets. KIS(D) generates roughly the same number of patterns as Krimp and has similar redundancy, but allows describing fragments of data larger than those described by Krimp. From these results we conclude that replacing frequent closed itemsets with biclosed ones and using a multi- model description allows us to improve the descriptiveness (coverage ratio) of pattern sets. In the case of S projections, the complexity of the model (i.e., # patterns, # patterns per cell) does not increase. Models with overlapping estimates. New probability estimates (in Formula 2) were proposed to improve redundancy of patterns. Let us compare Krimp with KIS algorithm (both simple (S) and detailed (D) projec- tions), where in KIS we use new estimates (Figure 4, “overlapping” bar). The results show that KIS (w.r.t. Figure 4: The average values (bars) and standard deviation (dashes) of the quality measures of pattern sets computed by KIS (S, D-projections) and Krimp. For the number of patterns, we exclude the results for “adult” dataset since the number of Krimp-generated patterns is drastically higher than the number of KIS (S,D)- generated ones with the same (“disjoint”) estimates (1283 vs 639 and 497, respectively). Krimp) returns smaller pattern sets that describe larger data fragments and have a lower amount of patterns per cell. Thus, using overlapping estimates we get more descriptive models (coverage ratio) that contain less amount of patterns. It is important to notice that the quality of patterns (F 1-measure) for pattern set computed using overlapping estimates is higher than that for Krimp-optimal. For KIS we also consider the average number of models computed with two types of estimates. On average, the number of KIS(D)-generated models is lower (2.69 and 2.54 for “disjoint” and “overlapped” estimates, respectively). The average number of KIS(S)-generated models is 4.08 and 3.54 for “disjoint” and “overlapped” estimates, respectively. Thus, KIS based on S-projections converges faster than that based on D-projections, i.e., it needs a lower number of steps to compute the whole set patterns. To sum up, the proposed modifications, namely (i) (simply-derivable) biclosed itemsets, (ii) multi-model description, and (iii) new probability estimates allow us to improve the results of PM w.r.t. Krimp. Depending on chosen settings we can significantly improve a particular quality measure. The best overall results are achieved for KIS based on D-projections with the proposed (overlapping) estimates. Applying KIS with these settings allows us to reduce considerably the size of pattern sets and pattern redundancy, improving descriptiveness and quality of patterns. 5 Conclusion In this paper, we have proposed an MDL-based multi-model approach to PM. It is based on an efficient method for computing candidates and combines the best practices of building tree-based supervised models with the state-of-the-art approach called Krimp. The results of experiments show that the proposed approach allows one to obtain non-redundant, interesting pattern sets that describe a large portion of data. Acknowledgments The work of Sergei O. Kuznetsov presented in Sections 2-3 was carried out at St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Science and supported by the Russian Science Foundation grant no. 17-11-01276. References [1] Charu C. Aggarwal and Jiawei Han, Frequent pattern mining, Springer, 2014. [2] Mario Boley, Claudio Lucchese, Daniel Paurat, and Thomas Gärtner, ‘Direct local pattern sampling by efficient two-step random procedures’, in Proceedings of the 17th ACM SIGKDD DMCD, pp. 582–590, (2011). [3] Leo Breiman, ‘Random forests’, Machine learning, 45(1), 5–32, (2001). [4] F Coenen, ‘The LUCS-KDD discretised/normalised ARM and CARM data library’. Department of CS, The University of Liverpool, UK, (2003). [5] Diane J Cook and Lawrence B Holder, ‘Substructure discovery using minimum description length and background knowledge’, Journal of Artificial Intelligence Research, 1, 231–255, (1993). [6] Thomas M Cover and Joy A Thomas, Elements of information theory, John Wiley & Sons, 2012. [7] Vladimir Dzyuba, Matthijs van Leeuwen, and Luc De Raedt, ‘Flexible constrained sampling with guarantees for pattern mining’, Data Mining and Knowledge Discovery, 31(5), 1266–1293, (2017). [8] Yoav Freund and Robert E Schapire, ‘A desicion-theoretic generalization of on-line learning and an appli- cation to boosting’, in European conference on computational learning theory, pp. 23–37. Springer, (1995). [9] Liqiang Geng and Howard J Hamilton, ‘Interestingness measures for data mining: A survey’, ACM Com- puting Surveys (CSUR), 38(3), 9, (2006). [10] Pierre Geurts, Damien Ernst, and Louis Wehenkel, ‘Extremely randomized trees’, Machine learning, 63(1), 3–42, (2006). [11] Peter Grünwald, The minimum description length principle, MIT, 2007. [12] Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti, ‘Summarizing data succinctly with the most informative itemsets’, ACM (TKDD), 6(4), 16, (2012). [13] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal, ‘Discovering frequent closed itemsets for association rules’, in International Conference on Database Theory, pp. 398–416. Springer, (1999). [14] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal, ‘Efficient mining of association rules using closed itemset lattices’, Information systems, 24(1), 25–46, (1999). [15] J. Ross Quinlan, ‘Induction of decision trees’, Machine learning, 1(1), 81–106, (1986). [16] Arno Siebes and René Kersten, ‘A structure function for transaction data’, in Proceedings of SIAM, pp. 558–569, (2011). [17] Koen Smets and Jilles Vreeken, ‘Slim: Directly mining descriptive patterns’, in Proceedings of SIAM, pp. 236–247, (2012). [18] Jilles Vreeken and Nikolaj Tatti, ‘Interesting patterns’, in Frequent Pattern Mining, eds., Charu C. Aggarwal and Jiawei Han, 105–134, Springer, (2014). [19] Jilles Vreeken, Matthijs Van Leeuwen, and Arno Siebes, ‘Krimp: mining itemsets that compress’, Data Mining and Knowledge Discovery, 23(1), 169–214, (2011).