Edit Operations on Lattices
          for MDL-based Pattern Summarization

                      Keisuke Otaki1,2 and Akihiro Yamamoto1
                  1
                   Department of Intelligence Science and Technology,
               Graduate School of Informatics, Kyoto University, Japan
            ootaki@iip.ist.i.kyoto-u.ac.jp, akihiro@i.kyoto-u.ac.jp
          2
            Research Fellow of the Japan Society for the Promotion of Science


         Abstract. The closedness, a fundamental conception in FCA, is also im-
         portant to address the pattern redundancy problem in pattern mining,
         where some enumerated patterns subsume others and they are redun-
         dant. Although many efficient mining algorithms have been proposed,
         finding characteristic and easy to interpret subsets of the whole enumer-
         ated patterns, called the pattern summarization problem, is still chal-
         lenging. A well-used Information Theory-based criterion; the Minimum
         Description Length (MDL) principle helps us to tackle the problem, but
         it requires us to design the MDL evaluation from scratch according to
         types of patterns. In this paper we propose a new framework applicable
         to various patterns using lattices, which are beneficial to formulate the
         MDL evaluation. A key idea is revising an existing model by using edit
         operations defined on lattices among concepts on them, which enables
         us to consider additional information such as background knowledge and
         helps us to design the MDL evaluation. We experiment our method to
         see that our proposal helps us to obtain informative results from the
         whole sets, and confirm that our model is applicable to various patterns.


  Keywords: formal concept analysis, lattice structures, edit operations, pattern
  summarization, minimum description length principle

  1    Introduction
  Knowledge Discovery using binary feature vectors is an important subject in
  data mining, where a binary value represents whether or not an object keeps
  some feature. A way to construct such vectors is using pattern mining: For a
  datum d and an enumerated pattern p, we compute a binary value bd,p = 1 if
  and only if p occurs in d. Since the frequent itemset mining problem was first
  proposed by Agrawal and Srikant [2], various patterns and algorithms have been
  studied [1]. We can exploit obtained vectors as representations of data within
  several Machine Learning methods, and tasks such as clustering and classification
  can be tackled with the vectors. Therefore pattern mining is an essential tool.
     Although many efficient mining algorithms have been studied, the pattern
  redundancy problem is still inevitable, where some patterns subsume others, enu-
  merated patterns are not independent, and obtained vectors contain redundant

c 2015 for the individual papers by the papers’ authors. Copying permitted for private
  and academic purposes. This volume is published and copyrighted by its editors.
  Local Proceedings also appeared in ISBN 978-84-606-7410-8, Universidad de Málaga
  (Dept. Matemática Aplicada), Spain.
18       Keisuke Otaki and Akihiro Yamamoto


values. Considering the closedness, a main conception in FCA, is a fundamental
way to resolve the problem and an example of the closedness used in pattern
mining is closed itemsets [12], known as compact representations of itemsets.
    The redundancy problem, however, still remains after considering the closed-
ness only. To select a subset of the whole set of enumerated patterns, the Min-
imum Description Length (MDL) principle has attracted much attention [7, 8,
14–16, 18]. An important idea is to represent databases with codes in a loss-less
manner, and to evaluate its difficulty by the length of codes used. For example,
the Krimp algorithm [18] is a fundamental method for itemsets. Other methods
for sequences and large graphs can be seen in [16] and [8], respectively. According
to patterns, we need to design an evaluation method from scratch and it makes
theoretical analyses difficult. It is also hard to integrate MDL-based methods
with knowledge resources although various side information (e.g., corpora, on-
tology, etc.) are helpful to select enumerated patterns.
    Developing methods applicable to various patterns in a unified framework and
possible to consider side information is, therefore, an important but remaining
problem. We in this paper present preliminary results of our new lattice-based
framework based on an existing MDL-based method to enhance its utility for
further developing of such methods to overcome problems above. Our key ideas
are introducing edit operations on lattices (in Section 3.1) for the loss-less rep-
resentation (by Proposition 1, Lemmata 1 and 2), and evaluating its difficulty
by extending an existing formulation (in Section 3.2). To investigate the appli-
cation possibility of our framework, we apply it to pattern concept lattices (in
Section 4). We experiment them using common benchmark datasets (in Sec-
tion 5). Our contributions are summarized below.

    – A new lattice-based framework ; it is extended from an existing model (code-
      tables in Section 2.2) by using terminologies from lattices.
    – The equivalency for closed itemsets (by Proposition 1).
    – Pattern structured-based generalization; GenSet, a generalization of itemsets
      using pattern structures (in Section 4) as an example of other lattices.
    – Empirical evaluation using some benchmark datasets (in Section 5).


2      Preliminary

We outline Formal Concept Analysis (FCA) and Frequent Closed Itemset Mining
(FCIM) for our new lattice-based framework, summarized in Table 1. We explain
an idea of the MDL-based algorithm for itemsets, named the Krimp algorithm.


2.1     Basic Concepts

Formal Concept Analysis [5] Let K = (G, M, I) be a context, where G
and M are sets of objects and attributes, and I ⊆ G × M is a binary relation.
A Galois connection between G and M , denoted shortly by (·)0 , is defined as
A0 = {m ∈ M | (g, m) ∈ I for all g ∈ A} for A ⊆ G and B 0 = {g ∈ G | (g, m) ∈
            Edit Operations on Lattices for MDL-based Pattern Summarization               19


               Table 1: The correspondence between FCA and FCIM.
                  FCA [5]             FCIM with a parameter θ [12]
       objects    G; a set of objects D; a set of transactions
       attributes M ; attributes      I; items
       functions {(·)0 , (·)0 }       {f, g}
       task       find all concepts find all itemsets I ⊆ I satisfying freq(I) ≥ θ


I for all m ∈ B} for B ⊆ M , respectively. A pair (A, B) is a concept if A0 = B
and B 0 = A, where A and B are called the extent and the intent, and we denote
them by A = Ext(c) and B = Int(c), respectively. For an object g ∈ G, the
concept ({g}00 , {g}0 ) is the object concept of g, denoted by γg.
    Computing all concepts is an important task in FCA, and we denote the
set of all concepts by B(K). For two concepts (A, B) and (C, D), the partial
order  is introduced by A ⊆ C, and hB(K), i becomes a complete lattice. We
represent it by B if it is understood from the context. For a concept c ∈ B, we
define the upset of c, denoted ↑ c, by {d ∈ B | c  d}3 . In addition we define the
direct upset ⇑ c by {d ∈↑ c | there exists no e ∈↑ c s.t. e  d}.


Frequent Closed Itemset Mining [12] Let I be the set of items. A set
I ⊆ I of items is called an itemset, and it is also called a transaction. A database
D = {t1 , . . . , tN } is a set of transactions. The frequency freq(I) of an itemset I
is defined as freq(I) = |{t ∈ D | I ⊆ t}|. For a parameter θ, the frequent itemset
mining (FIM) problem is the problem to find all itemsets satisfying freq(I) ≥ θ.
We denote the set of all frequent itemsets by F.
    For closed itemsets, two functions f and g are defined as f (T ) = {i ∈ I | ∀t ∈
T, i ∈ t} for T ⊆ D and g(I) = {t ∈ D | ∀i ∈ I, i ∈ t} for I ⊆ I. An itemset I
is closed if and only if f (g(I)) = I. The frequent closed itemset mining (FCIM)
problem is to find all frequent and closed itemsets from D with respect to θ. It is
easy to see that two functions f and g work in the same manner of (·)0 in FCA4 .
We denote the set of all frequent closed itemsets by CF.
    The closed itemsets aim at reducing the number of all frequent itemsets
enumerated in the naı̈ve FIM problem, and many efficient algorithms have been
studied [13, 17]. However the number of enumerated itemsets is still large in
applications. They are often redundant because some patterns subsume others,
which causes the pattern redundancy problem: The enumerated patterns are too
large and redundant to exploit them in applications although the closedness is
considered. In application, we often need only a small subset that characterizes
all the enumerated itemsets, which are easy to interpret. We call such extraction
problem pattern summarization (also known as pattern set mining [6, 15, 18]).
3
    This set ↑ c is also called the principal filter in partially ordered sets.
4
    We prepare ids of transactions and the set D0 = {(i1 , t1 ), . . . , (iN , tN )}. By con-
    structing I = {(ik , j) | 1 ≤ k ≤ N, j ∈ tk } from D0 , the FCIM problem of θ = 1 is
    equivalent to computing all intents of concepts from K = ({i1 , . . . , iN }, I, I).
20         Keisuke Otaki and Akihiro Yamamoto

      (X, usage, code)
      1    3    1
      2    4    2                                             (X, usage, code)
                             Encoded database                                     Encoded database
      3    2     3                                            2345   1    2345
      4    1      4      1    2   5           6       7   9   2379   1    2379   179     25    6
      5    2     5       2    3       4       5               179    3   179      2345
      6    1      6
                         1    2   7       8       9            25    1     25
                                                                                 179     2     8
      7    4    7                                               2    1      2
      8    1             1    7   9                                              179
                  8                                             6    1      6
      9    4    9        2    3   7       9                     8    1      8     2379


 (a) The standard code-table ST (left) and (b) A code-table CT (left) and the cover
 the cover of D by ST (right).             of D by CT (right).

    Fig. 1: The cover of D = {125679, 2345, 12789, 179, 2379} (from Example 1).


2.2       Two-part MDL Principle and The Krimp Algorithm
Out of various criteria for the selection, the two-part MDL principle [7] is often
adopted for itemsets, where code-tables are well-studied models. Databases (or
contexts) are encoded and represented by (binary) codes using code-tables.

Definition 1 (Code-tables [18]). A code-table CT consists of two columns
for itemsets and codes. Formally CT = {(X1 , c1 ), (X2 , c2 ), . . . } is a set of pairs,
where Xi ⊆ I, ci ∈ C, and C be the set of (binary) codes.

    For the convenience of our discussion, in the following, we regard code-tables
as sets of itemsets: For example X ∈ CT means (X, cX ) ∈ CT and cX is denoted
by code CT (X). We define CT − = {(X, cX ) ∈ CT | |X| > 1}. The standard code-
table ST consists of all singleton itemsets over I. Let us first introduce examples,
where sets are written as sequences (e.g. {1, 7, 9} is represented as 179).

Example 1. Let I = {1, . . . , 9} and D = {125679, 2345, 12789, 179, 2379}. In
Fig. 1a and 1b, the left table is a code-table and the right one is a database
represented by codes, where rectangles represent codes (e.g., the rectangle 7
means the code of the itemset 7). In Fig. 1a, the transaction 12789 is encoded
separately. In Fig. 1b (CT = ST ∪{25, 179, 2345, 2379}), it is done by 179∪2∪8.

   The process of representing databases using code-tables is called the cover.
Both frequent itemsets (e.g., 179) and non-frequent ones (e.g., 2345 or 2379) are
used and the MDL principle takes a balance among them. The property of the
cover of transactions is described by cover functions.

Definition 2 (Cover function5 ). Let CT be the set of all possible code-tables
over I, D be a database, t ∈ D be a transaction, and CT ∈ CT be a code-table.
We call the set of itemsets {X | (X, cX ) ∈ CT } the coding set of CT , denoted
by CS . A cover function on CT , cover : CT × 2I → 2CS , should satisfy:
C1. if X, Y ∈ cover (CT , t) then it holds either X = Y or X ∩ Y = ∅, and
5
     This definition is modified from Definition 2 of [18] by using a coding set CS .
          Edit Operations on Lattices for MDL-based Pattern Summarization             21

                                                                 S
C2. the union of all X ∈ cover (CT , t) equals to t, i.e., t =       X∈cover (CT ,t) X.

    Code-tables in Fig. 1 contain additional information usage. Using coding
methods in the Kolmogorov Complexity [9], codes are decided according to how
often itemsets are needed : For X ∈ CT , usage(X) is defined as |{t ∈ D | X ∈
cover (CT , t)}|, which determines the least length of prefix codes required.
Example 2 (from Example 1). In Fig. 1a, cover (ST , 179) = {1, 7, 9}, and in
Fig. 1b cover (CT , 12789) = {179, 2, 8}. It holds that usage(179) = 3.
If a transaction t ∈ D is encoded by a cover function cover (·, ·) and some CT ,
we say that t is covered. Then the MDL principle can be evaluated.
Definition 3 (Lemma 1, Definitions 4 and 5 in [18]). The two-part MDL
principle says that a code-table CT ∈ CT is the best for D if it minimizes
L(D, CT ) = L(CT ) + L(D | CT ), where L(CT ) is the length of the code-table
CT , and L(D | CT ) is that of D when CT is given. Following Definitions 1 and
2, terms L(CT ) and L(D | CT ) for L(D, CT ) are evaluated as follows:
                              X
            L(CT ) =                   L(code ST (X)) + |code CT (X)|,      (1)
                        X∈CT if usage(X)6=0
                        X        X
         L(D | CT ) =                         |code CT (X)|,                         (2)
                        t∈D X∈cover (CT ,t)
                         P
where L(code ST (X)) =      i∈X |code ST ({i})|.

   The Krimp algorithm [18] is an iterative algorithm updating a code-table
CT ∈ CT greedy by adding a new frequent itemset X ∈ F with evaluating
L(D, CT ). Please refer to Algorithm 3 in [18], which we follow in experiments.

2.3   Problems
Models based on code-tables have been widely used in MDL-based studies. The
cover function adopts the set union to represent transactions, and the subset
operation to divide them as seen in Example 1. The process assumes that items
are in the same granularity, but it is not the case in applications. It also disables
us to allow for side information (i.e., our background knowledge and resources)
about databases as well. Then our targeting problems are listed below.
Problem 1. The cover function has a drawback of the difficulty of its generaliza-
tion because it implicitly uses set operations.
Problem 2. Code-tables cannot reflect side information of items (e.g., hierarchy,
or degree of importance of items) because of implicitly used set operations.
   That is we need to revise set operations used in the existing method. Our
key idea is to adopt terminologies of lattices to overcome these problems. In
Section 3, we revise the cover function with lattices to build our new framework.
We apply it to pattern concept lattices by pattern structures in Section 4 to
confirm that our lattice-based framework is applicable to various patterns.
22       Keisuke Otaki and Akihiro Yamamoto


3      Models by Edit Operations for Closed Itemsets

We provide our framework based on lattices. For discussions, we define the object
concept corresponding to t ∈ D by γt = (g(t), t) from the relation in Table 1.
Here we only consider closed itemsets.


3.1     Cover Using Edit Operations on Lattices

In the cover of t ∈ D, t is recursively replaced with t \ B1 , t \ (B1 ∪ B2 ), . . .
by choosing mutually disjoint closed itemsets B1 , B2 , . . . until t gets ∅ in some
order6 . We now have the following.

Observation 1 The cover of t ∈ D can be represented as a sequence of the
deletion of items: t 7→ t \ B1 , t \ B1 7→ t \ (B1 ∪ B2 ), . . . , t \ (B1 ∪ · · · ∪ Bm ) 7→ ∅.

Thus a transaction t ∈ D can be covered by also using the deletion of items.
We use such deletions to construct a new cover based on lattices. On lattices,
selections of Bi corresponds to selecting concepts from B, and intents of them
can be stored in code-tables. With respect to the search space of the selection,
the following holds by properties of lattices.

Observation 2 For a transaction t ∈ D, only a sub-lattice corresponding to the
upset ↑ γt of the corresponding object concept γt of t should be considered.

    Choosing concepts, however, is not complete for the loss-less property (i.e.,
C2 of Definition 2) because in general B does not contain all singleton itemsets
in intents of concepts. In contrast, by the discussion of closed itemsets [12], all
frequent itemsets could be recovered from closed itemsets and lattices should
contain all information required.
    Our idea is extracting such hidden information on lattices as edit operations
on lattice (e.g., the deletion of items in Observation 1), and represent t ∈ D by
both the selections and edit operations. We first introduce edit information.

Definition 4 (Edit Information). For c = (A, B) ∈ B,   S the edit information,
denoted by Edit(A, B), is defined as Edit(A, B) ≡ B \ (C,D)∈⇑(A,B) D. For a
                                              S
set C of concepts, it is extended by Edit(C) = (A,B)∈C Edit(A, B).

For c ∈ B, Edit(c) indicates items which cannot be represented by intents of
concepts except c itself even when all concepts in ↑ c \ {c} are chosen.

Example 3 (from Example 2). For t3 = 12789, it holds that γ3 = (3, 12789) and
Edit(γ3) = {8}. The item 8 cannot be represented by any concepts except γ3.
Therefore if γ3 is not selected the item 8 must be represented by another way.
6
     The standard cover order is a basic one from the Krimp algorithm: For an itemset
     B ⊆ I, the standard cover order is the order of the cardinality |B| descending, the
     support freq(B) descending, and lexicographically ascending [18]
              Edit Operations on Lattices for MDL-based Pattern Summarization                            23


     Edit(1345,79)={7,9} (12345,{})                                             (12345,{})
                                  Edit(1235,2)={2}                                        Edit(1235,2)={2}
                   (1345,79)                                            (1345,79)
 Edit(134,179)={1}                       (1235,2)                                             (1235,2)
                                                                   CT
              (134,179)                                           (134,179)
                          (135,279)                                           (135,279)
           (13,1279)                  (12,25) (25,23)          (13,1279)                  (12,25) (25,23)
    Edit(γ3)={8}                                        Edit(γ3)={8}
                       (1,125679) (5,2379)                                  (1,125679) (5,2379)
           (3,12789)                         (2,2345)          (3,12789)                          (2,2345)
                          ({},123456789)                                       ({},123456789)

          (a) For t3 = 12789 with ST                          (b) For t3 = 12789 with CT

Fig. 2: Examples of the cover with edit operations for γ3 = (3, 12789) (i.e., t3 ).


    As another way, edit operations among intents of concepts are introduced
to represent such information. Because B contains the top > = (G, ∅)7 and
Int(γt) = t, all information for covering t should be found in ↑ γt. We use
edit information to capture such information and to explain edit operations on
lattices (i.e., on intents of concepts). We encode edit information by tuples called
edit scripts, and store them in edit-tables by analogy with code-tables.
Definition 5 (Edit scripts and Edit-tables). An edit-table ET consists of
three columns: kinds of operations, arguments of operations (itemsets), and
codes. That is, ET = {(t1 , X1 , c1 ), . . . , (tm , Xm , cm )}, where ti indicates the
type of operations, and both Xi and ci is the same to code-tables. The tuple
(t, X, cX ) is called an edit script. Let E be the set of all edit scripts.
Example 4 (from Example 3). Because Edit(γ3) = {8}, the item 8 should be
represented by the deletion of the item 8. Let DEL represent the deletion 8 . A
script (DEL, 8, c8 ) with a code c8 means the deletion of the item 8.
   Definition 5 is general and abstract. For closed itemsets, however, we only
need a specific kind of them. We define a function Encode : 2I → 2E for X ⊆ I
by Encode(X) = {(DEL, i, ci ) | i ∈ X, ci ∈ C}, where C is the set of codes.
Lemma 1. It holds that Edit(↑ γt) = t for any t ∈ D.
Proof If ↑ γt = {γt, >}, it holds that Edit(↑ γt) = Edit(γt) = t. Then Encode(·)
generates edit scripts of all items in t as the deletion of an item. By the induction
on lattices (by ⇑ c), we only need to confirm that for c ∈ B all items which cannot
be covered by intents of concepts in ⇑ c \ {c} can be represented by edit scripts.
This is done by the definition of Edit(c) and properties of hB(K), i.              t
                                                                                    u
    Lemma 1 guarantees that any t ⊆ I can be represented by using edit scripts
when no concepts are chosen. Now the cover using lattices can be described using
a function cover L (·, ·) by analogy with cover (·, ·). The following describes how
to implement it instead of giving a new formal description of the function.
7
    If B 6= ∅ for the top > = (G, B) of hB(K), i, we can insert a dummy (G, ∅).
8
    The type is required to introduce several types of edit operations if needed.
24     Keisuke Otaki and Akihiro Yamamoto

                                                                               I
Definition 6 (Revised Cover). The function cover L : CT × 2I → 22 × 2E re-
ceives a code-table CT containing intents of selected concepts and a transaction
t, and it returns a set of itemsets used (e.g., 179) and that of edit scripts.
    For a code-table CT (See Fig. 2b), we first choose intents of concepts from
CT in some order, and denote by C the set of selected concepts, which should
be excluded from the search space ↑ γt. The remainder should be represented by
edit scripts. Then cover L (CT , t) = ({Int(c) | c ∈ C}, Encode(Edit(↑ γt\ ↑ C))).

Example 5 (from Example 3). It holds that cover L (CT , t3 ) = ({179}, {(DEL, 2, c2 ),
(DEL, 8, c8 )}), representing the cover of t3 in Fig. 2b. It holds that cover L (ST , t3 ) =
{∅, {(DEL, 1, c1 ), (DEL, 2, c2 ), (DEL, 7, c7 ), (DEL, 8, c8 ), (DEL, 9, c9 )}}, which sim-
ulates the cover of t3 by singleton itemsets (in Fig. 2a).

   If the set C in Definition 6 is the empty set ∅, the function cover L (·, ·) achieves
the same result as the function cover (·, ·) with ST , supported by Lemma 1 (See
Example 5). If some concepts are selected, the following lemma is important.

Lemma 2. Let C be the set in Definition
                                  S       6 and (X , Y) = cover L (CT , t). The
set Y is complete to rerpesent t \ X∈X X.

Proof From Definition 6, if a concept c = (A, B) in ↑ γt is selected and stored
in CT (i.e., c ∈ C), any item i ∈ B is not represented by edit scripts in Y
because there exists no conceptS d ∈↑ γt\ ↑ C satisfying i ∈ Edit(d). It holds that
{i | i ∈ Edit(↑ γt\ ↑ C)} = t \ X∈X X, and the function Encode(·) ensures the
lemma, i.e., the cover by cover L (·, ·) is valid when some concepts are chosen.

   Together with the completeness in Lemma 1 and Lemma 2, our edit scripts
are enough to represent any t ∈ D with selecting concepts in C.


3.2   Model Evaluation Using Edit Operations

For our model (CT , ET ), with reference to Section 2.2, the MDL principle
L(D, CT , ET ) = L(CT , ET )+L(D | CT , ET ) can be evaluated. For L(CT , ET ),
                             X
         L(CT , ET ) =                  L(code ST (X)) + |cX |
                          (X,cX )∈CT
                          usage(X)6=0
                              X
                      +                  L(code OP (t)) + L(code ST (X)) + |cX |,     (3)
                          (t,X,cX )∈ET
                           usage(X)6=0


where code OP (t) represents codes used forPdistinguishing types of edit scripts.
For itemsets, L(code OP (t)) = 0. The first
                                       P       term of Equation 3 is the same to
the Krimp algorithm, and the second        term is for encoding the edit-table ET .
   The difficulty of representing databases, L(D | CT , ET ), is evaluated by
naturally generalizing the corresponding term of the Krimp algorithm below.
             Edit Operations on Lattices for MDL-based Pattern Summarization                        25

         T: (X, usage, code)
      2345      1        2345
      2379      1        2379
                                    -              (X, usage, code)
       179      3      179         T                                       (Type, X, usage, code)
        25      1          25                    2345     1    2345
         2      1           2                    2379     1    2379        DEL 2    1     DEL 2
         6      1           6                     179     3   179          DEL 6    1     DEL 6
         8      1
                                   ST
                            8                      25     1     25         DEL 8    1     DEL 8

   (a) Code-table T (= T − ∪ ST )                (b) Our model, two tables (CT , ET )

Fig. 3: A code-table T by the Krimp algorithm and our two tables (CT , ET ).
Our table CT is equivalent to T − and ET simulates ST .


                                                                                   
                                   X           X                      X
               L(D | CT , ET ) =                        |cX | +                 |cX |
                                   t∈D       (X,cX )∈X             (t,X,cX )∈Y

                                for (X , Y) = cover L (CT , t).                                     (4)

We illustrate an example in Fig. 3 to show how edit-tables intuitively work.

3.3   The Equivalency of Models for Closed Itemsets
For closed itemsets, we now have the following proposition.
Proposition 1 Let T be an obtained code-table by the Krimp algorithm for
the database D with some order on CF. With the same order, our model can
achieve a pair (CT , ET ) satisfying L(D, T ) = L(D, CT , ET ) for closed itemsets.

Proof (Sketch) At the beginning, it holds that T = ST (i.e., T − = ∅) and
CT = ∅, where all transactions should be covered by singleton itemsets. From
the definition of Encode(·) and Lemma 1, the proposition immediately holds.
If a closed itemset B ∈ CF is selected and stored as B ∈ T , there exists a
concept (A, B) ∈ B and its intent B is stored in our table CT . Because we
use the same order in the cover of t ∈ D, for (X , Y) = cover L (CT , t), it holds
that X = cover (T, t), meaning that the usage in T − and CT must be identical.
The usage of edit scripts must be the same as well by the fact X = cover (T, t),
Lemmata 1, 2, and the definition of Encode(·).                                   t
                                                                                 u

Summary Our idea is to revise the cover in the Krimp algorithm with lat-
tices. Our framework divides code-tables into a part of intents of concepts (i.e.,
closed itemsets) and that of edit scripts simulating singleton itemsets. For closed
itemsets, our model can achieve the same result as that obtained by the Krimp
algorithm (Proposition 1), which are supported by properties of lattices.
    The advantage of our lattice-based model is that lattices help us to generalize
the MDL-based evaluation for various patterns. For example, pattern concept
26     Keisuke Otaki and Akihiro Yamamoto


lattices [4] enable us to address the problem for several patterns with the MDL
principle. The key concept for the generalization is to introduce edit operations
on lattices according to patterns, encode them as edit scripts in edit-tables,
where lattices become the search space in our framework. Lattices help us to
consider the completeness of edit operations as well.


4     A Generalized Model via PSA

We apply our framework to pattern concept lattices. As an example of other lat-
tices, we consider GenSets, a generalized version of itemset using pattern struc-
tures, which are introduced to analyze non-binary data [4]. We call methods
using them for Knowledge Discovery Pattern Structure Analysis (PSA) below.


4.1   Pattern Structures and Generalized Sets

Pattern structures extend FCA using meet semi-lattices; a meet semi-lattice
(D, u) is a pair of a set D of descriptions representing fragments of objects
as their features, and a meet operator u between descriptions satisfying the
associativity, the commutativity, and the idempotency. A partial order v of
descriptions is induced by u for x, y ∈ D as x v y whenever x u y = x. A pattern
structure P is a triple P = (G, (D, u), δ), where G is a set of objects, (D, u) is a
meet semi-lattice, and a mapping δ : G → D gives a descriptions for each object.
    In PSA we use the following Galois connection (·) : A = ug∈A δ(g) for A ⊆
G and d = {g ∈ G | d v δ(g)} for d ∈ D. A pair (A, d) ∈ 2G × D is a pattern
concept if and only if A = d and d = A. A partial order among pattern
concepts is defined as well, and pattern concept lattices can be constructed in a
similar manner of FCA. Recent applications of PSA and lattices constructed by
PSA can be seen in [3, 11]. We consider a slight modification of FCA by giving
additional information to itemsets from the following scenario.

Example 6 (from Example 1). Let us consider t4 = 179 of id i4 and t5 = 2379
of id i5 . We have (i4 i5 )0 = 79 computed by i04 ∩ i05 = 179 ∩ 2379 = 79. The
remainders for each transaction are 1 = 179 \ 79 and 23 = 2379 \ 79, respectively.
In this case, the existence of these items (i.e., 1 in t4 and 2, 3 in t5 ) is ignored
although they are useful for the representation of itemsets.

   From this example, by using a meet semi-lattice, we introduce auxiliary num-
bers representing the existence of items, i.e., the number of items may be included
at most in common, to enhance the expressiveness of itemsets.

Definition 7 (GenSets). We define D = 2I × N. For an itemset B ⊆ I, we
encode it by a pair (B, 0) ∈ D. The meet operator u is defined as

(B1 , n1 ) u (B2 , n2 ) = (B1 ∩ B2 , min(n1 + |B1 \ B1 ∩ B2 |, n2 + |B2 \ B1 ∩ B2 |)).
           Edit Operations on Lattices for MDL-based Pattern Summarization      27


We call itemsets with additional numbers GenSets (generalized sets of items) 9 .

Example 7. For two itemsets t4 = 179, t5 = 2379, we encode them as d1 =
(179, 0) and d2 = (2379, 0), respectively. Then d = d1 u d2 is computed as
d = (B, n) = (79, 1), where n = 1 means that both itemsets t4 and t5 possibly
have further one item. In other words, a GenSet (79, 1) can be regarded as
{7, 9, ?} with a wildcard symbol ? representing any item in I.

    We can interpret GenSets as a richer representation, and conjecture that
they are helpful to capture characteristic features of itemsets. In other words,
we now consider to take care of both the extent and the intent of concepts
in evaluating the MDL principle. However, the Krimp algorithm cannot deal
with GenSets directly because the symbol ? is not in I and it has completely
different meaning. Therefore, applying the ordinal cover is not suited and a new
formulation is required from scratch. In contrast, our lattice-based model can
be applicable. We can evaluate the two-part MDL principle for GenSets by only
considering the cover function cover L (·, ·) using edit operations including ?.


4.2    The two-part MDL principle for GenSets

We can evaluate the two-part MDL principle only by considering the function
cover L (·, ·) using edit operations based on pattern concept lattices.


Models for GenSets For a GenSet (B, n) ∈ D, we need to encode n to store
GenSets in a code-table, which is possible by adding a column for storing the code
for n. There exist several codes for numbers (e.g., arithmetic codes). We choose
a simple one, which requires the length − log2 (1/P ) bits where P is the number
of possible candidates, and n is bounded by the minimum size of transactions in
D. Assuming this code, we set P = |I|, the worst length of such representation.


The Cover with GenSets For d = (B, n) ∈ D, if n = 0, d is an itemset B.
We focus on how to define the cover with GenSets when n > 0.

Example 8. Let d = (B, n) = (79, 1) and t3 = 12789. We first cover a part of
t3 by the itemset B and get 128 = t3 \ B. Additionally we rewrite a symbol ?
(n = 1) for {1, 2, 8}. An item 8 is chosen and code ST (8) is used for this choice
because the code of 8 in ST is longer than that of 1 and 2.

Please see Fig. 4 as well for the cover of t3 using a GenSet d = (79, 1) ∈ D. As
the symbol ? subsumes any items in I we need to evaluate such subsumptions.
To implement it, we assume that we would like to find general, common features
which characterize many transactions by excluding non-frequent items as long
9
    For three sets X, Y, Z ∈ 2I we want to compute n = min(|X| − |XY Z|, |Y | −
    |XY Z|, |Z| − |XY Z|), where XY Z = X ∩ Y ∩ Z, and we see that this value n can
    be computed by the counting and set operations.
28     Keisuke Otaki and Akihiro Yamamoto

                                                        (12345,({},3))
               GenSet (79,1)
            Traverse                    (1345,(79,1))    selected

        additional 1 item                                           (1235,(2,3))
                                 (134,(179,0))
                                                 (135,(279,1))
             item 8 of
                               (13,(1279,1))
           longest-code                                        (12,(25,2)) (25,(23,2))
                           (3,(12789,0)) (1,(125679,0)) (5,(2379,0))         (2,(2345,0))


Fig. 4: For t3 = 12789 with CT = {(79, 1)}, where (79, 1) is a selected GenSet.
An item can be covered by the GenSet (79, 1) and we need to find it except 79.


as possible. We implement the subsumption of ? as replacing the symbol ? with
i ∈ I by finding the item i having the longest code ST (i). Instead of giving a
formal description, we give sketches of our idea below.

Sketch 1 We prepare a type SUB and construct an edit script as (SUB, i, ci ).
As we have two types (DEL and SUB), L(code OP (t)) in Equation 3 is used. The
function Encode(·) should be revised with traversing of an item i having the
longest code. For example in Fig.4, cover L ({(79, 1)}, 12789) now should work
like ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 ), (SUB, 8, c8 )}) after traversing i = 8.


Sketch 2 For d = (B, n), if n > 0 (i.e., the code of n shows non-zero value), we
use a code code ST ({i}) for representing the item i of the longest code in ST with-
out edit scripts. This is done by extending the function cover L (·, ·), as a function
like cover new
           L ({(79, 1)}, 12789) = ({(79, 1)}, {(DEL, 1, c1 ), (DEL, 2, c2 )}, {8}), for
example, where {8} can be immediately evaluated by ST computed from D.


    In dealing with pattern concept lattices, many ways are applicable to fulfill
the loss-less property. Because both sketches are based on our edit-tables for
itemsets in Section 3 and they adopt prefix codes, they achieve the loss-less
property if they have codes for items subsumed by ?. As they have such codes in
fact, the loss-less property holds for GenSets. In experiments we use Sketch 2.

Corollary 1 (The Loss-Less Property of GenSets) The loss-less property
immediately holds for GenSets from discussions above.


    We stress that lattices are compact representations of objects and their fea-
tures, and they help us to prepare edit operations (i.e., edit scripts) and the cover
function. Furthermore, we can introduce more labels in subsumptions of ? for
representing various information to take into account valuable side information
in pattern mining. In addition, we expect that GenSets can be used not only for
the symbol ? but also for representing hierarchy by stacking GenSets as trees.
            Edit Operations on Lattices for MDL-based Pattern Summarization    29


Table 2: Benchmark datasets represented in the form of contexts (G, M, I). Re-
call that CF is the set of all closed itemsets with θ = 1.
                          objects |G| attributes |M | |I|/(|G||M |) |CF|
                 Iris             150             19         0.263    163
                 breast           699             16         0.624    641
                 pima             768             38         0.237 3,204
                 wine             178             68         0.206 14,555


5      Experiments
From the UCI Machine Learning repository [10], we use iris10 , breast11 , pima12 ,
and wine13 , which are also used in the Krimp algorithm [18] as benchmark
datasets. All of those datasets are available from the distribution package14 of
the Krimp algorithm. The statistics of those datasets are summarized in Table 2.
   All of our experiments were made on a machine of Mac OS X 10.10.2 with
2 × 2.26 GHz Quad-Core Xeon processors on 64GB main memory. All codes
were written in Python 2.7.9. Our algorithm was the same as Algorithm 3 in the
paper [18], which is an algorithm iteratively and greedy updating a code-table,
except the part of the MDL evaluation. In our framework we replaced it with
our implementation of Equations 3 and 4 based on lattices and edit operations.
We did not discuss the computational complexity of our framework seriously.
Rather than discussing solely run-times of experiments, we saw the followings.
     – Comparison of two models: Results from two-models become the same by
       the equivalency of them (Proposition 1) for itemsets. We then investigate
       how much computational time our new model increases.
     – We analyze the result with GenSets to check the application possibility of
       our lattice-based model for other patterns.
    Let CT be the obtained code-table after applying the Krimp algorithm. For
a database D we can compute both L(D, ST ) and L(D, CT ). We then measured
the ratio L(D, CT )/L(D, ST ) (used in [18]), which can be regarded as a kind of
the degree of compression achieved by CT , denoted Ratio in results. In addition
we also measured the size |CT − | because it indicates how many concepts are
chosen out of the whole set of enumerated patterns. Furthermore, in experiments
with GenSets, we measured the Jaccard index : J(X, Y ) ≡ |X ∩ Y |/|X ∪ Y | for
two sets X and Y , between a set of identifiers of closed itemsets CF and that of
GenSets, to see the difference between results by itemsets and those by GenSets.
Note that in our setting, the set of pattern concepts by GenSets can be regarded
as a set of closed itemsets if we ignore additional numbers n in (B, n) ∈ D.
10
   https://archive.ics.uci.edu/ml/datasets/Iris
11
   https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
12
   https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
13
   https://archive.ics.uci.edu/ml/datasets/Wine
14
   http://www.patternsthatmatter.org/software.php, confirmed Feb. 2015.
30      Keisuke Otaki and Akihiro Yamamoto


Table 3: Results for itemsets by code-tables (the Krimp algorithm) and two-
tables (concepts and edit-tables in our framework).
                                        Krimp                       I-Edit
             |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Pre-proc. [s] Run-time [s]
     iris     163     13   0.456       0.025        0.234         0.021       0.520
     breast   641     29   0.181       0.580        6.263         0.573      26.794
     pima   3,204     66   0.356      18.202       47.525        19.107      373.63
     wine 14,555      72   0.757     442.686      393.539       444.842    1354.570


   In results, we labeled by Krimp to show results by the Krimp algorithm,
by I-Edit to show results by our framework using lattices and edit operations,
and by Gen-Edit for results by GenSets using pattern concept lattices.


5.1     Results and Discussions for Itemsets

We show in Table 3 results of experiments for itemsets based on the existing
method Krimp and our framework I-Edit, where we focus on run-times only.
Note that we, of course, confirmed that results are equivalent.
     With respect to the time for pre-processing, there existed no large differences.
Concerning the time for the iterative updating process of models, roughly speak-
ing our new model is at most 10 times slower than code-tables. Because both
models are not implemented by sophisticated manners, we conjecture that our
implementation cannot exploit pre-processed information on lattices well. That
is, computing edit-tables using lattices is time-consuming and not sophisticated.
Because the part of edit operations apparently depends on implementations of
lattices and data structures for concepts, we expect that some advanced data
structures help us to overcome this bottleneck.


5.2     Results and Discussions for GenSets

We conducted the same experiments using the same datasets with GenSets (in
Section 4.1) and our MDL evaluation (in Section 4.2). Results are summarized
in Table 4. We can see that the numbers of selected concepts via GenSets (i.e.,
12, 24, 62, and 40) are lower than those via itemsets (i.e., 13, 29, 66, and 72),
which mean that obtained models via GenSets of databases are more compact
than those via itemsets. The same trend can be confirmed in the Ratio column.
    The run-times of experiments have become faster than those in Table 3.
Because the numbers of concepts (itemsets) and pattern concepts (GenSets) are
the same now, these results should arise from the differences of the cover and
the MDL evaluation. In addition the contents of the selected patterns should be
affected. We can see the difference of the contents of the selection by checking
the Jaccard index. As we can see that, our GenSets formulation surely generates
completely different results compared with the results for itemsets.
         Edit Operations on Lattices for MDL-based Pattern Summarization        31


         Table 4: Results by our lattice-based framework for GenSets.
                                           Gen-Edit
                 |CF| |CT − | Ratio Pre-proc. [s] Run-time [s] Jaccard Index
       iris     163       12   0.428       0.009       0.550          0.785
       breast   641       24   0.168       0.351      20.666          0.293
       pima   3,204       62   0.342       18.92     285.676          0.123
       wine 14,555        40   0.686     415.376     454.631          0.131


    Furthermore, the existence of stars ? in GenSets affect the results. For GenSets
(in Section 4.2), we aggressively cover items having longer codes in ST if the
symbol ? appears. As less frequent itemsets have longer codes in the table ST
and such items are covered by ? in advance, concepts stored in our model have
relatively high frequency in the database, and the results get more compact.

Concluding Remarks on Experiments, and Future Directions We con-
ducted the MDL-based pattern summarization for selected benchmark datasets
by using code-tables and our model (two tables). We also applied our model for
GenSets to which we can apply the same framework. At least we confirmed that
our model obtained more compact results using GenSets (checked by Ratio),
and the contents are also different (done by Jaccard index).
    With respect to the quality of the choice of concepts via the two-part MDL
principle it is always open to discuss. To discuss the quality, the previous work
[18] developed the classifier using code-tables, tested it for labeled data, and
discussed the quality of their choice. Therefore following their strategy and de-
veloping a classifier using edit operations and lattices are our future work to
consider the pros and cons of our model and the quality of the choice.
    Further discussion on the relation of run-times for edit-tables and properties
of lattices (e.g., the density of contexts), is also important. Because the order
of checking the set CF is optional, developing and analyzing further heuristic
based on lattices is also important. We expect that several terminologies (not
used in this paper) and methods for lattices would be helpful for this direction.


6   Conclusions and Future Work
In this paper we propose a new framework for pattern summarization to con-
quer the redundancy problem for various patterns by combining lattices and
edit operations, which are evaluated by the two-part MDL principle. Our ex-
periments show that our model successfully extends an existing model towards
more general patterns. As an example we confirm that our model is applicable
to GenSets, defined by pattern structures as a richer representation of itemsets.
For closed itemsets, our model achieves the same results as the existing model
(Proposition 1). Results of Ratio in experiments, GenSets are useful to obtain
more compact representations of database compared with an existing model.
32     Keisuke Otaki and Akihiro Yamamoto


    In our future work, we plan to develop a classifier based on the MDL principle
by following the existing work for discussing the quality of the selection. We
also investigate encoding methods and heuristic for the cover towards various
patterns together with side information. Theoretical analysis of the two-part
MDL evaluation via lattices is also our important future research direction.

Acknowledgements. The authors are grateful to the anonymous reviewers
for valuable comments. This study was partially supported by Grant-in-Aid for
JSPS Fellows (26-4555) and JSPS KAKENHI Grant Number 26280085.


References
 1. Aggarwal, C.C., Han, J. (eds.): Frequent pattern mining. Springer International
    Publishing (2014)
 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
    databases. In: Proc. of the 20th VLDB. pp. 487–499 (1994)
 3. Buzmakov, A., Egho, E., Jay, N., Kuznetsov, S.O., Napoli, A., Raı̈ssi, C.: The
    representation of sequential patterns and their projections within formal concept
    analysis. In: Workshop Notes for LML (ECML/PKDD2013) (2013)
 4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Proc. of
    the 9th ICCS. pp. 129–142 (2001)
 5. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations.
    Springer (1999)
 6. Geerts, F., Goethals, B., Mielikäinen, T.: Tiling databases. In: Proc. of the 7th
    DS. pp. 278–289 (2004)
 7. Grünwald, P.D.: The minimum description length principle. The MIT Press (2007)
 8. Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: Summarizing and under-
    standing large graphs. In: Proc. of the 22nd SDM. pp. 91–99 (2014)
 9. Li, M., Vitnyi, P.M.: An introduction to kolmogorov complexity and its applica-
    tions. Springer Publishing Company, Incorporated, 3 edn. (2008)
10. Lichman,        M.:      UCI       machine        learning   repository      (2013),
    http://archive.ics.uci.edu/ml
11. Otaki, K., Ikeda, M., Yamamoto, A.: Pattern structures for understanding episode
    patterns. In: Proc. of the 11th CLA. pp. 47–58 (2014)
12. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed item-
    sets for association rules. In: Prof. of the 7th ICDT. pp. 398–416 (1999)
13. Pei, J., Han, J., Mao, R.: CLOSET: An efficient algorithm for mining frequent
    closed itemsets. In: Proc. of the DMKD2000. pp. 21–30 (2000)
14. Smets, K., Vreeken, J.: Slim: directly mining descriptive patterns. In: Proceedings
    of the 20th SDM. pp. 236–247 (2012)
15. Tatti, N., Vreeken, J.: Discovering descriptive tile trees. In: Proc. of the
    ECML/PKDD 2012. pp. 24–28 (2012)
16. Tatti, N., Vreeken, J.: The long and the short of it: Summarising event sequences
    with serial episodes. In: Proc. of the 18th KDD. pp. 462–470 (2012)
17. Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: An efficient algorithm for enu-
    merating frequent closed item sets. In: Proc. of the FIMI’03 (2003)
18. Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress.
    Data Mining and Knowledge Discovery 23(1), 169–214 (2011)