Concept Stability as a Tool for Pattern Selection Aleksey Buzmakov1,2 , Sergei O. Kuznetsov2 , and Amedeo Napoli1 1 LORIA (CNRS – Inria NGE – U. de Lorraine), Vandœuvre-lès-Nancy, France 2 National Research University Higher School of Economics, Moscow, Russia aleksey.buzmakov@inria.fr, amedeo.napoli@loria.fr, skuznetsov@hse.ru Abstract. Data mining aims at finding interesting patterns from datasets, where “interesting” means reflecting intrinsic dependencies in the do- main of interest rather than just in the dataset. Concept stability is a popular relevancy measure in FCA but its behaviour have never been studied on various datasets. In this paper we propose an approach to study this behaviour. Our approach is based on a comparison of stabil- ity computation on datasets produced by the same general population. Experimental results of this paper show that high stability of a con- cept in one dataset suggests that concepts with the same intent in other dataset drawn from the population have also high stability. Moreover, experiments shows some asymptotic behaviour of stability in such kind of experiments when dataset size increases. Keywords: formal concept analysis, stability, pattern selection, exper- iments 1 Introduction In data mining, many usefulness measures of patterns are introduced. For ex- ample, more than 30 statistical methods are enumerated and discussed in [1]. Such a high number of different approaches to pattern selection emphasizes the importance of the problem. In this paper we would like to focus on a measure which is introduced within Formal Concept Analysis (FCA). FCA is a mathe- matical formalism having many applications in data analysis [2]. Starting from the set of objects and the corresponding sets of attributes FCA tends to general- ize the descriptions for any set of objects. Although this approach is less efficient than the statistical methods it is still feasible and ensures that no potentially interesting pattern is missed. Within FCA there are several approaches for pattern selection. Two disjoint approaches can be distinguished. The first one is to introduce background knowl- edge into the procedure computing concepts [3–5]. These approaches allow one to find patterns which are likely to be useful for the current task. Although the number of resulting patterns can be significantly reduced, they are still numer- ous. The second approach can be applied in a composition with the first ones, ranking the resulting patterns w.r.t. a relevance measure. The authors of [6] provide several measures for ranking concepts that stem from the algorithms possibly underlying human behavior. Stability is another ({g1 , g2 , g3 , g4 , g5 } ; ∗)[0.47] m1 m2 m3 m4 m5 m6 ( {g1 , g2 , g3 , g4 } ; {m6 })[0.69] g1 x x g2 x x ({g1 } ; ∗)[0.5] ({g2 } ; ∗)[0.5] ({g3 } ; ∗)[0.5] ({g4 } ; ∗)[0.5] ({g5 } ; ∗)[0.5] g3 x x g4 x x g5 x (∅; ∗)[1.0] (a) (b) Fig. 1: A toy formal context (a) and the correspnoding concept lattice with sta- bility indexes (b). measure for ranking concepts, introduced in [7] and later revised in [8–10]. Sev- eral other methods are considered in [11], where it is shown that stability is more reliable in noisy data. For the moment, stability seems to be the most widely used usefulness measure around the FCA community. Thus, in this paper we are going to focus on stability. Although this measure is often used, there is neither a reliable comparison nor a deep research on its usefulness. Consequently, the goal of this paper is to evaluate the usefulness of stability. Here we experimentally prove that the stability for a pattern is coherent with the stability computed for the same pattern but w.r.t. a different dataset coming from the same population (the similarly distributed dataset). The rest of the paper is organised as follows. Section 2 introduces definition of stability and discusses known stability estimates. In Section 3 experiments on relevancy of stability are discussed. 2 Stability of a formal concept 2.1 Formal concept analysis (FCA) FCA [2] is a formalism for data analysis. FCA starts with a formal context and builds a set of formal concepts organized within a concept lattice. A formal context is a triple (G, M, I), where G is a set of objects, M is a set of attributes and I is a relation between G and M , I ⊆ G × M . In Figure 1a, a formal context is shown. A Galois connection between G and M is defined as follows: A0 = {m ∈ M | ∀g ∈ A, (g, m) ∈ I}, A⊆G B 0 = {g ∈ G | ∀m ∈ B, (g, m) ∈ I}, B⊆M The Galois connection maps a set of objects to the maximal set of attributes 0 shared by all objects and reciprocally. For example, {g1 , g2 } = {m6 }, while 0 {m6 } = {g1 , g2 , g3 , g4 }. Definition 1. A formal concept is a pair (A, B), where A is a subset of objects, B is a subset of attributes, such that A0 = B and A = B 0 , where A is called the extent of the concept, and B is called the intent of the concept. For example, a pair ({g1 , g2 , g3 , g4 } ; {m6 }) is a formal concept. Formal con- cepts can be partially ordered w.r.t. the extent inclusion (dually, intent inclu- sion). For example, ({g3 } ; {m3 , m6 }) ≤ ({g1 , g2 , g3 , g4 } ; {m6 }). This partial or- der of concepts is shown in Figure 1b. 2.2 The definition of stability Stability is an interestingness measure of a formal concept introduced in [7] and later revised in [8, 10]. Definition 2. Given a concept c, concept stability Stab(c) is defined as |{s ∈ ℘(Ext(c)) | s0 = Int(c)}| Stab(c) := (1) 2|Ext(c)| i.e., the relative number of subsets of the concept extent (denoted by Ext(c)), whose description (i.e., the result of (·)0 ) is equal to the concept intent (denoted by Int(c)) where ℘(P ) is the power set of P . Example 1. Figure 1b shows the concept lattice of the context in Figure 1a, for simplicity some intents are not given. The extent of the highlighted con- cept c is Ext(c) = {g1 , g2 , g3 , g4 }, thus, its power set contains 24 elements. The descriptions of 5 subsets of Ext(c) ({g1 } , . . . , {g4 } and ∅) are different from Int(c) = {m6 }, while all other subsets of Ext(c) have a description equal to 4 {m6 }. So, Stab(c) = 2 2−5 4 = 0.69. Stability measures the dependence of a concept intent on objects of the con- cept extent. In [10] it is shown that stability of a concept c is the relative number of subcontexts where there exists the concept c with intent Int(c). A stable con- cept can be found in many such subcontexts, and therefore is likely to be found in an unrelated context built from the population under study. In some papers it is noticed that in large datasets most of the concepts tends to have stability close to 1 [12, 13]. Thus, in order to distinguish between them we use the following logarithmic stability: LStab(c) = − log2 (1 − Stab(c)) (2) Stability computation is #P-complete [7, 8]. In this paper we rely on the algorithm from [10], with a worst-case complexity of O(L2 ), where L is the size of the concept lattice. However, generally it is quite efficient on real data. 3 Experiment on relevancy of stability Experiments on behaviour of stability are carried out on public datasets available from the UCI repository [14]. These datasets are shown in Table 1. With their different size and complexity, these datasets provide a rich experimental basis. Complexity here stands for the size of the concept lattice given the initial number Table 1: Datasets used in the experiments. Column ‘Shortcut’ refers to the short name of the dataset used in the rest of the paper; ’Size’ is the number of objects in the dataset; ‘Max. Size’ is the maximal number of objects in a random subset of the dataset the concept lattice can be computed for; ‘Max. Lat. Size’ is the size of the corresponding concept lattice; ‘Lat. Time’ is the time in seconds for computing this lattice; ‘Stab. Time’ is the time in seconds to compute stability for every concept in the maximal lattice. Dataset Shortcut Size Max. Size Max. Lat. Size Lat. Time Stab. Time Mushrooms1 Mush 8124 8124 2.3 · 105 324 57 Plants2 Plants 34781 1000 2 · 106 45 104 Chess3 Chess 3198 100 2 · 106 30 7.4 · 103 Solar Flare (II)4 Flare 1066 1066 2988 0 0 Nursery5 Nurs 12960 12960 1.2 · 105 245 5 1 http://archive.ics.uci.edu/ml/datasets/Mushroom 2 http://archive.ics.uci.edu/ml/machine-learning-databases/plants/ 3 http://archive.ics.uci.edu/ml/datasets/Chess+(King-Rook+vs.+King-Pawn) 4 http://archive.ics.uci.edu/ml/datasets/Solar+Flare 5 http://archive.ics.uci.edu/ml/datasets/Nursery of objects in the corresponding context. For example, Chess is the most complex dataset as for only 100 objects in the context there are already 2 · 106 of concepts in the concept lattice. When computing stability, one wants to know if the intent of a stable concept is a general characteristic rather than an artefact specific for a dataset. For that it is necessary to evaluate stability w.r.t. a test dataset different from the reference one. Reference and test datasets are two names of disjoint datasets on which the stability behaviour is evaluated. In order to do that the following scheme of experiment is developed: 1. Given a dataset K of size K objects, experiments are performed on dataset subsets whose size in terms of number of objects is N . This size is required to be at least half the size of K. For example, for a dataset of size K = 10 the size of it subset can be N = 4. 2. Two disjoint dataset subsets K1 and K2 of size N (in terms of objects) of dataset K are generated by sampling, e.g., K1 = {g2 , g5 , g6 , g9 } and K2 = {g3 , g7 , g8 , g10 }. Later, K1 is used as a reference dataset for computing stability, while K2 is a test dataset for evaluating stability computed in K1 . 3. The corresponding sets of concepts L1 and L2 with their stability are built for both datasets K1 and K2 . 4. The concepts with the same intents in L1 and L2 are declared as correspond- ing concepts. 5. Based on this list of corresponding concepts, a list of pairs S = {hX, Y i , . . . } is built, where X is the stability of the concept in L1 and Y is the stability of the corresponding concept in L2 . If an intent exists only in one dataset, its stability is set to zero in the other dataset (following the definition of Fig. 2: Stability in the test dataset w.r.t the reference one in Mush4000 in (a) plane scale (b) logarithmic scale. stability). Finally, the list LS = {hXlog , Ylog i , . . . } includes the stability pairs from S in logarithmic scale as stated by Eq. (2). We study here the sets S and LS. The idea of evaluating stability computed on a reference dataset w.r.t. a test dataset comes from the supervised classification methods. Moreover, this idea is often used to evaluate statistical measures for pattern selection and can be found as a part of pattern selection algorithms with a good performance [15]. Sets of pairs S and LS can be drawn by matching every point hX, Y i to a point in a 2D-plot. The best case is y = x. It means that stability computed for dataset part K1 is exactly the same as stability computed for the dataset part K2 . However, this is hardly the case in real-world experiments. For ex- ample, Figure 2a shows the corresponding diagram for the dataset Mush4000.1 This figure also highlights the fact that many concepts have stability close to 1. However, when the logarithmic set LS is used, a blurred line y = x can be perceived in Figure 2b. Moreover, selecting the concepts which are stable w.r.t. a high threshold in the reference dataset K1 , the corresponding concepts in K2 are stable w.r.t. a lower threshold. Thus, we can conclude that stability is more tractable in the logarithmic scale, and, thus, we only consider this logarithmic scale in the rest of the paper. 3.1 Setting a stability threshold In the previous subsection it is mentioned that concepts stable in the reference dataset are stable in the test dataset with a smaller threshold. But what is “smaller”? Imagine that in the reference dataset K1 we have the threshold θ1 , i.e., if Stab(c) ≥ θ1 then c is stable, while in the K2 we have θ2 . Then, we want to know the threshold θ1 such that at least 99% of stable concepts in K1 1 From here, the name of a dataset followed by a number such as ‘N ameN ’ refers to an experiment based on the dataset N ame where K1 and K2 are of the size N . 20 1: Mush ● 1: Plnt 1: Sflr Reference Stability Threshold 1: Nurs 15 5: Mush 5: Plnt 10 5: Sflr 5: Nurs ● ● ● 5 ● ● 50 100 200 500 1000 2000 5000 Dataset size in objects Fig. 3: Stability threshold in the reference dataset ensuring that 99% of concepts in the test datasets corresponding to stable concepts are stable with stability thresholds 1 or 5. corresponds to stable concepts in K2 . Figure 3 shows the reference threshold θ1 (x-axis) w.r.t. the size of the datasets (y-axis) for θ2 = 1 and θ2 = 5. For example, the line ‘5: Mush’ corresponds to the line of θ1 , where θ2 is fixed to 5 w.r.t. to the size of the dataset built from dataset Mushrooms. The value θ2 = 1 means that any stable concept is just found in the test dataset, while θ2 = 5 requires that they are quite stable in the test dataset. We can see that for large datasets the stability threshold is independent of the dataset, while for small datasets the diversity is higher. We can see that the value of θ1 should be set to 5–6 in order to ensure that 99% of stable concepts have corresponding concepts in another dataset. 3.2 Stability and ranking Another way of using usefulness measures is pattern ranking. Thus, it is an interesting question if the order of patterns could be preserved by using stability. A way to study an order of an array ar is to compute its sorting rate r, i.e., the relative number of pairs in the array sorted in the ascending order: r = {(i,j)|i