Distributed Distributed and General and General Galois Galois Lattice ForLattice Large Data For Large Data Bases Bases FatmaF.Baklouti Baklouti, andG.Gérard Lévy Lévy CERIALaboratory, CERIA Laboratory, Paris Paris Dauphine DauphineUniversity University Place Place du du Maréchal Maréchal dedeLattre Lattre de Tassigny Tassigny,75775 75775Paris cedex Paris 16, 16, cedex France France {fatma.baklouti, glevy}@dauphine.fr {fatma.baklouti, glevy}@dauphine.fr Abstract. Large data processing is an essential problem in many data mining application. Our work explores the algorithms calculating a Generalized Galois Lattice (G2L) over a large collection of data. A G2L contains all the so-called closed sets of a collection of tuples (individuals characterized by a set of properties), G2Ls generalize to multivalued data the GL already popular in the concept analysis using the binary values. They appear an attractive tool for data mining. The G2L or GL calculus is CPU intensive. In practice, the current techniques limit the approach only to small sets of data only, e.g., a hundred tuples with a few dozens of properties each. Our research consists in building scalable distributed G2L calculus algorithms. We think this research direction promising and probably the only way towards our goal. We first have implemented and analyzed some centralized algorithms. One termed ELL seems to be the most efficient. We have defined therefore a scalable distributed version of ELL termed SD-ELL. It recursively partitions the set of tuples for the G2L computation over sufficiently many sites to let ELL execute fast enough at each site. The closed sets produced by each site enter a common scalable distributed data structure (SDDS). We then plan to test the resulting behavior and analyze the performance of the algorithm and of some promising variants. 1 Introduction A Galois Lattice (GL) allows extracting concepts and rules from other concepts. Several algorithms determine the GL over a given context C, provided the size of C is small, [4] [7] [8]. These algorithms apply to binary data. Their generalization, G2L was proposed in [5]. An algorithm for G2L calculus termed ELL, and some alternative ones, were subsequently defined in [6]. These algorithms, as any previous for a GL computation, were designed for the single-site computations. It was known that they could run in reasonable time only for small sets, e.g., at most of a few hundreds of tuples. Further analysis has shown that the most promising way towards larger sets of data, the only of interest to the data mining, was to find some scalable distributed variant of G2L calculus. This goal becomes the subject of our work. We have defined therefore a scalable distributed version of ELL termed SD-ELL. It recursively partitions the set of tuples for the G2L computation over sufficiently many Radim Bělohlávek, Václav Snášel (Eds.): CLA 2005, pp. 197–206, ISBN 80–248–0863–3. 198 Fatma Baklouti, Gérard Lévy sites to let ELL execute fast enough at each site. The closed sets produced by each site enter a common scalable distributed data structure (SDDS). We then plan to test the resulting behavior and analyze the performance of the algorithm. The rest of the paper is organized as follows. In Section 2, we recall the definition of a G2L. Section 2 recalls the main idea in ELL. Section 3 overviews the principles of our scalable distributed extension, exploring the recursive G2L calculation in [6]. Section 4 concludes this paper by listing the future research direction. 2 General Galois Lattices Concept lattice [9] and Closed itemset lattice are based on order theory and lattice theory [3]. They are used to represent the order relation on concepts or closed itemsets. Concept lattice describes the character of the set pair: intent and extent of concept. And the general of Galois Lattice formalism was addressed by [16], [17]. In this section, we define some notions generalizing usual ones: Data context, Closure operator, Closed itemset, etc. Definition 1. A lattice is a mathematical structure F = < F, , , , 0F, 1F >, where F is a partially ordered set by the relation , with the largest element 1F, a smallest element 0F, and , are internal composition laws of sup (or supremum), and inf (or infimum). In many situations F is the Cartesian product of several lattices Fj = , for j J = {1,..., n}. We write this F=F1 x...x F x...x Fn. The relation on F is defined by z = (z1, ...,zj, ...,zn) t = (t1, ...,tj, ...,tn) iff zj j tj for each j of J . We note z t = (..., zj j tj,....), z t = (..., zj j tj, .... ), OF = (....,OFj, ... ), 1F = (..., 1Fj, ). (For standard Galois lattices, we have for each j: Fj = {0, 1}, 0 < 1, 0 j 0 = 0, 0 j 1 = 1 j 0 = 1 j =1, 0 j 0 = 0 j 1 = 1 j 0 = 0, 1 j 1 = 1. So OF = (...., 0, ...), and 1F = (..., 1,...)). Definition 2. General contexts and descriptions: Let m be a finite positive integer, I = {1,..., m} = [1, ,m], and F any lattice. Let d:I F be any mapping from I to F. A context C is defined as an array with rows d (i), i = 1, ..., m. Formalism: The context provides, for each individual or object i of I, its description d(i)=(d1(i), d2(i), , dj(i),..., dn(i)) F = F1 x x Fn, according to the attributes or properties j J. (In the standard case, dj(i)=1 means that i has property j, and dj(i) = 0 means that i has not the property j). So, in the general case, C is an m x n array of elements dj (i) of F, and for each individual i and each property j, dj (i) is the value of this property for i. Definition 3. Galois connection: Let C = be a context. We define E = P(I), |E|=2l and f: E F as follows: for each subset X of I, f(X) = d(i): i X} if X , and f( ) = 1F. So, f(X) is the infimum of the descriptions d(i) of elements i of X. And in the standard case f(X) is an element z = (z1,..., zj, ...) of F, and zj = j {dj(i) : i X} = 1, iff dj(i)=1, for each i of X. This means that zj = 1 iff j is a property which belongs to all i in X. For this reason, we call f (X) the intent of X. Distributed and General Galois Lattice for Large Data Bases 199 Remark: For each i of I, we have f({i}) = d(i), and f is decreasing (if X X I then f(X ) f(X)). We define g: F E by g(z) = {i I: z d(i)}, for each z of F. We say that g(z) is the extent of z. (In standard case, g(z) is the set of all individuals who have all properties of z, zj = 1). We can see that g is also a decreasing mapping. The ordered pair (f, g) is called a Galois connection. From it we derive two other mappings: h : P(I) P(I), by h = g f, and k = F F by k = f g. So, for each subset X of I, we have h(X) = g(f(X)) = {i I: f(X) d(i)}, and for each z of F, we have k(z) = f(g(z)) = {d(i): i g (z)}. We can see that h and k are closure operators. This means that each of them is an increasing, extensive, and idempotent operator. More explicitly, for each X, X of E, and z, z of F: X X implies that h(X) h(X ), z z implies that k(z) k(z ); X h(X), z k(z); h(h( X)) = h(X), k(k( z)) = k(z) . Any subset X of I such that X = h(X) is called a I-closed set, and each z of F such that z = k(z) is called F-closed element. Let us define H = {X I: h(X) = X} the set of all closed subsets of I, and K = {z F: z = k(z)}, the set of all closed elements of F. One can proof that there is a bijective mapping between H and K. The ordered pairs (X, z) H x K such that f(X) = z, and therefore such that g(z) = X, are called the concepts associated with the context C. The set of all such concepts constitutes the Galois lattice GL(C) associated with this context C. (the order relation on GL(C) is defined by (X, z) (X, z ) iff X X and z z.) 3 ELL algorithm This algorithm was proposed in [6]. We have subsequently improved it and compared to some other algorithms [2]. This comparison has shown better performance of ELL. This motivated our choice for this algorithm as a single-set G2L computation. Main idea: For two disjoint subsets X0 and K of the set of objects I, let ELL (X0, K) denote a procedure which lists all the closed sets of I obtained by extending X0 with some elements of K. In other words, ELL (X0, K) lists all the closed sets, which strictly contain X0 and are contained in X0 U K. Obviously, ELL (Ø, I) lists all the non-empty closed sets of I. ELL (X, K) proceeds by dichotomy as follows: 200 Fatma Baklouti, Gérard Lévy If (K Ø) Choose an element i0 K Find the closed sets which contain i0 Find the closed sets which do not contain i0 Endif The key point of the proposed algorithm is that the search time of such closed sets is considerably reduced by using the following proposition. Proposition: Let X0 and K Ø be two disjoint subsets of I. Let i0 K. We have: h(X0 U {i0}) = X0 U A where A = {i I\X0: f(X0) f(i0) f(i)} If a closed set contains X0 and i0, then it also contains A. Hence, if A K then X0UA is the smallest closed set containing X0 and i0 and contained in X0 U K. If a closed set contains X0 and does not contain i0, then it also does not contain any element of the set: R = {i K: f(X0) f(i) f(i0)}. A (vs. R) is used for Attraction (vs. Rejection). The proof of this proposition is in [6]. The following pseudo-code is a recursive version of the algorithm. Procedure ELL (X0, K) GL = Ø // GL is a list of concepts Var i0: element of I, z, z0: elements of F; X, A, R: subsets of I; begin z0 = f(X0); if K Ø then begin Choose an element i0 of K; z = z0 f(i0); A = {i I\X0: z f(i)}; if A K then begin X = X0 U A; insert node (X, z) in GL; ELL (X, K\A); end; R = {i K: z0 f(i) f(i0)}; ELL (X0, K\R); end end The procedure ELL (Ø, I) starts with any i0 I (I is non empty). Then it determines the set A. Observing that i0 A, we have the strict inclusions Ø X and I\A I. Hence if A K, ELL (X, I\A) will run with a strictly smaller second parameter. Since i0 R, we see that the same holds for ELL (X, I\R). More generally ELL (X, K\A) and ELL (X0, I\R) run with a strictly smaller second parameter than that of ELL (X, K). Since I is finite, this parameter which is a subset of I, will reach the void set and the procedure will terminate. This algorithm lists all closed sets without duplicates. Let us show that each closed set F occurs exactly once. Distributed and General Galois Lattice for Large Data Bases 201 Starting with GL = Ø, X0 = Ø and K = I, let i0 I be fixed and consider two cases: i0 F and i0 F. If i0 F then Either F is the smallest closed set which contains i0. Then according to the previous proposition (a), F=X=X0 U A and F will be listed by ELL (X0, K), Or F is not the smallest closed set which contains i0. In this case X F and F will be listed by ELL (X, K\A). If i0 F, then according to the previous proposition (c), F will be listed by ELL(X0,K\R). Since each insert only concerns the unique smallest closed set containing X0 and i0, we see that F occurs exactly once. The only case not treated by the algorithm is whether the empty set is closed or not. The algorithm yields all the nodes (X, z) of the GL such that X Ø. Since f(Ø) = 1F and h(Ø) ={i I: 1F f(i)} = {i I : f (i) = 1F}, Ø is closed iff there is no i I such that f(i) = 1F. 4 Scalable Distributed G2L calculus We proposed a first solution in [1] which describes a way to parallelize the large contexts, by sharing a context C into sub-contexts depending on its rows or columns. Galois lattices associated with these sub-contexts are built, with the ELL, in different computers and then the global lattice is determined from these lattices. The algorithms used for the build of this global lattice are detailed in [1]. This solution was successfully tested. The study has shown that the time complexity appeared excessive for larger sets we have wished. We have designed therefore a new solution to parallelize the work. It is based on a new algorithm, termed SD-ELL. The key idea is a recursive application of ELL. 4.1 SD-ELL algorithm SD-ELL is composed of two algorithms. The first one computes the sets of objects Ki ( i={1, , n}, n is the number of the computed sets Ki) and the second one is duplicated on different machines to compute all the closed sets corresponding to each Ki. The algorithm presented in the following table permits computing the sets Ki. It proceeds as follows: Choose an element i0 K Compute the sets A and R Verify if A K then (X, Z) is a closed itemset and Ki = K\A Ki+1 = K\R The previous steps are reused with the parameters Xi and Ki. The process of decomposing Ki is stopped if |Ki| is smaller than a fixed threshold or the total number of machine, in which the closed itemsets corresponding to each Ki are computed, is used. 202 Fatma Baklouti, Gérard Lévy Procedure KCompute(X0,K0,z0) LF = Ø; List of (K, X, z) t = 1, q = 1, Xt = Ø, Kt = I, zt = 1F. while ((t q) and ((q - t + 1) site number)) begin X0 = Xt; K0 = Kt; z0=zt Choose an element i0 of K z = z0 f (i0); A = {i I\X0: z f(i)}; R = {i K: z0 f(i) f(i0)}; if A K0 then begin X = X0 A; z = z0 f(i0); K = K0\A; LF = LF (X, z) q=q+1; Xq = X; Kq = K; zq = z; end q = q + 1; Xq = X0; Kq = K0\R; zq = z0; t = t + 1; end The second algorithm constituting SD-ELL is the iterative version of ELL (I-ELL) duplicated on different machines. This version is detailed in [6] and [2]. Let us use the context C from the example described below to illustrate how these two algorithms work. O\A a b c d e f g h 1 12 16 6 10 12 13 12 6 2 12 12 8 12 11 13 11 8 3 10 13 16 14 11 8 12 10 4 13 14 11 10 13 13 12 14 5 17 10 10 14 13 10 14 12 6 0 14 3 8 4 13 11 10 For this example the threshold is 4 and the number of machine is 4. The dispatcher executes the first algorithm (KCompute). KCompute starts with the parameters X0=Ø and K= I ={1,2,3,4,5,6}. An element i0 is chosen such as i0=1 and then the sets A, K1, R, and K2 are computed. z = z0 f (i0) = {12,16,6,10,12,13,12,6}, A = {i I \ X0: z f (i)}={1}. Then A K: X1=X0UA={1} and C2({1};{12,16,6,10,12,13,12,6}) is a concept. K1 = K\A = {2,3,4,5,6}; |K1| 4, K1 must then be decomposed. R = {i K: z0 f (i) f (i0)} = {1} K2 = K\R = {2,3,4,5,6}; |K2| 4, K2 must also be decomposed. The previous steps are reused with the parameters (X1 = {1}, K1) and (X0=Ø, K2). Distributed and General Galois Lattice for Large Data Bases 203 Decomposition of K1: Choosing i0 K1 such as i0 = 2. z = z0 f (i0) = {12,12,6,10,11,13,11,6}, A = {i I \ X: z f (i)} = {2,4}. Then A K 1: X3 = X1 U A = {1,2,4} and C3 ({1,2,4};{12,12,6,10,11,13,11,6}) is a concept. K3 = K1 \ A = {3,5,6} (|K3| 4) R = {i K1: z0 f (i) f (i0)} = {2} K4 = K1\R = {6,3,4,5} (|K4| 4) with X4={1}. Decomposition of K2: Choosing i0 K2 such as i0=2. z = z0 f (i0) = {12,12,8,12,11,13,11,8}, A={i I \ X: z f (i)} = {2}. Then A K2: X5 = X0 U A = {2} and C4({2};{12,12,8,12,11,13,11,8}) is a concept. K5 = K2\A = {3,4,5,6} (|K5| 4) R = {i K2: z0 f (i) f (i0)} = {2} K6 = K2\R = {6,3,4,5} (|K6| 4) with X6=Ø. Then, the dispatcher sends to the applications the parameters (K3,X3) (K4,X4) (K5,X5) (K6,X6). In parallel, the applications, described in the subsection 4.2, execute the I- ELL. The following sets Ei are the sets of concepts generated by I-ELL(Ki+2, Xi+2) in the applications Ai. The set of concepts E1 generated by The set of concepts E2 generated by I-ELL (K3,X3). I-ELL(K4, X4). C5({1,2,4,3};{10,12,6,10,11,8,11,6}) C12({1,6,4};{0,14,3,8,4,13,11,6}) C6({1,2,4,3,5};{10,10,6,10,11,8,11,6}) C13({1,6,4,3};{0,13,3,8,4,8,11,6}) C7 ({1,2,4,3,5,6};{0,10,3,8,4,8,11,6}) C14({1,3,4};{10,13,6,10,11,8,12,6}) C8({1,2,4,3,6};{0,12,3,8,4,8,11,6}) C15 ({1,3,4,5};{10,10,6,10,11,8,12,6}) C9 ({1,2,4,5};{12,10,6,10,11,10,11,6}) C16 ({1,5,4};{12,10,6,10,12,10,12,6}) C10 ({1,2,4,5,6};{0,10,3,8,4,10,11,6}) C17({1,4};{12,14,6,10,12,13,12,6}) C11 ({1,2,4,6} ;{0,12,3,8,4,13,11,6}) The set of concepts E3 generated by The set of concepts E4 generated by I-ELL(K5, X5). I-ELL(K6, X6). C18({2,3};{10,12,8,12,11,8,11,8}) C29({6,4);{0,14,3,8,4,13,11,10}) C19({2,3,4};{10,12,8,10,11,8,11,8}) C30{6,4,3}; {0,13,3,8,4,8,11,10}) C20({2,3,4,5};{10,10,8,10,11,8,11,8}) C31({6,4,3,5};{0,10,3,8,4,8,11,10}) C21({2,4,3,5,6};{0,10,3,8,4,8,11,8}) C32({6,4,5};{0,10,3,8,4,10,11,10}) C22({2,4,3,6}; {0,12,3,8,4,8,11,8}) C33({3};{10,13,16,14,11,8,12,10}) C23({2,3,5};{10,10,8,12,11,8,11,8}) C34({3,4};{10,13,11,10,11,8,12,10}) C24({2,6,4};{0,12,3,8,4,13,11,8}) C35({3,4,5};{10,10,10,10,11,8,12,10}) C25({2,6,4,5};{0,10,3,8,4,10,11,8}) C36({3,5};{10,10,10,14,11,8,12,10}) C26({2,5};{12,10,8,12,11,10,11,8}) C37({4};{13,14,11,10,13,13,12,14}) C27({2,5,4};{12,10,8,10,11,10,11,8}) C38({4,5};{13,10,10,10,13,10,12,12}) C28({2,4};{12,12,8,10,11,13,11,8}) C39 ({5};{17,10,10,14,13,10,14,12}) The union of E1, E2, E3, E4, and the concepts calculated in the dispatcher is the set of all concepts corresponding to the context C. 204 Fatma Baklouti, Gérard Lévy 4.2 The Architecture A major problem for the final efficiency of SD-ELL is the storage of the collection of the closed produced, presumably very large and of unknown size. Our tool is a Scalable Distributed Data Structure (SDDS). An SDDS dynamically distributes over the (server) nodes whose number dynamically scales with the file growth. The application accesses the file through the client nodes making the data distribution transparent. For scalability, the data address calculations do not involve any centralized directory. The data are also basically stored in the distributed RAM, for faster access than to the traditional disk-based structures [10]. All these properties are prime importance for our goal. Our first architecture for SD-ELL, illustrated below, calculus is to add to the SDDS structure of clients and servers, a dispatcher and our applications at each available client site. The dispatcher executes the first algorithm of SD-ELL. It then sends to the applications the parameters Ki and Xi. In parallel, on each client the iterative version is executed with its parameters. We obtain the closed itemsets that after the calculus partition the Galois lattice. The closed sets are sent by the clients to the servers, where they are saved. Dispatcher (KCompute) K1, X1 Ki, Xi Kn, Xn Application Application Application I-ELL(K1,X1) E1 I-ELL(Ki,Xi) Ei I-ELL(Kn,Xn) En Client Client Client E1 Ei En Server Server Server Server Name Buckets in RAM Buckets in RAM Buckets in RAM The dispatcher is a centralized component. To offset the potential drawbacks of this approach, we have designed also a more distributive solution. The dispatcher calculates only the set K1 and K2 (K1=K0\A and K2 =K0\R) and sends them to two applications. These applications become dispatchers on there turn. They send to other Distributed and General Galois Lattice for Large Data Bases 205 applications the parameters Ki and Xi if Ki is greater than a fixed threshold and there are available machines. We will first implement and measure the 1st solution. Later we will also design the end one and compare the related trade-offs. Our preliminary experimentations has shown that the processing time of the distributed SD-ELL algorithm out performs the processing time of the ELL algorithm. Besides, increasing the number of processors improves the processing time. 5 Related work In [12], Ndoundam and al. proposed a methodology which permits implementing a parallel version of Bordat [14] through the parallelization of nested loops. Nevertheless, this parallelization is incompleted as proved by [13] how propose a better fine-grain parallelization. By another way, [15] present a parallel version of Ganter algorithm [7] for large context which relies on a partitioning of the search space. Let us note that all these previous works do not provide effective implementation of the parallel algorithms. Only [11] propose a validation of their theoretical model through an effective implementation. 6 Conclusion In this paper, we propose a new algorithm (SD-ELL) which is a distributed version of ELL, in order to deal with large databases. The architecture used for this distribution allows to share the computation of closed sets of large databases on the different applications and to store them in the server nodes. The storage of the great number of closed sets is successful since the number of server nodes dynamically scales with the growth of the closed itemsets file. Our ongoing research consists in continuing the evaluating of the prototype performance measurements. A possible future research direction consists in determining the links between the concepts and then comparing our results with the results of [11]. References 1. Baklouti. F, Lévy. G. Parallel algorithms for general Galois lattices building. Proc. WDAS 2003. 2. Baklouti. F, Lévy. G. A fast and general algorithm for Galois lattices building. Journal of Symbolic Data Analysis. Vol 3. pp. 19-31. 3. Birkhoff. G. Lattice Theory. American Mathematical Society, Providence, RI, 3rd edition. 1967. 4. Bordat. J. Calcul pratique du treillis de galois d une correspondance, Mathématique, Informatique et Sciences Humaines 24(94), 31. 1986. 206 Fatma Baklouti, Gérard Lévy 5. Diday. E, Emilion. R. Treillis de Galois maximaux et capacités de Choquet. Cahier de Recherche de l Acadèmie des Sciences. Paris, t.325, Série I, p.261-266, 1997. 6. Emilion. R. Lambert. G, Lévy. G, Algorithmes pour les treillis de Galois, Indo- French Workshop, University Paris IX-Dauphine. 1997. 7. Ganter. B. Two basic algorithms in concept analysis. Preprint 831, Technische Hochschule Darmstadt 1984. 8. Godin. R., Mineau. G. and al. Méthodes de classification conceptuelle bases sur les treillis de Galois et application. Revue d intelligence artificielle pp 105-137. 1995. 9. Ganter. B, Wille. R. Formal Concept Analysis. Mathematical Foundations, Springer. 1999. 10. Litwin. W., Neimat, M-A., Schneider. D. A Scalable Distributed Data Structures. ACM Transactions on Database Systems (ACM-TODS), Dec. 1996. 11. Valtchev, P., Djoufak Kengue, J,F., Djamegni C,T., A parallel algorithm for lattice construction, in the proceding of the third ICFCA, February 2005, Lens, Fance. pp 249-264. 12. Ndoundam, R., Njiwoua, P., Mephu Nguifo, E. Une etude comparative de la parallelisation d algorithmes de construction de treillis de Galois. Atelier francophone de la plate forme de l AFIA. Usage des treillis de Galois pour l intelligence artificielle, ESIEA Recherche (2003). 13. Njiwoua, P., Mephu Nguifo, E. A parallel algorithm to build concept lattice. In proceedings of the fourth Groningen Int. Information Tech. Conf. for students (1997). 14. Bordat, J.P. Calcul pratique du treillis de Galois d une correspondance. Mathématiques et Sciences Humaines, Vol. 96 (1986) 31-47. 15. Fu, H., Mephu Nguifo, E. A parallel algorithm to generate formal concepts for large data. In proceedings of the second international conference on formal concept analysis (ICFCA04), springer LNCS, Sydney Australia (2004). 16. Diday. E, Emilion. R. Treillis de Galois maximaux et capacités de Choquet. Cahier de Recherche de l Acadèmie des Sciences. Paris, t.325, Série I, p.261-266, 1997. 17. Diday. E, Emilion. R. Maximal and stochastic Galois lattices , Discrete Applied Mathematics, 127, (2003), 271-284. Acknowledgements We would like to express special thanks to Pr. Witold Litwin for his precious remarks and his help for using the SDDS.