Efficient Efficient algorithms algorithms for for clone clone items items detection detection Raoul Medina, Caroline Noyer, and Olivier Raynaud Raoul Medina, Caroline Noyer and Olivier Raynaud LIMOS - Université Blaise Pascal, LIMOS -des Campus universitaire Université Cézeaux,Blaise Pascal, Clermont-Ferrand, France Campus universitaire des Cézeaux, {medina, Clermont-Ferrand, France raynaud}@isima.fr {medina, raynaud}@isima.fr Abstract. This paper presents efficient algorithms for clone items de- tection in a binary relation. Best implementation of these algorithms has O(|J|.||M||) time complexity, with J corresponding to the set of items of the relation and ||M|| corresponding to the size of the relation. This re- sult improves the previous algorithm given in [3] which is O(|J|2 .||M||). Clone items have been introduced by Medina and Nourine to explain why, sometimes, the number of rules of a minimum cover of a relation is exponential with the number of items of the relation. 1 Introduction The clone items notion has been introduced by Medina and Nourine in [3]. Aim of their paper was to understand the combinatorial explosion that might arise in a minimum basis of a relation. Most famous example of such exponential minimum basis is given by Manilla and Rähiä in [2]. Medina and Nourine no- ticed that in this example some items play symetrical role in the basis. Indeed, for each rule containing a given item a in the antecedent there is a symetri- cal rule where item a is replaced by item b. Such symetrical items are said to be clone items. The clone notion is an equivalence relation and thus classes of clone items can be found in a binary relation. In [3], the authors show how to compute such clone classes and, from those classes, how to reduce the binary relation in order to obtain a minimum basis with no clone items. The detection algorithm as well as the reduction algorithm have both polynomial time com- plexities. The obtained minimum basis is smaller than the original one (in the case of the Mannila and Rähiä example it reduces to a single rule) and this later can be reconstructed from the clone free minimum basis and the clone classes information in polynomial time. Recently, Gely et al. (in [1]) extended the notion of clone items (which are defined on closed sets) to the notion of P-clone items (defined on pseudo-closed sets) and to the notion of A-clone items (defined on what they call an ”atomized” context). Their approach could be generalized to any generation problem where, from a given relation, one wants to compute a (potentially exponential) collection of sets verifying a property (e.g. the sets must be closed, or the sets must be ideals, or the sets are pseudo-closed sets, etc...). The idea is then to reduce the combinatorial explosion of the wanted collection by representing it by a clone free collection and the classes of clone items of this collection. The problem is Radim Bělohlávek, Václav Snášel (Eds.): CLA 2005, pp. 70–81, ISBN 80–248–0863–3. Efficient algorithms for clone items detection 71 then to be able to compute the clone classes of the (potentially exponential) collection without generating its sets. To solve this problem in polynomial time, the general method could be as follows: – Let M be the (potentially exponential) collection of sets verifying a property over a given binary relation R. Compute in polynomial time a collection M0 such that: 1. the size of M0 is polynomial in the size of R, 2. items a and b are clone in M if and only if they are clone items in M0 . – Detect the clone classes in M0 . Our paper focuses on the detection phase of this approach. The clone classes computation algorithm given in [3] has an O(|J|2 .||M||) time complexity, where J is the set P of items and ||M|| is the size of the input collection. In other words, ||M|| = m∈M |m|. In this paper, we present different algorithms to solve the clone classes computation. The best complexity we obtain is in O(|J|.||M||). This paper is organized as follows: section 2 formally defines the problem in terms of collection of sets and introduces the corresponding Abstract Data Type which will be used by our algorithms. Section 3 describes three computation strategies and the corresponding time complexities are studied. Section 4 shows how to take advantage of those algorithms in order to compute the clone items classes as defined in [3]. 2 General context and definitions In this section, we first formally define the studied problem. Then we present the Abstract Data Structure called Map used in our algorithms and discuss on its possible implementations. 2.1 Clone items in a Sets Collection Let J be a set of items {x1 , ..., x|J| } and M a sets collection on J. We denote by ϕa,b : 2J → 2J the mapping which associates to any subset of J its image by swapping items a and b. More formally :   (m \ {a}) ∪ {b} if b 6∈ m and a ∈ m ϕa,b (m) = (m \ {b}) ∪ {a} if a 6∈ m and b ∈ m m otherwise  Definition 1. Let M be a collection of sets defined on J. We say that items a and b are clone items in M if and only if ∀m ∈ M, ϕa,b (m) ∈ M. The clone items concept is a binary one. To the question ”are a and b clone items ?”, only the answer ”true” or ”false” is possible. It could be interesting to have more precisions when the negative response is given. Are a and b very far from being clone items or why are not they clone ? For this purpose, we introduce a measure to qualify the clone property. This measure will represent 72 Raoul Medina, Caroline Noyer, Olivier Raynaud a distance between two items a and b, based on the definition of the clone items property. This distance is exactly the number of elements m of M which do not have an image in M when applying the swapping function ϕa,b (m). When this distance is zero, a and b are clone items. The greater the distance is, the farther a and b are to be clone. More formally: Definition 2. Let M be a sets collection on J and let (a, b) in J 2 . We call distance between a and b, denoted by dM (a, b), the mapping : J2 → N dM (a, b) → {m ∈ M | ϕa,b (m) 6∈ M} Thanks to definition 2, clone items could be charaterized as follows : Proposition 1. Let M be a sets collection on J and (a, b) in J 2 , a and b are clone items if and only if dM (a, b) = 0. The problem we study in this paper is the computation of distances between all pair of items of J: Problem 1 (Distance). Data : a sets collection M on J; Result : the matrix dM . Here, we present the main property on which rely our algorithms. It charac- terizes a couple (m, m0 ) of the sets collection M such that m = ϕa,b (m0 ). Proposition 2. Let M be a sets collection defined over J, m and m0 two dis- tinct sets of M and (a, b) two items of J such that a ∈ m and b ∈ m0 . Then the following assertions are equivalent: 1. m = ϕa,b (m0 ) 2. m0 = ϕa,b (m) 3. |m| = |m0 | and m \ m0 = {a} and m0 \ m = {b} 4. m \ {a} = m0 \ {b} This property states that two sets m and m0 are their respective images by the swapping function ϕ if and only if they have same size t and share t − 1 items. This property follows directly from the definition of the ϕ mapping. 2.2 Abstract Data Type : Key Mapping Interface. We use a Map abstract data type similar to the Map interface of Java language. This data structure maps keys to values. In our case, the keys are the sets of the collection. The values mapped by the keys depend on the algorithm. This abstract data type supplies the following operators: – new() operator: creates a Map object and returns an empty map. Efficient algorithms for clone items detection 73 – get(e) operator: returns the value associated to the key e if this key maps a value, or Nil otherwise. – put(e,value) operator: inserts set e in the map and associates value to it. Time complexities of those operators deeply rely on the data structure used for the implementation of the Map data type. We propose an implementation which takes advantage of the type of the keys, i.e. sets. To implement the Map type we propose a lexicographic tree: itemsets are represented by branches of the tree. Implementation: lexicographic tree. We first give a formal definition of a lexicographic tree corresponding to a sets collection. Definition 3. Let M be a sets collection defined over J, with a total order on J denoted by