Introduction

Efficient algorithms for clone items detection Efficient algorithms for clone items detection

Raoul Medina

medina@isima.fr 0

Caroline Noyer

C@liesrimmoan.tf-rFerrand 0

Olivier Raynaud Raoul Medina

raynaud@isima.fr 0

Caroline Noyer

Olivier Raynaud

0 0 LIMOS - Universit ́e Blaise Pascal , Campus univeLrsIiMtaOirSe -deUsnCiv ́eezresaietu ́ xB,laCisleermPaosncta-lF,errand , France Campus univers

70 81

This paper presents efficient algorithms for clone items detection in a binary relation. Best implementation of these algorithms has O(|J|.||M||) time complexity, with J corresponding to the set of items of the relation and ||M|| corresponding to the size of the relation. This result improves the previous algorithm given in [3] which is O(|J|2.||M||). Clone items have been introduced by Medina and Nourine to explain why, sometimes, the number of rules of a minimum cover of a relation is exponential with the number of items of the relation.

Introduction

then to be able to compute the clone classes of the (potentially exponential) collection without generating its sets. To solve this problem in polynomial time, the general method could be as follows: – Let M be the (potentially exponential) collection of sets verifying a property over a given binary relation R. Compute in polynomial time a collection M0 such that: 1. the size of M0 is polynomial in the size of R, 2. items a and b are clone in M if and only if they are clone items in M0. – Detect the clone classes in M0.

Our paper focuses on the detection phase of this approach. The clone classes computation algorithm given in [ 3 ] has an O(|J |2.||M||) time complexity, where J is the set of items and ||M|| is the size of the input collection. In other words, ||M|| = Pm∈M |m|. In this paper, we present different algorithms to solve the clone classes computation. The best complexity we obtain is in O(|J |.||M||).

This paper is organized as follows: section 2 formally defines the problem in terms of collection of sets and introduces the corresponding Abstract Data Type which will be used by our algorithms. Section 3 describes three computation strategies and the corresponding time complexities are studied. Section 4 shows how to take advantage of those algorithms in order to compute the clone items classes as defined in [ 3 ]. 2

General context and de nitions

In this section, we first formally define the studied problem. Then we present the Abstract Data Structure called Map used in our algorithms and discuss on its possible implementations. 2.1

Clone items in a Sets Collection

Let J be a set of items {x1, ..., x|J|} and M a sets collection on J . We denote by ϕa,b : 2J → 2J the mapping which associates to any subset of J its image by swapping items a and b. More formally :

 (m \ {a}) ∪ {b} if b 6∈ m and a ∈ m ϕa,b(m) =  (m \ {b}) ∪ {a} if a 6∈ m and b ∈ m

 m otherwise Definition 1. Let M be a collection of sets defined on J . We say that items a and b are clone items in M if and only if ∀m ∈ M, ϕa,b(m) ∈ M.

The clone items concept is a binary one. To the question ”are a and b clone items ?”, only the answer ”true” or ”false” is possible. It could be interesting to have more precisions when the negative response is given. Are a and b very far from being clone items or why are not they clone ? For this purpose, we introduce a measure to qualify the clone property. This measure will represent a distance between two items a and b, based on the definition of the clone items property. This distance is exactly the number of elements m of M which do not have an image in M when applying the swapping function ϕa,b(m). When this distance is zero, a and b are clone items. The greater the distance is, the farther a and b are to be clone. More formally: Definition 2. Let M be a sets collection on J and let (a, b) in J 2. We call distance between a and b, denoted by dM(a, b), the mapping :

J 2 → N dM(a, b) → {m ∈ M | ϕa,b(m) 6∈ M}

Thanks to definition 2, clone items could be charaterized as follows : Proposition 1. Let M be a sets collection on J and (a, b) in J 2, a and b are clone items if and only if dM(a, b) = 0.

The problem we study in this paper is the computation of distances between all pair of items of J : Problem 1 (Distance).

Data : a sets collection M on J ; Result : the matrix dM.

Here, we present the main property on which rely our algorithms. It characterizes a couple (m, m0) of the sets collection M such that m = ϕa,b(m0). Proposition 2. Let M be a sets collection defined over J , m and m0 two distinct sets of M and (a, b) two items of J such that a ∈ m and b ∈ m0. Then the following assertions are equivalent: 1. m = ϕa,b(m0) 2. m0 = ϕa,b(m) 3. |m| = |m0| and m \ m0 = {a} and m0 \ m = {b} 4. m \ {a} = m0 \ {b}

This property states that two sets m and m0 are their respective images by the swapping function ϕ if and only if they have same size t and share t − 1 items. This property follows directly from the definition of the ϕ mapping. 2.2 Interface. We use a Map abstract data type similar to the Map interface of Java language. This data structure maps keys to values. In our case, the keys are the sets of the collection. The values mapped by the keys depend on the algorithm.

This abstract data type supplies the following operators: – new() operator: creates a Map object and returns an empty map. corresponds exactly to the set m. respects the order defined by <J ; b

d {a,d} %% a value, or Nil otherwise. – get(e) operator: returns the value associated to the key e if this key maps – put(e,value) operator: inserts set e in the map and associates value to it.

Efficient algorithms for clone items detection 73

Time complexities of those operators deeply rely on the data structure used for the implementation of the Map data type. We propose an implementation which takes advantage of the type of the keys, i.e. sets. To implement the Map type we propose a lexicographic tree: itemsets are represented by branches of

Implementation: lexicographic tree. We first give a formal definition of a

lexicographic tree corresponding to a sets collection.

Definition 3. Let M be a sets collection defined over J , with a total order on J denoted by <J . A unique lexicographic tree is associated to M such that: – Each edge of the tree is labeled with an element of J ; – To each marked node of the tree corresponds a set of M; – To each set m of M correspond a unique path in the tree (starting from root and ending with a marked node) such that the union of labels in this path – For any path from the root to any node, the order of the successive labels – The order of edges leaving a node respects the order defined by <J . {a,b,c} node is marked (i.e. corresponds to a set of the collection and thus to a key) or not. Any field type can be associated to the node for storing the value associated to the key (on marked nodes only). There are two ways of implementing the list of children of a node in the lexicographic tree: – Using a List structure, each entry of the list containing the label of the associated edge and a reference to the child; – Using an Array structure being indexed by the labels and the entries containing either NIL or reference to the child.

Complexities of the put and get operators rely on the chosen implementation. Proposition 3. If the lexicographic tree is implemented with Lists then: – The put(m,value) operator as an O(|J |) time complexity; – The get(m) operator as an O(|J |) time complexity;

Access complexity is due to the fact that an item appears only once in a set. Cost of a node creation is done in constant time.

Proposition 4. If the lexicographic tree is implemented with Arrays then: – The put(m,value) operator as an O(|m| × |J |) time complexity; – The get(m) operator as an O(|m|) time complexity;

Acces complexity is due to the fact that we have direct access to a child labeled by a given item. Creation of a node of the tree costs |J | since we need to create and initialize a |J | sized array. 3

Strategies and algorithms

In this section, we study three different strategies for solving the Distance problem. All strategies rely on the same basic idea: first the distance matrix is initialized with the maximum possible values for each distance dM(a, b). Then, each time the algorithm finds m and m0 such that m = ϕa,b(m0), the distance dM(a, b) is decreased by one. Main difference between the three strategies is the way they detect that m = ϕa,b(m0).

The first strategy is the one given in [ 3 ]. We call it set existence checking. Main idea of the algorithm is, given m ∈ M, to compute all possible sets ϕa,b(m) and then check if these sets belong to the collection. Second strategy is called ϕa,b relation checking. Its principle is, for each pair (m, m0), to check if there exist items a and b such that m = ϕa,b(m0). This test is done using Property 2. Third strategy is called classes computation. Based on Property 2, it computes classes C of itemsets such that for any pair (m, m0) of C, there exist items a and b such that m = ϕa,b(m0).

We first discuss about the initialization of the distance matrix since this is a common ground for all strategies.

Discussion on the distance matrix

Since the notion of distance dM(a, b) is a symetrical one, the distance matrix is also symmetrical. We thus choose to represent it by a triangular matrix such that: dM(a, a) = 0 dM(a, b) = dM(b, a)

In this section we discuss the different strategies for the initialization and the update of the distance matrix, as well as their respective costs.

Let M = Mab + Mab + Mab + Mab, with: - Mab: the sets m of M such that a ∈ m and b 6∈ m - Mab: the sets m of M such that a 6∈ m and b ∈ m - Mab: the sets m of M such that a ∈ m and b ∈ m - Mab: the sets m of M such that a 6∈ m and b 6∈ m.

Incrementation or decrementation strategy ? There are two ways of computing the distance matrix: – Either by first initializing all distances to 0 and then by increasing by 1 the distance dM(a, b) each time we find m ∈ M such that ϕa,b(m) 6∈ M, – or by first initializing the distances by a maximal value and then decreasing by 1 the distance dM(a, b) each time we find m and m0 ∈ M such that m = ϕa,b(m0).

What is the maximal value that can be taken by dM(a, b) ? Clearly the answer is |Mab| + |Mab|. Indeed, for any item m ∈ Mab ∪ Mab we have m = ϕa,b(m) and thus m cannot increase the distance between a and b. This represents the maximum number of basic operations (either incrementation or decrementation) needed to compute the distance matrix in the worst case. Thus, whatever strategy we choose, the number of basic operations will be the same in the worst case.

What is the best time complexity for both strategies ? We consider that the basic operations can be done in O(1). Thus the overall complexity will be: O(X X |Mab| + |Mab|).

a∈J b∈J Now, consider m ∈ M. In the worst case, m will be taken into account in at most |m| × |J \ m| distances dM(a, b). Indeed, for m to be taken into account in dM(a, b) either a or b belongs to m but not both. Thus, we have: O(X X |Mab| + |Mab|) = O( X |m| × |J \ m|).

a∈J b∈J m∈M This can be rewritten as follows:

O( X |m| × (|J | − |m|)) = O(( X |m| × |J |) − ( X |m| × |m|)).

m∈M m∈M m∈M And since |m| × |m| is lesser or equal than |m| × |J |, we obtain the worst case complexity:

O( X |m| × |J |) = O(|J | × ||M||).

m∈M

Thus, whatever update strategy is chosen, time complexity cannot be less than O(|J | × ||M||) (upon the hypothesis that the basic operations increment or decrement by 1).

In this paper we choose to adopt the decrementation strategy since our algorithms are based on Property 2. We now discuss the complexity of the initialization of the distance matrix.

Initializing the distance matrix. We initialize each distance dM(a, b) with

the maximal value possible, i.e. with |Mab| + |Mab|. Initially, the distances are equal to 0. This can be done in O(|J | × |J |).

Then, for each m ∈ M we increment by 1 the distances dM(a, b), with a ∈ m and b 6∈ m. As shown in previous subsection, this can be done in O(|J | × ||M||).

Algorithm 1: InitDistance(M)

Data : A sets collection M defined over J.

Result : The distance matrix dM such that for all a and b in J 2 we have dM(a, b) = |Mab| + |Mab|. begin foreach a ∈ J do foreach b ∈ J do

dM(a, b) = 0; foreach m ∈ M do foreach a ∈ m do foreach b 6∈ m do

dM(a, b) + +; return dM; end 3.2

Set Existence Checking Strategy

For each set m of M and for all pair (a, b) of J 2, we check if ϕa,b(m) belongs to M. In order to check the existence of ϕa,b(m), we choose to store the collection M in the Map structure presented before. The sets of M are the keys while their associated value is "present".

Beware that since a distance dM(a, b) is initialized with |Mab| + |Mab|, this distance should be decremented only when m 6= ϕa,b(m). Indeed, if m = ϕa,b(m) then m 6∈ Mab ∪ Mab. Algorithm 2: ComputeDistance(M) : Set Existence Checking Strategy Data : A sets collection M defined over J.

Result : The distance matrix dM. begin dM = InitDistance(M); T = new Map(); foreach m ∈ M do

T .put(m,”present”); end Proposition 5. The Set Existence Checking Strategy has an O(|J |2 × ||M||) time complexity.

Proof. Initialization of the matrix is done in O(|J | × ||M||). We suppose that the T Map structure is implemented using Arrays. Thus, the initialization of T is done in O(Pm∈M |m| × |J |) = O(|J | × ||M||). Loop on line 1 does |M| iterations while loop of line 2 does |J |2 iterations. The ϕa,b(m) operation as well as the test m 6= ϕa,b(m) and the get(m) operation can be done in O(|m|) time complexity. Overall complexity is thus, O(Pm∈M |J |2 × |m|) = O(|J |2 × ||M||).

This strategy is a slighter improvement of the one presented in [ 3 ]. Note that it can be improved a little more by choosing a in m and b in J \ m. 3.3

ϕa,b Relation Checking Strategy For any pair of elements (m, m0) of M, we check if there exist a and b such that m = ϕa,b(m0). According to Property 2, items a and b exist if and only if m and m0 have same size and share |m| − 1 items. In this case, dM(a, b) is decremented by 1.

Proposition 6. The ϕa,b relation checking strategy has an O((|J | + |M|) × ||M||) time complexity.

Proof. Initialisation of the matrix is done in O(|J | × ||M||). External loop (line 1) does |M|2 iterations. All tests of line 2 can be done in O(|m|) provided that the sets m and m0 are stored sorted according to <J . The complexity of the loop is in O(|M| × ||M||). Overall complexity is thus in O((|J | + |M|) × ||M||). 3.4

Classes Computation Strategy

This strategy also relies on Property 2. Let consider m, m0 and m00 be sets of M such that m = ϕa,b(m0) and m = ϕa,c(m00). Then, according to Property Algorithm 3: ComputeDistance(M): ϕa,b Relation Checking Strategy Data : A set collections M defined over J.

Result : The distance matrix dM. begin

dM = InitDistance(M); 1 foreach (m, m0) ∈ M2 do 2 if |m| = |m0| and |m \ m0| = 1 and |m0 \ m| = 1 then dM(m \ m0, m \ m) − −;

0 end 2 we have m \ {a} = m0 \ {b} = m00 \ {c}. And thus, m0 = ϕb,c(m00). Idea of the algorithm is to compute classes of sets mi of M having |mi| − 1 common items. Thus, a class C can be represented by the set of common items and we memorize in a set Union all the extra items xi which are not common. In the Map structure we use, the set C will be the key while the set Union will be the value associated to the key.

Then, for any m ∈ C and for any (a, b) ∈ Union, we know that according to Property 2 we have ϕa,b(m) ∈ M and m 6= ϕa,b(m). And thus, dM(a, b) has to be decremented. Note that a set m can belong to at most |m| classes.

The algorithm is quite straightforward using the Map structure. For all sets m of M we insert each of its |m| subsets of size |m| − 1 in the Map structure. If the key was already present, we just append the extra item of m to the set Union and update all the necessary entries in the distance matrix. Otherwise, a new key m \ {x} is present in the Map structure and its associated Union value is initialized with {x}.

Proposition 7. Classes computation strategy has O(|J | × ||M||) time complexity.

Proof. Initialization of the distance matrix is done in O(|J |×||M||). We suppose that the Map structure is implemented using a lexicographic tree with Lists. The line 1 loop does |M| iterations. Loop in line 2 does |m| iterations. In line 3, the retrieval is done in O(|J |). The loop in line 4 does at most |J | iterations and the update of the matrix takes constant time. The insertion of line 5 is done in O(|J |). Thus, the overall complexity is in O(Pm∈M |m| × |J |) = O(|J | × ||M||).

Let us illustrate Algorithm 4 with an example. The considered collection is M = {m1 = {a, e, f, h}, m2 = {b, e, f, h}, m3 = {b, d, f, h}}. The resulting lexicographic tree obtained with Algorithm 4 is shown in Figure 2.

From this tree, we conclude that m2 and m3 share the items {b, f, h} and that m2 = ϕd,e(m3). Thus, dM(d, e) should be decremented. For the same reason, the distance dM(a, b) should also be decremented since m1 = ϕa,b(m2) (they share the items {e, f, h}). Algorithm 4: ComputeDistance(M): Classes Computation Strategy Data : A set collections M defined over J.

Result : The distance matrix dM. begin dM = InitDistance(M);

T = new Map(); 1 foreach m ∈ M do 2 foreach x ∈ m do

C = m \ {x} 3 Union = T .get(C)

if Union 6= Nil then 4 foreach y ∈Union do

dM(x, y) − −; Union = Union ∪{x};

T .put(C, U nion); else 5

T .put(C, {x}) end 3.5

Memory usage

In this section we discuss the space complexity required by each strategy and show how to reduce the memory usage when possible.

First, it is obvious that the distance matrix should be present in memory. It requires O(|J |2) memory space.

Concerning the Set Existence Checking Strategy, a lexicographic tree implemented with arrays is used. The number of nodes is clearly bounded by ||M||. And since each node requires a |J | sized array, this strategy uses O(|J | × ||M||) memory space.

For the ϕa,b Relation Checking Strategy, no lexicographic tree is used. However, the collection M needs to be present in memory. Thus, this strategy requires O(||M||) memory space.

Now, for the Classes Computation Strategy, the used lexicographic tree is implemented using lists. Such implementation requires O(||M||) memory space. But M is not the collection stored in the tree. Indeed, for each m ∈ M, we store its |m| subsets of size |m| − 1. And since |m| is bounded by |J | we conclude that this strategy requires O(|J | × ||M||) memory space.

The conclusion seems to be that whatever strategy is used, at least O(||M||) memory space will be needed. But one can notice that, according to Property 2, not all sets in M need to be present in memory at the same time. Indeed, if m = ϕa,b(m0), then m and m0 have same size. The idea is then to do a partitionning of M according to the size of the sets. Thus, only sets of same size need to be present in memory at the same time. Computations done for f h e h f

f (h) (f) (e) (h) (f)

(h) a d h f e b

d f h (f) e f h (d,e) h (b) f

h (a,b) those sets is totally independent of computations done for sets with a different size. Note that the partitionning of M can be done in O(||M||) time complexity, with a single scan of the collection. Thus, the partitionning does not change the overall complexity of the different strategies.

Another remark is that since computations by size are totally independent, one could easily implement a distributed version of the strategies. This could eventually speed up the computation of the distance matrix. But note that in this case, the distance matrix should also be distributed. Best solution should be to initialize local distance matrices with the local partition. After local computation (for sets with same size), all the distance matrices should be merged together (by adding all the obtained values) in order to obtain the global distance matrix. 4

Clone classes computation

The problem stated in [ 1 ] and in [ 3 ] is the computation of clone classes. We thus give the algorithm which computes clone classes from a distance matrix.

First, let us recall that two items a and b are clone if and only if dM(a, b) = 0. Since, the clone relation is an equivalence relation, it defines a partition of the set J .

Principle of our algorithm 5 is the following. Let a ∈ J be an item which has still not been assigned to a class. We then search all remaining items b which distance with a is null. All those items will form a clone class with a and thus are removed from the list of items which are not assigned to a class. The class of a is then stored in a list L. Total complexity of the clone classes computation is in O(|J | × ||M||) time and space complexity. 5

Conclusion

We have seen that if the problem is the computation of the distance matrix and only basic incrementation or decrementation by 1 are allowed, then the minimal

Algorithm 5: ComputeCloneClasses(M)

Data : A sets collection M defined over J.

Result : The list L of clone classes. begin dM = ComputeDistance(M); L = ∅; temp = J; while temp 6= ∅ do 1 foreach a ∈ temp do

la = newList(); la = la ∪ {a}; temp = temp \ {a}; 2 foreach b ∈ temp do 3 if dM(a, b) = 0 then 4 la = la ∪ {b}; 5 temp = temp \ {b};

L = L + la; return L; end time complexity for the update of the matrix is in O(|J | × ||M||). Thus, under this hypothesis, our algorithm is optimal. Figure 3 gives the different time and space complexities obtained with our algorithms.

An open question is to know whether or not the distance matrix could be incremented (or decremented) by more than 1 at each step. This could be the only way of improving our algorithm. Another open question is to know if there is a more efficient algorithm to compute the clone classes (for instance without computing the distance matrix). Those are two questions we are investigating.

Gely ,

Medina ,

Nourine , and

Renaud . Uncovering and reducing hidden combinatorics in guigues-duquenne covers . In ICFCA'05 , 2005 .

Manilla and K.J. Ra ¨hia¨. On the complexity of inferring functionnal dependencies . Discret Applied Mathematics , 40 ( 2 ): 237 - 243 , 1992 .

Medina and

Nourine . Clone items: a pre-processing information for knowledge discovery . submitted.