Efficient algorithms for clone items detection

Efficient algorithms for clone items detection RaoulMedina medina@isima.fr LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

CarolineNoyer LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

OlivierRaynaud raynaud@isima.fr LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

LIMOS -Université Blaise Pascal

Campus universitaire des Cézeaux Clermont-Ferrand France

Efficient algorithms for clone items detection A2FFACD1DE7DF271A001D5C5407664D8 GROBID - A machine learning software for extracting information from scholarly documents

This paper presents efficient algorithms for clone items detection in a binary relation. Best implementation of these algorithms has O(|J|.||M||) time complexity, with J corresponding to the set of items of the relation and ||M|| corresponding to the size of the relation. This result improves the previous algorithm given in [3] which is O(|J| 2 .||M||).

Clone items have been introduced by Medina and Nourine to explain why, sometimes, the number of rules of a minimum cover of a relation is exponential with the number of items of the relation.

Introduction

The clone items notion has been introduced by Medina and Nourine in [3]. Aim of their paper was to understand the combinatorial explosion that might arise in a minimum basis of a relation. Most famous example of such exponential minimum basis is given by Manilla and Rähiä in [2]. Medina and Nourine noticed that in this example some items play symetrical role in the basis. Indeed, for each rule containing a given item a in the antecedent there is a symetrical rule where item a is replaced by item b. Such symetrical items are said to be clone items. The clone notion is an equivalence relation and thus classes of clone items can be found in a binary relation. In [3], the authors show how to compute such clone classes and, from those classes, how to reduce the binary relation in order to obtain a minimum basis with no clone items. The detection algorithm as well as the reduction algorithm have both polynomial time complexities. The obtained minimum basis is smaller than the original one (in the case of the Mannila and Rähiä example it reduces to a single rule) and this later can be reconstructed from the clone free minimum basis and the clone classes information in polynomial time.

Recently, Gely et al. (in [1]) extended the notion of clone items (which are defined on closed sets) to the notion of P-clone items (defined on pseudo-closed sets) and to the notion of A-clone items (defined on what they call an "atomized" context). Their approach could be generalized to any generation problem where, from a given relation, one wants to compute a (potentially exponential) collection of sets verifying a property (e.g. the sets must be closed, or the sets must be ideals, or the sets are pseudo-closed sets, etc...). The idea is then to reduce the combinatorial explosion of the wanted collection by representing it by a clone free collection and the classes of clone items of this collection. The problem is then to be able to compute the clone classes of the (potentially exponential) collection without generating its sets. To solve this problem in polynomial time, the general method could be as follows:

-Let M be the (potentially exponential) collection of sets verifying a property over a given binary relation R. This paper is organized as follows: section 2 formally defines the problem in terms of collection of sets and introduces the corresponding Abstract Data Type which will be used by our algorithms. Section 3 describes three computation strategies and the corresponding time complexities are studied. Section 4 shows how to take advantage of those algorithms in order to compute the clone items classes as defined in [3].

General context and definitions

In this section, we first formally define the studied problem. Then we present the Abstract Data Structure called Map used in our algorithms and discuss on its possible implementations.

Clone items in a Sets Collection

Let J be a set of items {x 1 , ..., x |J| } and M a sets collection on J. We denote by ϕ a,b : 2 J → 2 J the mapping which associates to any subset of J its image by swapping items a and b. More formally :

ϕ a,b (m) =    (m \ {a}) ∪ {b} if b ∈ m and a ∈ m (m \ {b}) ∪ {a} if a ∈ m and b ∈ m m otherwise Definition 1.

Let M be a collection of sets defined on J. We say that items a and b are clone items in M if and only if ∀m ∈ M, ϕ a,b (m) ∈ M.

The clone items concept is a binary one. To the question "are a and b clone items ?", only the answer "true" or "false" is possible. It could be interesting to have more precisions when the negative response is given. Are a and b very far from being clone items or why are not they clone ? For this purpose, we introduce a measure to qualify the clone property. This measure will represent a distance between two items a and b, based on the definition of the clone items property. This distance is exactly the number of elements m of M which do not have an image in M when applying the swapping function ϕ a,b (m). When this distance is zero, a and b are clone items. The greater the distance is, the farther a and b are to be clone. More formally: Definition 2. Let M be a sets collection on J and let (a, b) in J 2 . We call distance between a and b, denoted by d M (a, b), the mapping : The problem we study in this paper is the computation of distances between all pair of items of J:

J 2 → N d M (a, b) → {m ∈ M | ϕ a,b (m) ∈ M}Problem 1 (Distance).

Data : a sets collection M on J; Result : the matrix d M .

Here, we present the main property on which rely our algorithms. It characterizes a couple (m, m ) of the sets collection M such that m = ϕ a,b (m ).

Proposition 2. Let M be a sets collection defined over J, m and m two distinct sets of M and (a, b) two items of J such that a ∈ m and b ∈ m . Then the following assertions are equivalent:

1. m = ϕ a,b (m ) 2. m = ϕ a,b (m) 3. |m| = |m | and m \ m = {a} and m \ m = {b} 4. m \ {a} = m \ {b}

This property states that two sets m and m are their respective images by the swapping function ϕ if and only if they have same size t and share t − 1 items. This property follows directly from the definition of the ϕ mapping.

Abstract Data Type : Key Mapping

Interface. We use a Map abstract data type similar to the Map interface of Java language. This data structure maps keys to values. In our case, the keys are the sets of the collection. The values mapped by the keys depend on the algorithm.

This abstract data type supplies the following operators:

new() operator: creates a Map object and returns an empty map.

get(e) operator: returns the value associated to the key e if this key maps a value, or Nil otherwise. put(e,value) operator: inserts set e in the map and associates value to it. Time complexities of those operators deeply rely on the data structure used for the implementation of the Map data type. We propose an implementation which takes advantage of the type of the keys, i.e. sets. To implement the Map type we propose a lexicographic tree: itemsets are represented by branches of the tree.

Implementation: lexicographic tree. We first give a formal definition of a lexicographic tree corresponding to a sets collection. Definition 3. Let M be a sets collection defined over J, with a total order on J denoted by < J . A unique lexicographic tree is associated to M such that:

-Each edge of the tree is labeled with an element of J; -To each marked node of the tree corresponds a set of M; -To each set m of M correspond a unique path in the tree (starting from root and ending with a marked node) such that the union of labels in this path corresponds exactly to the set m. -For any path from the root to any node, the order of the successive labels respects the order defined by < J ; -The order of edges leaving a node respects the order defined by < J . Figure 1 gives an example of collection and its associated lexicographic order.

¡ ¢ £ ¤ ¥ ¦ § © ! ! " " # # $ $

% & ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( )

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ A A A boolean can be associated to a node of the tree in order to indicate if the node is marked (i.e. corresponds to a set of the collection and thus to a key) or not. Any field type can be associated to the node for storing the value associated to the key (on marked nodes only). There are two ways of implementing the list of children of a node in the lexicographic tree:

1 1 1 1 1 1 1 1 2 2 2

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

-Using a List structure, each entry of the list containing the label of the associated edge and a reference to the child; -Using an Array structure being indexed by the labels and the entries containing either NIL or reference to the child.

Complexities of the put and get operators rely on the chosen implementation. Acces complexity is due to the fact that we have direct access to a child labeled by a given item. Creation of a node of the tree costs |J| since we need to create and initialize a |J| sized array.

Strategies and algorithms

In this section, we study three different strategies for solving the Distance problem. All strategies rely on the same basic idea: first the distance matrix is initialized with the maximum possible values for each distance d M (a, b). Then, each time the algorithm finds m and m such that m = ϕ a,b (m ), the distance d M (a, b) is decreased by one. Main difference between the three strategies is the way they detect that m = ϕ a,b (m ).

The first strategy is the one given in [3]. We call it set existence checking. Main idea of the algorithm is, given m ∈ M, to compute all possible sets ϕ a,b (m) and then check if these sets belong to the collection. Second strategy is called ϕ a,b relation checking. Its principle is, for each pair (m, m ), to check if there exist items a and b such that m = ϕ a,b (m ). This test is done using Property 2. Third strategy is called classes computation. Based on Property 2, it computes classes C of itemsets such that for any pair (m, m ) of C, there exist items a and b such that m = ϕ a,b (m ).

We first discuss about the initialization of the distance matrix since this is a common ground for all strategies.

Discussion on the distance matrix

Since the notion of distance d M (a, b) is a symetrical one, the distance matrix is also symmetrical. We thus choose to represent it by a triangular matrix such that:

d M (a, a) = 0 d M (a, b) = d M (b, a)

In this section we discuss the different strategies for the initialization and the update of the distance matrix, as well as their respective costs.

Let and thus m cannot increase the distance between a and b. This represents the maximum number of basic operations (either incrementation or decrementation) needed to compute the distance matrix in the worst case. Thus, whatever strategy we choose, the number of basic operations will be the same in the worst case.

What is the best time complexity for both strategies ? We consider that the basic operations can be done in O (1). Thus the overall complexity will be:

O( a∈J b∈J |M ab | + |M ab |).

Now, consider m ∈ M. In the worst case, m will be taken into account in at most |m| × |J \ m| distances d M (a, b). Indeed, for m to be taken into account in d M (a, b) either a or b belongs to m but not both. Thus, we have:

O( a∈J b∈J |M ab | + |M ab |) = O( m∈M |m| × |J \ m|).

This can be rewritten as follows:

O( m∈M |m| × (|J| − |m|)) = O(( m∈M |m| × |J|) − ( m∈M |m| × |m|)).

And since |m| × |m| is lesser or equal than |m| × |J|, we obtain the worst case complexity:

m∈M |m| × |J|) = O(|J| × ||M||).

Thus, whatever update strategy is chosen, time complexity cannot be less than O(|J| × ||M||) (upon the hypothesis that the basic operations increment or decrement by 1).

In this paper we choose to adopt the decrementation strategy since our algorithms are based on Property 2. We now discuss the complexity of the initialization of the distance matrix. This strategy is a slighter improvement of the one presented in [3]. Note that it can be improved a little more by choosing a in m and b in J \ m.

Classes Computation Strategy

This strategy also relies on Property 2. Let consider m, m and m be sets of M such that m = ϕ a,b (m ) and m = ϕ a,c (m ). Then, according to Property

h h f h h h h h f (h) (f) (e) (h) (f) (h) (f) (d,e) (b) (a,b)

Fig. 2. The lexicographic tree corresponding to the collection M = {{a, e, f, h}, {b, e, f, h}, {b, d, f, h}}. Circled nodes correspond to sets of M. The Union value associated to these node is presented in parenthesis.

those sets is totally independent of computations done for sets with a different size. Note that the partitionning of M can be done in O(||M||) time complexity, with a single scan of the collection. Thus, the partitionning does not change the overall complexity of the different strategies.

Another remark is that since computations by size are totally independent, one could easily implement a distributed version of the strategies. This could eventually speed up the computation of the distance matrix. But note that in this case, the distance matrix should also be distributed. Best solution should be to initialize local distance matrices with the local partition. After local computation (for sets with same size), all the distance matrices should be merged together (by adding all the obtained values) in order to obtain the global distance matrix.

Clone classes computation

The problem stated in [1] and in [3] is the computation of clone classes. We thus give the algorithm which computes clone classes from a distance matrix.

First, let us recall that two items a and b are clone if and only if d M (a, b) = 0. Since, the clone relation is an equivalence relation, it defines a partition of the set J.

Principle of our algorithm 5 is the following. Let a ∈ J be an item which has still not been assigned to a class. We then search all remaining items b which distance with a is null. All those items will form a clone class with a and thus are removed from the list of items which are not assigned to a class. The class of a is then stored in a list L. Total complexity of the clone classes computation is in O(|J| × ||M||) time and space complexity.

Conclusion

We have seen that if the problem is the computation of the distance matrix and only basic incrementation or decrementation by 1 are allowed, then the minimal

Thanks to definition 2 ,2clone items could be charaterized as follows : Proposition 1. Let M be a sets collection on J and (a, b) in J 2 , a and b are clone items if and only if d M (a, b) = 0.

Fig. 1 .1Fig.1. The lexicographic tree corresponding to the collection {abc, ad, bcde, cd, cde, e}, with J = {a, b, c, d, e} and <J being the alphabetical order of the items. Circled nodes are the marked nodes corresponding to sets of the collection.

Proposition 3 .-3If the lexicographic tree is implemented with Lists then: The put(m,value) operator as an O(|J|) time complexity; -The get(m) operator as an O(|J|) time complexity; Access complexity is due to the fact that an item appears only once in a set. Cost of a node creation is done in constant time. Proposition 4. If the lexicographic tree is implemented with Arrays then: -The put(m,value) operator as an O(|m| × |J|) time complexity; -The get(m) operator as an O(|m|) time complexity;

M = M ab + M ab + M ab + M ab , with: -M ab : the sets m of M such that a ∈ m and b ∈ m -M ab : the sets m of M such that a ∈ m and b ∈ m -M ab : the sets m of M such that a ∈ m and b ∈ m -M ab : the sets m of M such that a ∈ m and b ∈ m. Incrementation or decrementation strategy ? There are two ways of computing the distance matrix: -Either by first initializing all distances to 0 and then by increasing by 1 the distance d M (a, b) each time we find m ∈ M such that ϕ a,b (m) ∈ M, -or by first initializing the distances by a maximal value and then decreasing by 1 the distance d M (a, b) each time we find m and m ∈ M such that m = ϕ a,b (m ). What is the maximal value that can be taken by d M (a, b) ? Clearly the answer is |M ab | + |M ab |. Indeed, for any item m ∈ M ab ∪ M ab we have m = ϕ a,b (m)

Initializing the distance matrix. We initialize each distance d M (a, b) with the maximal value possible, i.e. with |M ab | + |M ab |. Initially, the distances are equal to 0. This can be done in O(|J| × |J|). Then, for each m ∈ M we increment by 1 the distances d M (a, b), with a ∈ m and b ∈ m. As shown in previous subsection, this can be done in O(|J| × ||M||).

Algorithm 1 :1InitDistance(M)Data : A sets collection M defined over J. Result : The distance matrix dM such that for all a and b in J 2 we have dM(a, b) = |M ab | + |M ab |. begin foreach a ∈ J do foreach b ∈ J do dM(a, b) = 0; foreach m ∈ M do foreach a ∈ m do foreach b ∈ m do dM(a, b) + +; return dM; end 3.2 Set Existence Checking Strategy For each set m of M and for all pair (a, b) of J 2 , we check if ϕ a,b (m) belongs to M. In order to check the existence of ϕ a,b (m), we choose to store the collection M in the Map structure presented before. The sets of M are the keys while their associated value is "present". Beware that since a distance d M (a, b) is initialized with |M ab | + |M ab |, this distance should be decremented only when m = ϕ a,b (m). Indeed, if m = ϕ a,b (m) then m ∈ M ab ∪ M ab .

Algorithm 2 : 1 foreach m ∈ M do 2 foreach212ComputeDistance(M) : Set Existence Checking Strategy Data : A sets collection M defined over J. Result : The distance matrix dM. begin dM = InitDistance(M); T = new Map(); foreach m ∈ M do T .put(m,"present"); couple(a, b) ∈ J 2 do if T .get(m) = N IL and m = ϕ a,b (m) then dM(a, b) − −; return dM; end Proposition 5. The Set Existence Checking Strategy has an O(|J| 2 × ||M||) time complexity. Proof. Initialization of the matrix is done in O(|J| × ||M||). We suppose that the T Map structure is implemented using Arrays. Thus, the initialization of T is done in O( m∈M |m| × |J|) = O(|J| × ||M||). Loop on line 1 does |M| iterations while loop of line 2 does |J| 2 iterations. The ϕ a,b (m) operation as well as the test m = ϕ a,b (m) and the get(m) operation can be done in O(|m|) time complexity. Overall complexity is thus, O( m∈M |J| 2 × |m|) = O(|J| 2 × ||M||).

3. 33ϕ a,b Relation Checking Strategy For any pair of elements (m, m ) of M, we check if there exist a and b such that m = ϕ a,b (m ). According to Property 2, items a and b exist if and only if m and m have same size and share |m| − 1 items. In this case, d M (a, b) is decremented by 1. Proposition 6. The ϕ a,b relation checking strategy has an O((|J| + |M|) × ||M||) time complexity. Proof. Initialisation of the matrix is done in O(|J| × ||M||). External loop (line 1) does |M| 2 iterations. All tests of line 2 can be done in O(|m|) provided that the sets m and m are stored sorted according to < J . The complexity of the loop is in O(|M| × ||M||). Overall complexity is thus in O((|J| + |M|) × ||M||).

Compute in polynomial time a collection M such that: 1. the size of M is polynomial in the size of R, 2. items a and b are clone in M if and only if they are clone items in M . -Detect the clone classes in M . Our paper focuses on the detection phase of this approach. The clone classes computation algorithm given in [3] has an O(|J| 2 .||M||) time complexity, where J is the set of items and ||M|| is the size of the input collection. In other words, ||M|| = m∈M |m|. In this paper, we present different algorithms to solve the clone classes computation. The best complexity we obtain is in O(|J|.||M||).

Memory usage

In this section we discuss the space complexity required by each strategy and show how to reduce the memory usage when possible.

First, it is obvious that the distance matrix should be present in memory. It requires O(|J| 2 ) memory space.

Concerning the Set Existence Checking Strategy, a lexicographic tree implemented with arrays is used. The number of nodes is clearly bounded by ||M||. And since each node requires a |J| sized array, this strategy uses O(|J| × ||M||) memory space.

For the ϕ a,b Relation Checking Strategy, no lexicographic tree is used. However, the collection M needs to be present in memory. Thus, this strategy requires O(||M||) memory space. Now, for the Classes Computation Strategy, the used lexicographic tree is implemented using lists. Such implementation requires O(||M||) memory space. But M is not the collection stored in the tree. Indeed, for each m ∈ M, we store its |m| subsets of size |m| − 1. And since |m| is bounded by |J| we conclude that this strategy requires O(|J| × ||M||) memory space.

The conclusion seems to be that whatever strategy is used, at least O(||M||) memory space will be needed. But one can notice that, according to Property 2, not all sets in M need to be present in memory at the same time. Indeed, if m = ϕ a,b (m ), then m and m have same size. The idea is then to do a partitionning of M according to the size of the sets. Thus, only sets of same size need to be present in memory at the same time. Computations done for return L; end time complexity for the update of the matrix is in O(|J| × ||M||). Thus, under this hypothesis, our algorithm is optimal. Figure 3 gives the different time and space complexities obtained with our algorithms. An open question is to know whether or not the distance matrix could be incremented (or decremented) by more than 1 at each step. This could be the only way of improving our algorithm. Another open question is to know if there is a more efficient algorithm to compute the clone classes (for instance without computing the distance matrix). Those are two questions we are investigating.

Uncovering and reducing hidden combinatorics in guigues-duquenne covers AGely RMedina LNourine YRenaud ICFCA' 2005 05 On the complexity of inferring functionnal dependencies HManilla KJRähiä Discret Applied Mathematics 40 2 1992 Clone items: a pre-processing information for knowledge discovery RMedina LNourine submitted