Automatic Construction and Re nement of a Class Hierarchy o ver Semistructured Data

Introduction

In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of semistructured data. The rst language of classes has a high power of abstraction and guides the construction of a lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the re nement of a part of the lattice that the user wants to focus on. Our approach has been implemented and experimented on real data in the setting of the GAEL project 1 which aims at building exible electronic catalogs organized as a hierarchy of classes of products. Our experiments have been conducted on real data coming from the C/Net (http://www.cnet.com) electronic catalog of computer products.

Languages of instances and classes

In this section, we de ne the language of instances, L 1 , in which w e describe semistructured data, and the two languages of classes L 2 and L 3 that we use to describe classes over those data, at two levels of abstraction. First, we provide some notations and preliminaries.

Preliminaries and notations

Given a language of instances, a language of classes L de nes the expressions that are allowed as class descriptions. A class description is intended to represent i n a n abstract and concise way the properties that are common to the set of its instances. A membership relation, denoted by isa L , establishes the necessary connection between a given language of instances and an associated language of classes L. De nition 1 (Extension of a class description) Let I be a set of instances, and C a L class description.

The extension of C w.r.t I is the following set: ext I (C) = fi 2 I j i isa L Cg 1 GAEL is funded by the French Ministry of Industry it is a joint project with the Verso database group of INRIA and a startup, MatchVision, specialized in Electronic Commerce

The subsumption relation is a preorder relation between class descriptions, induced by the inclusion relation between class extensions.

De nition 2 (Subsumption between classes)

Let C 1 and C 2 be t w o L class descriptions. C 1 is subsumed b y C 2 , denoted C 1 L C 2 , i for every set I of instances, ext I (C 1 ) ext I (C 2 ).

In sections 2.3, and 2.4, we will provide a constructive characterization of subsumption for the two languages of classes that we consider.

The notion of abstraction of an instance in a language of classes L corresponds, when it exists, to the most speci c class description in L which it is an instance of. De nition 3 (Abstraction of an instance) Let i be an instance, the L class description C is an abstraction of i in L (for short C = abs L (i)) i 1. i isa L C, and 2. if D is a class description such that i isa L D, t h e n C L D. The notion of least common subsumer (a.k.a msg) will be the basis for gathering classes in our clustering algorithm.

De nition 4 (Least Common Subsumer) Let C 1 : : : C n be class descriptions in L. The L class description C is a least common subsumer of C 1 : : : C n in L (for short C = lcs L (C 1 : : :

C n )) i 1. C i L C for all 1 i n, a n d 2. if D is a class description satisfying C i L D for all 1 i n, then C L D

The language of instances

The data that serve as instances of the classes that we build are semistructured data in the following sense: each data is described by a set of pairs (Attribute, Values) but the attributes used for describing the data may vary from an item to another. In addition, each d a t a i s typed (i.e., labelled by the name of a basic type).

De nition 5 (Terms of L 1 ) Let B be a nite set of basic types, A a nite set of attributes, a n d V a set of values. A term of L 1 is of the form: fc att 1 = V 1 : : : a t t n = V n g where c 2 B , 8i 2 The connection between the language of instances L 1 and the language of classes L 2 is based on the following de nition of the membership relation.

De nition 7 (Membership relation for L 2 ) Let i be an instance description in L 1 . L et C be a L 2 class description: i is an instance o f C i every attribute appearing in C also appears in i.

The following proposition, whose proof is straightforward, characterizes subsumption, least common subsumer and abstraction in L 2 . Proposition 1 (Properties of L 2 ) .

Let C 1 and C 2 be two L 2 class descriptions. C 1 L2

C 2 i every attribute of C 2 is also an attribute of C 1 .

Let fatt 1 = V 1 : : : a t t n = V n g be an instance d escription in L 1 . Its abstraction in L 2 is unique: it is fatt 1 : : : a t t n g.

Let C 1 : : : C n be n L 2 class descriptions. Their least common subsumer is unique: it is made of the set of attributes that are c ommon to all the C i 's.

2.4

The language of classes L 3 L 3 is richer than L 2 on di erent aspects: it makes possible to restrict the possible values of an attribute it enables to distinguish the numb e r o f v alues of an attribute through di erent su xes ( + ? ) whose notation is inspired by the one used in XML for describing document t ype de nitions (DTDs), and whose formal semantics corresponds to standard description logics constructors. In fact, as it will become clearer in the following, : V 1 : : : a t t suf fn n : V n g where 8i 2 1::n], att i 2 A , V i V , and suff i 2 f + ? g

The following de nition formalizes the membership relation between an instance and a class description in L 3 .

De nition 9 (Membership relation for L 3 ) Let i be an instance description in L 1 . L et C be a L 3 class description. i is an instance o f C i every attribute in i appears in C and for every term att suf f : V appearing in C, -when suff = , i f t h e r e exists V 0 s.t att = V 0 2 i, t h e n V 0 V , -when suff = + , there exists V 0 V s.t att = V 0 2 i, -when suff =?, if there exists V 0 s.t att = V 0 2 i, t h e n V 0 is a singleton and V 0 V , -when suff = , there exists V 0 singleton s.t V 0 V and att = V 0 2 i.

The product described in 2. i all the attributes appearing in C 1 appear also in C 2 and for every pair att suf f : V appearing in C 2 , -when suff = , i f t h e r e e x i s t s att suf f 0 : V 0 2 C 1 , t h e n V 0 V , -when suff = + , there exists V 0 V s.t att + : V 0 2 C 1 or att : V 0 2 C 1 -when suff = ?, if there e x i s t s att suf f 0 : V 0 2 C 1 , t h e n suff 0 = ? o r suff 0 = , a n d V 0 V , -when suff = , there exists V 0 s.t V 0 V and att :

V 0 2 C 1 .

The complexity o f c hecking subsumption in L 3 is quadratic w.r.t the maximal size of class descriptions.

Proposition 3 (Characterization of abstraction in L 3 )

Let fatt 1 = V 1 : : : a t t n = V n g be an instance description in L 1 . Its abstraction in L 3 is unique: abs L3 = fatt suf f1 1 : V 1 : : : a t t suf fn n : V n g, where 8i 2 1::n], i f j V i j 2 then suff i = + else suff i = .

Proposition 4 (Characterization of lcs in L 3 ) Let C 1 : : : C n be n L 3 class descriptions. Le t A b e the set of attributes belonging to at least one description C i . C 1 : : : C n have a unique least common subsumer in L 3 , whose description is characterized as follows:

for every attribute att 2 A, l e t V be the union of the sets of values associated with att in the class

descriptions C i 's: V = S n 1 fv 2 V i j att suf f : V i 2 C i g.

{ att :V 2 lcs(C 1 : : : C n ) i att :V i 2 C i 8i 2 Computing the lcs of L 3 descriptions is linear in the number of descriptions and quadratic in their size.

Construction of a lattice of L 2 classes

The goal is to structure a set of data described in L 1 into clusters labelled by L 2 descriptions, and organized in a lattice providing a browsable semantic interface facilitating the access to the data for end-users. We proceed to a two-step clustering:

1. In the rst step, the data are partitioned according to their type: for each t ype c, w e create a basic class named c. Its set of instances, denoted inst(c), is the set of data of type c. Its L 2 description, desc(c), is obtained by computing the least common subsumer of the abstractions of its instances. The result of this step is a set C of basic classes and a set A of attributes supporting the L 2 descriptions of the classes of C. F or each attribute a, the set classes(a)

of basic classes having a in their description is computed. This preliminary clustering step has a linear data complexity.

2. In the second step, a lattice of clusters is constructed by gathering basic classes according to similarities of their L 2 descriptions. In this step, clusters are unions of basic classes. The computational complexity of this step does not depend on the number of initial data but only on the size of the L 2 descriptions of basic classes. We n o w detail this second step. A cluster c i1 : : : c ik will appear in the lattice if the L 2 descriptions of the classes c i1 : : : c ik are judged similar enough to gather their instances. The similarity b e t ween class descriptions is stated by attributes in common. However, we take i n to account only attributes that do not occur in too many (or all the) classes. For instance, the attribute price may appear in all the instances of a catalog describing products, and is therefore not useful to discriminate product descriptions. Among the set A of attributes, we select meaningful attributes as being the attributes a 2 A such that jclasses(a)j jclassesj s where s is a certain threshold (e.g., s = 0 :8). Let A 0 be the set of meaningful attributes. We redescribe all the basic classes in terms of the attributes of A 0 only: for a basic class c, w e call its short description, d e n o t e d shortdesc(c), the L 2 description of c restricted to the meaningful attributes:

shortdesc(c) = desc(c) \ A 0 .

Our clustering algorithm, L 2 -Cluster, is described in Algorithm 1. It is adapted from a frequent i t e m s e t algorithm ( 2]). It iteratively builds levels of clusters, starting with building the level of the coarsest clusters corresponding to unions of classes having atleast one attribute in common. Each iteration k is guided by a ttribute sets of increasing size k which, being common to some class descriptions, are the support of the creation of a potential node gathering those classes. Among those potential nodes, we e ectively add to the lattice those whose L 2 short description is equal to their k-support:

the k-support of a node generated at iteration k is the k-itemset supporting the generation of that node. By doing so, we guarantee that the description of the nodes added to the lattice is strictly subsumed by those of their fathers.

Notation: We c a l l a k-itemset a set of attributes of size k. W e assume that attributes in itemsets are kept sorted in their lexicographic order. We use the notation p i] t o represent the i-th attribute of the k-itemset p consisting of the attributes p 1] : : : p k] where p 1] < : : : < p k].

Figure 1 shows the lattice returned by L 2 -Cluster when it is applied on C = fc 1 c 2 c 3 c 4 c 5 g and A = fa 1 a 2 a 3 a 4 g such that:

shortdesc(c 1 ) = fa 1 a 2 a 3 g shortdesc(c 2 ) = fa 2 g shortdesc(c 3 ) = fa 1 a 3 g shortdesc(c 4 ) = fa 3 a 4 g shortdesc(c 5 ) = fa 1 a 3 g

The following proposition summarizes the properties of the algorithm L 2 -Cluster. Proposition 5 (Properties of L 2 -Cluster) Let H be the lattice r eturned b y L 2 -Cluster.

For each node n 2 H, let shortdesc(n) and classes(n) be r espectively the description and the set Require: a set A 0 of meaningful attributes: for each a 2 A 0 , classes(a) is the set of basic classes of C whose L 2 short description contains a. Ensure: return a lattice organized in levels of nodes.

Each n o d e n is characterized by classes(n): the basic classes it gathers, and shortdesc(n): the least common subsumer of the short description of the basic classes of the cluster. if p 1] = q 1] : : : p k ; 1] = q k ; 1] p k] < q k] then 13: let newp = p f q k]g, and let S k be the set of k-subsets of newp. where fi 1 : :

: i k g = S c2classes(n) inst(c).

H is a Galois lattice, i.e. for every node n, the pair 4 Re nement i n L 3

(classes(n) shortdesc(n)) is

The goal of this step is to re ne a part of the lattice H computed by L 2 -Cluster based on the more expressive language L 3 . T h i s s t e p i s a c hieved after a user chooses one node F a t nand one of its descendants Sonn in H.

Algorithm 2 describes how new nodes are possibly added between Sonn and F a t n . Those new nodes correspond to clusters whose descriptions in L 2 did not distinguish from those of F a t nor Sonn, while having distinct descriptions in L 3 . A closure operation on those nodes is necessary in order to make their L 3 descriptions maximal w.r.t the union of basic classes which they gather.

L 3 -Cluster applies after the descriptions in L 3 (denoted desc3 in Algorithm 2) have been computed for Sonn and Fatn. Those computations are least common subsumer calculations whose overall time cost is polynomial w.r.t to the size and the number of the instances of the basic classes involved in Fatn.

Let us illustrate the application of L 3 -Cluster on the nodes c1 c3 c5 a n d c1 of Figure 1, assuming that the L 3 descriptions of the involved basic classes are: desc(c 1 )=fatt + 1 :fv1 v 3g att + 2 :fv2 v 4g a t t 3 :fv6gg desc(c 3 )=fatt 1 :fv3g a t t ? 2 :fv4g att 3 :fv7gg desc(c 5 )=fatt 1 :fv5g a t t 3 :fv7 v 8gg.

LRes-Cl and L-Cl are initialized to ffc1 c 3 c 5g fc1gg.

Gathering c3 with c1 is considered rst: desc3=fatt + 1 :fv1 v 3g att 2 :fv2 v 4g a t t 3 :fv6 v 7gg, classes=fc1 c 3g Since desc3(c5) is not subsumed by desc3, the node c1c3 is added. LRes-Cl becomes ffc1,c3,c5g,fc1g,fc1,c3gg.

Require: Two nodes F a t n and Sonn such that classes(Sonn) classes(F a t n ).

Ensure: return a lattice between Fatn and Sonn 1: L-Cl f classes(Fatn) c l a s s e s (Sonn)g where fi 1 : :

: i k g = S c2classes(n) inst(c).

H 3 is a complete Galois lattice, i.e. for every node n, the pair (classes(n) desc3(n)) is maximal, and H 3 contains every node verifying the maximality criteria and whose set of classes includes C 2 and are included i n C 1 .

The worst time complexity of L 3 -Cluster is exponential w.r.t j C 1 j ; j C 2 j.

Conclusion and Discussion

This paper has proposed an approach to organize into clusters large sets of semistructured data. The scaling up of the approach is made possible because its complexity is remained in control in di erent w ays: (1) the data are aggregated into basic classes and the clustering applies on the set of those basic classes instead of applying on the data set (2) the two-step clustering method rst builds a coarse hierarchy, based on a simple language for describing the clusters, and uses a more elaborate language for re ning a small subpart of the hierarchy delimited by t wo nodes.

Experimental results: We h a ve e v aluated our approach using a real dataset composed of 2274 computer products extracted from the C/Net catalog. Each p r o duct is described using a subset of 234 attributes, possibly multi-valued. There are 59 types of products and each product is labelled by one and only one type. The goal of the experiment w as twofold : to assess the e ciency and the simplicity of the lattice for the rst clustering step and to show the accuracy of the re nement of a part of the lattice using the second clustering step.

In order to make t h e L 2 lattice even simpler, the number of nodes obtained with L 2 -Cluster may b e parametrized by a threshold n used to restrict the nodes that appear in the lattice to gather at least n basic classes. Figure 2 shows that, as it is mentionned in 15], this quantitative criteria allows us to signi cantly decrease the size of the lattice. L 3 -Cluster allows to distinguish nodes that cannot be distinguished by L 2 -Cluster. For instance, if L 3 -Cluster is applied when an end-user chooses to re ne the L 2 lattice between the node (a) and the node (b) in Figure 3, the aggregation of all types of drivers (i.e. RemovableT apeDrive, RemovableDiskDrive and HardDiskDrive) is part of the L 3 lattice. This new cluster appears for the following reasons : no driver is described using the attributes StorageController=RAIDLevel, Networking=DataLinkProtocol or Networking=T ype (those attributes were optional in L 3 description of (a)). In addition, the value SCSI for the attribute StorageController=T ype is not possible for a driver.

Related work: Our work can be compared with existing work in machine learning based on more expressive languages than propositional language and/or using a Figure 3: A part of the L 2 lattice for C/net (n=3) shift of representation. Most work on expressive l a nguages has been developped in a supervised setting (e.g. Inductive Logic Programming), while little work exists in an unsupervised setting. We can cite KBG 4], TIC 5] and 11] which perform clustering in a relational setting. The main di erence with our approach is that they use a distance as a numerical estimation of similarity. Although the best representation of a cluster is the least common subsumer of its instances, they approximate it numerically by the cluster centroid (i.e., the point that minimizes the sum of squared distances). The reason is that, in contrast with our setting where the lcs computation in L 3 is polynomial, lcs computing in their rstorder language may be exponential. 3] presents lcs computing for a more expressive language than L 3 but for this language subsumption is exponential. KLUSTER 10] re nes a basic taxonomy of concepts in the setting of a description logic for which computing lcs is polynomial. In KLUSTER, the clusters are not unions but subconcepts of primitive concepts, and the re nement aims at learning discriminating de nitions of mutually disjoint subconcepts of a same concept. As for the use of a shift of representation, it is used in supervised learning in order to increase accuracy (i.e. the proportion of correctly predicted concepts in a set of test examples) 8 14] or to search e ciently a reduced space of concepts 9 6]. In unsupervised learning, shift of representational bias may be used to change the point of view about the data 12 13]. For instance, Cluster/2 12] provides a user with a set of parameters about his preferences on the concepts to be created. Finally, the two-step clustering approach proposed in 1] is similar in spirit with our clustering in L 2 since it rst identi es basic clusters (as high density clusters) before building more general clusters that are unions of those basic clusters.

Perspectives: We plan to extend our current w ork to take nested attributes and textual values into account i n L 3 in order to fully deal with XML data.

L 33is a subset of the C-CLASSIC description logic 7]. De nition 8 (Class description in L 3 ) A L 3 class description (of size n) is a tuple fatt suf f1 1

2 is an instance of the L 3 class description C 1 : f RemovableDiskDrive :ftrueg, CD=DV D=ReadSpeed ? :f20x 32x 24xg, CD=DV D=Type :fCDROM CDRWg, Compatibility + :fMA C PCg, StorageRemovableType :fS u p e r D i s k Z I P J A Z gg It represents the set of products that have in their description (i) necessarily the monovalued and boolean attribute RemovableDiskDrive whose value must be true, (ii) possibly the attribute CD=DV D=ReadSpeed, and if that is the case, this attribute is monovalued and its value belongs to the set f20x 32x 24xg, (iii) necessarily the attribute CD=DV D=Type, w h i c h i s m o n o valued and takes its value in the set fCDROM CDRWg, (iv) necessarily the attribute Compatibility, which c a n b e m ultivalued and takes its value(s) in the set fMA C PCg, (v) necessarily the attribute StorageRemovableT ype, which i s m o n o valued and takes its value in the set fSuperDisk ZIP JAZg. The following propositions state the main properties of L 3 . Their proofs follow from results in tractable description logics where structural subsumption is complete. Proposition 2 (Characterization of subsumption in L 3 ) Let C 1 and C 2 be t w o L 3 class descriptions. C 1 L3 C 2

1 :1(* Initialization step gathering the biggest unions of classes having atleast one attribute in common:*) 2: A 1 A 0 , level(1) : : : l e v e l (j C j ) 3: for every a 2 A 1 do 4: let classes(a) = fc a * Generation of new nodes supported by k + 1-itemsets : *) 10: repeat 11: for every pair (p q) 2 A k do 12:

Figure 1 :1Figure 1: Example of a lattice constructed by L 2 -Cluster of basic classes returned b y L 2 -Cluster: shortdesc(n) = lcs L2 (abst L2 (i 1 ) : : : a b s t L2 (i k ))

maximal in the following sense: there i s n o m 2 H such that classes(n) classes(m) and shortdesc(n) = shortdesc(m), o r shortdesc(n) shortdesc(m) and classes(n) = classes(m).The worst time complexity of L 2 -Cluster is exponential in the maximal size of the basic classes L 2 descriptions.

2 :2LRes-Cl L-Cl 3: Nodes f Sonng 4: for eve r y n o d e n 2 Nodes do 5: for every class c 2 classes(F a t n ) n classes(nnode p to Nodes such that classes(p) = classes and desc3(p) = desc3 18: LRes-Cl LRes-Cl fclasses(p)g 19: if Change = true then 20: L-Cl L-Cl classes(p) 21: Suppress n from Nodes Algorithm 2: L 3 -Cluster Gathering c5 with c1 i s n o w considered: desc3=fatt + 1 :fv1 v 3 v 5g a t t 2 :fv2 v 4g att 3 :fv6 v 7 v 8gg, classes = fc1 c 5g Since desc3(c3) is subsumed by desc3, c3 i s a d d e d t o classes, which is updated to fc1 c 3 c 5g. The node corresponding to c1 c5 is not added since it is not closed, the node corrresponding to its closure c1 c3 c5 i s n o t added either because fc1 c 3 c 5g is already in LRes-Cl. The following proposition summarizes the main properties of the algorithm L 3 -Cluster. Proposition 6 (Properties of L 3 -Cluster) Let C 1 be the set of basic classes of the father node. Let C 2 (C 2 C 1 ) b e the set of basic classes of the son node. Let H 3 be the lattice r eturned b y L 3 -Cluster. For each node n 2 H 3 , let desc3(n) and classes(n) be r espectively the description and the set of basic classes returned b y L 3 -Cluster: desc3(n) = lcs L3 (abst L3 (i 1 ) : : : a b s t L3 (i k ))

Figure 2 :2Figure 2: Quantitative results of L 2 -Cluster Figure 3 illustrates the simplicity o f t h e L 2 descriptions and the signi cance of the nodes.

1::n], att i 2 A and V i V . The description of an instance is a term of L 1 . F or example, we can nd a product in the C/Net catalog,whose L 1 description is:fRemovableDiskDrive, CD=DV D=T ype=fCDRWg, StorageRemovableT ype=fSuperDiskg, Compatibility=fMA C PCggIn the following, we will consider that the type c of a L 1 description is a boolean attribute.2.3 The language of classes L 2 De nition 6 (Class description in L 2 ) A L 2 class description (of size n) is a tuple of attributes fatt 1 : : : a t t n g, where 8i 2 1::n], att i 2 A .

Automatic subspace clustering of high dimensional data for data mining applications RAgrawal JGehrke DGunopulos PRaghavan SIG-MOD Record 27 1998 Fast algorithms for mining association rules RAgrawal RSrikant VLDB-94, S a n tiago

, Chile

Computing Least Common Subsumers in Description Logics with Existencial Restrictions FBaader RUsters RMolitor IJCAI-99, S t o c kholm

, Sweden

Conceptual clustering in a rst order logic representation GBisson ECAI-92

Vienne, Austria

Top-down induction of clustering trees HBlockeel LDeRaedt JRamon ICML-98 Morgan Kaufmann 1998 Tabata: a learning algorithm performing a bidirectional search in a reduced search space using a tabu strategy P HSoldano I n ECAI-98

Brighton, Angleterre

1998 Learning the CLAS-SIC description logic: Theoretical and experimental results WWCohen HHirsh KR 94 1 9 9 4 Learning through progressive re nement WVan De Velde Proceedings of the third European Working Session on Learning (EWSL'98) Pitman the third European Working Session on Learning (EWSL'98)

London

1983 Tdis: an algebraic formalization JGanascia IJCAI-93

Vancouver, Canada

1993 A polynomial approach t o the constructive induction of structural knowledge JUKietz KMorik Machine Learning 14 2 1994 Extending k-means clustering to rst-order representations MathiasKirsten StephanWrobel Proc. of the 10th International Conference on Inductive Logic Programming JCussens AFrish of the 10th International Conference on Inductive Logic Programming Springer Verlag 2000 Learning from observation: Conceptual clustering RSMichalski REStepp Machine learning : An arti cial intelligence a p p r oach MorganKaufmann 1 Local Scaling in Conceptual data Systems GStumme ICCS, LNAI 1115 1996 Springer Verlag Shift of bias for inductive concept learning PEUtgo Machine learning : An arti cial intelligence approach MorganKaufmann 1986 2 Restructuring lattice theory RWille Symposium on Ordered S e t s

Boston

1982 , U n i v ersity of Calgary