=Paper= {{Paper |id=Vol-38/paper-1 |storemode=property |title=Automatic Construction and Refinement of a Class Hierarchy over Semi-Structured Data |pdfUrl=https://ceur-ws.org/Vol-38/pernelle.pdf |volume=Vol-38 |dblpUrl=https://dblp.org/rec/conf/ijcai/PernelleRV01 }} ==Automatic Construction and Refinement of a Class Hierarchy over Semi-Structured Data== https://ceur-ws.org/Vol-38/pernelle.pdf
    Automatic Construction and Re nement of a Class Hierarchy over
                         Semistructured Data
                 Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos
                                          L.R.I., C.N.R.S. & University of Paris-Sud
                                          Building 490, 91405, Orsay Cedex, France
                                             Email: fpernelle, mcr,ventosg@lri.fr

1 Introduction                                                      The subsumption relation is a preorder relation be-
In many applications, it becomes crucial to help users to        tween class descriptions, induced by the inclusion rela-
access to a huge amount of data by clustering them in a          tion between class extensions.
small number of classes described at an appropriate level        De nition 2 (Subsumption between classes)
of abstraction. In this paper, we present an approach            Let C and C be two L class descriptions. C is
                                                                       1         2                                     1
based on the use of two languages of description of classes      subsumed by C , denoted C L C , i for every set I
                                                                                 2             1       2
for the automatic clustering of semistructured data. The         of instances, extI (C )  extI (C ).
                                                                                       1           2
  rst language of classes has a high power of abstraction           In sections 2.3, and 2.4, we will provide a constructive
and guides the construction of a lattice of classes cover-       characterization of subsumption for the two languages of
ing the whole set of the data. The second language of            classes that we consider.
classes, more expressive and more precise, is the basis for         The notion of abstraction of an instance in a language
the re nement of a part of the lattice that the user wants       of classes L corresponds, when it exists, to the most
to focus on. Our approach has been implemented and               speci c class description in L which it is an instance of.
experimented on real data in the setting of the GAEL
project 1 which aims at building exible electronic cata-         De nition 3 (Abstraction of an instance) Let i be
logs organized as a hierarchy of classes of products. Our        an instance, the L class description C is an abstraction
experiments have been conducted on real data coming              of i in L (for short C = absL (i)) i
from the C/Net (http://www.cnet.com) electronic cata-              1. i isaL C, and
log of computer products.                                          2. if D is a class description such that i isaL D, then
                                                                      C L D.
2 Languages of instances and classes                                The notion of least common subsumer (a.k.a msg) will
In this section, we de ne the language of instances, L1,         be the basis for gathering classes in our clustering algo-
in which we describe semistructured data, and the two            rithm.
languages of classes L2 and L3 that we use to describe           De nition 4 (Least Common Subsumer) Let
classes over those data, at two levels of abstraction.           C ; : : :; Cn be class descriptions in L. The L class de-
First, we provide some notations and preliminaries.                1
                                                                 scription C is a least common subsumer of C ; : : :; Cn
2.1 Preliminaries and notations
                                                                                                                  1
                                                                 in L (for short C = lcsL (C ; : : :; Cn)) i
                                                                                               1
Given a language of instances, a language of classes L             1. Ci L C for all 1  i  n, and
de nes the expressions that are allowed as class descrip-          2. if D is a class description satisfying Ci L D for
tions. A class description is intended to represent in an              all 1  i  n, then C L D
abstract and concise way the properties that are com-
mon to the set of its instances. A membership relation,          2.2 The language of instances
denoted by isaL , establishes the necessary connection           The data that serve as instances of the classes that we
between a given language of instances and an associated          build are semistructured data in the following sense:
language of classes L.                                           each data is described by a set of pairs (Attribute, Val-
De nition 1 (Extension of a class description)                   ues) but the attributes used for describing the data may
Let I be a set of instances, and C a L class description.        vary from an item to another. In addition, each data is
The extension of C w.r.t I is the following set:                 typed (i.e., labelled by the name of a basic type).
               extI (C) = fi 2 I j i isaL C g                    De nition 5 (Terms of L ) Let B be a nite set of
                                                                                               1
   1
     GAEL is funded by the French Ministry of Industry ; it is   basic types, A a nite set of attributes, and V a set of
a joint project with the Verso database group of INRIA and       values. A term of L is of the form:
                                                                                       1

a startup, MatchVision, specialized in Electronic Commerce                     fc; att = V ; : : :; attn = Vn g
                                                                                       1   1
where c 2 B, 8i 2 [1::n], atti 2 A and Vi  V .                                            De nition 9 (Membership relation for L ) Let i    3

  The description of an instance is a term of L1. For                                      be an instance description in L . Let C be a L class
                                                                                                                             1                   3

example, we can nd a product in the C/Net catalog,                                         description. i is an instance of C i every attribute in
whose L1 description is:                                                                   i appears in C and for every term attsuf f : V appearing
                                                                                           in C ,
    fRemovableDiskDrive,                                                                   - when suff = , if there exists V 0 s.t att = V 0 2 i, then
    CD=DV D=Type=fCDRW g,                                                                  V0  V,
    StorageRemovableT ype=fSuperDiskg,                                                     - when suff = +, there exists V 0  V s.t att = V 0 2 i,
    Compatibility=fMAC; P C gg                                                             - when suff =?, if there exists V 0 s.t att = V 0 2 i, then
  In the following, we will consider that the type c of a                                  V 0 is a singleton and V 0  V ,
L1 description is a boolean attribute.                                                     - when suff = , there exists V 0 singleton s.t V 0  V
                                                                                           and att = V 0 2 i.
2.3 The language of classes L2                                                                The product described in 2.2 is an instance of the L3
De nition 6 (Class description in L ) A L class                    2       2               class description C1 :
description (of size n) is a tuple of attributes                                                f RemovableDiskDrive :ftrueg,
fatt ; : : :; attng, where 8i 2 [1::n], atti 2 A.
     1
                                                                                                CD=DV D=ReadSpeed? :f20x; 32x; 24xg,
  The connection between the language of instances L                                   1        CD=DV D=Type :fCDROM; CDRW g,
and the language of classes L is based on the following
                                               2                                                Compatibility+ :fMAC;PC g,
de nition of the membership relation.                                                           StorageRemovableType :fSuperDisk; ZIP; JAZ gg
De nition 7 (Membership relation for L ) Let i                         2
be an instance description in L . Let C be a L class
                                                   1                       2               It represents the set of products that have in their de-
description: i is an instance of C i every attribute ap-                                   scription (i) necessarily the monovalued and boolean at-
pearing in C also appears in i.                                                            tribute RemovableDiskDrive whose value must be true,
                                                                                           (ii) possibly the attribute CD=DV D=ReadSpeed, and
   The following proposition, whose proof is straight-                                     if that is the case, this attribute is monovalued and its
forward, characterizes subsumption, least common sub-                                      value belongs to the set f20x; 32x; 24xg, (iii) necessarily
sumer and abstraction in L2.                                                               the attribute CD=DV D=Type, which is monovalued and
Proposition 1 (Properties of L ) .                                                         takes its value in the set fCDROM; CDRW g, (iv) nec-
   Let C and C be two L class descriptions. C L2
                                                       2
                                                                                           essarily the attribute Compatibility, which can be mul-
          1                2               2
C i every attribute of C is also an attribute of C .
                                                                               1
                                                                                           tivalued and takes its value(s) in the set fMAC; P C g,
 2                                     2
   Let fatt = V ; : : :; attn = Vng be an instance de-
                                                                                   1
                                                                                           (v) necessarily the attribute StorageRemovableT ype,
               1               1
scription in L . Its abstraction in L is unique: it is                                     which is monovalued and takes its value in the set
                   1
fatt ; ; : : :; attng.
                                                               2
                                                                                           fSuperDisk; ZIP; JAZ g.
     1
   Let C ; : : :; Cn be n L class descriptions. Their                                        The following propositions state the main properties of
           1                               2
least common subsumer is unique: it is made of the set                                     L3. Their proofs follow from results in tractable descrip-
of attributes that are common to all the Ci 's.                                            tion logics where structural subsumption is complete.
2.4 The language of classes L3                                                             Proposition 2 (Characterization of subsumption in L )              3
                                                                                           Let C and C be two L class descriptions. C L3 C
                                                                                                1        2          3                        1       2
L is richer than L on di erent aspects: it makes possi-
 3                             2                                                           i all the attributes appearing in C1 appear also in C2
ble to restrict the possible values of an attribute ; it en-                               and for every pair attsuf f : V appearing in C2,
ables to distinguish the number of values of an attribute                                  - when suff = , if there exists attsuf f : V 0 2 C1, then
                                                                                                                                     0


through di erent suxes ( ; + ; ? ; ) whose notation is                                  V0  V,
inspired by the one used in XML for describing docu-                                       - when suff = +, there exists V 0  V s.t att+ : V 0 2 C1
ment type de nitions (DTDs), and whose formal seman-                                       or att : V 0 2 C1
tics corresponds to standard description logics construc-                                  - when suff = ?, if there exists attsuf f : V 0 2 C1 , then
                                                                                                                                     0


tors. In fact, as it will become clearer in the following,                                 suff 0 = ? or suff 0 = , and V 0  V ,
L3 is a subset of the C-CLASSIC description logic [7].                                     - when suff = , there exists V 0 s.t V 0  V and att : V 0
De nition 8 (Class description in L ) A L class                    3       3
                                                                                           2 C1.
description (of size n) is a tuple                                                           The complexity of checking subsumption in L3 is
                                                                                           quadratic w.r.t the maximal size of class descriptions.
             fatt   : V ; : : :; attsuf
                       suf f1           f
                                          : V ng           n

                                                                                           Proposition 3 (Characterization of abstraction in L )
                       1           1n

where 8i 2 [1::n], atti 2 A, Vi  V , and suffi 2
                                                                                                                                                          3
                                                                                           Let fatt = V ; : : :; attn = Vn g be an instance de-
f; +; ?; g                                                                                        1        1
                                                                                           scription in L . Its abstraction in L is unique:
                                                                                                             1                           3

   The following de nition formalizes the membership re-                                   absL3 = fattsuf f1 : V ; : : :; attsuf
                                                                                                                        1
                                                                                                                                   fn  : Vn g, where
                                                                                           8i 2 [1::n], if j Vi j 2 then suffi = + else suffi = .
                                                                                                             1                  n
lation between an instance and a class description in L3.
Proposition 4 (Characterization of lcs in L ) Let      3            2. In the second step, a lattice of clusters is constructed
C ; : : :; Cn be n L class descriptions. Let A be the set
 1                  3                                                   by gathering basic classes according to similarities
of attributes belonging to at least one description Ci .                of their L2 descriptions. In this step, clusters are
C1; : : :; Cn have a unique least common subsumer in                    unions of basic classes. The computational com-
L3, whose description is characterized as follows:                      plexity of this step does not depend on the number
    for every attribute att 2 A, let V be the union of                 of initial data but only on the size of the L2 descrip-
      the sets of values associated        with att in the class        tions of basic classes.
                                   S
      descriptions Ci's: V = 1 fv 2 Vi j att
                                     n               suf f
                                                           : Vi 2    We   now detail this second step. A cluster ci1 : : :ci
                                                                                      in the lattice if the L2 descriptions of the
                                                                                                                                  k
      Ci g.                                                       will  appear
                                                                  classes
         { att:V 2 lcs(C1; : : :; Cn) i att:Vi 2 Ci 8i 2 their instances. c i 1  : :
                                                                                     k :ci are judged similar enough to gather
            [1::n].                                                                        The similarity between class descrip-
                                                                  tions is stated by attributes in common. However, we
         { att :V 2 lcs(C1; : : :; Cn) i
                ?
                                                                  take into account only attributes that do not occur in
            (8i 2 [1::n] att :Vi 62 Ci and att+ :Vi 62 Ci ),and too many (or all the) classes. For instance, the attribute
             either 9i 2 [1::n] s.t. att? :Vi 2 Ci ,             price may appear in all the instances of a catalog describ-
             or 9i 2 [1::n] s.t. atts :V 0 62 Ci for any s.      ing products, and is therefore not useful to discriminate
         { att :V 2 lcs(C1; : : :; Cn) i                         product descriptions. Among the set A of attributes,
             either 9i 2 [1::n] s.t. att :Vi 2 Ci ,
                                                                 we select meaningful attributes as being the attributes
             or 9i 2 [1::n] s.t. att :Vi 2 Ci , and 9j 2 a 2 A such that jclasses
                                           +
                                                                                               jclassesj  s where s is a certain
                                                                                                      (a)j


               [1::n] s.t. att? :Vj 2 Cj or atts :V 0 62 Cj for threshold (e.g., s = 0:8). Let A0 be the set of mean-
                                                0



               any sux s0 .                                      ingful attributes. We redescribe all the basic classes in
         { att :V 2 lcs(C1; : : :;+ Cn) i
                +                                                 terms of the attributes of A0 only: for a basic class c, we
            9i +2 [1::n] s.t. att  :Vi 2 Ci and 8j 2 [1::n], call     its short description, denoted shortdesc(c), the L2
            att :Vj 2 Cj or att :Vj 2 Cj                          description of c restricted to the meaningful attributes:
                                                                  shortdesc(c) = desc(c) \ A0.
For example, if C2 is the L3 description:                            Our clustering algorithm, L2-Cluster, is described in
     fCompatibility :fPC; Unix        g ,                        Algorithm 1. It is adapted from a frequent item set
     StorageRemovableType :fDAT g,                               algorithm ([2]). It iteratively builds levels of clusters,
     CompressedCapacity : f8; 24; 32; 70gg                       starting with building the level of the coarsest clusters
lcs(C1 ; C2)=                                                     corresponding to unions of classes having atleast one at-
  fRemovableDiskDrive?? : ftrueg,                                 tribute in common. Each iteration k is guided by at-
  CD=DV D=ReadSpeed :f20x; 32x; 24xg,                             tribute sets of increasing size k which, being common to
  CD=DV D=Type :fCDROM; CDRW g,
                       ?                                          some class descriptions, are the support of the creation of
  Compatibility :fMAC; P C; Unixg,                               a potential node gathering those classes. Among those
  StorageRemovableType :fSuperDisk; ZIP; JAZ; DAT g, potential nodes, we e ectively add to the lattice those
  CompressedCapacity? :f8; 24; 32;70gg                            whose L2 short description is equal to their k-support:
   Computing the lcs of L3 descriptions is linear in the k-itemsetthe  k-support of a node generated at iteration k is the
number of descriptions and quadratic in their size.                               supporting the generation of that node. By
                                                                  doing so, we guarantee that the description of the nodes
3 Construction of a lattice of L2 classes added                   fathers.
                                                                          to the lattice is strictly subsumed by those of their
The goal is to structure a set of data described in L1 Notation: We call a k-itemset a set of attributes of size
into clusters labelled by L2 descriptions, and organized k. We assume that attributes in itemsets are kept sorted
in a lattice providing a browsable semantic interface fa- in their lexicographic order. We use the notation p[i] to
cilitating the access to the data for end-users.                  represent the i-th attribute of the k-itemset p consisting
   We proceed to a two-step clustering:                           of the attributes p[1]; : : :; p[k] where p[1] < : : : < p[k].
  1. In the rst step, the data are partitioned according when        Figure 1 shows the lattice returned by L2-Cluster
      to their type: for each type c, we create a basic class fa1; a2it; ais3; aapplied       on C = fc1 ; c2; c3; c4; c5g and A =
                                                                                     4 g such that:
      named c. Its set of instances, denoted inst(c), is the         shortdesc(c1 ) = fa1; a2; a3g shortdesc(c2 ) = fa2g
      set of data of type c. Its L2 description, desc(c), is         shortdesc(c3 ) = fa1; a3g shortdesc(c4 ) = fa3; a4g
      obtained by computing the least common subsumer                            shortdesc(c5 ) = fa1; a3g
      of the abstractions of its instances. The result of
      this step is a set C of basic classes and a set A of the algorithm Lproposition
                                                                     The   following
                                                                                            2 -Cluster.
                                                                                                           summarizes the properties
      of attributes supporting the L2 descriptions of the
      classes of C . For each attribute a, the set classes(a) Proposition 5 (Properties of L2 -Cluster) Let H be
      of basic classes having a in their description is com- the lattice returned by L2-Cluster.
      puted. This preliminary clustering step has a linear            For each node n 2 H, let shortdesc(n) and
      data complexity.                                                  classes(n) be respectively the description and the set
                                                                               {a3}    c1 c3 c4 c5
Require: a set A of meaningful attributes: for each
                     0
   a 2 A , classes(a) is the set of basic classes of C
           0
   whose L short description contains a.
               2
Ensure: return a lattice organized in levels of nodes.                         {a1 a3} c1 c3 c5
    Each node n is characterized by classes(n): the basic
    classes it gathers, and shortdesc(n): the least com-          {a2} c1 c2
    mon subsumer of the short description of the basic
    classes of the cluster.
 1: (* Initialization step gathering the biggest unions of            {a1 a2 a3} c1              {a3 a 4} c4
    classes having atleast one attribute in common:*)
 2: A1 A0 , level(1) ;; : : :; level(j C j) ;                     Figure 1: Example of a lattice constructed by L2-Cluster
 3: for every a 2 A1 do
 4: let classes(a) = fca1 ; : : :; caj g                              of basic classes returned by L2 -Cluster:
 5: let desc = lcsL2 (desc(ca1 ); : : :; desc(caj ))
 6: if desc \ A0 = fag then                                             shortdesc(n) = lcsL2 (abstL2 (i1 ); : : :; abstL2 (ik ))
 7:      add to level(j) a node n such that:                                                  S
                                                                      where fi1; : : :; ik g = c2classes(n) inst(c).
            classes(n) = fca1 ; : : :; cajg ;
            shortdesc(n) = desc \ A0;                                H is a Galois lattice, i.e. for every node n, the pair
            node(fag) = n                                             (classes(n); shortdesc(n)) is maximal in the follow-
 8: k 1                                                               ing sense: there is no m 2 H such that classes(n) 
 9: (* Generation of new nodes supported by                           classes(m) and shortdesc(n) = shortdesc(m), or
    k + 1-itemsets : *)                                               shortdesc(n)  shortdesc(m) and classes(n) =
10: repeat                                                            classes(m).
11: for every pair (p; q) 2 Ak do                                    The worst time complexity of L -Cluster is expo-
        if p[1] = q[1]; : : :; p[k ; 1] = q[k ; 1]; p[k] < q[k]
                                                                                                           2
12:                                                                   nential in the maximal size of the basic classes L
         then                                                         descriptions.
                                                                                                                              2


13:         let newp = p [ fq[k]g, and let Sk be the set
14:
            of k-subsets of newp.
            if Sk  Ak and classes(node(p)) \                     4 Re nement in L3
            classes(q[k]) =  6 ; then                             The goal of this step is to re ne a part of the lattice H
15:            add newp to Ak+1                                   computed by L2 -Cluster based on the more expressive
16:            let fci1 ; : : :; ci g be classes(node(p)) \       language L3. This step is achieved after a user chooses
               classes(q[k])
                               j
                                                                  one node Fatn and one of its descendants Sonn in H.
17:            let desc = lcsL2 (desc(ci1 ); : : :; desc(ci ))    Algorithm 2 describes how new nodes are possibly added
18:            if desc = newp then
                                                        j
                                                                  between Sonn and Fatn. Those new nodes correspond
19:               add to level(j) a node n such that:             to clusters whose descriptions in L2 did not distinguish
                     classes(n) = fci1 ; : : :; ci g ;            from those of Fatn or Sonn, while having distinct de-
                     shortdesc(n) = desc;
                                               j
                                                                  scriptions in L3. A closure operation on those nodes is
                     node(newp) = n                               necessary in order to make their L3 descriptions maxi-
20: k k + 1                                                       mal w.r.t the union of basic classes which they gather.
21: until Ak = ;                                                  L3-Cluster applies after the descriptions in L3 (denoted
22: (* Creation of the lattice. For every node n,                 desc3 in Algorithm 2) have been computed for Sonn and
    Fathers(n) group the fathers of n among the nodes             Fatn. Those computations are least common subsumer
    of greater levels:*)                                          calculations whose overall time cost is polynomial w.r.t
23: Initialize Fathers(n) and AncNotFathers(n) to ;               to the size and the number of the instances of the basic
    for every generated node n.                                   classes involved in Fatn.
24: for i =j C j ;1 downto 1 do                                      Let us illustrate the application of L3-Cluster on the
25: for every node n 2 level(i) do                                nodes c1 c3 c5 and c1 of Figure 1, assuming that the L3
26:      for j = i + 1 to j C j do                                descriptions of+the involved +basic classes are:
27:         for every node m 2 level(j) do                        desc(c1 )=fatt1 :fv1; v3g; att2 :fv2; v4g; att3:fv6gg
28:            if classes(n)  classes(m)                         desc(c3 )=fatt1:fv3g; att?2:fv4g; att3:fv7gg
               and m 62 AncNotFathers(n) then                     desc(c5 )=fatt1:fv5g; att3:fv7; v8gg.
29:               add m to Fathers(n)                             LRes-Cl and L-Cl are initialized to ffc1; c3; c5g; fc1gg.
30:               add F athers(m) [ AncNotFathers(m)               Gathering+ c3 with c1 isconsidered rst:
                  to AncNotFathers(n)                             desc3=fatt1 :fv1; v3g; att2:fv2; v4g; att3:fv6; v7gg,
                                                                  classes=fc1; c3g
                 Algorithm 1: L2-Cluster                          Since desc3(c5) is not subsumed by desc3, the node c1c3
                                                                  is added. LRes-Cl becomes ffc1,c3,c5g,fc1g,fc1,c3gg.
Require: Two nodes Fatn and Sonn such that                         5 Conclusion and Discussion
    classes(Sonn)  classes(F atn).                                This paper has proposed an approach to organize into
Ensure: return a lattice between Fatn and Sonn                     clusters large sets of semistructured data. The scaling
 1: L-Cl fclasses(Fatn); classes(Sonn)g                            up of the approach is made possible because its complex-
 2: LRes-Cl L-Cl                                                   ity is remained in control in di erent ways: (1) the data
 3: Nodes fSonng                                                   are aggregated into basic classes and the clustering ap-
 4: for every node n 2 Nodes do                                    plies on the set of those basic classes instead of applying
 5: for every class c 2 classes(F atn) n classes(n) do             on the data set (2) the two-step clustering method rst
 6:      Change false                                              builds a coarse hierarchy, based on a simple language
 7:      classes classes(n) [ fcg                                  for describing the clusters, and uses a more elaborate
 8:      if classes 62 L-Cl then                                   language for re ning a small subpart of the hierarchy
 9:         L-Cl L-Cl [fclassesg                                   delimited by two nodes.
10:         desc3 lcsL3 (desc3(n); desc3(c))                       Experimental results: We have evaluated our ap-
11:         (* Closure operation: *)                               proach using a real dataset composed of 2274 computer
12:         for every class Cl 2 classes(F atn) n classes          products extracted from the C/Net catalog. Each prod-
            do                                                     uct is described using a subset of 234 attributes, possibly
13:           if desc3(Cl) L3 desc3 then                          multi-valued. There are 59 types of products and each
14:             add Cl to classes                                  product is labelled by one and only one type. The goal
15:             Change true                                        of the experiment was twofold : to assess the eciency
16:       if classes 62 LRes-Cl then                               and the simplicity of the lattice for the rst clustering
17:          add a new node p to Nodes such that                   step and to show the accuracy of the re nement of a part
             classes(p) = classes and desc3(p) = desc3             of the lattice using the second clustering step.
18:          LRes-Cl LRes-Cl [fclasses(p)g                            In order to make the L2 lattice even simpler, the
19:          if Change = true then                                 number of nodes obtained with L2-Cluster may be
20:             L-Cl L-Cl [classes(p)                              parametrized by a threshold n used to restrict the nodes
21:   Suppress n from Nodes                                        that appear in the lattice to gather at least n basic
               Algorithm 2: L3-Cluster                             classes. Figure 2 shows that, as it is mentionned in [15],
                                                                   this quantitative criteria allows us to signi cantly de-
                                                                   crease the size of the lattice.
 Gathering c5 with c1 is now considered:
desc3=fatt+1 :fv1; v3; v5g; att2:fv2; v4g; att3:fv6; v7; v8gg,
classes = fc1; c5g                                                  Threshold            1      2     3     4     5

Since desc3(c3) is subsumed by desc3, c3 is added to                Number of nodes     119    60     24    13    9
classes, which is updated to fc1; c3; c5g. The node
corresponding to c1 c5 is not added since it is not closed,         Maximal depth        4      3     2     1     1
the node corrresponding to its closure c1 c3 c5 is not              Running Times (s)   12.3   10.4   2.1   1.1   1
added either because fc1; c3; c5g is already in LRes-Cl.
   The following proposition summarizes the main prop-
erties of the algorithm L3-Cluster.                                      Figure 2: Quantitative results of L2 -Cluster
Proposition 6 (Properties of L3-Cluster) Let C1                       Figure 3 illustrates the simplicity of the L2 descrip-
be the set of basic classes of the father node. Let C2             tions and the signi cance of the nodes.
(C2  C1) be the set of basic classes of the son node. Let            L3-Cluster allows to distinguish nodes that can-
H3 be the lattice returned by L3-Cluster.                          not be distinguished by L2 -Cluster. For instance,
   For each node n 2 H3, let desc3(n) and classes(n)              if L3-Cluster is applied when an end-user chooses
     be respectively the description and the set of basic          to re ne the L2 lattice between the node (a)
     classes returned by L3 -Cluster:                              and the node (b) in Figure 3, the aggregation
                                                                   of all types of drivers (i.e. RemovableTapeDrive,
        desc3(n) = lcsL3 (abstL3 (i1 ); : : :; abstL3 (ik ))       RemovableDiskDrive and HardDiskDrive) is part
                             S
    where fi1 ; : : :; ik g = c2classes(n) inst(c).
                                                                   of the L3 lattice. This new cluster appears for
                                                                   the following reasons : no driver is described
   H3 is a complete Galois lattice, i.e. for every node           using the attributes StorageController=RAIDLevel,
    n, the pair (classes(n); desc3(n)) is maximal, and             Networking=DataLinkProtocol or Networking=T ype
    H contains every node verifying the maximality                 (those attributes were optional in L3 description of
       3
    criteria and whose set of classes includes C and               (a)). In addition, the value SCSI for the attribute
    are included in C .
                                                        2
                                                                   StorageController=T ype is not possible for a driver.
                        1
                                                                   Related work: Our work can be compared with exist-
   The worst time complexity of L -Cluster is exponen-
                                       3                           ing work in machine learning based on more expressive
    tial w.r.t j C j ; j C j.
                  1         2                                      languages than propositional language and/or using a
       {Stor./Type,
                                       {Input Dev/Type}         Perspectives: We plan to extend our current work to
        Stor. Controller/Type}
                                                                take nested attributes and textual values into account in
        Hard Disk Drive
                                        Key-Entry Device        L in order to fully deal with XML data.
                                                                 3
                                        Pointing Device
                                                                References
        Removable Disk Drive            Game Controller
 (a)    Network Storage                 Communic. Device
        Removable Tape Drive            Handheld/PDA            [1] R. Agrawal, J. Gehrke, D. Gunopulos, and
                                        Laptop                      P. Raghavan. Automatic subspace clustering of high
       {Stor./Type,                                                 dimensional data for data mining applications. SIG-
        Stor. Controller/Type,             {Input Dev/Type,         MOD Record, 27:94{105, 1998.
        Removable Stor/Type}                Form Factor}        [2] R. Agrawal and R. Srikant. Fast algorithms for min-
        Removable Disk Drive            Key-Entry Device            ing association rules. In VLDB-94, Santiago, Chile.
        Network Storage                 Pointing Device         [3] F. Baader, R. Kusters and R. Molitor. Comput-
        Removable Disk Drive            Game Controller             ing Least Common Subsumers in Description Logics
                                                                    with Existencial Restrictions. In IJCAI-99, Stock-
                                                                    holm, Sweden.
                                 {Stor./Type,                   [4] G. Bisson. Conceptual clustering in a rst order
(b)     Removable Disk Drive      Stor. Controller/Type,
                                  Removable Stor/Type,              logic representation. In ECAI-92, Vienne, Austria.
                                  CD/DVD/Type}                  [5] H. Blockeel, L. De Raedt, and J. Ramon. Top-down
                                                                    induction of clustering trees. In ICML-98, Morgan
  Figure 3: A part of the L2 lattice for C/net (n=3)                Kaufmann, 1998.
                                                                [6] P. Brezellec and H. Soldano. Tabata: a learning
                                                                    algorithm performing a bidirectional search in a re-
shift of representation. Most work on expressive lan-               duced search space using a tabu strategy. In ECAI-
guages has been developped in a supervised setting (e.g.            98, Brighton, Angleterre, 1998.
Inductive Logic Programming), while little work exists          [7] W. W. Cohen and H. Hirsh. Learning the CLAS-
in an unsupervised setting. We can cite KBG [4], TIC [5]            SIC description logic: Theoretical and experimental
and [11] which perform clustering in a relational setting.          results. In KR-94, 1994.
The main di erence with our approach is that they use           [8] W. Van de Velde. Learning through progres-
a distance as a numerical estimation of similarity. Al-             sive re nement. In Pitman, editor, Proceedings of
though the best representation of a cluster is the least             the third European Working Session on Learning
common subsumer of its instances, they approximate it                (EWSL'98),London, 1983.
numerically by the cluster centroid (i.e., the point that       [9] J. Ganascia. Tdis: an algebraic formalization. In
minimizes the sum of squared distances). The reason is               IJCAI-93,Vancouver, Canada, 1993.
that, in contrast with our setting where the lcs compu-
tation in L3 is polynomial, lcs computing in their rst-         [10] J. U. Kietz and K. Morik. A polynomial approach to
order language may be exponential.[3] presents lcs com-              the constructive induction of structural knowledge.
puting for a more expressive language than L3 but for                Machine Learning, 14(2):193{217, 1994.
this language subsumption is exponential. KLUSTER               [11] Mathias Kirsten and Stephan Wrobel. Extending
[10] re nes a basic taxonomy of concepts in the setting              k-means clustering to rst-order representations. In
of a description logic for which computing lcs is poly-              J. Cussens and A. Frish, editors, Proc. of the 10th
nomial. In KLUSTER, the clusters are not unions but                  International Conference on Inductive Logic Pro-
subconcepts of primitive concepts, and the re nement                 gramming, Springer Verlag, 2000.
aims at learning discriminating de nitions of mutually          [12] R.S. Michalski and R.E. Stepp. Learning from ob-
disjoint subconcepts of a same concept. As for the use               servation: Conceptual clustering. In Morgan Kauf-
of a shift of representation, it is used in supervised learn-        mann, editor, Machine learning : An arti cial in-
ing in order to increase accuracy (i.e. the proportion of            telligence approach, 1983.
correctly predicted concepts in a set of test examples) [8;
14] or to search eciently a reduced space of concepts [9;      [13] G. Stumme. Local Scaling in Conceptual data Sys-
6]. In unsupervised learning, shift of representational              tems. In Springer Verlag, editor, ICCS, LNAI 1115,
bias may be used to change the point of view about the               1996.
data [12; 13]. For instance, Cluster/2 [12] provides a          [14] P. E. Utgo . Shift of bias for inductive concept
user with a set of parameters about his preferences on               learning. In Morgan Kaufmann,ed., Machine learn-
the concepts to be created. Finally, the two-step clus-              ing : An arti cial intelligence approach Vol. 2, 1986.
tering approach proposed in [1] is similar in spirit with       [15] R. Wille. Restructuring lattice theory. In Sym-
our clustering in L2 since it rst identi es basic clusters           posium on Ordered Sets, University of Calgary,
(as high density clusters) before building more general              Boston, 1982.
clusters that are unions of those basic clusters.