=Paper=
{{Paper
|id=Vol-38/paper-2
|storemode=property
|title=Learning Ontologies from RDF annotations
|pdfUrl=https://ceur-ws.org/Vol-38/delteil_ol.pdf
|volume=Vol-38
|dblpUrl=https://dblp.org/rec/conf/ijcai/DelteilFD01
}}
==Learning Ontologies from RDF annotations==
/HDUQLQJRQWRORJLHVIURP5')DQQRWDWLRQV
$OH[DQGUH'HOWHLO&DWKHULQH)DURQ=XFNHU5RVH'LHQJ
ACACIA project, INRIA, 2004, route des Lucioles, B.P. 93, 06902 Sophia Antipolis, France
{Alexandre.Delteil, Catherine.Faron, Rose.Dieng}@sophia.inria.fr
$EVWUDFW objects, as in [Mineau, 1990; Carpineto and Romano, 1993;
Bournaud HWDO., 2000].
In this paper, we present a method for learning Since all RDF annotations are gathered inside a common
ontologies from RDF annotations of Web RDF graph, the problem which arises is the extraction of a
resources by systematically generating the most description for a given resource from the whole RDF graph.
specific generalization of all the possible sets of After a brief description of the RDF data model (Section 2)
resources. The preliminary step of our method and of RDF Schema (Section 3), Section 4 presents several
consists in extracting (partial) resource criteria for extracting partial resource descriptions. In order
descriptions from the whole RDF graph gathering to deal with the intrinsic complexity of the building of a
all the annotations. In order to deal with
generalization hierarchy, we propose an incremental
algorithmic complexity, we incrementally build
approach by gradually increasing the size of the descriptions
the ontology by gradually increasing the size of
the resource descriptions we consider. we consider. The principle of the approach is explained in
Section 5 and more deeply detailed in Section 6.
,QWURGXFWLRQ
7KH5')GDWDPRGHO
The Semantic Web, expected to be the next step that will
lead the Web to its full potential, will be based on semantic RDF is the emerging Web standard for annotating resources
metadata describing all kinds of Web resources. Resource with semantic metadata [RDF, 1999] [Decker HW DO, 2000].
Description Framework [RDF, 1999] seems to be the An RDF annotation consists of a set of statements, each one
emerging standard allowing to semantically annotate Web specifying a value of a property of a resource. A statement
resources. These annotations are related to ontologies, for is thus a triple (resource, property, value), a value being
example declared in RDF Schema [RDFS, 2000], as either a resource or a literal. The RDF data model is close to
proposed by W3C, or in RDF-compatible languages like semantic nets. A set of statements is viewed as a directed
OIL [Fensel HW DO., 2000]. Methods for building hierarchies labeled graph: a vertex is either a resource or a literal; an arc
of classes from RDF annotations will appear to be useful to between two vertices is labeled by a property. RDF is
classify resources, organize knowledge and finally learn provided with an XML syntax. Figure 1 presents an
ontologies. example of an RDF graph describing the Web page relative
The building of hierarchical structures has been extensively to the cat Njal and its XML serialization.
studied in machine learning, especially in concept Cat House
formation. Most approaches of concept formation are
type type
dedicated to the prediction of unknown features of new livesIn ownedBy
objects [Fischer HW DO, 1987; Gennari HW DO, 1989]. The Njal Catherine
clusters of VLPLODU objects are then privileged, the learned
conceptual hierarchy does not comprise all the possible sets
of objects, but only the best ones according to some
heuristic criteria.
For learning ontologies, we adopt a particular approach of
concept formation. An ontology is viewed as a concept
hierarchy, where each concept is defined in extension by a
cluster of resources and in intension by the most specific
common description of these resources. This approach leads
to the systematic generation of all the possible clusters of
)LJXUH An RDF annotation and its XML serialization
An RDF annotation is a set of RDF triples. It can thus be ([WUDFWLQJUHVRXUFHGHVFULSWLRQV
viewed as a graph, which is a subgraph of the complete
Regarding the RDF model, the knowledge base representing
RDF graph representing the whole set of annotations on the
the resource annotations consists of a single graph G. There
Semantic Web.
is no difference between stating a resource description in
one annotation and stating it in several pieces in separate
5HSUHVHQWDWLRQRI2QWRORJLHVLQ5')
annotations: “there is no distinction between the statements
6FKHPDV
made in a single sentence and the statements made in
RDF Schema (RDFS) is a schema specification language separate sentences” [RDF, 1999]. The RDF model does not
[RDFS, 2000]. It is dedicated to the specification of handle the delimitation of the subgraph of G describing a
schemas representing the ontological knowledge used in resource. We thus propose different criteria for extracting a
RDF statements: a schema consists of a set of declarations description of a resource R from G.
of classes and properties. Multi-inheritance is allowed for
both classes and properties. A property is declared with a &RPSOHWH GHVFULSWLRQ: We define the complete description
signature allowing several domains and a single range. The of a resource R as follows. A resource is completely
RDFS metamodel is presented in Figure 2 and is itself described by the subgraph of G containing all the resources
defined as a set of statements by using the core RDFS reachable from R through properties. Formally, the
properties: UGIVVXEFODVV2I and UGIW\SH which denote complete description of R is the largest connected subgraph
respectively the subsumption relation between classes and of G containing R; it is inductively defined as the join of the
the instantiation relation between an instance and a class. complete descriptions of the resources adjacent to R in G.
Such a complete description may be very large; potentially
it may be the graph G representing the whole knowledge
Literal Resource Property
RDFS metamodel
base. This is why we define ways of extracting partial
descriptions of a resource R from G.
Class range type 3LHFH RI NQRZOHGJH : We define the piece of knowledge
relative to a resource R as the largest connected subgraph of
subClassOf domain
G whose all internal nodes excepted R are anonymous
Living being resources. External nodes are either identified resources, or
literals or anonymous resources connected to the only
color ownedBy resources belonging to the piece.
Ontology
Person Cat
domain
'HVFULSWLRQ RI OHQJWK Q: We define the description of
House range livesIn length n of a resource R as the largest connected subgraph
of G containing all possible paths of length smaller or equal
subclassOf to n, starting from or ending to R. The description Dn(R) of
annotation
type length n of a resource R is inductively obtained by joining
RDF
livesIn ownedBy Dn-1(R) with the descriptions D1 of length 1 of the resources
Njal Catherine
which are external nodes of Dn-1(R).
)LJXUH The RDFS metamodel and an RDFS ontology. 3DUWLDO GHVFULSWLRQ : We define a partial description of a
resource R as either the piece of knowledge relative to R or
As shown in Figure 2, an ontology embedding domain a description of length n of R.
specific knowledge is represented by a schema defined by
refining the core RDFS. Domain specific classes are Figure 3 presents the extraction of three possible partial
declared as instances of the “Class” resource, and domain descriptions of Njal from the whole RDF graph which the
specific properties as instances of the “Property” resource. RDF annotation of Figure 1 participates to: the piece of
The “subclassOf” and “subPropertyOf” properties enable to knowledge relative to Njal, the description of length 1 of
define class hierarchies and property hierarchies. Njal and the description of length 2 of Njal. D1(Njal) is a
subgraph of D2(Njal) which is made of paths of length 1 and
The resources appearing in an RDF annotation are then of length 2 starting from or ending to the resource Njal.
typed by the classes declared in the RDF schema the
annotation is relative to; the properties between the Given the whole RDF graph G, by choosing a description
resources are those declared in the RDF schema. extraction criterion, we can now be provided with a set of
partial descriptions for all the resources that are nodes of G.
Now, given an RDF graph G and a resource description
red
Piece of knowledge relative to Njal
black Cat
extraction criterion, let us consider the set of the
type descriptions of all the resources nodes of G. We can now
color type color
motherOf further describe our approach of ontology learning. It
Njal Sandwich consists in associating to this set of resource descriptions the
hierarchy of the concepts whose extensions correspond to
livesIn ownedBy
all the possible resource clusters.
Catherine
ownedBy
* type *∈{Sandwich, Njal,
type type
Catherine, An. Res.}
House Person
type
* Living Being
ownedBy
black Cat red * Catherine *∈{Sandwich, Njal, Catherine}
type type
color type color
D1(Njal): description
of length 1 of Njal
motherOf
Njal Sandwich *
*∈{Sandwich, An. Res.} color
livesIn ownedBy * Cat
Catherine type
ownedBy *∈{Sandwich, Njal}
type type
House Person
red Cat black Cat
black Cat red color color type
type * *
color type color type
D2(Njal): description
motherOf ownedBy livesIn motherOf
of length 2 of Njal
motherOf
Njal Sandwich
Njal Catherine Sandwich
livesIn ownedBy
*∈{Sandwich} *∈{Njal}
Catherine
ownedBy
type type Person Sandwich House Catherine
House Person type ownedBy type ownedBy
* * Njal
ownedBy livesIn
)LJXUH Partial RDF descriptions of the resource Njal.
*∈{Catherine} *∈{Anonymous Resource}
)LJXUHThe concept hierarchy associated to descriptions
2XUDSSURDFKRIRQWRORJ\OHDUQLQJ
of length 1 extracted from the RDF graph of Figure 3.
Our general aim is to learn from the whole RDF graph G
comprising the resources we are interested by, new domain In this hierarchy, each concept ci is labeled by a pair (exti,
specific concepts to enrich the ontology from which the inti), where exti is the extension of ci and inti is the intension
RDF annotations participating to G have been built. of ci. Inti is the most specific generalization of the
descriptions of the resources belonging to exti. We thus call
&RQFHSWIRUPDWLRQ this concept hierarchy a generalization hierarchy. The
generalization of a resource description is based on the
Our approach of ontology learning consists of
subsumptions relations between classes and properties
systematically considering all the concepts covering a set of
declared in the RDF schema representing the ontology
resources nodes of G. Each of these concepts may then be
which the RDF graph G we consider is relative to. Such a
defined in extension as a cluster of resources; its definition
generalization hierarchy is a lattice: its nodes are partially
in intension must generalize the descriptions of the
ordered by the subsumption relation on their intensions as
resources belonging to its extension. The definition of
well as the inclusion relation on their extensions.
criteria to extract resource descriptions from an RDF graph
Figure 4 presents the generalization hierarchy built from
G was thus a preliminary step to concept.
descriptions of length 1 of four resources nodes of the RDF
graph depicted in Figure 3: Njal, Sandwich, Catherine and deleting triples subsuming another one. Such an
the anonymous resource of type ‘House’. intension is the most specific description of the node.
If several concepts share the same intension, a single 4. Building of S1 based on the inclusion relations between
concept is added to the generalization hierarchy: the one the node extensions. Several nodes may share the same
with the largest extension. Therefore, if the size of the intension. In this case, the node we preserve
generalization hierarchy may theoretically reach 2N concepts corresponds to the largest extension.
for N resources in the RDF graph G, in practice it is much Let us apply this principle on the RDF graph depicted on
lower. For instance, the size of the hierarchy of Figure 4 is 8 Figure 3. The consecutive steps are illustrated in Figure 5.
concepts instead of 16 (24 ).
( *, color, black) Njal
,QFUHPHQWDOSULQFLSOH ( *, color, red) Sandwich
The question which now arises is the choice of a resource ( *, type, Person) Catherine
Step 1
description extraction criterion: starting from an RDF graph, ( *, type, Cat) Njal, Sandwich
we must choose from which partial resource descriptions the ( *, type, House) Anonymous Resource
(Sandwich, ownedBy,*) Catherine
concept hierarchy will be built. On the one hand, the larger
(*, ownedBy, Catherine) Sandwich, Anonym. Res.
the extracted resource descriptions will be, the more … …
domain-significant the concepts will be. On the other hand,
graph matching and lattice building both have a well-known ( *, color, ∅) Njal, Sandwich
intrinsic exponential complexity. As a result, we adopt an
Step 2
( *, type, Living Being) Njal, Sandwich, Catherine
incremental approach for the construction of the ( *, type, ∅) Catherine, Sandwich, Njal,
generalization hierarchy. Anonymous Resource
To be precise, we gradually increase the length of the partial … …
resource descriptions we consider. We first build a
generalization hierarchy S1 from resource descriptions of Njal, Sandwich ( *, type, Cat)
length 1. The concepts of S1 thus have intensions of length ( *, color, ∅)
Step 3
1. Sn is then inductively built from Sn-1 and S1 by ( *, type, Living Being)
incrementally increasing the maximum length of the Njal, Catherine ( *, type, Living Being)
resource descriptions we consider. The description Dn(R) of … …
length n of a resource R is inductively increased by joining )LJXUH Building of 6 depicted in Figure 4.
Dn-1(R) with the descriptions of length 1 of the resources
which are external nodes of Dn-1(R). In the first step, the descriptions of length 1 of all the
resources are extracted from the RDF graph of Figure 4. For
,QFUHPHQWDO EXLOGLQJ RI D JHQHUDOL]DWLRQ each RDF triple (R, P, V), both the triples (*, P,V) and (R,
P, *) are generated; the resources R matching * in ( *, P, V)
KLHUDUFK\
and the resources V matching * in (R, P, *) are indexed to
these triples. For instance, the triples (‘Njal’, type, cat) and
%XLOGLQJRIDJHQHUDOL]DWLRQKLHUDUFK\EDVHGRQ
(‘Sandwich’, type, cat) both match the triple (*, type, cat):
UHVRXUFHGHVFULSWLRQVRIOHQJWK
the resources ‘Njal’ and ‘Sandwich’ are thus indexed by the
The principle for building S1 is as follows: triple (*, type, Cat). The result of step 1 is thus an inverted
1. Extraction of resource descriptions of length 1 from the file where resources are indexed by the triples they are the
whole RDF graph. D1(R) is the set of RDF triples extensions of. ‘Njal’ and ‘Sandwich’ belong to the extension
beginning or ending by R. of the intension represented by (*, type, Cat).
2. Iterative generalization of all the possible pairs of In the second step, the triples are matched two by two; for
triples. The generalizations of two triples (R1, P1, V1) each pair of triples, the most specific generalization of both
and (R2, P2, V2) are the most specific triples (RG, PG, triples is computed, according to the ontological knowledge
VG) subsuming them. This is based on the ontological expressed in the RDF schema the RDF graph is relative to.
knowledge represented in the RDF Schema relative to The extension of a generalized triple is the union of the
the RDF annotations we consider. PG is one of the most extensions of the two initial triples it generalizes. For
specific properties generalizing P1 and P2. If V1 and V2 instance, the triples ( *, color, black) and ( *, color, red) are
are classes, then VG is one of the most specific classes generalized into the triple ( *, color, ∅), black and red being
generalizing V1 and V2; else either VG=V1=V2 or VG is incomparable; ( *, type, Person) and ( *, type, Cat) are
anonymous. We call L1 the set of all generalized triples. generalized into the triple ( *, type, Living Being), ‘Living
3. Construction of the intensions of length 1 of the S1 Being’ being the most specific class subsuming ‘Cat’ and
nodes. The triples sharing a same extension are grouped ‘Person’. Note that RDFS allows for multi-inheritance on
together. An intension may include redundant triples, class and property hierarchies. Therefore two classes or two
one being more general than another. It is cleaned up by properties may have several most specific subsumers; in
such cases, the generalization of two triples may lead to 1. Iterative construction of Ln by join of all the possible
several triples. pairs of one triple of L1 and one triple path of length n-1
In the third step, each possible set of resources is considered of Ln-1. Two triples can be joined if the value in the first
as the extension of a concept whose intension is to be found: triple is equal to the resource described in the second
for each extension, the common triples the resources are triple. A triple path of length n is thus the result of n-1
indexed by are grouped together to build the intension of the joins between n triples. This iterative construction of Ln
concept. For instance, the two resources ‘Njal’ and is equivalent to considering resource descriptions Dn(R)
‘Sandwich’ viewed as a concept extension lead to the of length n by joining Dn-1(R) and D1(Ri), with i=1..k,
construction of the intension of this concept from the set of Ri being the external nodes of Dn-1(R).
triples they are both indexed by: {( *, type, Cat), ( *, color, 2. Construction of the intensions of length n of the Sn
∅), ( *, type, Living Being)}. Since the triple ( *, type, nodes (this step is similar to step 3 of S1 building).
Living Being) subsumes the triple ( *, type, Cat), it is 3. Building of Sn based on the inclusion relations between
discarded from the final intension. Finally, a new concept the node extensions (similar to step 4 of S1 building).
will be added to the generalization hierarchy, with {'Njal', * *∈{Sandwich, Njal,
'Sandwich'} as extension and {( *, type, Cat), ( *, color, ∅), type Catherine, An. Res.}
( *, type, Living Being)} as intension.
The last step is dedicated to the building of the
* Living Being
generalization hierarchy based on the inclusion relations type
ownedBy
between the concept extensions. The concepts sharing their * Catherine *∈{Sandwich, Njal, Catherine}
intensions with concepts whose extensions include their type type
own ones are discarded: for instance, the concept ({Njal, *
Person
Catherine}, {( *, type, Living Being)}) is discarded since color
the concept whose extension is the set {Njal, Sandwich, type
*∈{Sandwich, An. Res.} * Cat
Catherine} shares the same intension. The generalization
hierarchy depicted in Figure 4 is the final result S1 obtained *∈{Sandwich, Njal}
from the index file depicted in Figure 5.
%XLOGLQJRIDJHQHUDOL]DWLRQKLHUDUFK\EDVHGRQ
black Cat red black Cat red
UHVRXUFHGHVFULSWLRQVRIOHQJWKQ color type type color color type type color
Njal motherOf * * motherOf Sandwich
( *, type, Person) Catherine livesIn ownedBy livesIn ownedBy
( *, type, House) Anonymous Resource Catherine Catherine
(Sandwich, ownedBy,*) Catherine ownedBy ownedBy
(*, ownedBy, Catherine) Sandwich, Anony. Res. type type
(*, livesIn, An. Res.) Njal Person House
S1
( *, type, ∅) Catherine, Sandwich, Njal,
Anonymous Resource *∈{Sandwich} *∈{Njal}
… …
(*, ownedBy, Catherine) Sandwich, Anony. Res. black Cat Cat red
(Catherine, type, Person) color type type type color
Step 1
(*, livesIn, Anony. Res.) Njal Njal motherOf Sandwich Njal motherOf Sandwich
(An. Res., type, House)
(*, ownedBy, Catherine) Sandwich, Anony. Res. livesIn ownedBy livesIn ownedBy
(Catherine, type, ∅) * Catherine *
… … ownedBy ownedBy
type type type type
Sandwich, Anony. Res. 1. (*, ownedBy, Catherine) House Person House Person
2. (*, type, ∅)
3. (*, ownedBy, Catherine) *∈{Anonymous Resource} *∈{Catherine}
Step 2
(Catherine, type, Person)
4. (*, ownedBy, Catherine)
)LJXUH S2 resulting from the index file of Figure 6.
(Catherine, type, ∅) Figure 6 describes the consecutive steps of the building of
… … S2 (shown in Figure 7) from S1 (depicted in Figure 4). In the
)LJXUH Building of 6 from 6 depicted in Figure 4. first step, the triples of S1 are joined one with another (in the
general case, the triples of Sn-1 would be joined with the
The principle for building Sn from Sn-1 and S1 is as follows: ones of S1). For instance, the triple (*, ownedBy, Catherine)
can be joined with the triple (*, type, Person) since the value &RQFOXVLRQ
of the former triple is equal to a resource belonging to the
We have presented a method to learn ontologies from RDF
extension of the second one. The join results in a triple path
annotations by systematically generating the most specific
of length 2 (*, ownedBy, Catherine) (Catherine, type,
generalization of all the possible sets of resources. In order
Person), whose extension is equal to the extension of the
former triple. to deal with the intrinsic exponential complexity of such a
In the second step, each possible set of resources is task, we incrementally build the hierarchy by increasing at
considered as the extension of a concept whose intension is each step the maximum size of the resource descriptions we
to be found: the common triples the resources of a concept extract from the RDF graph gathering all the annotations.
Our algorithm is currently under implementation and will be
extension are indexed by are grouped together to build the
tested inside of the European IST Comma Project. The so
intension of the concept. It is thus cleaned up in order to
learned ontologies will be used to improve the efficiency of
obtain the most specific description. For instance, the triple
a query engine over a set of annotations.
path (*, ownedBy, Catherine) (Catherine, type, ∅) is
We are currently exploring the possible improvements of
discarded from the intension of the extension {Sandwich,
our algorithms for the construction of the first class
Anonymous Resource}, since it subsumes the triple path (*,
hierarchy [Baader HWDO, 1999].
ownedBy, Catherine) (Catherine, type, Person).
The last step is dedicated to the building of the
5HIHUHQFHV
generalization hierarchy based on the inclusion relations
between the node extensions. Figure 7 presents the [Baader HW DO, 1999] F. Baader, R. Kusters, R. Molitor.
generalization hierarchy S2 built from the generalization Computing Least Common Subsumers in Descriptions
hierarchy S1 depicted in Figure 4. S2 has the same number Logics with Existential Restrictions. In 7th IJCAI, Morgan
of concepts than S1 but five of its concepts have more Kaufmann, 1999.
complex intensions: the four concepts whose extensions are [Bournaud HW DO, 2000] I. Bournaud, M. Courtine and J-D
reduced to a single resource and whose intensions Zucker. KIDS: An iterative algorithm to organize
correspond to the descriptions of length 2 of these resources, relational knowledge. In 12th EKAW, LNAI 1937,
Springer-Verlag, p. 217-232, Juan-Les-Pins, France, 2000.
and the concept of extension {Sandwich, An. Res.}.
[Carpineto and Romano, 1993] C. Carpineto and G.
Romano. GALOIS: an Order Theoretic Approach to
5HODWHG:RUN Conceptual Clustering. In 10th ICML, Morgan Kaufmann,
Conceptual clustering aims at building hierarchies to cluster p. 33-40, Amherst, Massachusetts, 1993.
similar objects and classify object descriptions [Fischer HW [Decker HW DO, 2000] S. Decker, P. Mitra, S. Melnik.
DO, 1987]; a single class hierarchy is built, the best Framework for the Semantic Web – An RDF Tutorial. In
according to a certain criterion. Our approach of concept ,(((,QWHUQHW&RPSXWLQJ, December 2000.
formation for ontology learning is slightly different since it [Fischer HW DO, 1987] D.H. Fisher, M.J. Pazzani, and P.
aims at V\VWHPDWLFDOO\ generating a class for each possible Langley. Concept Formation: Knowledge and Experience
set of objects. This systematic approach is shared by in Unsupervised Learning. Morgan Kaufmann, 1991.
researches in formal concept analysis [Wille, 1982], on [Fensel HW DO., 2000] D. Fensel, I. Horrocks, F. Van
knowledge organization [Mineau HW DO, 1990] and in Harmelen, S. Decker, M. Erdmann and M. Klein. OIL in a
inductive logic programming [Kietz and Morik, 1994; nutshell. In 12th EKAW, Juan-Les-Pins, France, 2000.
[Gennari HW DO, 1989] J. H. Gennari, P. Langley and D.H.
Schlobach, 2000]. Another particularity of our method is the
Fisher. Models of incremental concept formation.
gradual increase of the size of the resource descriptions to
$UWLILFLDO,QWHOOLJHQFH, 40: 11-61, 1989.
deal with the intrinsic complexity of description matching [Kietz and Morik, 1994] J.U. Kietz and K. Morik. A
and ontology building. A similar approach is adopted in Polynomial approach to the constructive induction of
[Bournaud HWDO, 2000]; it is based on a gradual increase of structural knowledge. In 0DFKLQH/HDUQLQJ, 1994.
the VWUXFWXUH RI PDWFKLQJ. Object descriptions are JLYHQ, and [Maedche and Staab, 2000] A. Maedche and S. Staab.
the concept description language is made more expressive at Mining ontologies from text. In 12th EKAW, 2000.
each step to gradually take into account the complexity of [Mineau HW DO, 1990] G. Mineau, J. Gecsei and R. Godin.
the descriptions. Our method differs in that the resource Structuring knowledge bases using automatic learning. In
descriptions are not given in an RDF graph and its 6th ICDE, p. 274-280, Los Angeles, CA, February 1990.
incrementallity is based on the gradual increase of the VL]H [RDF, 1999] http://www.w3.org/TR/REC-rdf-syntax/, 1999.
and not of the structure of matching. Finally compared to [RDFS, 2000] http://www.w3.org/TR/rdf-schema/, 2000.
the approaches of ontology learning based on the analysis of [Schlobach, 2000] S. Schlobach. Assertional Mining in
textual corpus (e.g. [Maedche and Staab, 2000]), RDF Description Logics. In proc. of DL’2000.
annotations may give a better starting point than HTML [Wille, 1982] R. Wille. Restructuring lattice theory: an
documents. approach based on hierarchies of concepts. In: I. Rival
(ed): Ordered Sets, Reidel, Dordrecht-Boston, 1982.