The ICS-FORTH RDFSuite:
                  Managing Voluminous RDF Description Bases
                              Sofia Alexaki                                                           Karsten Tolle
                          Vassilis Christophides
                          Gregory Karvounarakis
                           Dimitris Plexousakis
                 Institute of Computer Science, FORTH,                                  Johann Wolfgang Goethe-University,
                      Vassilika Vouton, P.O.Box 1385                                  Robert-Mayer-Str. 11-15 P.O.Box 11 19 32,
                      GR 711 10, Heraklion, Greece                                       D-60054 Frankfurt/Main, Germany
                    [alexaki, christop, gregkar, dp]                                 tolle@dbis.informatik.uni-frankfurt.de
                              @ics.forth.gr


ABSTRACT                                                                             1. INTRODUCTION
Metadata are widely used in order to fully exploit informa-                             Metadata are widely used in order to fully exploit infor-
tion resources available on corporate intranets or the Inter-                        mation resources (e.g., sites, documents, data, images, etc.)
net. The Resource Description Framework (RDF) aims at                                available on corporate intranets or the Internet. The Re-
facilitating the creation and exchange of metadata as any                            source Description Framework (RDF) [17] aims at facilitat-
other Web data. The growing number of available infor-                               ing the creation and exchange of metadata as any other Web
mation resources and the proliferation of description ser-                           data. More precisely, RDF provides i) a Standard Repre-
vices in various user communities, lead nowadays to large                            sentation Language for metadata based on directed labeled
volumes of RDF metadata. Managing such RDF resource                                  graphs in which nodes are called resources (or literals) and
descriptions and schemas with existing low-level APIs and                            edges are called properties and ii) an XML syntax for ex-
  le-based implementations does not ensure fast deployment                           pressing metadata in a form that is both humanly read-
and easy maintenance of real-scale RDF applications. In                              able and machine understandable. Due to its exible model,
this paper, we advocate the use of database technology to                            RDF is playing a central role in the next evolution step of
support declarative access, as well as, logical and physical                         the Web - termed the Semantic Web. Indeed, RDF/S enable
independence for voluminous RDF description bases.                                   the provision of various kinds of metadata (for administra-
   We present RDFSuite, a suite of tools for RDF validation,                         tion, recommendation, content rating, site maps, push chan-
storage and querying. Speci cally, we introduce a formal                             nels, etc.) about resources of quite diverse nature (ranging
data model for RDF description bases created using mul-                              from PDF or Word documents, e-mail or audio/video les to
tiple schemas. Next, we present the design of a persistent                           HTML pages or XML data) to di erent target communities
RDF Store (RSSDB) for loading resource descriptions in an                            (corporate, inter-enterprise, e-marketplace, etc.). The most
ORDBMS by exploring the available RDF schema knowl-                                  distinctive RDF feature is its ability to suport superimposed
edge. Our approach preserves the exibility of RDF in re-                             descriptions for the same Web resources, enabling content
  ning schemas and/or enriching descriptions at any time,                            syndication - and hence, automated processing - in a variety
whilst it outperforms, both in storage volumes and query                             of application contexts. To interpret resource descriptions
execution time, other approaches using a monolithic table                            within or across communities, RDF allows for the de nition
to represent resource descriptions and schemas under the                             of schemas [6] i.e., vocabularies of labels for graph nodes
form of triples. Last, we brie y present RQL, a declarative                          (i.e., classes) and edges (i.e., properties). Furthermore, these
language for querying both RDF descriptions and schemas,                             vocabularies can be easily extended to meet the description
and sketch query evaluation on top of RSSDB.                                         needs of speci c (sub-)communities while preserving the au-
                                                                                     tonomy of description services for each (sub-)community.
                                                                                        Many content providers (e.g., ABCNews, CNN, Time Inc.)
 This work was partially supported by the EC project C-                             and Web Portals (e.g., Open Directory, CNET, XMLTree1 )
                                                                                     or browsers (e.g., Netscape 6.0, W3C Amaya) already adopt
Web (IST-1999-13479) and Mesmuses (IST-2000-26074).                                  RDF, as well as, emerging application standards for Web
                                                                                     data and services syndication (e.g., the RDF Site Summary
                                                                                     [5], the Dublin Core [25], or the Web Service Description
                                                                                     Language [26]). In a nutshell, the growing number of avail-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
                                                                                     able information resources and the proliferation of descrip-
not made or distributed for profit or commercial advantage and that copies           tion services in various user communities, lead nowadays to
bear this notice and the full citation on the first page. To copy otherwise, to      large volumes of RDF metadata (e.g., the Open Directory
republish, to post on servers or to redistribute to lists, requires prior specific
permission by the authors.                                                           1
Semantic Web Workshop 2001 Hongkong, China                                             www.dmoz.org, home.cnet.com, www.xmltree.com respec-
Copyright by the authors.                                                            tively.
                                                                                   DBMS
                                    VRP                              Class                        Property                               RQL
                                                                id   c_name            id domainid p_name rangeid
                                    Validator                    1 ns1#Ext.Resource    4    1      ns1#title 12
                                                                                                                                        Graph
                                                                 2 ns2#Hotel           5    11     ns3#title 12


                                                                                                                          Evaluation
                                                                                                                                       Constructor


                                                                                                                            Engine
                                                 Loader
                                                                 3 ns2#Hotel Direct


                           Parser


                                                          API


                                                                                                                                                     Parser
                                                                                                                    API
                                     VRP Model
                                                                     SubClass                 SubProperty
                                                                  subid      superid        subid     superid
                                                                     3          2             4         5                               Optimizer
                                                                   ns1#Ext.Resource         ns3#title                                    Module
                                                                           URI           source      target
                                                                           r1              r1       SunScale


                                          Figure 1: Overview of the ICS-FORTH RDFSuite

Portal of Netscape exports in RDF around 170M of Subject                                      2. THE OPEN DIRECTORY CATALOG
Topics and 700M of indexed URIs). It becomes evident that                                        Portals are nowadays becoming increasingly popular by
managing such voluminous RDF resource descriptions and                                        enabling the development and maintenance of speci c com-
schemas with existing low-level APIs and le-based imple-                                      munities of interest (e.g., enterprise, professional, trading)
mentations [22] does not ensure fast deployment and easy                                      [12] on corporate intranets or the Internet. Such Commu-
maintenance of real-scale RDF applications. Still, we want                                    nity Web Portals essentially provide the means to select,
to bene t from three decades of research in database tech-                                    classify and access, in a semantically meaningful way, var-
nology to support declarative access and logical and phys-                                    ious information resources. Portals may be distinguished
ical independence for RDF description bases. In this way,                                     according to the breadth of the target community (e.g., tar-
RDF applications have to specify in a high-level language                                     geting horizontal or vertical markets), and the complexity
only which resources need to be accessed, leaving the task                                    of information resources (e.g., sites, documents, data). In
of determining how to eÆciently store or access them to the                                   all cases, the key Portal component is the Knowledge Cat-
underlying RDF database engine.                                                               alog holding descriptive information, i.e., metadata, about
   In this paper we present ICS-FORTH RDFSuite [14], a suite                                  the community resources.
of tools for RDF validation (Validating RDF Parser-VRP),                                         For instance, the catalogs of Internet (or horizontal) Por-
storage (RDF Schema Speci c DataBase-RSSDB), and query-                                       tals, like Yahoo! or Open Directory (ODP), use huge hi-
ing (RDF Query Language-RQL) using an object-relational                                       erarchies of topics to semantically classify Web resources.
DBMS (see Figure 1). To illustrate the functionality and the                                  Speci cally, Table 1 lists statistics of 15 (out of 16) ODP
performance of RDFSuite we use as testbed the catalog of                                      hierarchies (version of 16-01-2001), comprising 252840 top-
the Open Directory Portal exported in RDF (see Section 2).                                    ics under which 1770781 sites are classi ed. We can observe
The paper makes the following contributions:                                                  that ODP hierarchies are not deep (the maximum depth is
     Section 3 introduces a formal data model for descrip-                                   13) while the number of direct subclasses of a class varies
       tion bases created according to the RDF Model & Syn-                                   greatly (the maximum number of subclasses is 314 but the
       tax and Schema speci cations [17, 6] (without rei ca-                                  average is around 4.02). Additionally, various administra-
       tion). The main challenge for this model is the rep-                                   tive metadata (e.g., titles, mime-types, modi cation dates)
       resentation of several interconnected RDF schemas, as                                  of resources are usually created using an OCLC Dublin-Core
       well as the introduction of a graph instantiation mech-                                like schema [25]. Users can either navigate through the top-
       anism permitting multiple classi cation of resources.                                  ics of the catalog to locate resources of interest, or issue a
     Section 4 presents our persistent RDF Store (RSSDB)                                     full-text query on topic names and the URIs or the titles of
       for loading resource descriptions in an object-relational                              described resources. In Section 5 we will illustrate how our
       DBMS (ORDBMS) by exploiting the available RDF                                          query language can be used to provide declarative access to
       schema knowledge. Our approach preserves the ex-                                       the catalog content. In the sequel, we illustrate how Por-
       ibility of RDF in re ning schemas and/or enriching                                     tal Catalogs can be easily and e ectivelly represented using
       descriptions at any time whilst it can be customized                                   RDF/S.
       in several ways according to the speci cities of both                                     The middle part of Figure 2 depicts the two schemas em-
       the manipulated desription bases and the underlying                                    ployed by the Open Directory: the rst is intended to cap-
       RDF application scenario.                                                              ture the semantics of web resources, while the second is in-
     Section 5 sketches the functionality of a declarative                                   tended for Portal administrators. The scope of the declara-
       language, called RQL, for querying both RDF de-                                        tions is determined by the corresponding namespace de ni-
       scriptions and related schemas. The implementation                                     tion of each schema, e.g., ns1 (www.dmoz.org/topics.rdfs)
       of RQL on top of RSSDB is described with emphasis                                      and ns2 (www.oclc.com/dublincore.rdfs). For simplicity,
       on how the RQL interpreter pushes query evaluation                                     we will hereforth omit the namespaces as well as the paths
       as much as possible to the underlying ORDBMS.                                          from the root of the topic hierarchies (since topics have
     Section 6 illustrates the performance of RSSDB for                                      non-unique names) pre xing class and property names. Fig-
       storing and querying voluminous RDF descriptions,                                      ure 2 depicts 2 (out of the 16) hierarchies in the ODP topics
       such as the ODP catalog. In particular, RSSDB out-                                     schema, namely, Regional and Recreation, whose topics are
       performs both in storage volumes and query execution                                   represented as RDF classes (see the rdf and rdfs default
       time other approaches using a monolithic table [20,                                    namespaces in the upper part of Figure 2). Various semantic
       18, 7] to represent resource descriptions and schemas                                  relationships exist between these classes either within a topic
       under the form of triples.                                                             hierarchy (e.g., subtopics) or across hierarchies (e.g., related
Finally, Section 7 presents our conclusions and draws direc-                                  topics). The former, is represented using the RDF subclass
tions for further research.
                                  rdf:www.w3.org/1999/02/22-rdf-syntax-ns                        rdfs:www.w3.org/TR/2000/CR-rdf-schema-20000327


          RDF/S
                                                                                      Class
                                                                                                      property

                                  ns1:www.dmoz.org/topics.rdfs                                                        ns2:www.oclc.org/dublincore.rdfs
                                              Regional                                   Recreation


          Portal Schemas                                                                  Lodging                              String
                                                  Paris
                                                                                                                                                   String

                                            Travel                                                                          description
                                                                                    Vacation                                                                Integer
                                                                                    -Rentals                                                title
                                          Hotel                     related                                                                          file_size
                                                                                                                           Ext.Resource                               Date
                                              Hotel Directories                 Ile-de-France                                                      last_modified


                                                                                                                                                              Disneyland
                                                                                                                                           title
          Resource descriptions


                                                          &r1                                                                   &r4       description Official site of
                                                                                   &r2                         &r3                                   Disneyland Paris
                                                            title                     title                      title                     description
                                                                                                                                                             Site officiel de
                                                                                                                                                            Disneyland Paris
                                                     SunScale                 Pulitzer Opera               Bedford

                                          typeOf (instance)         r1: http://www.sunscale.com/france/paris         r3: http://www.hotel-bedford.fr/index/hotel.htm
                                          subClassOf (isA)          r2: http://www.hotelpulitzer.com                 r4: http://www.disneylandparis.com


                                           Figure 2: Modeling in RDF the Catalog of the Open Directory Portal
relationship (e.g., Travel is a, not necessarily direct, sub-                                    of XML trees (i.e., there is no root XML node).
class of Paris) and the latter using an RDF property named                                          Hence, RDF properties are unordered (e.g., the property
related (e.g., between the classes Ile-de-France and Hotel).                                     title can appear before or after the property description), op-
Finally, the ODP administrative metadata schema captures                                         tional (e.g., the property le size is not used), can be multi-
various descriptive elements of Dublin-Core [25] (with the                                       valued (e.g., we have two description properties), and they
execption of the Subject element), as literal RDF properties                                     can be inherited (e.g., to subclasses of ExtResource), while
de ned on class ExtResource. Note that properties serve                                          resources can be multiply classi ed (e.g., &r4). These mod-
to represent attributes of resources as well as relationships                                    eling primitives provide all the exibility we need to rep-
between resources. Properties can also be organized in tax-                                      resent superimposed descriptions of community resources,
onomies in a manner similar to the organization of classes.                                      while preserving a conceptually uni ed view of the descrip-
   Using these schemas, we can see in the lower part of Fig-                                     tion base through one or the union of all related schemas. It
ure 2, the descriptions created for four sites (resources &r1-                                   becomes clear that the RDF data model di ers substantially
&r4). For instance, &r4 is a resource classi ed under both                                       from standard object or relational models [3] or the recently
the classes Paris and ExtResource and has three associated                                       proposed models for semistructured or XML databases [1]:
literal properties: a title property with value \Disneyland"                                          Classes do not de ne object or relation types: an in-
and two description properties with values \OÆcial site of                                              stance of a class is just a resource URI;
Disneyland Paris" and \Site oÆciel de Disneyland Paris"                                               Resources may belong to di erent classes not necessar-
respectively. In the RDF jargon, a speci c resource (i.e.,                                              ily pairwise related by specialization: the instances of
node) together with a named property (i.e., edge) and its                                               a class may have quite di erent properties associated
value (i.e., node) form a statement. Each statement is rep-                                             with them, while there is no other class on which the
resented by a triple having a subject (e.g., &r4), a predicate                                          union of these properties is de ned;
(e.g., title), and an object (e.g., \Disneyland "). The sub-
                                                                                                      Properties may also be re ned by respecting a minimal
                                                                                                        set of constraints (domain and range compatibility).
ject and object should be of a class compatible (under class                                        Note also that due to multiple classi cation, resources
specialization) with the domain and range of the predicate                                       may have quite irregular descriptions modeled only through
(e.g., &r4 is of type ExtResource). In the rest of the pa-                                       an exception mechanism a la SGML [15]. Last but not
per, the term description base will be used to denote a set of                                   least, semistructured or XML models can't distinguish be-
RDF statements. Although not illustrated in Figure 2, RDF                                        tween entity (e.g., ExtResource) and property labels (e.g.,
also supports structured values called containers (e.g., Bags)                                   title). Therefore existing proposals cannot be used, as
for grouping statements, as well as, higher-order statements                                     such, to manipulate RDF description bases. In the sequel,
(i.e., rei cation2 ). Finally, both RDF graph schemas and                                        we will present a logical data model allowing users to issue
descriptions can be serialized in XML using various forests                                      high-level queries on RDF/S graphs, while several physical
                                                                                                 representations can be used by the underlying DBMS to im-
2
    We will not treat rei cation in this paper.                                                  prove storage volumes and optimize queries on these graphs.
            Hierarchy            Max.Depth      Avg.Depth     Max.Subclass      Avg.Subclass    #Topics    #Resources
            news.rdf                  7              5              51              5.625          721         47735
            kat.rdf                   7             4.8             46                6            761          7730
            home.rdf                  8              5              53                5            1722        26688
            shopping.rdf              9             4.8             61               4.5           3349        88821
            health.rdf                9             5.3             52               5.3           3202        45519
            games.rdf                 10            5.4             125              5.3           4857        36181
            computers.rdf             10            5.3             147               5            6010        91597
            recreation.rdf            11           5.87             85              5.47           7269        93929
            business.rdf              11            5.9             52              4.69           6833       161877
            reference.rdf             13           7.13             154             3.92           6483        75105
            sports.rdf                9            5.56             178             7.02          10625        66280
            science.rdf               10           6.53             314             5.05           8667        65939
            society.rdf               12            6.3             157             6.27          16250       161433
            arts.rdf                  11            5.5             267              6.5          25314       214795
            regional.rdf              13            7.7             254             3.37         150762       587152
            Total                     13           6.86             314             4.02         252825      1770781

                                       Table 1: ODP Topic Hierarchy Statistics
3. A FORMAL DATA MODEL FOR RDF                                Each RDF schema uses a nite set of class names C  C
  Since the notion of resource is somehow overloaded in the and property names P  P whose scope is determined by
RDF M&S and RDFS speci cations [17, 6], we distinguish                one or more namespaces. Property types are then de ned
RDF resources w.r.t. their nature into individual entities            using class names or literal types so that: for each p 2 P ,
(i.e., nodes) and properties of entity resources (i.e., edges).       domain(p) 2 C and range(p) 2 C [ L. We denote by H =
                                                                      (N; ) a hierarchy of class and property names, where N =
     Nodes : a set of individual resources, representing             C [ P . H is well-formed if  is a smallest partial ordering
       abstract or concrete entities of independent existence,        such that: if p1 ; p2 2 P and p1  p2 , then domain(p1 ) 
       e.g., class ExtResource de ned in an RDF Schema or             domain(p2 ) and range(p1 )  range(p2 )3 .
       a speci c web resource e.g., &r4 (see Figure 2).
     Edges : a set of properties, representing both at-              3.1 Additional RDF Constraints
       tributes of and binary relationships between nodes, ei-           Three remarks are noteworthy. First, unlike the current
       ther abstract or concrete, e.g., the property title de-        RDF M&S and RDFS speci cations [17, 6] the domain and
         ned in our example schema and used by the speci c            range of properties should always be de ned. This additional
       resource &r4.                                                  constraint is required mainly because the sets of RDF/S
                                                                      classes C and literal types L are disjoint. Then, a prop-
RDF Resources are also distinguished according to their               erty with unde ned range may take as values both a class
concreteness into tokens and classes.                                 instance (i.e., a resource) or a literal. Since, the union of
                                                                      rdfs:Class and rdfs:Literal is meaningless in RDF/S,
    Tokens : a set of concrete resources, either objects,            this freedom leads to semantic inconsistencies. Additional
      or literals (e.g., &r4, \SunScale").                            semantic problems arise in the case of specialization of prop-
    Classes : a set of abstract entity or property re-               erties with unde ned domain and range. More precisely, to
      sources, in the sense that they collectively refer to a set     preserve the set inclusion requirement of specialized proper-
      of objects similar in some respect (e.g., ExtResource).         ties (binary predicates) their domain and range should also
                                                                      specialize the domain and range (unary predicates) of their
   To label abstract (i.e., schema) or concrete (i.e., token)         superproperties. This is something which, in the case of mul-
RDF nodes and edges, we assume the existence of the fol-              tiple specialization of properties (see Figure 3 -a-), cannot
lowing countably in nite and pairwise disjoint sets of sym-           always be ensured because RDF/S do not support intersec-
bols: C of Class names, P of Property names, U of Resource            tion of classes (in a Description Logics style). The second
URIs as well as a set L of Literal type names such as string ,        constraint imposes that both domain and range declarations
integer, date, etc. Each literal type t 2 L has an
                                                 S associated         of a property be unique. This is foremost required because
domain, denoted dom(t) and dom(L) denotes t2L dom(t)                  RDF/S do not support union of classes, which can be consid-
(i.e., the rdfs:Literal declaration). Without loss of gen-            ered as the domain of properties. Furthermore, it is not pos-
erality, we assume that the sets C and P are extended to              sible to infer domains in case of specialization of properties
include as elements the class name Class and the prop-                with multiple domains (see Figure 3 -b-). In this context,
erty name Property respectively. The former captures the              the rdfs:Class de nition should be used, in order to de ne
root of a class hierarchy (i.e., the rdfs:Class declaration)          a property with domain or range all the token resources.
while the latter captures the root of a property hierarchy            Conforming to RDF/S, rdf:Resource should be used as
(i.e., the rdf:Property declaration) de ned in RDF/S [17,             the domain or range, when we need to de ne properties
6]. The set P also contains integer labels (f1; 2; : : : g) used      viewing schema classes also as tokens (e.g., rdfs:seeAlso,
as property names (i.e., the rdfs:ContainerMembership-                rdfs:isDefinedBy, rdfs:comment, rdfs:label).
Properties declaration) by the members of container values
(i.e., the rdfs:Bag, rdfs:Sequence declarations).                     3
                                                                          The symbol  extends  with equality.
                                                                                         nite set of URIs (of type U ) to each class name6 is cap-
                                                                                       tured by a population function  : C ! 2U . The set of all
                                                                                       values forseen by our model is denoted by V .
                                                                                           De nition 1. The interpretation function [ :] is de ned
                                                                                       as follows:
                                                                                             for literal types: [ L] = dom(L);
Figure 3: Inconsistencies in Property Specialization                                         for the Bag type, [ f g] = fv1 ; v2 ; : : : vn g where v1 ; v2 ; : : : vn
                                                                                              2 V are values of type  ;
   Last, we need to consider an additional constraint of a
syntactic nature, imposing that class and property de ni-                                    for the Seq type, [ [ ]]] = [v1 ; v2 ; : : : vn ] where v1 ; v2 ; : : : vn
tions be complete. This means that, on the one hand, super-                                   2 V are values of type  ;
class and superproperty declarations should accompany the                                    for the Alt type [ (1 + 2 + : : : + n )]] = vi where vi 2 V
class and property de nitions respectively and, on the other                                  1 < i < n is a value of type i ;
hand, the domain and range of a property should be given
along with the de nition of the property. In this manner,
                                                                                             for each class c 2 C , [ c] = (c) [ Sc c [ c0 ] ;
                                                                                                                                             0


the extension of existing RDF hierarchies of names by re n-                                  for each property p 2SP , [ p] 0= f[v1 ; v2 ] j v1 2 [ domain(p)]];
ing their classes and properties in a new namespace is still                                  v2 2 [ range(p)]]g [ p p [ p ] .
                                                                                                                          0

permitted, while at the same time semantic inconsistencies                                As usual, the interpretation of classes and properties in
that may arise due to arbitrary unions of used hierarchies                             our model is set-based. This implies that a resource URI
are avoided. Such inconsistencies include the introduction of                          appears only once in the extent of a class even when it is
multiple ranges of properties or the introduction of cycles in                         classi ed several times under its subclasses (i.e., it belongs to
class or property hierarchies. Unlike the current RDF M&S                              the direct extent of the most speci c class). The notation
and RDFS speci cations [17, 6] this constraint ensures that                            ^[[:] is used in Table 2 to distinguish between strict and
the union of two valid RDF hierarchies of names is always                              extended interpretation of classes and properties.
valid. The imposed constraints (C3-C5) are summarized in                               3.3 RDF Description Bases and Schemas
Table 2 using the notation introduced in this section.
                                                                                         De nition 2. An RDF schema is a 5-tuple RS = (VS ;
3.2 RDF Typing System                                                                  ES ; ; ; H ), where: VS is the set of nodes and ES is the set
   In this context, RDF data can be atomic values (e.g.,                               of edges, H is a well-formed hierarchy of class and property
strings), resource URIs, and container values holding query                                                
                                                                                       names H = (N; ),  is a labeling function  : VS ES               [ !
results, namely rdf:Bag (i.e., multi-sets) and rdf:Sequence                               [
                                                                                       N T , and       is an incidence function : ES VS VS .      !       
(i.e., lists). The main types foreseen by our model are:                                 The incidence function captures the rdfs:domain and rdfs:
                                                                                       range declarations of properties7 . Note that the incidence
        = Lj U j f g j [ ] j (1 :  + 2 :  + : : : + n :  )                     and labeling functions are total in VS [ ES and ES respec-
                                                                                       tively. This does not exclude the case of schema nodes which
where L is a literal type in L, f:g is the Bag type, [:] is the                       are not connected through an edge. Additionally, we impose
Seq uence type, (:) is the Alternative type, and U is the type                        a unique name assumption on the labels of RS nodes and
for resource U RI s4 . Alternatives capture the semantics of                           edges.
union (or variant) types [8], and they are also ordered (i.e.,                            De nition 3. An RDF description base, instance of a
integer labels play the role of union member markers). Since                           schema RS , is a 5-tuple RD = (VD ; ED ; ; ; ), where: VD
there exists a prede ned ordering of labels for sequences                              is a set of nodes and ED is a set of edges, is the incidence
and alternatives, labels can be omitted (for bags, labels are                          function : ED ! VD  VD ,  is a value function  : VD !
meaningless). Furthermore, all types are mutually exclusive                            V , and  is a labeling function  : VD [ ED ! 2N [T which
(e.g., a literal value cannot also be a bag) and no subtyping                          satis es the following:
relation is de ned in RDF/S (e.g., between bags of di erent                                 for each node v in VD ,  returns a set of names n 2
types). The set of all type names is denoted by T .                                          C [ T where the value of v belongs to the interpretation
   The proposed type system o ers all the arsenal we need                                    of each n:  (v ) 2 [ n] ;
to capture containers with both homogeneous and heteroge-                                   for each edge  in ED going from node v to node v0 , 
neous member types, as well as, to interpret RDF schema                                      returns a property name p 2 P .
classes and properties. For instance, unnamed ordered tu-                                 Note that the labeling function captures the rdf:type dec-
ples denoted by [v1 ; v2 ; : : : ; vn ] (where vi is of some type i )                 laration that associates each RDF node with one or more
can be de ned as heterogeneous sequences5 of type [(1 +                               class names (opposite to traditional object models) which
2 + : : : + n )]. Unlike traditional object data models, RDF
classes are then represented as unary relations of type fU g                          may be de ned in several well-formed hierarchies of names.
while properties as binary relations of type f[U ; U ]g (for                         Additionally, integer labels (f1; 2; : : : g) are used as property
relationships) or f[U ; L ]g (for attributes). Furthermore,                          names by the members of RDF container values. The im-
RDF containers can also be used to represent n-ary rela-                               posed constraints (C6-C7) are summarized in Table 2 using
tions (e.g., as a bag of sequences). Finally, assignment of a                          the notation introduced in this section.

4                                                                                      6
  In Section 5, we will see that our query language treats                               Note that we consider here a non-disjoint object id assign-
URIs, i.e., identi ers, as simple strings.                                             ment to classes due to multiple classi cation.
5
  Observe that, since tuples are ordered, for any two permu-                           7
                                                                                         Constraint C4 of Table 2 ensures that rdfs:domain and
tations i1 ; : : : ; in and j1 ; : : : ; jn of 1; : : : ; n, [i1 : v1 ; : : : ; in :   rdfs:range are not any more relations as in the current
vn ] is distinct from [j1 : v1 ; : : : ; jn : vn ].                                    RDF M&S and RDFS speci cations [17, 6].
    Alphabets:    C1   Class, Property and Type names are mutually exclusive
                               C\P \T =;
                  C2   Literal, Resources and Container values are mutually exclusive
                               L \ U \ V n (L [ U ) = ;
    Schema:       C3   8c; c0 ; c00 2 C :
                  C3.1  Class is the root of the class hierarchy:
                                   c  Class
                  C3.2     subClassOf         relation is transitive:
                                   c  c0 ; c0  c00 ) c  c00
                  C3.3     subClassOf         relation is antisymmetric:
                                   c  c0 ) c =  6 c0
                  C4     Domain and range of properties should be de ned and they should be unique:
                                8p 2 P; 9!c1 2 C (c1 = domain(p)) ^ 9!c2 2 C [ TL (c2 = range(p))
                  C5     8p; p0 ; p00 2 P :
                  C5.1     Property is the root of the property hierarchy:
                                   p  P roperty
                  C5.2     subPropertyOf           relation is transitive:
                                   p  p0 ; p0  p00 ) p  p00
                  C5.3     subPropertyOf           relation is antisymmetric:
                                   p  p0 ) p =   6 p00 0
                  C5.4     if p is subPropertyOf          p then the domain (range) of p is a subset of the domain (range) of p0 :
                                   p  p0 ) domain(p)  domain(p0 ) ^ range(p)  range(p0 )
    Data:         C6     8v 2 V :
                  C6.1     if v is a URI then it is an instance of one or more classes not related by :
                                   v 2 U ) (v )  C
                                   8c; c0 2 C ; v 2 ^[[c] ; c  c0 ) v 2= ^[[c0 ]
                  C6.2     if v is a literal value then it is an instance of one (and only one) Literal type:
                                   v 2 L ) (v )  TL and (v ) is a singleton
                  C6.3     if v is a container value then it is an instance of one (and only one) Container type:
                                   v 2 V n (L [ U ) ) (v )  TBjS jA and (v ) is a singleton
                  C7     8p 2 P ; [v1 ; v2 ] 2 [ p] :
                  C7.1     if p belongs to the set f1; 2; : : : g then v1 is an instance of either rdf:Bag or rdf:Seq or rdf:Alt:
                                   p 2 f1; 2; : : : g ) (v1 )  TBjS jA
                  C7.2     if p doesn't belong to the set f1; 2; : : : g then v1 (v2 ) belongs to the domain (range) of p:
                                   p 2 P n f1; 2; : : : g ) 9n 2 (v1 ); n  domain(p) ^ 9m 2 (v2 ); m  range(p)
                  C7.3     Additionally, the direct extends          of properties must be unique
                                   p 2 P n f1; 2; : : : g ) (8p0 2 P n f1; 2; : : : g; p  p0 ) [v1 ; v2 ] 2
                                                                                                           = ^[[p0 ] )
    Rei cation:   C8     A rei ed statement should have exactly one rdf:predicate, rdf:subject and rdf:object property

                             Table 2: Formal De nition of Imposed RDF Constraints
3.4 The Validating RDF Parser (VRP)                         the constraints speci ed in the RDF M&S and RDFS speci-
  To conclude this section, we brie y describe the Validat-  cations [17, 6] as well as the additional constraints we intro-
ing RDF Parser (VRP) we have implemented to analyze,        duced previously (see Table 2). This permits the validation
validate and process RDF descriptions. Unlike other RDF             of both the RDF descriptions against one or more RDFS
parsers (e.g., SiRPAC8 ), VRP (see Figure 4) is based on            schemas, and the schemas themselves. The VRP validation
standard compiler generator tools for Java, namely CUP/-            module relies on (a) a complete and sound algorithm [24] to
JFlex (similar to YACC/LEX). As a result, users do not              translate descriptions from an RDF/XML form (using both
need to install additional programs (e.g., XML Parsers) in          the Basic and Compact serialization syntax) into the RDF
order to run VRP. The VRP BNF grammar can be eas-                   triple-based model (b) an internal object representation of
ily extended or updated in case of changes in the RDF/S             this model in Java, allowing to separate RDF schema from
speci cations. VRP is a 100% Java(tm) development under-            data information. As we will see in the sequel, this approach
standing embedded RDF in HTML or XML and providing                  enables a ne-grained processing of RDF statements w.r.t.
full Unicode support. The stream-based parsing support of           their type, which is crucial in order to implement an incre-
JFlex and the quick LALR grammar parsing of CUP ensure              mental loading of RDF descriptions and schemas.
good performance when processing large volumes of RDF
descriptions. Currently VRP is a command line tool with             4. THE RDF SCHEMA SPECIFIC DATABASE
various options to generate a textual representation of the           This section describes the persistent RDF store (RSSDB)
internal model (either graph or triple based).                      for loading resource descriptions in an object-relational DBMS
   The most distinctive feature of VRP is its ability to verify     (ORDBMS). We begin by presenting our schema generation
                                                                    strategy in comparison to monolithic approaches [20, 18, 7]
8
    www.w3.org/RDF/Implementations/SiRPAC                           representing resource descriptions and schemas as triples.
    Figure 4: The Validating RDF Parser (VRP)
   The core RDF/S model is represented by four tables (see
Figure 6-A), namely, Class, Property, SubClass and Sub-
Property which capture the class and property hierarchies
de ned in an RDF schema. To save space when storing class
and properties names, we also consider a table NameSpace           Figure 5: RDF Schema Speci c DataBase & Loader
keeping the namespaces of the RDF Schemas stored in the
ORDBMS. Furthermore, a table Type is used to hold the              with URI object values). Indices (i.e., B-trees) are con-
names of RDF/S built-in types (e.g., rdf:Property, rdfs:-          structed for all table attributes.
Class), as well as, those of Container (e.g., rdf:Bag, rdf:Seq,       Compared to GenRepr, our SpecRepr is exible enough
rdf:Alt) and Literal types (e.g., string, integer, date).          to allow the customization of the physical representation of
The main novelty of our representation is the separation           RDF metadata in the underlying ORDBMS. This is impor-
of the RDF schema from data information, as well, as the           tant since no representation is good for all purposes and in
distinction between unary and binary relations holding the         most real-scale RDF applications variations of a basic rep-
instances of classes and properties. More precisely, class ta-     resentation are required to take into account the speci c
bles store the URIs of resources while property tables store       characteristics of employed schema classes and properties.
the URIs of the source and target nodes of the property.           Our main goal here is to reduce the total number of cre-
Indices are constructed on the attributes URI, source and          ated instance tables. This is justi ed by the fact that some
target of the above tables in order to speed up joins and          commercial ORDBMSs (and not PostgreSQL) permit only
the selection of speci c tuples of the tables. Indices are also    a limited number of tables. Furthermore, numerous tables
constructed on the attributes lpart, nsid and id of the ta-        (e.g., the ODP catalog implies the creation of 252840 tables,
bles Class and Property and on the attribute subid of the          i.e. one for each topic) have a signi cant overhead on the
tables SubClass and SubProperty.                                   response time of all queries (i.e., to nd and open a table,
   It should be stressed that the taxonomic relationships          its attributes, etc.). One of the possible variations we have
between schema labels is captured by the SubClass and              experimented for the ODP catalog is the representation of
SubProperty tables, while the corresponding instance tables        all class instances by a unique table Instances (see dashed
are also connected through the subtable relationship, sup-         rectangular in Figure 6-A). This table has two attributes,
ported today by all ORDBMSs9 . In other words, RSSDB               namely uri and classid, for keeping the uri's of the re-
relies on a schema speci c representation of resource de-          sources and the id's of the classes in which resources belong.
scriptions, called SpecRepr, similar to the attribute-based ap-    The bene ts of this SpecRepr variation are illustrated in Sec-
proach proposed for storing XML data [13, 23]. SpecRepr            tion 6 given that most ODP classes (i.e. topics) have few or
preserves the exibility of RDF in re ning schemas and/or           no instances at all (more than 90% of the ODP topics con-
enriching descriptions at any time, and as we will see in the      tain less than 30 URIs). For other RDF schemas it could be
sequel, it can be customized in several ways according to the      also interesting to represent in a similar way all the instances
speci cities of both the manipulated description bases and         of properties, but in general real-scale RDF schemas have
the RDF application scenario (i.e., querying functionality).       more classes than properties. Another alternative to our ba-
   This is not the case of other proposals employing a mono-       sic SpecRepr could be the representation of properties with
lithic table [20, 18, 7] to represent RDF metadata under           range a literal type, as attributes of the tables created for the
the form of triples. These approaches provide a generic rep-       domain of this property. Consequently, new attributes will
resentation (i.e., for all RDF schemas), called GenRepr, of        be added to the created class tables. The tables created for
both RDF schemas and resource descriptions using two ta-           properties with range a class will remain unchanged. The
bles (see Figure 6-B), namely, Resources and Triples. The          above representation is applicable to RDF schemas where
former represents each resource (including schema classes          attribute-properties are single-valued and they are not spe-
and properties) by a unique id whereas the latter represents       cialized. Finally, our SpecRepr opens the way for a more
the statements made about the resources (including classes         cleaver representation of class and property names using ap-
and properties) under the form of predicate-subject-object         propriate encoding systems that preserve the taxonomic re-
triples (captured by predid, subid and objid respectively).        lationships of schema labels and enable to optimize recursive
Note that in the Triples table we distinguish between prop-        traversals on subclass (or subproperty) hierarchies.
erties representing attributes (i.e., objvalue with literal val-   4.1 The RDF Description Loader
ues) and those relationships between resources (i.e., objid          Figure 5 depicts the architecture of our system for loading
9
 Note that the syntactic constraint (see Section 3) impos-         RDF metadata in an ORDBMS, namely PostgreSQL10 . The
ing complete class and property de nitions, ensures that           loader has been implemented [4] in Java and communication
the table hierarchy created in RSSDB can be only extended          10
through specialization                                                  www.postgresql.org
                            Figure 6: Relational Representation of RDF Description Bases
with PostgreSQL relies on the JDBC protocol.                         ed under the topic of interest. Note that, in order to locate
   The loader comprises two main modules. The rst mod-             resources classi ed under the subtopics (e.g., Hotel Directo-
ule checks the consistency of analyzed schemas descriptions        ries) of a given topic, browsing should be continued by the
in comparison with the stored information in the ORDBMS.           users. RQL aims to simplify such tasks, by providing power-
For example, in case that an analyzed property has already         ful path expressions permitting smooth ltering/navigation
been stored, it checks whether its domain and range are the        on both Portal schemas and resource descriptions. Then the
same as the ones stored in the ORDBMS. Another func-               previous query can be expressed as follows:
tionality of this module is the validation of RDF metadata            Q: Find the resources under the hierarchy Regional, about
based on stored RDF schemas instead of connecting to the           hotels in Paris whose title matches "*Opera*".
respective namespaces. Thus, we avoid analyzing and vali-          select Z
dating repeatedly the RDF schemas used in metadata and             from      (select $X
reduce the required main memory, since only parts of RDF                     from       Regionalf:$Xg
schemata are fetched. Consequently, our system enables in-                   where      $X like "*Hotel*"
cemental loading of RDF descriptions and schemas, which                                 and $X < Paris)fYg.fZgtitlefTg
is crucial for handling large RDF schemas and even larger          where T like "*Opera*"
RDF description volumes created using multiple schemas.               The schema path expression in the from clause of the
   The second module implements the loading of RDF de-             nested query, states that we are interested in classes (vari-
scriptions in the DBMS. To this end, a number of load-             ables pre xed by $ like $X ) specializing the root Regional.
ing methods have been implemented as member functions              Then, the ltering condition in the where clause will re-
of the related VRP internal classes. Speci cally, for every        tain only the classes whose name matches "*Hotel*" and
attribute of the classes of the VRP model, a method is cre-        they are also subclasses of Paris (e.g., Hotel and Hotel Di-
ated for storing the attribute of the created object in the        rectories. Here, to get all relevant topics, the only required
ORDBMS. For example, the method storetype is de ned                schema knowledge is that the subtopics of Regional contain
for the class RDF Resource, in order to store object type          geographical information and a topic Paris is somewhere in
information. The primitive methods of each class are incor-        the hierarchy. Thanks to RQL typing, variable Y is of type
porated in a storage method de ned in the respective class         class name and ranges over the result of the nested query.
invoked during the loading process. A two-phase algorithm          The data path expression in the outer query iterates over the
is used for loading the RDF descriptions. During the rst           source (variable Z of type resource URI) and target values
phase, RDF class and properties are stored to create the cor-      (variable T of type resource URI) of the title property. The
responding schema. During the second phase the database            \." implies an implicit join condition between the extent of
is populated with resource descriptions.                           each class valuating Y and the resources valuating Z . Fi-
                                                                   nally, the result of Q will be a bag of resources whose title
                                                                   value matches "*Opera*". Obviously, RQL schema queries
5.   QUERYING RDF DESCRIPTION BASES                                are far more powerful than the corresponding topic queries
   As shown in Table 1, the catalog of Portals like Netscape       of common portals, which allow only full-text queries on the
Open Directory comprises huge hierarchies of classes and an        names of topics. Furthermore, compositions of schema and
even bigger number of resource descriptions. It becomes ev-        data queries like Q, are not possible in current portals, since
ident that browsing such large description bases is a quite        one cannot specify that some query terms should match the
cumbersome and time-consuming task. Consider, for in-              topic names and other should be found in the descriptions
stance, that we are looking for information about hotels           of the resources.
in Paris, under the topic Regional of ODP (see Figure 2).             To make things more complex, relevant information in
ODP allows users to navigate through the topic hierarchy;          portals may also be found under di erent hierarchies, that
as shown in Table 1, even if one knows the exact path to           may be \connected" through related links. For example, as
follow, this would require approximately 8 steps, in order to      we can see in Figure 2, information about hotels in Paris
reach the required topics (i.e., Hotels, Hotel Directories in      may also be found in the Recreation hierarchy. Moreover,
Figure 2). Then, in order to nd the URIs of the sites whose        such links are not necessarily bi-directional; thus, a user
title matches the string "*Opera*", users are forced to man-       starting from the Regional hierarchy may never nd out
ually browse the entire collection of resources directly classi-   that similar information may be found under Recreation
       (a)                   Project                                 (b)                   Project                              (c)                  Project
                                 Z                                                             Z                                                          Z

                              Select                                                        Select                                                    Select

                                 T like ‘‘*Opera*’’                                            T like ‘‘*Opera*’’                                         T like ‘‘*Opera*’’

                              DJoin                                                         DJoin                                                      Join

                                                                                                                                                      Y=C

             Map                                 Semijoin                  Map                                Semijoin                Select                            Join
                                                                                                                                          Y like ‘‘*Hotel*’’
                 Y:$X                             W=Z                          Y:$X                            W=Z                        and                          W=Z
                                       ^(Y)[W]          title[Z,T]                                   Select          title[Z,T]           Y < Paris         Instance[C,W]    title[Z,T]
             Project                                                       Project                                              subclassOf(Regional)[Y]
                                                                                                         C=Y
                 $X                                                            $X
                                                                                                     Instance[C,W]
             Select                                                        Select
                 $X like ‘‘*Hotel*’’                                           $X like ‘‘*Hotel*’’
                 and                                                           and
                 $X < Paris                                                    $X < Paris
       subclassOf(Regional)[$X]                                      subclassOf(Regional)[$X]


                                                  Figure 7: Example of an RQL Query Optimization
e.g., Vacation-Rentals. For such cases, RQL path expres-                                              left-hand side (i.e., nested query) to the right-hand side.
sions allow us to navigate through schemas as for exam-                                               Using the database representation, we can substitute the
ple, f: $Z grelated:Regionalf: $X g. The above examples                                               scan operation on class instances (^(Y )[W ]) with a selection
illustrate a unique feature of RQL, namely that RQL pro-                                              on the Instances relation, as shown in Figure 7 (b). Then,
vides the ability to smoothly switch between schema and                                               each of the subqueries in the shadow boxes can be pushed
data querying while exploiting - in a transparent way - the                                           down to PostgreSQL, leaving the evaluation of the Djoin to
taxonomies of labels and multiple classi cation of resources                                          the interpreter. However, transforming the semijoin on the
(unlike logic-based RDF query languages, e.g., SiLRI [11],                                            right into a join and pushing the selection criterium C =
Metalog [19]). More examples on RQL can be found in [14].                                             Y up to the Djoin operation leads to the evaluation plan
                                                                                                      shown in Figure 7 (c). As one can see, the whole query
5.1 RQL Interpreter                                                                                   is expressible in SQL3 (subclassof and issubclassof are
   The RQL interpreter, implemented in C++, consists of                                               procedural SQL functions) as follows:
                                                                                                      select Z.source
(a) the parser, analyzing the syntax of queries; (b) the graph                                        from subclassesof(Regional) Y, Intances I, title* Z
constructor, capturing the semantics of queries in terms of                                           where issubclassOf(X, Paris) and X like \*Hotel" and
typing and interdependencies of involved expressions; and                                                      I.classid=Y and Z.source=I.uri and
(c) the evaluation engine, accessing RDF descriptions from                                                     Z.target like \*Opera*"
the underlying database [16]. Since our implementation re-                                               Note that the * indicates an extended interpretation of
lies on a full- edged ORDBMS like PostgreSQL, the goal                                                tables, according to the subtable hierarchy. Thus, the whole
of the RQL optimizer is to push as much as possible query                                             query can be pushed down to PostgreSQL, and the Post-
evaluation to the underlying SQL3 engine. Then pushing                                                greSQL optimizer can perform on further (common) opti-
selections or reordering joins to evaluate RQL path expres-                                           mizations, as pushing down selections, join reordering etc.
sions is left to PostgreSQL while the evaluation of RQL
functions for traversing class and property hierarchies relies                                        6. PERFORMANCE TESTS
on the existence of appropriate indices (see Section 6). The                                            In order to evaluate the performance of the two represen-
main diÆculty in translating an entire RQL algebraic ex-                                              tations, we used as testbed the RDF dump of the Open Di-
pression (expressed in an object algebra a la [10]) to a single                                       rectory Catalog (version of 16-01-2001). Experiments have
SQL3 query is due to the fact that most RQL path expres-                                              been carried out on a Sun with two UltraSPARC-II 450MHz
sions interleave schema with data querying [9]. This is the                                           processors and 1 GB of main memory, using PostgreSQL
case of the query Q presented previously.                                                             (Version 7.0.2) with Unicode con guration. During loading
   Figure 7 (a) illustrates the algebraic translation of Q. The                                       and querying, we have used 1000 and 10000 bu ers (of size
translation of the nested query - given in the left branch -                                          8KB) respectively. We have loaded 15 ODP hierarchies (see
is rather straightforward: the class variable $X over all the                                         Table 1) with a total number of 252825 topics (i.e., classes)
subclasses of Regional and its values are ltered according                                            contained in 51MB of RDF/XML les11 . For these hierar-
to the conditions in the where clause. The operator Map on                                            chies, we have also loaded the corresponding descriptions of
top is a simple variable renaming for the iterator (Y ) de ned                                        1770781 resource URIs (672MB).
over the nested query result. Then, the data path expression                                          6.1 Loading
in the from clause is translated into a semi-join between the
source-values of title and the proper instances of the class                                             The leftmost graph of Figure 8 depicts the database size
extents (W ) returned by the nested query, as shown in the                                            required to load the ODP schema and resource descrip-
right branch. The connection between the two expressions                                              tions measured in terms of triples in the schema-speci c
is captured by a Djoin operation (i.e., a dependent join in                                           (SpecRepr) and in the generic (GenRepr) representation sche-
which the evaluation of the right expression depends on the                                           mes. Note that to each class de nition correspond two
evaluation of the left one). Djoin corresponds to a nested                                           11
                                                                                                        This is the volume of the pure ODP schema, produced
loop evaluation with values of variable Y passed from the                                             when properties attributed to the classes are removed.
                        Figure 8: Statistics for Loading Schema and Resource Descriptions
triples: one for the class itself and one for its unique super-   ples), 202MB for Resources (2022869 tuples) and the total
class (multiple class specialization is not used in the ODP       size of indexes on these tables is 900MB. In SpecRepr, the
Catalog). We can observe that in both representations the         size is 32MB for Class (252825 tuples), 8KB for Property
size of the DBMS scales linearly with the number of schema        (5 tuples), 11MB for SubClass (252825 tuples) and 0MB for
triples. The tests show that, in SpecRepr, each schema            SubProperty, and the total size of indexes on these tables is
triple requires on average 0.086KB versus 0.1582KB in the         44MB. The size of the Instances table is 150MB (1770781
GenRepr. This is mainly attributed to the space saved in          tuples) whereas the size of indexes created on it is 140 MB.
SpecRepr due to the Namespace table, as well as to the fact
that for each class de nition in GenRepr three tuples are         6.2 Querying
stored: one in table Resources and two tuples in table               Table 3 describes the RDF query templates that we used
Triples (i.e., 1770781 extra tuples). Although not illus-         for our experiments, as well as their respective algebraic ex-
trated here, the average time for loading a schema triple is      pressions in the two representations (capital letters abreviate
about 0.0021 sec and 0.0025 sec respectively in the two rep-      the table names of Figure 6). This benchmark illustrates the
resentations. The time required to store a schema triple is       desired querying functionality for RDF description bases,
larger in GenRepr because of extra (252825) tuples stored.        namely: a) pure schema queries on class and property de -
When indices are constructed, the average storage volumes         nitions (e.g., Q1-Q4); b) queries on resource descriptions us-
per schema triple become 0.1734KB (SpecRepr) and 0.3062KB         ing available schema knowledge (e.g., Q5-Q9); and c) schema
(GenRepr) and the average loading times become 0.0025 sec         queries for speci c resource descriptions (e.g., Q10, Q11). As
and 0.00317 sec respectively. The total validation time for       a matter of fact, these query templates depict the core func-
the ODP schema is 1539 sec.                                       tionality of RQL presented in the previous section.
   The database size required for the resource descriptions          To get signi cant experimental results, we carried out all
is shown in the rightmost graph of Figure 8. As expected,         benchmark queries several times: one initially to warm up
the DBMS size in both representations scales linearly with        the database bu ers and then nine times to get the average
the number of data triples. The average space required to         execution time of a query. Table 4 illustrates the obtained
store a data triple is 0.123KB in both representations. This      execution time (in sec) for both representations in up to
should not appear as a surprise since the extra space re-         three di erent result cases per query. The main observa-
quired in SpecRepr for storing the URIs of resources in the       tion is that SpecRepr outperforms GenRepr for all types of
property tables compensates for the extra tuples stored for       queries considered. The deviation in performance is more
each resource description in GenRepr. Note that we could          apparent in the cases where self-join computations on the
obtain better storage volumes in the SpecRepr by encoding         large Triples table are required.
the resource URIs as integers, as we did in the GenRepr,             GenRepr and SpecRepr exhibit comparable performance
but this solution comes with extra loading and join costs         in queries Q1, Q2, Q5, Q7, Q10 and Q11, with SpecRepr
(between the class and property tables) for the retrieval of      outperforming GenRepr by a factor of up to 3.73 approx-
the URIs. The tests also show that the average time for           imately. This is clearly illustrated in Q1, where one tu-
loading a data triple is about 0.0033 sec and 0.0039 sec          ple is selected from both table Triples (selectivity 1,7e-
respectively in the two representations. When indices are         5%) and Property (selectivity 20%) using index and se-
constructed, the average storage volumes per data triple be-      quential scans respectively. As expected, in Q2, we can see
come 0.2566KB (SpecRepr) and 0.2706KB (GenRepr) while             that the time required to lter a table in both representa-
the average loading time become 0.0043 sec and 0.00457 sec        tions depends on the number of tuples in the query results:
respectively. Despite the use of ids, the indexes in GenRepr      we have experimented with classes having 1, 30, 314 sub-
take up more space because of: a) the extra tuples stored         classes which represent in GenRepr (SpecRepr) selectivities
b) the index constructed on the predid attribute for which        of 1.7e-5% (3.955e-4%), 5.14e-4% (1,19e-2%) and 5.38e-3%
there is no corresponding index in SpecRepr.                      (0.124%) for table Triples (SubClass) in the three cases
   To summarize, after loading the entire ODP catalog, the        respectively. The (semi-)joins involved in the evaluation of
size of tables in GenRepr is 545MB for Triples (5835756 tu-       queries Q5, Q7 and Q11 incur an additional cost for GenRepr,
     Query Description                           Algebraic Expression                      Algebraic Expression
                                                 in GenRepr                                in SpecRepr
       Q1      Find the range (or domain)        predid=9^subjid=propid (T )              id=propid (P )
               of a property
       Q2      Find the direct subclasses        predid=6^objid=clsid (T )                superid=clsid (SC )
               of a class
       Q3      Find the transitive sub-          repeat Wi       (Wi 1                     repeat Wi      (Wi 1
               classes of a class                 
                                                 > id=subjid (predid=6 (T ))) Wi 1         
                                                                                           > id=superid SC ) Wi 1
                                                 until Wi = Wi 1                           until Wi = Wi 1
       Q4      Check if a class is a             repeat Wi       (Wi 1                     repeat Wi      (Wi 1
               subclass of another class          
                                                 > id=objid (predid=6 (T ))) Wi 1          
                                                                                           > id=subid SC ) Wi 1
                                                                     _
                                                 until Wi = Wi 1 clsid Wi       2          until Wi = Wi 1 clsid _      2 Wi
       Q5      Find the direct extent of         (predid=5^objid=clsid (T ))              id=clsid (I )
               a class (or property)             >subjid=id R
       Q6      Find the transitive extent        [clsid2Q3 ((predid=5^objid=clsid(T ))    [clsid2Q3 (id=clsid (I ))
               of a class (or property)          >subjid=id R)
       Q7      Find if a resource is             (predid=5^objid=clsid (T ))              URI =r^id=clsid (I )
               an instance of a class            1subjid=id (URI =r (R))
       Q8      Find the resources having         (predid=propid^objvalue=val (T ))        target=val (tpropid )
               a property with a speci c         1subjid=id R
               (or range of) value(s)
       Q9      Find the instances of a class     (predid=5^objid=clsid (T ))             (id=clsid (I )) >source=URI
               that have a given property        >subjid=subjid (predid=propid (T ))    (tpropid )
                                                 >subjid=id (R)
      Q10      Find the properties of a          (URI =r (R)) 1id=subjid                 [propid2P (source=r(tpropid ))
               resource and their values         (predid6=5 (T )) 1predid=id (R)
      Q11      Find the classes under which      (URI =r (R)) >id=subjid predid=6 (T ) URI =r (I )
               a resource is classi ed
                         Table 3: Benchmark Query Templates for RDF Description Bases
whereas in Q10 the join cost (for GenRepr) is comparable to its evaluation involves index scans on the property table,
the cost of evaluating set union (for SpecRepr).            whereas in GenRepr di erent evaluation plans are executed
   Queries Q3, Q4 and Q6 involve a transitive closure compu-         in each case. Q9 has been tested for three properties with
tation (using a variation of the Æ -wavefront algorithm [21])        6292, 52029 and 1770584 instances respectively.
over the subclass hierarchy. SpecRepr outperforms GenRepr               To summarize, SpecRepr outperforms GenRepr, which -
by a factor of up to 2.8. In Q3 and Q6, we use the same              in the case of complex path expressions - pays a severe per-
three classes having 3, 30 and 3879 subclasses and a total of        formance penalty for maintaining large tables. We argue
2, 20 and 9049 instances respectively. The execution times           that the performance of SpecRepr can be further improved
in these three cases depend on the sizes of intermediate re-         by employing an appropriate encoding system (e.g., Dewey,
sults (i.e., the costs of joins involving the tables Triples or      post x, pre x, etc.) that preserves the taxonomic relation-
SubClass) as well as, the number of iteration steps of the           ships of schema labels. In this way, checking subclass rela-
algorithm (i.e., the length of the longest path from the given       tionships can be done in constant time. We believe that this
class to its leaves, called depth). In Q4, for the same root         approach will prove to be quite useful, not only for RDF, but
class, we have checked for subclasses residing at depth 3,           for tree-structured data, such as XML [2].
5 and 7 respectively. The di erence in the obtained times
between Q3, Q6 and Q4 is due to the di erent evaluation              7. SUMMARY
method used: "top-down" for the former (i.e., from the sub-             In this paper we presented the architecture and function-
tree root to the leaves) and "bottom-up" for the latter.             ality of ICS-FORTH RDFSuite, a suite of tools for RDF
   In the case of queries Q8 and Q9 SpecRepr exhibits a much         metadata management. RDFSuite addresses a notable need
better performance than GenRepr. GenRepr reaches its lim-            for RDF processing in Web-based applications (such as Web
its when table Triples needs to be self-joined (here Post-           portals) that aim to provide a rich information content made
greSQL uses a hash join algorithm), whereas in SpecRepr,             up of large numbers of heterogeneous resource descriptions.
a join between two small tables suÆces to be evaluated as            It comprises eÆcient mechanisms for parsing and validating
expected. Speci cally, SpecRepr outperforms GenRepr by a             RDF descriptions, loading into an ORDBMS as well as query
factor ranging from 1753 up to 95538. Note that GenRep               processing and optimization in RQL. We also presented a
will su er similar performance limitations in the evaluation         formal data model for RDF metadata and de ned a set of
of queries involving complex path expressions (e.g., in more         constraints that enforce consistency of RDF schemas, thus
sophisticated schemas than in the case of the ODP cata-              enabling the incremental validation and loading of volumi-
log) which will essentially result in a number of self-joins of      nous description bases. We argue that, given the immense
table Triples. Query Q8 has been tested for value ranges             volumes of information content in Web Portals, this is a vi-
returning 1, 10 and 40 resources respectively. In SpecRepr,          able approach to providing persistent storage for Web meta-
                                                                     data. By the same token, eÆcient access to information in
                            Query                Generic                          Speci c
                                        Case 1 Case 2       Case 3        Case 1 Case 2 Case 3
                             Q1         0.0015                           0.0012
                             Q2         0.0017 0.0028         0.02       0.0012 0.0022 0.0124
                             Q3         0.0460 0.082        344.91       0.0463 0.0612 341.98
                             Q4          0.033 0.0415       0.0662       0.0333 0.0415 0.0662
                             Q5         0.0043 0.008          0.04       0.0015 0.0028 0.027
                             Q6         0.0573 0.315        627.43       0.0508 0.1118 482.45
                             Q7         0.0034 0.0034       0.0034       0.0016 0.0016 0.00174
                             Q8         124.20 365.73       675.42       0.0013 0.0069 0.0466
                             Q9         110.58 117.68        185.7        0.031 0.0338 0.1059
                             Q10        0.0072 0.0072       0.0072       0.0071 0.0071 0.0076
                             Q11        0.0035 0.0043       0.0056       0.0013 0.0015 0.0015
                                Table 4: Execution Time of RDF Benchmark Queries
such Portal applications is only feasible using a declarative        [10] S. Cluet and G. Moerkotte. Nested Queries in Object
language providing the ability to query schema and data                   Bases. In DBPL'93, pages 226{242, 1993.
and to exploit schema organization for the purpose of op-            [11] S. Decker, D. Brickley, J. Saarela, and J. Angele. A
timization. We also reported on the results of preliminary                query and inference service for rdf. In W3C QL
tests conducted for assessing the performance of the loading              Workshop, 1998.
component of RDFSuite. These results illustrate that the             [12] C. Finkelstein and P. Aiken. Building Corporate
approach followed is not only feasible, but also promising for            Portals using XML. McGraw-Hill, 1999.
yielding considerable performance gains in query processing,         [13] D. Florescu and D. Kossmann. A performance
as compared to monolithic approaches.                                     evaluation of alternative mapping schemes for storing
   Current research and development e orts focus on study-                xml data in a relational database. Technical Report
ing the transactional aspects of loading RDF descriptions in              3680, INRIA Rocquencourt, France, 1999.
an ORDBMS, as well as, the problem of updating or revising           [14] ICS-FORTH. The ICS-FORTH RDFSuite web site.
description bases. A detailed analysis of query optimiza-                 http://139.91.183.30:9090/RDF, 2001.
tion using alternative representation schemes is underway.           [15] ISO. Information Processing-Text and OÆce Systems-
Furthermore, appropriate index structures for reducing the                Standard Generalized Markup Language (SGML).
cost of recursive querying of deep hierarchies need to be de-             ISO 8879, 1986.
vised as well. Speci cally, an implementation of hierarchy
linearization is underway, exploring alternative node encod-         [16] G. Karvounarakis. A Declarative RDF Metadata
ings. Last, but not least, we intend to extend our formal                 Query Language for Community Web Portals.
data model to capture higher-order aspects such as state-                 Master's thesis, University of Crete, 2000.
ment rei cation.                                                     [17] O. Lassila and R. Swick. Resource Description
                                                                          Framework (RDF) Model and Syntax Speci cation,
8.   REFERENCES                                                           W3C Recommendation. Technical report, , 1999.
 [1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the             [18] J. Liljegren. Description of an rdf database
     Web: From Relations to Semistructured Data and                       implementation. Available at
     XML. Morgan Kaufmann, 1999.                                          WWW-DB.stanford.edu/~melnik/rdf/db-jonas.html.
 [2] S. Abiteboul, H. Kaplan, and T. Milo. Compact                   [19] M. Marchiori and J. Saarela. Query + metadata +
     labeling schemes for ancestor queries. In 12th                       logic = metalog. In W3C QL Workshop, 1998.
     Symposium on Discrete Algorithms, 2001.                         [20] S. Melnik. Storing rdf in a relational database.
 [3] Serge Abiteboul, Richard Hull, and Victor Vianu.                     Available at
     Foundations of Databases. Addison-Wesley, 1995.                      http://WWW-DB.stanford.edu/~melnik/rdf/db.html.
 [4] S. Alexaki. Storage of RDF Metadata for Community               [21] G. Qadah, L. Henschen, and J. Kim. EÆcient
     Web Portals. Master's thesis, Univ. of Crete, 2000.                  Algorithms for the Instantiated Transitive Closure
 [5] G. Beged-Dov et.al. RSS 1.0 Speci cation Protocol.                   Queries. IEEE Transactions on Software Engineering,
     Draft, August 2000.                                                  17(3):296{309, 1991.
 [6] D. Brickley and R.V. Guha. Resource Description                 [22] Some proposed RDF APIs.
     Framework (RDF) Schema Speci cation 1.0, W3C                         GINF, RADIX, Mozzila, Jena, Redland:
     Candidate Recommendation. Technical report, , 2000.                  www.w3.org/RDF/Interest.
 [7] D. Brickley and L. Miller. Rdf, sql and the semantic            [23] F. Tian, D. DeWitt, J. Chen, and C. Zhang. The
     web - a case study. Available at                                     Design and Performance Evaluation of Alternative
     ilrt.org/discovery/2000/10/swsql/.                                   XML Storage Strategies. Technical report, Univ. of
 [8] L. Cardelli. A semantics of multiple inheritance.                    Wisconsin, 2000.
     Information and Computation, 76(2/3):138{164, 1988.             [24] K. Tolle. ICS-Validating RDF Parser: A Tool for
 [9] V. Christophides, S. Cluet, and G. Moerkotte.                        Parsing and Validating RDF Metadata and Schemas.
     Evaluating Queries with Generalized Path                             Master's thesis, University of Hannover, 2000.
     Expressions. In Proc. of ACM SIGMOD, pages                      [25] S. Weibel, J. Miller, and R. Daniel. Dublin Core. In
     413{422, 1996.                                                       OCLC/NCSA metadata workshop report, 1995.
[26] Web service description language www106.ibm.com.
     /developerworks/library/ws-rdf, 2000.