The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases Sofia Alexaki Karsten Tolle Vassilis Christophides Gregory Karvounarakis Dimitris Plexousakis Institute of Computer Science, FORTH, Johann Wolfgang Goethe-University, Vassilika Vouton, P.O.Box 1385 Robert-Mayer-Str. 11-15 P.O.Box 11 19 32, GR 711 10, Heraklion, Greece D-60054 Frankfurt/Main, Germany [alexaki, christop, gregkar, dp] tolle@dbis.informatik.uni-frankfurt.de @ics.forth.gr ABSTRACT 1. INTRODUCTION Metadata are widely used in order to fully exploit informa- Metadata are widely used in order to fully exploit infor- tion resources available on corporate intranets or the Inter- mation resources (e.g., sites, documents, data, images, etc.) net. The Resource Description Framework (RDF) aims at available on corporate intranets or the Internet. The Re- facilitating the creation and exchange of metadata as any source Description Framework (RDF) [17] aims at facilitat- other Web data. The growing number of available infor- ing the creation and exchange of metadata as any other Web mation resources and the proliferation of description ser- data. More precisely, RDF provides i) a Standard Repre- vices in various user communities, lead nowadays to large sentation Language for metadata based on directed labeled volumes of RDF metadata. Managing such RDF resource graphs in which nodes are called resources (or literals) and descriptions and schemas with existing low-level APIs and edges are called properties and ii) an XML syntax for ex- le-based implementations does not ensure fast deployment pressing metadata in a form that is both humanly read- and easy maintenance of real-scale RDF applications. In able and machine understandable. Due to its exible model, this paper, we advocate the use of database technology to RDF is playing a central role in the next evolution step of support declarative access, as well as, logical and physical the Web - termed the Semantic Web. Indeed, RDF/S enable independence for voluminous RDF description bases. the provision of various kinds of metadata (for administra- We present RDFSuite, a suite of tools for RDF validation, tion, recommendation, content rating, site maps, push chan- storage and querying. Speci cally, we introduce a formal nels, etc.) about resources of quite diverse nature (ranging data model for RDF description bases created using mul- from PDF or Word documents, e-mail or audio/video les to tiple schemas. Next, we present the design of a persistent HTML pages or XML data) to di erent target communities RDF Store (RSSDB) for loading resource descriptions in an (corporate, inter-enterprise, e-marketplace, etc.). The most ORDBMS by exploring the available RDF schema knowl- distinctive RDF feature is its ability to suport superimposed edge. Our approach preserves the exibility of RDF in re- descriptions for the same Web resources, enabling content ning schemas and/or enriching descriptions at any time, syndication - and hence, automated processing - in a variety whilst it outperforms, both in storage volumes and query of application contexts. To interpret resource descriptions execution time, other approaches using a monolithic table within or across communities, RDF allows for the de nition to represent resource descriptions and schemas under the of schemas [6] i.e., vocabularies of labels for graph nodes form of triples. Last, we brie y present RQL, a declarative (i.e., classes) and edges (i.e., properties). Furthermore, these language for querying both RDF descriptions and schemas, vocabularies can be easily extended to meet the description and sketch query evaluation on top of RSSDB. needs of speci c (sub-)communities while preserving the au- tonomy of description services for each (sub-)community. Many content providers (e.g., ABCNews, CNN, Time Inc.)  This work was partially supported by the EC project C- and Web Portals (e.g., Open Directory, CNET, XMLTree1 ) or browsers (e.g., Netscape 6.0, W3C Amaya) already adopt Web (IST-1999-13479) and Mesmuses (IST-2000-26074). RDF, as well as, emerging application standards for Web data and services syndication (e.g., the RDF Site Summary [5], the Dublin Core [25], or the Web Service Description Language [26]). In a nutshell, the growing number of avail- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are able information resources and the proliferation of descrip- not made or distributed for profit or commercial advantage and that copies tion services in various user communities, lead nowadays to bear this notice and the full citation on the first page. To copy otherwise, to large volumes of RDF metadata (e.g., the Open Directory republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors. 1 Semantic Web Workshop 2001 Hongkong, China www.dmoz.org, home.cnet.com, www.xmltree.com respec- Copyright by the authors. tively. DBMS VRP Class Property RQL id c_name id domainid p_name rangeid Validator 1 ns1#Ext.Resource 4 1 ns1#title 12 Graph 2 ns2#Hotel 5 11 ns3#title 12 Evaluation Constructor Engine Loader 3 ns2#Hotel Direct Parser API Parser API VRP Model SubClass SubProperty subid superid subid superid 3 2 4 5 Optimizer ns1#Ext.Resource ns3#title Module URI source target r1 r1 SunScale Figure 1: Overview of the ICS-FORTH RDFSuite Portal of Netscape exports in RDF around 170M of Subject 2. THE OPEN DIRECTORY CATALOG Topics and 700M of indexed URIs). It becomes evident that Portals are nowadays becoming increasingly popular by managing such voluminous RDF resource descriptions and enabling the development and maintenance of speci c com- schemas with existing low-level APIs and le-based imple- munities of interest (e.g., enterprise, professional, trading) mentations [22] does not ensure fast deployment and easy [12] on corporate intranets or the Internet. Such Commu- maintenance of real-scale RDF applications. Still, we want nity Web Portals essentially provide the means to select, to bene t from three decades of research in database tech- classify and access, in a semantically meaningful way, var- nology to support declarative access and logical and phys- ious information resources. Portals may be distinguished ical independence for RDF description bases. In this way, according to the breadth of the target community (e.g., tar- RDF applications have to specify in a high-level language geting horizontal or vertical markets), and the complexity only which resources need to be accessed, leaving the task of information resources (e.g., sites, documents, data). In of determining how to eÆciently store or access them to the all cases, the key Portal component is the Knowledge Cat- underlying RDF database engine. alog holding descriptive information, i.e., metadata, about In this paper we present ICS-FORTH RDFSuite [14], a suite the community resources. of tools for RDF validation (Validating RDF Parser-VRP), For instance, the catalogs of Internet (or horizontal) Por- storage (RDF Schema Speci c DataBase-RSSDB), and query- tals, like Yahoo! or Open Directory (ODP), use huge hi- ing (RDF Query Language-RQL) using an object-relational erarchies of topics to semantically classify Web resources. DBMS (see Figure 1). To illustrate the functionality and the Speci cally, Table 1 lists statistics of 15 (out of 16) ODP performance of RDFSuite we use as testbed the catalog of hierarchies (version of 16-01-2001), comprising 252840 top- the Open Directory Portal exported in RDF (see Section 2). ics under which 1770781 sites are classi ed. We can observe The paper makes the following contributions: that ODP hierarchies are not deep (the maximum depth is  Section 3 introduces a formal data model for descrip- 13) while the number of direct subclasses of a class varies tion bases created according to the RDF Model & Syn- greatly (the maximum number of subclasses is 314 but the tax and Schema speci cations [17, 6] (without rei ca- average is around 4.02). Additionally, various administra- tion). The main challenge for this model is the rep- tive metadata (e.g., titles, mime-types, modi cation dates) resentation of several interconnected RDF schemas, as of resources are usually created using an OCLC Dublin-Core well as the introduction of a graph instantiation mech- like schema [25]. Users can either navigate through the top- anism permitting multiple classi cation of resources. ics of the catalog to locate resources of interest, or issue a  Section 4 presents our persistent RDF Store (RSSDB) full-text query on topic names and the URIs or the titles of for loading resource descriptions in an object-relational described resources. In Section 5 we will illustrate how our DBMS (ORDBMS) by exploiting the available RDF query language can be used to provide declarative access to schema knowledge. Our approach preserves the ex- the catalog content. In the sequel, we illustrate how Por- ibility of RDF in re ning schemas and/or enriching tal Catalogs can be easily and e ectivelly represented using descriptions at any time whilst it can be customized RDF/S. in several ways according to the speci cities of both The middle part of Figure 2 depicts the two schemas em- the manipulated desription bases and the underlying ployed by the Open Directory: the rst is intended to cap- RDF application scenario. ture the semantics of web resources, while the second is in-  Section 5 sketches the functionality of a declarative tended for Portal administrators. The scope of the declara- language, called RQL, for querying both RDF de- tions is determined by the corresponding namespace de ni- scriptions and related schemas. The implementation tion of each schema, e.g., ns1 (www.dmoz.org/topics.rdfs) of RQL on top of RSSDB is described with emphasis and ns2 (www.oclc.com/dublincore.rdfs). For simplicity, on how the RQL interpreter pushes query evaluation we will hereforth omit the namespaces as well as the paths as much as possible to the underlying ORDBMS. from the root of the topic hierarchies (since topics have  Section 6 illustrates the performance of RSSDB for non-unique names) pre xing class and property names. Fig- storing and querying voluminous RDF descriptions, ure 2 depicts 2 (out of the 16) hierarchies in the ODP topics such as the ODP catalog. In particular, RSSDB out- schema, namely, Regional and Recreation, whose topics are performs both in storage volumes and query execution represented as RDF classes (see the rdf and rdfs default time other approaches using a monolithic table [20, namespaces in the upper part of Figure 2). Various semantic 18, 7] to represent resource descriptions and schemas relationships exist between these classes either within a topic under the form of triples. hierarchy (e.g., subtopics) or across hierarchies (e.g., related Finally, Section 7 presents our conclusions and draws direc- topics). The former, is represented using the RDF subclass tions for further research. rdf:www.w3.org/1999/02/22-rdf-syntax-ns rdfs:www.w3.org/TR/2000/CR-rdf-schema-20000327 RDF/S Class property ns1:www.dmoz.org/topics.rdfs ns2:www.oclc.org/dublincore.rdfs Regional Recreation Portal Schemas Lodging String Paris String Travel description Vacation Integer -Rentals title Hotel related file_size Ext.Resource Date Hotel Directories Ile-de-France last_modified Disneyland title Resource descriptions &r1 &r4 description Official site of &r2 &r3 Disneyland Paris title title title description Site officiel de Disneyland Paris SunScale Pulitzer Opera Bedford typeOf (instance) r1: http://www.sunscale.com/france/paris r3: http://www.hotel-bedford.fr/index/hotel.htm subClassOf (isA) r2: http://www.hotelpulitzer.com r4: http://www.disneylandparis.com Figure 2: Modeling in RDF the Catalog of the Open Directory Portal relationship (e.g., Travel is a, not necessarily direct, sub- of XML trees (i.e., there is no root XML node). class of Paris) and the latter using an RDF property named Hence, RDF properties are unordered (e.g., the property related (e.g., between the classes Ile-de-France and Hotel). title can appear before or after the property description), op- Finally, the ODP administrative metadata schema captures tional (e.g., the property le size is not used), can be multi- various descriptive elements of Dublin-Core [25] (with the valued (e.g., we have two description properties), and they execption of the Subject element), as literal RDF properties can be inherited (e.g., to subclasses of ExtResource), while de ned on class ExtResource. Note that properties serve resources can be multiply classi ed (e.g., &r4). These mod- to represent attributes of resources as well as relationships eling primitives provide all the exibility we need to rep- between resources. Properties can also be organized in tax- resent superimposed descriptions of community resources, onomies in a manner similar to the organization of classes. while preserving a conceptually uni ed view of the descrip- Using these schemas, we can see in the lower part of Fig- tion base through one or the union of all related schemas. It ure 2, the descriptions created for four sites (resources &r1- becomes clear that the RDF data model di ers substantially &r4). For instance, &r4 is a resource classi ed under both from standard object or relational models [3] or the recently the classes Paris and ExtResource and has three associated proposed models for semistructured or XML databases [1]: literal properties: a title property with value \Disneyland"  Classes do not de ne object or relation types: an in- and two description properties with values \OÆcial site of stance of a class is just a resource URI; Disneyland Paris" and \Site oÆciel de Disneyland Paris"  Resources may belong to di erent classes not necessar- respectively. In the RDF jargon, a speci c resource (i.e., ily pairwise related by specialization: the instances of node) together with a named property (i.e., edge) and its a class may have quite di erent properties associated value (i.e., node) form a statement. Each statement is rep- with them, while there is no other class on which the resented by a triple having a subject (e.g., &r4), a predicate union of these properties is de ned; (e.g., title), and an object (e.g., \Disneyland "). The sub-  Properties may also be re ned by respecting a minimal set of constraints (domain and range compatibility). ject and object should be of a class compatible (under class Note also that due to multiple classi cation, resources specialization) with the domain and range of the predicate may have quite irregular descriptions modeled only through (e.g., &r4 is of type ExtResource). In the rest of the pa- an exception mechanism a la SGML [15]. Last but not per, the term description base will be used to denote a set of least, semistructured or XML models can't distinguish be- RDF statements. Although not illustrated in Figure 2, RDF tween entity (e.g., ExtResource) and property labels (e.g., also supports structured values called containers (e.g., Bags) title). Therefore existing proposals cannot be used, as for grouping statements, as well as, higher-order statements such, to manipulate RDF description bases. In the sequel, (i.e., rei cation2 ). Finally, both RDF graph schemas and we will present a logical data model allowing users to issue descriptions can be serialized in XML using various forests high-level queries on RDF/S graphs, while several physical representations can be used by the underlying DBMS to im- 2 We will not treat rei cation in this paper. prove storage volumes and optimize queries on these graphs. Hierarchy Max.Depth Avg.Depth Max.Subclass Avg.Subclass #Topics #Resources news.rdf 7 5 51 5.625 721 47735 kat.rdf 7 4.8 46 6 761 7730 home.rdf 8 5 53 5 1722 26688 shopping.rdf 9 4.8 61 4.5 3349 88821 health.rdf 9 5.3 52 5.3 3202 45519 games.rdf 10 5.4 125 5.3 4857 36181 computers.rdf 10 5.3 147 5 6010 91597 recreation.rdf 11 5.87 85 5.47 7269 93929 business.rdf 11 5.9 52 4.69 6833 161877 reference.rdf 13 7.13 154 3.92 6483 75105 sports.rdf 9 5.56 178 7.02 10625 66280 science.rdf 10 6.53 314 5.05 8667 65939 society.rdf 12 6.3 157 6.27 16250 161433 arts.rdf 11 5.5 267 6.5 25314 214795 regional.rdf 13 7.7 254 3.37 150762 587152 Total 13 6.86 314 4.02 252825 1770781 Table 1: ODP Topic Hierarchy Statistics 3. A FORMAL DATA MODEL FOR RDF Each RDF schema uses a nite set of class names C  C Since the notion of resource is somehow overloaded in the and property names P  P whose scope is determined by RDF M&S and RDFS speci cations [17, 6], we distinguish one or more namespaces. Property types are then de ned RDF resources w.r.t. their nature into individual entities using class names or literal types so that: for each p 2 P , (i.e., nodes) and properties of entity resources (i.e., edges). domain(p) 2 C and range(p) 2 C [ L. We denote by H = (N; ) a hierarchy of class and property names, where N =  Nodes : a set of individual resources, representing C [ P . H is well-formed if  is a smallest partial ordering abstract or concrete entities of independent existence, such that: if p1 ; p2 2 P and p1  p2 , then domain(p1 )  e.g., class ExtResource de ned in an RDF Schema or domain(p2 ) and range(p1 )  range(p2 )3 . a speci c web resource e.g., &r4 (see Figure 2).  Edges : a set of properties, representing both at- 3.1 Additional RDF Constraints tributes of and binary relationships between nodes, ei- Three remarks are noteworthy. First, unlike the current ther abstract or concrete, e.g., the property title de- RDF M&S and RDFS speci cations [17, 6] the domain and ned in our example schema and used by the speci c range of properties should always be de ned. This additional resource &r4. constraint is required mainly because the sets of RDF/S classes C and literal types L are disjoint. Then, a prop- RDF Resources are also distinguished according to their erty with unde ned range may take as values both a class concreteness into tokens and classes. instance (i.e., a resource) or a literal. Since, the union of rdfs:Class and rdfs:Literal is meaningless in RDF/S,  Tokens : a set of concrete resources, either objects, this freedom leads to semantic inconsistencies. Additional or literals (e.g., &r4, \SunScale"). semantic problems arise in the case of specialization of prop-  Classes : a set of abstract entity or property re- erties with unde ned domain and range. More precisely, to sources, in the sense that they collectively refer to a set preserve the set inclusion requirement of specialized proper- of objects similar in some respect (e.g., ExtResource). ties (binary predicates) their domain and range should also specialize the domain and range (unary predicates) of their To label abstract (i.e., schema) or concrete (i.e., token) superproperties. This is something which, in the case of mul- RDF nodes and edges, we assume the existence of the fol- tiple specialization of properties (see Figure 3 -a-), cannot lowing countably in nite and pairwise disjoint sets of sym- always be ensured because RDF/S do not support intersec- bols: C of Class names, P of Property names, U of Resource tion of classes (in a Description Logics style). The second URIs as well as a set L of Literal type names such as string , constraint imposes that both domain and range declarations integer, date, etc. Each literal type t 2 L has an S associated of a property be unique. This is foremost required because domain, denoted dom(t) and dom(L) denotes t2L dom(t) RDF/S do not support union of classes, which can be consid- (i.e., the rdfs:Literal declaration). Without loss of gen- ered as the domain of properties. Furthermore, it is not pos- erality, we assume that the sets C and P are extended to sible to infer domains in case of specialization of properties include as elements the class name Class and the prop- with multiple domains (see Figure 3 -b-). In this context, erty name Property respectively. The former captures the the rdfs:Class de nition should be used, in order to de ne root of a class hierarchy (i.e., the rdfs:Class declaration) a property with domain or range all the token resources. while the latter captures the root of a property hierarchy Conforming to RDF/S, rdf:Resource should be used as (i.e., the rdf:Property declaration) de ned in RDF/S [17, the domain or range, when we need to de ne properties 6]. The set P also contains integer labels (f1; 2; : : : g) used viewing schema classes also as tokens (e.g., rdfs:seeAlso, as property names (i.e., the rdfs:ContainerMembership- rdfs:isDefinedBy, rdfs:comment, rdfs:label). Properties declaration) by the members of container values (i.e., the rdfs:Bag, rdfs:Sequence declarations). 3 The symbol  extends  with equality. nite set of URIs (of type U ) to each class name6 is cap- tured by a population function  : C ! 2U . The set of all values forseen by our model is denoted by V . De nition 1. The interpretation function [ :] is de ned as follows:  for literal types: [ L] = dom(L); Figure 3: Inconsistencies in Property Specialization  for the Bag type, [ f g] = fv1 ; v2 ; : : : vn g where v1 ; v2 ; : : : vn 2 V are values of type  ; Last, we need to consider an additional constraint of a syntactic nature, imposing that class and property de ni-  for the Seq type, [ [ ]]] = [v1 ; v2 ; : : : vn ] where v1 ; v2 ; : : : vn tions be complete. This means that, on the one hand, super- 2 V are values of type  ; class and superproperty declarations should accompany the  for the Alt type [ (1 + 2 + : : : + n )]] = vi where vi 2 V class and property de nitions respectively and, on the other 1 < i < n is a value of type i ; hand, the domain and range of a property should be given along with the de nition of the property. In this manner,  for each class c 2 C , [ c] = (c) [ Sc c [ c0 ] ; 0 the extension of existing RDF hierarchies of names by re n-  for each property p 2SP , [ p] 0= f[v1 ; v2 ] j v1 2 [ domain(p)]]; ing their classes and properties in a new namespace is still v2 2 [ range(p)]]g [ p p [ p ] . 0 permitted, while at the same time semantic inconsistencies As usual, the interpretation of classes and properties in that may arise due to arbitrary unions of used hierarchies our model is set-based. This implies that a resource URI are avoided. Such inconsistencies include the introduction of appears only once in the extent of a class even when it is multiple ranges of properties or the introduction of cycles in classi ed several times under its subclasses (i.e., it belongs to class or property hierarchies. Unlike the current RDF M&S the direct extent of the most speci c class). The notation and RDFS speci cations [17, 6] this constraint ensures that ^[[:] is used in Table 2 to distinguish between strict and the union of two valid RDF hierarchies of names is always extended interpretation of classes and properties. valid. The imposed constraints (C3-C5) are summarized in 3.3 RDF Description Bases and Schemas Table 2 using the notation introduced in this section. De nition 2. An RDF schema is a 5-tuple RS = (VS ; 3.2 RDF Typing System ES ; ; ; H ), where: VS is the set of nodes and ES is the set In this context, RDF data can be atomic values (e.g., of edges, H is a well-formed hierarchy of class and property strings), resource URIs, and container values holding query  names H = (N; ),  is a labeling function  : VS ES [ ! results, namely rdf:Bag (i.e., multi-sets) and rdf:Sequence [ N T , and is an incidence function : ES VS VS . !  (i.e., lists). The main types foreseen by our model are: The incidence function captures the rdfs:domain and rdfs: range declarations of properties7 . Note that the incidence  = Lj U j f g j [ ] j (1 :  + 2 :  + : : : + n :  ) and labeling functions are total in VS [ ES and ES respec- tively. This does not exclude the case of schema nodes which where L is a literal type in L, f:g is the Bag type, [:] is the are not connected through an edge. Additionally, we impose Seq uence type, (:) is the Alternative type, and U is the type a unique name assumption on the labels of RS nodes and for resource U RI s4 . Alternatives capture the semantics of edges. union (or variant) types [8], and they are also ordered (i.e., De nition 3. An RDF description base, instance of a integer labels play the role of union member markers). Since schema RS , is a 5-tuple RD = (VD ; ED ; ; ; ), where: VD there exists a prede ned ordering of labels for sequences is a set of nodes and ED is a set of edges, is the incidence and alternatives, labels can be omitted (for bags, labels are function : ED ! VD  VD ,  is a value function  : VD ! meaningless). Furthermore, all types are mutually exclusive V , and  is a labeling function  : VD [ ED ! 2N [T which (e.g., a literal value cannot also be a bag) and no subtyping satis es the following: relation is de ned in RDF/S (e.g., between bags of di erent  for each node v in VD ,  returns a set of names n 2 types). The set of all type names is denoted by T . C [ T where the value of v belongs to the interpretation The proposed type system o ers all the arsenal we need of each n:  (v ) 2 [ n] ; to capture containers with both homogeneous and heteroge-  for each edge  in ED going from node v to node v0 ,  neous member types, as well as, to interpret RDF schema returns a property name p 2 P . classes and properties. For instance, unnamed ordered tu- Note that the labeling function captures the rdf:type dec- ples denoted by [v1 ; v2 ; : : : ; vn ] (where vi is of some type i ) laration that associates each RDF node with one or more can be de ned as heterogeneous sequences5 of type [(1 + class names (opposite to traditional object models) which 2 + : : : + n )]. Unlike traditional object data models, RDF classes are then represented as unary relations of type fU g may be de ned in several well-formed hierarchies of names. while properties as binary relations of type f[U ; U ]g (for Additionally, integer labels (f1; 2; : : : g) are used as property relationships) or f[U ; L ]g (for attributes). Furthermore, names by the members of RDF container values. The im- RDF containers can also be used to represent n-ary rela- posed constraints (C6-C7) are summarized in Table 2 using tions (e.g., as a bag of sequences). Finally, assignment of a the notation introduced in this section. 4 6 In Section 5, we will see that our query language treats Note that we consider here a non-disjoint object id assign- URIs, i.e., identi ers, as simple strings. ment to classes due to multiple classi cation. 5 Observe that, since tuples are ordered, for any two permu- 7 Constraint C4 of Table 2 ensures that rdfs:domain and tations i1 ; : : : ; in and j1 ; : : : ; jn of 1; : : : ; n, [i1 : v1 ; : : : ; in : rdfs:range are not any more relations as in the current vn ] is distinct from [j1 : v1 ; : : : ; jn : vn ]. RDF M&S and RDFS speci cations [17, 6]. Alphabets: C1 Class, Property and Type names are mutually exclusive C\P \T =; C2 Literal, Resources and Container values are mutually exclusive L \ U \ V n (L [ U ) = ; Schema: C3 8c; c0 ; c00 2 C : C3.1  Class is the root of the class hierarchy: c  Class C3.2  subClassOf relation is transitive: c  c0 ; c0  c00 ) c  c00 C3.3  subClassOf relation is antisymmetric: c  c0 ) c = 6 c0 C4 Domain and range of properties should be de ned and they should be unique: 8p 2 P; 9!c1 2 C (c1 = domain(p)) ^ 9!c2 2 C [ TL (c2 = range(p)) C5 8p; p0 ; p00 2 P : C5.1  Property is the root of the property hierarchy: p  P roperty C5.2  subPropertyOf relation is transitive: p  p0 ; p0  p00 ) p  p00 C5.3  subPropertyOf relation is antisymmetric: p  p0 ) p = 6 p00 0 C5.4  if p is subPropertyOf p then the domain (range) of p is a subset of the domain (range) of p0 : p  p0 ) domain(p)  domain(p0 ) ^ range(p)  range(p0 ) Data: C6 8v 2 V : C6.1  if v is a URI then it is an instance of one or more classes not related by : v 2 U ) (v )  C 8c; c0 2 C ; v 2 ^[[c] ; c  c0 ) v 2= ^[[c0 ] C6.2  if v is a literal value then it is an instance of one (and only one) Literal type: v 2 L ) (v )  TL and (v ) is a singleton C6.3  if v is a container value then it is an instance of one (and only one) Container type: v 2 V n (L [ U ) ) (v )  TBjS jA and (v ) is a singleton C7 8p 2 P ; [v1 ; v2 ] 2 [ p] : C7.1  if p belongs to the set f1; 2; : : : g then v1 is an instance of either rdf:Bag or rdf:Seq or rdf:Alt: p 2 f1; 2; : : : g ) (v1 )  TBjS jA C7.2  if p doesn't belong to the set f1; 2; : : : g then v1 (v2 ) belongs to the domain (range) of p: p 2 P n f1; 2; : : : g ) 9n 2 (v1 ); n  domain(p) ^ 9m 2 (v2 ); m  range(p) C7.3  Additionally, the direct extends of properties must be unique p 2 P n f1; 2; : : : g ) (8p0 2 P n f1; 2; : : : g; p  p0 ) [v1 ; v2 ] 2 = ^[[p0 ] ) Rei cation: C8 A rei ed statement should have exactly one rdf:predicate, rdf:subject and rdf:object property Table 2: Formal De nition of Imposed RDF Constraints 3.4 The Validating RDF Parser (VRP) the constraints speci ed in the RDF M&S and RDFS speci- To conclude this section, we brie y describe the Validat- cations [17, 6] as well as the additional constraints we intro- ing RDF Parser (VRP) we have implemented to analyze, duced previously (see Table 2). This permits the validation validate and process RDF descriptions. Unlike other RDF of both the RDF descriptions against one or more RDFS parsers (e.g., SiRPAC8 ), VRP (see Figure 4) is based on schemas, and the schemas themselves. The VRP validation standard compiler generator tools for Java, namely CUP/- module relies on (a) a complete and sound algorithm [24] to JFlex (similar to YACC/LEX). As a result, users do not translate descriptions from an RDF/XML form (using both need to install additional programs (e.g., XML Parsers) in the Basic and Compact serialization syntax) into the RDF order to run VRP. The VRP BNF grammar can be eas- triple-based model (b) an internal object representation of ily extended or updated in case of changes in the RDF/S this model in Java, allowing to separate RDF schema from speci cations. VRP is a 100% Java(tm) development under- data information. As we will see in the sequel, this approach standing embedded RDF in HTML or XML and providing enables a ne-grained processing of RDF statements w.r.t. full Unicode support. The stream-based parsing support of their type, which is crucial in order to implement an incre- JFlex and the quick LALR grammar parsing of CUP ensure mental loading of RDF descriptions and schemas. good performance when processing large volumes of RDF descriptions. Currently VRP is a command line tool with 4. THE RDF SCHEMA SPECIFIC DATABASE various options to generate a textual representation of the This section describes the persistent RDF store (RSSDB) internal model (either graph or triple based). for loading resource descriptions in an object-relational DBMS The most distinctive feature of VRP is its ability to verify (ORDBMS). We begin by presenting our schema generation strategy in comparison to monolithic approaches [20, 18, 7] 8 www.w3.org/RDF/Implementations/SiRPAC representing resource descriptions and schemas as triples. Figure 4: The Validating RDF Parser (VRP) The core RDF/S model is represented by four tables (see Figure 6-A), namely, Class, Property, SubClass and Sub- Property which capture the class and property hierarchies de ned in an RDF schema. To save space when storing class and properties names, we also consider a table NameSpace Figure 5: RDF Schema Speci c DataBase & Loader keeping the namespaces of the RDF Schemas stored in the ORDBMS. Furthermore, a table Type is used to hold the with URI object values). Indices (i.e., B-trees) are con- names of RDF/S built-in types (e.g., rdf:Property, rdfs:- structed for all table attributes. Class), as well as, those of Container (e.g., rdf:Bag, rdf:Seq, Compared to GenRepr, our SpecRepr is exible enough rdf:Alt) and Literal types (e.g., string, integer, date). to allow the customization of the physical representation of The main novelty of our representation is the separation RDF metadata in the underlying ORDBMS. This is impor- of the RDF schema from data information, as well, as the tant since no representation is good for all purposes and in distinction between unary and binary relations holding the most real-scale RDF applications variations of a basic rep- instances of classes and properties. More precisely, class ta- resentation are required to take into account the speci c bles store the URIs of resources while property tables store characteristics of employed schema classes and properties. the URIs of the source and target nodes of the property. Our main goal here is to reduce the total number of cre- Indices are constructed on the attributes URI, source and ated instance tables. This is justi ed by the fact that some target of the above tables in order to speed up joins and commercial ORDBMSs (and not PostgreSQL) permit only the selection of speci c tuples of the tables. Indices are also a limited number of tables. Furthermore, numerous tables constructed on the attributes lpart, nsid and id of the ta- (e.g., the ODP catalog implies the creation of 252840 tables, bles Class and Property and on the attribute subid of the i.e. one for each topic) have a signi cant overhead on the tables SubClass and SubProperty. response time of all queries (i.e., to nd and open a table, It should be stressed that the taxonomic relationships its attributes, etc.). One of the possible variations we have between schema labels is captured by the SubClass and experimented for the ODP catalog is the representation of SubProperty tables, while the corresponding instance tables all class instances by a unique table Instances (see dashed are also connected through the subtable relationship, sup- rectangular in Figure 6-A). This table has two attributes, ported today by all ORDBMSs9 . In other words, RSSDB namely uri and classid, for keeping the uri's of the re- relies on a schema speci c representation of resource de- sources and the id's of the classes in which resources belong. scriptions, called SpecRepr, similar to the attribute-based ap- The bene ts of this SpecRepr variation are illustrated in Sec- proach proposed for storing XML data [13, 23]. SpecRepr tion 6 given that most ODP classes (i.e. topics) have few or preserves the exibility of RDF in re ning schemas and/or no instances at all (more than 90% of the ODP topics con- enriching descriptions at any time, and as we will see in the tain less than 30 URIs). For other RDF schemas it could be sequel, it can be customized in several ways according to the also interesting to represent in a similar way all the instances speci cities of both the manipulated description bases and of properties, but in general real-scale RDF schemas have the RDF application scenario (i.e., querying functionality). more classes than properties. Another alternative to our ba- This is not the case of other proposals employing a mono- sic SpecRepr could be the representation of properties with lithic table [20, 18, 7] to represent RDF metadata under range a literal type, as attributes of the tables created for the the form of triples. These approaches provide a generic rep- domain of this property. Consequently, new attributes will resentation (i.e., for all RDF schemas), called GenRepr, of be added to the created class tables. The tables created for both RDF schemas and resource descriptions using two ta- properties with range a class will remain unchanged. The bles (see Figure 6-B), namely, Resources and Triples. The above representation is applicable to RDF schemas where former represents each resource (including schema classes attribute-properties are single-valued and they are not spe- and properties) by a unique id whereas the latter represents cialized. Finally, our SpecRepr opens the way for a more the statements made about the resources (including classes cleaver representation of class and property names using ap- and properties) under the form of predicate-subject-object propriate encoding systems that preserve the taxonomic re- triples (captured by predid, subid and objid respectively). lationships of schema labels and enable to optimize recursive Note that in the Triples table we distinguish between prop- traversals on subclass (or subproperty) hierarchies. erties representing attributes (i.e., objvalue with literal val- 4.1 The RDF Description Loader ues) and those relationships between resources (i.e., objid Figure 5 depicts the architecture of our system for loading 9 Note that the syntactic constraint (see Section 3) impos- RDF metadata in an ORDBMS, namely PostgreSQL10 . The ing complete class and property de nitions, ensures that loader has been implemented [4] in Java and communication the table hierarchy created in RSSDB can be only extended 10 through specialization www.postgresql.org Figure 6: Relational Representation of RDF Description Bases with PostgreSQL relies on the JDBC protocol. ed under the topic of interest. Note that, in order to locate The loader comprises two main modules. The rst mod- resources classi ed under the subtopics (e.g., Hotel Directo- ule checks the consistency of analyzed schemas descriptions ries) of a given topic, browsing should be continued by the in comparison with the stored information in the ORDBMS. users. RQL aims to simplify such tasks, by providing power- For example, in case that an analyzed property has already ful path expressions permitting smooth ltering/navigation been stored, it checks whether its domain and range are the on both Portal schemas and resource descriptions. Then the same as the ones stored in the ORDBMS. Another func- previous query can be expressed as follows: tionality of this module is the validation of RDF metadata Q: Find the resources under the hierarchy Regional, about based on stored RDF schemas instead of connecting to the hotels in Paris whose title matches "*Opera*". respective namespaces. Thus, we avoid analyzing and vali- select Z dating repeatedly the RDF schemas used in metadata and from (select $X reduce the required main memory, since only parts of RDF from Regionalf:$Xg schemata are fetched. Consequently, our system enables in- where $X like "*Hotel*" cemental loading of RDF descriptions and schemas, which and $X < Paris)fYg.fZgtitlefTg is crucial for handling large RDF schemas and even larger where T like "*Opera*" RDF description volumes created using multiple schemas. The schema path expression in the from clause of the The second module implements the loading of RDF de- nested query, states that we are interested in classes (vari- scriptions in the DBMS. To this end, a number of load- ables pre xed by $ like $X ) specializing the root Regional. ing methods have been implemented as member functions Then, the ltering condition in the where clause will re- of the related VRP internal classes. Speci cally, for every tain only the classes whose name matches "*Hotel*" and attribute of the classes of the VRP model, a method is cre- they are also subclasses of Paris (e.g., Hotel and Hotel Di- ated for storing the attribute of the created object in the rectories. Here, to get all relevant topics, the only required ORDBMS. For example, the method storetype is de ned schema knowledge is that the subtopics of Regional contain for the class RDF Resource, in order to store object type geographical information and a topic Paris is somewhere in information. The primitive methods of each class are incor- the hierarchy. Thanks to RQL typing, variable Y is of type porated in a storage method de ned in the respective class class name and ranges over the result of the nested query. invoked during the loading process. A two-phase algorithm The data path expression in the outer query iterates over the is used for loading the RDF descriptions. During the rst source (variable Z of type resource URI) and target values phase, RDF class and properties are stored to create the cor- (variable T of type resource URI) of the title property. The responding schema. During the second phase the database \." implies an implicit join condition between the extent of is populated with resource descriptions. each class valuating Y and the resources valuating Z . Fi- nally, the result of Q will be a bag of resources whose title value matches "*Opera*". Obviously, RQL schema queries 5. QUERYING RDF DESCRIPTION BASES are far more powerful than the corresponding topic queries As shown in Table 1, the catalog of Portals like Netscape of common portals, which allow only full-text queries on the Open Directory comprises huge hierarchies of classes and an names of topics. Furthermore, compositions of schema and even bigger number of resource descriptions. It becomes ev- data queries like Q, are not possible in current portals, since ident that browsing such large description bases is a quite one cannot specify that some query terms should match the cumbersome and time-consuming task. Consider, for in- topic names and other should be found in the descriptions stance, that we are looking for information about hotels of the resources. in Paris, under the topic Regional of ODP (see Figure 2). To make things more complex, relevant information in ODP allows users to navigate through the topic hierarchy; portals may also be found under di erent hierarchies, that as shown in Table 1, even if one knows the exact path to may be \connected" through related links. For example, as follow, this would require approximately 8 steps, in order to we can see in Figure 2, information about hotels in Paris reach the required topics (i.e., Hotels, Hotel Directories in may also be found in the Recreation hierarchy. Moreover, Figure 2). Then, in order to nd the URIs of the sites whose such links are not necessarily bi-directional; thus, a user title matches the string "*Opera*", users are forced to man- starting from the Regional hierarchy may never nd out ually browse the entire collection of resources directly classi- that similar information may be found under Recreation (a) Project (b) Project (c) Project Z Z Z Select Select Select T like ‘‘*Opera*’’ T like ‘‘*Opera*’’ T like ‘‘*Opera*’’ DJoin DJoin Join Y=C Map Semijoin Map Semijoin Select Join Y like ‘‘*Hotel*’’ Y:$X W=Z Y:$X W=Z and W=Z ^(Y)[W] title[Z,T] Select title[Z,T] Y < Paris Instance[C,W] title[Z,T] Project Project subclassOf(Regional)[Y] C=Y $X $X Instance[C,W] Select Select $X like ‘‘*Hotel*’’ $X like ‘‘*Hotel*’’ and and $X < Paris $X < Paris subclassOf(Regional)[$X] subclassOf(Regional)[$X] Figure 7: Example of an RQL Query Optimization e.g., Vacation-Rentals. For such cases, RQL path expres- left-hand side (i.e., nested query) to the right-hand side. sions allow us to navigate through schemas as for exam- Using the database representation, we can substitute the ple, f: $Z grelated:Regionalf: $X g. The above examples scan operation on class instances (^(Y )[W ]) with a selection illustrate a unique feature of RQL, namely that RQL pro- on the Instances relation, as shown in Figure 7 (b). Then, vides the ability to smoothly switch between schema and each of the subqueries in the shadow boxes can be pushed data querying while exploiting - in a transparent way - the down to PostgreSQL, leaving the evaluation of the Djoin to taxonomies of labels and multiple classi cation of resources the interpreter. However, transforming the semijoin on the (unlike logic-based RDF query languages, e.g., SiLRI [11], right into a join and pushing the selection criterium C = Metalog [19]). More examples on RQL can be found in [14]. Y up to the Djoin operation leads to the evaluation plan shown in Figure 7 (c). As one can see, the whole query 5.1 RQL Interpreter is expressible in SQL3 (subclassof and issubclassof are The RQL interpreter, implemented in C++, consists of procedural SQL functions) as follows: select Z.source (a) the parser, analyzing the syntax of queries; (b) the graph from subclassesof(Regional) Y, Intances I, title* Z constructor, capturing the semantics of queries in terms of where issubclassOf(X, Paris) and X like \*Hotel" and typing and interdependencies of involved expressions; and I.classid=Y and Z.source=I.uri and (c) the evaluation engine, accessing RDF descriptions from Z.target like \*Opera*" the underlying database [16]. Since our implementation re- Note that the * indicates an extended interpretation of lies on a full- edged ORDBMS like PostgreSQL, the goal tables, according to the subtable hierarchy. Thus, the whole of the RQL optimizer is to push as much as possible query query can be pushed down to PostgreSQL, and the Post- evaluation to the underlying SQL3 engine. Then pushing greSQL optimizer can perform on further (common) opti- selections or reordering joins to evaluate RQL path expres- mizations, as pushing down selections, join reordering etc. sions is left to PostgreSQL while the evaluation of RQL functions for traversing class and property hierarchies relies 6. PERFORMANCE TESTS on the existence of appropriate indices (see Section 6). The In order to evaluate the performance of the two represen- main diÆculty in translating an entire RQL algebraic ex- tations, we used as testbed the RDF dump of the Open Di- pression (expressed in an object algebra a la [10]) to a single rectory Catalog (version of 16-01-2001). Experiments have SQL3 query is due to the fact that most RQL path expres- been carried out on a Sun with two UltraSPARC-II 450MHz sions interleave schema with data querying [9]. This is the processors and 1 GB of main memory, using PostgreSQL case of the query Q presented previously. (Version 7.0.2) with Unicode con guration. During loading Figure 7 (a) illustrates the algebraic translation of Q. The and querying, we have used 1000 and 10000 bu ers (of size translation of the nested query - given in the left branch - 8KB) respectively. We have loaded 15 ODP hierarchies (see is rather straightforward: the class variable $X over all the Table 1) with a total number of 252825 topics (i.e., classes) subclasses of Regional and its values are ltered according contained in 51MB of RDF/XML les11 . For these hierar- to the conditions in the where clause. The operator Map on chies, we have also loaded the corresponding descriptions of top is a simple variable renaming for the iterator (Y ) de ned 1770781 resource URIs (672MB). over the nested query result. Then, the data path expression 6.1 Loading in the from clause is translated into a semi-join between the source-values of title and the proper instances of the class The leftmost graph of Figure 8 depicts the database size extents (W ) returned by the nested query, as shown in the required to load the ODP schema and resource descrip- right branch. The connection between the two expressions tions measured in terms of triples in the schema-speci c is captured by a Djoin operation (i.e., a dependent join in (SpecRepr) and in the generic (GenRepr) representation sche- which the evaluation of the right expression depends on the mes. Note that to each class de nition correspond two evaluation of the left one). Djoin corresponds to a nested 11 This is the volume of the pure ODP schema, produced loop evaluation with values of variable Y passed from the when properties attributed to the classes are removed. Figure 8: Statistics for Loading Schema and Resource Descriptions triples: one for the class itself and one for its unique super- ples), 202MB for Resources (2022869 tuples) and the total class (multiple class specialization is not used in the ODP size of indexes on these tables is 900MB. In SpecRepr, the Catalog). We can observe that in both representations the size is 32MB for Class (252825 tuples), 8KB for Property size of the DBMS scales linearly with the number of schema (5 tuples), 11MB for SubClass (252825 tuples) and 0MB for triples. The tests show that, in SpecRepr, each schema SubProperty, and the total size of indexes on these tables is triple requires on average 0.086KB versus 0.1582KB in the 44MB. The size of the Instances table is 150MB (1770781 GenRepr. This is mainly attributed to the space saved in tuples) whereas the size of indexes created on it is 140 MB. SpecRepr due to the Namespace table, as well as to the fact that for each class de nition in GenRepr three tuples are 6.2 Querying stored: one in table Resources and two tuples in table Table 3 describes the RDF query templates that we used Triples (i.e., 1770781 extra tuples). Although not illus- for our experiments, as well as their respective algebraic ex- trated here, the average time for loading a schema triple is pressions in the two representations (capital letters abreviate about 0.0021 sec and 0.0025 sec respectively in the two rep- the table names of Figure 6). This benchmark illustrates the resentations. The time required to store a schema triple is desired querying functionality for RDF description bases, larger in GenRepr because of extra (252825) tuples stored. namely: a) pure schema queries on class and property de - When indices are constructed, the average storage volumes nitions (e.g., Q1-Q4); b) queries on resource descriptions us- per schema triple become 0.1734KB (SpecRepr) and 0.3062KB ing available schema knowledge (e.g., Q5-Q9); and c) schema (GenRepr) and the average loading times become 0.0025 sec queries for speci c resource descriptions (e.g., Q10, Q11). As and 0.00317 sec respectively. The total validation time for a matter of fact, these query templates depict the core func- the ODP schema is 1539 sec. tionality of RQL presented in the previous section. The database size required for the resource descriptions To get signi cant experimental results, we carried out all is shown in the rightmost graph of Figure 8. As expected, benchmark queries several times: one initially to warm up the DBMS size in both representations scales linearly with the database bu ers and then nine times to get the average the number of data triples. The average space required to execution time of a query. Table 4 illustrates the obtained store a data triple is 0.123KB in both representations. This execution time (in sec) for both representations in up to should not appear as a surprise since the extra space re- three di erent result cases per query. The main observa- quired in SpecRepr for storing the URIs of resources in the tion is that SpecRepr outperforms GenRepr for all types of property tables compensates for the extra tuples stored for queries considered. The deviation in performance is more each resource description in GenRepr. Note that we could apparent in the cases where self-join computations on the obtain better storage volumes in the SpecRepr by encoding large Triples table are required. the resource URIs as integers, as we did in the GenRepr, GenRepr and SpecRepr exhibit comparable performance but this solution comes with extra loading and join costs in queries Q1, Q2, Q5, Q7, Q10 and Q11, with SpecRepr (between the class and property tables) for the retrieval of outperforming GenRepr by a factor of up to 3.73 approx- the URIs. The tests also show that the average time for imately. This is clearly illustrated in Q1, where one tu- loading a data triple is about 0.0033 sec and 0.0039 sec ple is selected from both table Triples (selectivity 1,7e- respectively in the two representations. When indices are 5%) and Property (selectivity 20%) using index and se- constructed, the average storage volumes per data triple be- quential scans respectively. As expected, in Q2, we can see come 0.2566KB (SpecRepr) and 0.2706KB (GenRepr) while that the time required to lter a table in both representa- the average loading time become 0.0043 sec and 0.00457 sec tions depends on the number of tuples in the query results: respectively. Despite the use of ids, the indexes in GenRepr we have experimented with classes having 1, 30, 314 sub- take up more space because of: a) the extra tuples stored classes which represent in GenRepr (SpecRepr) selectivities b) the index constructed on the predid attribute for which of 1.7e-5% (3.955e-4%), 5.14e-4% (1,19e-2%) and 5.38e-3% there is no corresponding index in SpecRepr. (0.124%) for table Triples (SubClass) in the three cases To summarize, after loading the entire ODP catalog, the respectively. The (semi-)joins involved in the evaluation of size of tables in GenRepr is 545MB for Triples (5835756 tu- queries Q5, Q7 and Q11 incur an additional cost for GenRepr, Query Description Algebraic Expression Algebraic Expression in GenRepr in SpecRepr Q1 Find the range (or domain) predid=9^subjid=propid (T ) id=propid (P ) of a property Q2 Find the direct subclasses predid=6^objid=clsid (T ) superid=clsid (SC ) of a class Q3 Find the transitive sub- repeat Wi (Wi 1 repeat Wi (Wi 1 classes of a class  > id=subjid (predid=6 (T ))) Wi 1  > id=superid SC ) Wi 1 until Wi = Wi 1 until Wi = Wi 1 Q4 Check if a class is a repeat Wi (Wi 1 repeat Wi (Wi 1 subclass of another class  > id=objid (predid=6 (T ))) Wi 1  > id=subid SC ) Wi 1 _ until Wi = Wi 1 clsid Wi 2 until Wi = Wi 1 clsid _ 2 Wi Q5 Find the direct extent of (predid=5^objid=clsid (T )) id=clsid (I ) a class (or property) >subjid=id R Q6 Find the transitive extent [clsid2Q3 ((predid=5^objid=clsid(T )) [clsid2Q3 (id=clsid (I )) of a class (or property) >subjid=id R) Q7 Find if a resource is (predid=5^objid=clsid (T )) URI =r^id=clsid (I ) an instance of a class 1subjid=id (URI =r (R)) Q8 Find the resources having (predid=propid^objvalue=val (T )) target=val (tpropid ) a property with a speci c 1subjid=id R (or range of) value(s) Q9 Find the instances of a class (predid=5^objid=clsid (T )) (id=clsid (I )) >source=URI that have a given property >subjid=subjid (predid=propid (T )) (tpropid ) >subjid=id (R) Q10 Find the properties of a (URI =r (R)) 1id=subjid [propid2P (source=r(tpropid )) resource and their values (predid6=5 (T )) 1predid=id (R) Q11 Find the classes under which (URI =r (R)) >id=subjid predid=6 (T ) URI =r (I ) a resource is classi ed Table 3: Benchmark Query Templates for RDF Description Bases whereas in Q10 the join cost (for GenRepr) is comparable to its evaluation involves index scans on the property table, the cost of evaluating set union (for SpecRepr). whereas in GenRepr di erent evaluation plans are executed Queries Q3, Q4 and Q6 involve a transitive closure compu- in each case. Q9 has been tested for three properties with tation (using a variation of the Æ -wavefront algorithm [21]) 6292, 52029 and 1770584 instances respectively. over the subclass hierarchy. SpecRepr outperforms GenRepr To summarize, SpecRepr outperforms GenRepr, which - by a factor of up to 2.8. In Q3 and Q6, we use the same in the case of complex path expressions - pays a severe per- three classes having 3, 30 and 3879 subclasses and a total of formance penalty for maintaining large tables. We argue 2, 20 and 9049 instances respectively. The execution times that the performance of SpecRepr can be further improved in these three cases depend on the sizes of intermediate re- by employing an appropriate encoding system (e.g., Dewey, sults (i.e., the costs of joins involving the tables Triples or post x, pre x, etc.) that preserves the taxonomic relation- SubClass) as well as, the number of iteration steps of the ships of schema labels. In this way, checking subclass rela- algorithm (i.e., the length of the longest path from the given tionships can be done in constant time. We believe that this class to its leaves, called depth). In Q4, for the same root approach will prove to be quite useful, not only for RDF, but class, we have checked for subclasses residing at depth 3, for tree-structured data, such as XML [2]. 5 and 7 respectively. The di erence in the obtained times between Q3, Q6 and Q4 is due to the di erent evaluation 7. SUMMARY method used: "top-down" for the former (i.e., from the sub- In this paper we presented the architecture and function- tree root to the leaves) and "bottom-up" for the latter. ality of ICS-FORTH RDFSuite, a suite of tools for RDF In the case of queries Q8 and Q9 SpecRepr exhibits a much metadata management. RDFSuite addresses a notable need better performance than GenRepr. GenRepr reaches its lim- for RDF processing in Web-based applications (such as Web its when table Triples needs to be self-joined (here Post- portals) that aim to provide a rich information content made greSQL uses a hash join algorithm), whereas in SpecRepr, up of large numbers of heterogeneous resource descriptions. a join between two small tables suÆces to be evaluated as It comprises eÆcient mechanisms for parsing and validating expected. Speci cally, SpecRepr outperforms GenRepr by a RDF descriptions, loading into an ORDBMS as well as query factor ranging from 1753 up to 95538. Note that GenRep processing and optimization in RQL. We also presented a will su er similar performance limitations in the evaluation formal data model for RDF metadata and de ned a set of of queries involving complex path expressions (e.g., in more constraints that enforce consistency of RDF schemas, thus sophisticated schemas than in the case of the ODP cata- enabling the incremental validation and loading of volumi- log) which will essentially result in a number of self-joins of nous description bases. We argue that, given the immense table Triples. Query Q8 has been tested for value ranges volumes of information content in Web Portals, this is a vi- returning 1, 10 and 40 resources respectively. In SpecRepr, able approach to providing persistent storage for Web meta- data. By the same token, eÆcient access to information in Query Generic Speci c Case 1 Case 2 Case 3 Case 1 Case 2 Case 3 Q1 0.0015 0.0012 Q2 0.0017 0.0028 0.02 0.0012 0.0022 0.0124 Q3 0.0460 0.082 344.91 0.0463 0.0612 341.98 Q4 0.033 0.0415 0.0662 0.0333 0.0415 0.0662 Q5 0.0043 0.008 0.04 0.0015 0.0028 0.027 Q6 0.0573 0.315 627.43 0.0508 0.1118 482.45 Q7 0.0034 0.0034 0.0034 0.0016 0.0016 0.00174 Q8 124.20 365.73 675.42 0.0013 0.0069 0.0466 Q9 110.58 117.68 185.7 0.031 0.0338 0.1059 Q10 0.0072 0.0072 0.0072 0.0071 0.0071 0.0076 Q11 0.0035 0.0043 0.0056 0.0013 0.0015 0.0015 Table 4: Execution Time of RDF Benchmark Queries such Portal applications is only feasible using a declarative [10] S. Cluet and G. Moerkotte. Nested Queries in Object language providing the ability to query schema and data Bases. In DBPL'93, pages 226{242, 1993. and to exploit schema organization for the purpose of op- [11] S. Decker, D. Brickley, J. Saarela, and J. Angele. A timization. We also reported on the results of preliminary query and inference service for rdf. In W3C QL tests conducted for assessing the performance of the loading Workshop, 1998. component of RDFSuite. These results illustrate that the [12] C. Finkelstein and P. Aiken. Building Corporate approach followed is not only feasible, but also promising for Portals using XML. McGraw-Hill, 1999. yielding considerable performance gains in query processing, [13] D. Florescu and D. Kossmann. A performance as compared to monolithic approaches. evaluation of alternative mapping schemes for storing Current research and development e orts focus on study- xml data in a relational database. Technical Report ing the transactional aspects of loading RDF descriptions in 3680, INRIA Rocquencourt, France, 1999. an ORDBMS, as well as, the problem of updating or revising [14] ICS-FORTH. The ICS-FORTH RDFSuite web site. description bases. A detailed analysis of query optimiza- http://139.91.183.30:9090/RDF, 2001. tion using alternative representation schemes is underway. [15] ISO. Information Processing-Text and OÆce Systems- Furthermore, appropriate index structures for reducing the Standard Generalized Markup Language (SGML). cost of recursive querying of deep hierarchies need to be de- ISO 8879, 1986. vised as well. Speci cally, an implementation of hierarchy linearization is underway, exploring alternative node encod- [16] G. Karvounarakis. A Declarative RDF Metadata ings. Last, but not least, we intend to extend our formal Query Language for Community Web Portals. data model to capture higher-order aspects such as state- Master's thesis, University of Crete, 2000. ment rei cation. [17] O. Lassila and R. Swick. Resource Description Framework (RDF) Model and Syntax Speci cation, 8. REFERENCES W3C Recommendation. Technical report, , 1999. [1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the [18] J. Liljegren. Description of an rdf database Web: From Relations to Semistructured Data and implementation. Available at XML. Morgan Kaufmann, 1999. WWW-DB.stanford.edu/~melnik/rdf/db-jonas.html. [2] S. Abiteboul, H. Kaplan, and T. Milo. Compact [19] M. Marchiori and J. Saarela. Query + metadata + labeling schemes for ancestor queries. In 12th logic = metalog. In W3C QL Workshop, 1998. Symposium on Discrete Algorithms, 2001. [20] S. Melnik. Storing rdf in a relational database. [3] Serge Abiteboul, Richard Hull, and Victor Vianu. Available at Foundations of Databases. Addison-Wesley, 1995. http://WWW-DB.stanford.edu/~melnik/rdf/db.html. [4] S. Alexaki. Storage of RDF Metadata for Community [21] G. Qadah, L. Henschen, and J. Kim. EÆcient Web Portals. Master's thesis, Univ. of Crete, 2000. Algorithms for the Instantiated Transitive Closure [5] G. Beged-Dov et.al. RSS 1.0 Speci cation Protocol. Queries. IEEE Transactions on Software Engineering, Draft, August 2000. 17(3):296{309, 1991. [6] D. Brickley and R.V. Guha. Resource Description [22] Some proposed RDF APIs. Framework (RDF) Schema Speci cation 1.0, W3C GINF, RADIX, Mozzila, Jena, Redland: Candidate Recommendation. Technical report, , 2000. www.w3.org/RDF/Interest. [7] D. Brickley and L. Miller. Rdf, sql and the semantic [23] F. Tian, D. DeWitt, J. Chen, and C. Zhang. The web - a case study. Available at Design and Performance Evaluation of Alternative ilrt.org/discovery/2000/10/swsql/. XML Storage Strategies. Technical report, Univ. of [8] L. Cardelli. A semantics of multiple inheritance. Wisconsin, 2000. Information and Computation, 76(2/3):138{164, 1988. [24] K. Tolle. ICS-Validating RDF Parser: A Tool for [9] V. Christophides, S. Cluet, and G. Moerkotte. Parsing and Validating RDF Metadata and Schemas. Evaluating Queries with Generalized Path Master's thesis, University of Hannover, 2000. Expressions. In Proc. of ACM SIGMOD, pages [25] S. Weibel, J. Miller, and R. Daniel. Dublin Core. In 413{422, 1996. OCLC/NCSA metadata workshop report, 1995. [26] Web service description language www106.ibm.com. /developerworks/library/ws-rdf, 2000.