Achieving Scalability and Expressivity in an RDF Knowledge Base by Implementing Contexts Heiko Stoermer Ignazio Palmisano Domenico Redavid University of Trento Università degli Studi di Bari Università degli Studi di Bari Dept. of Information and Communication Tech. Dipartimento di Informatica Dipartimento di Informatica Trento, Italy Bari, Italy Bari, Italy Email: stoermer@dit.unitn.it Email: palmisano@di.uniba.it Email: redavid@di.uniba.it Abstract— In this paper we are presenting the context archi- results, and we wrap up with a conclusion and a short mention tecture implemented on top of the RDFCore system. With this of planned further works in Sect. VI. extended Knowledge Representation framework we are trying to overcome some of the limitations of RDF and OWL as they are II. M OTIVATION AND R ELATED W ORK today, without losing sight of performance and scalability issues. We are illustrating motivations – partly based on requirements One of our initial motivations to move in the direction of in the VIKEF project – as well as theoretical background, contexts in Semantic Web KBs was our critical view on one implementation details and test-results of our latest works. of the ideas of the Semantic Web, namely that – with a shared ontology – two RDF Aboxes provided by different agents I. I NTRODUCTION can simply be merged, collapsed on identical URIs, and thus provide a new, bigger KB for answering a query (the pre-merge Motivated by requirements of the VIKEF1 project, where scenario is depicted in Fig. 1). a large-scale Semantic Web knowledge-base about documents and other objects provides for intelligent services to the user, we are investigating and developing a more extended KR framework, trying to overcome some of the limitations of RDF and OWL as they are today. Our basic idea is to introduce the notion of context into Semantic Web Knowledge Representation (KR), as previously described in [2], [14]. We claim that the distributed nature of the Semantic Web raises issues that can be attacked by contextualizing knowledge bases, i.e. restricting the scope of statements to the circumstances they were made under. Fig. 1. Two RDF Aboxes A and A’ compliant to a single TBox T. The contribution of this paper is to present one possible realization of this more complex KR approach for the Seman- However, apart from implicit semantics that are omitted tic Web, and to illustrate our progress based on the KBMS when applying such a strategy, cases can be constructed that RDFCore [4]. In continuation of the ideas and preliminary unveil problems even on the logical level. Take the following results presented in [14], we have been concentrating more on example, as depicted in Fig. 2: on the formalization side we the aspects of compatibility relations (CRs) between contexts, have a TBox T with some relations that have cardinality con- which can be used to describe in which way statements in straints, and two ABoxes A and A0 with assertions compliant more than one context can be combined to answer queries to this TBox. Both ABoxes are consistent by themselves, to the KB. We have conducted a more extensive experiment but when merged, they produce an inconsistency as the two to investigate performance aspects of RDFCore and our exten- following statements violate the cardinality constraints in T : sions, and backed by general theories of Contextual Reasoning < prodi prime minister italian government > we believe that we will in some cases be able to provide for < berlusconi prime minister italian government > better scalability than a flat, non-contextual KB. Relying on a host of research done in the area of Context in The paper is organized as follows: In Sect. II we present KR [7], [12], [11], [6], [1], [5], [13], we believe it is a viable intuitive and technical motivations for our approach, as well as approach to attack issues of this nature by binding consistent related work. Sect. III describes our general proposal, whereas sets of assertions to the circumstances they were made under, Sect. IV contains a technical description of the steps taken to i.e. to limit their scope to a context, as we will describe in realize our ideas. In Sect. V we present our experimentation Sect. III. As discussed in [8], [2], [3], this contextualization can 1 Virtual Information and Knowledge Environment Framework; more infor- serve as a basis for a number of KR modelling aspects, mation at http://www.vikef.net such as temporal evolution, trust, beliefs and provenance. The Fig. 2. Example formalization that produces an inconsistency when merged. contributions of our approach compared to the proposals made One issue that becomes obvious immediately is the case in [8], [3], [10], [9] as well as compared to named graph where the union of C 0 and C produces an inconsistent ABox implementations in current RDF triple stores are that i) we which makes query-answering impossible. This can result do not propose or require an extension of the current RDF from cardinality constraints in the TBox (see the Berlusconi- standard and ii) we aim at substantial support for Compatibility Prodi example in Sect. II), or subsumption issues (an individ- Relations (CRs). ual o is said to be instance of different classes). Our basic These relations between contexts enable us to make explicit solution approach is to extract a minimal subgraph containing in which way the assertions in the related contexts are sup- the statement(s) that caused the inconsistency into a named posed to be combined for query answering, to provide for graph NG, as illustrated in Fig. 3. flexible and powerful contextual reasoning as envisioned in the mentioned bibliography. In the course of the VIKEF project it became evident that some of the relations we have in mind have procedural se- mantics, and can thus not be formalized in an OWL ontology, and these are what we are concentrating on at the moment. In the next section we will describe our examplary proposal of such a complex relation. III. A N E XEMPLARY CR Fig. 3. Two contexts C and C’ in an EXTENDS relation. The EXTENDS relation we have chosen to illustrate is meant to describe a situation where we know that two contexts The result is that the query can be processed on the conflict- describe the same object, but assume that one context contains free part of the union of C 0 and C. One possible criticism more information about it than the other. could be that of course we could pose the query to C alone, Take the example of two Information Extraction processes without respecting C 0 , and thus avoid the conflict altogether. P and P 0 that are run on the same document, at different points This however ignores the EXT EN DS relation between the in time. Assume P 0 is a more advanced process and is able to two (which has been established for a reason), and thus should extract more information from the document. We propose to only be allowed on contexts that are not in such a relation. model this as two contexts C (created by P ) and C 0 (created The case is of course slightly more complex when we by P 0 ) with a relation EXT EN DS that explicates that C 0 take into account more than two contexts. We envision the is an extension to C (a necessary condition for this relation EXT EN DS relation to be transitive. This can result in a is that both contexts describe the same object). Intuitively we reasoning chain i) when establishing the relation, as conflicts want to keep the information derived from different sources have to be detected and re-modelled and ii) when querying separate and with explicit metadata, but have the possibility the contexts, as the necessary contexts and relevant subgraphs to combine the resulting information where necessary. have to be traversed. This chain however is non-cyclic, as When a query q is posed on C 0 , the procedural semantics the relation is directional. Section IV describes our first of EXT EN DS are envisioned as follows: implementation of this relation. if q can be answered in C’ We have chosen to attack and illustrate this specific relation then return answer due to its relative complexity. However, we are convinced that else our basic approach as described in [14] is fairly general and propagate query to C’ union C. can be used to implement relations of different kinds. In the course of the project we envision relations that make explicit temporal evolution, trust and a number of domain specific The implementation we are presenting in this paper relies on aspects. RDFCore for RDF models storage, and on Pellet2 for reason- ing tasks such as consistency check over a View. As illustrated IV. R EALIZATION in Fig. 4, the DL reasoner is used by the CompatibilityRelation implementations (note that different implementations could A. RDFContextManager need different reasoning settings, e.g. only RDFS or OWL Lite The component we developed to manage contexts is called inference rather than OWL DL inference), while all the storage RDFContextManager; its architecture is presented in Fig. 4. and retrieval of RDF models is done on RDFContextManager, RDFContextManager is implemented as a Java interface, ex- which uses RDFCore and its facilities for model storage and posing methods to: query[14], using the multiuser environment of RDFCore to enable use of Context information by other applications. • set the Compatibility Relation Ontology (CRO), which is the ontology that defines Compatibility Relations, B. The Compatibility Relations Ontology (CRO) Contexts, parts of Contexts (Graphs) and also gives the concepts necessary to represent context splitting and The CRO contains the definition of the main concepts used relations between Contexts and Graphs to describe the KB structure in terms of contexts; it contains • add new statements to the CRO, stating for example that the definition of Context and the definition of Graph, where a given URI C1 represents a Context, that this Context both concepts represent entities that are named graphs; a extends an existing context C2 , or that there is a Graph Context has the (informal) property of representing something G1 which is part of C2 and is compatible or not with C1 that has a meaning as a whole, e.g. the set of statements • add, remove or update Contexts and Graphs in the un- extracted from a specific document, at a specific time, with derlying persistence layer a specific algorithm, while a Graph is a set of statements that • obtain Views over a Context, e.g. ask RDFContextMan- is included in one or more Contexts or other Graphs, but has no ager to return all Contexts and Graphs that are connected specific meaning alone (e.g. the set of statements in a Context to a Context C1 with EXTENDS relations, directly or that cause inconsistencies with another Context). A domain- by means of part of relations, following all the relation range view of the CRO is given in Fig. 5. chains and obeying imposed limitations Moreover, the CRO contains the definition of the Splittin- gReason class, which represents the reason that led to the isolation of a part of a Context and the storage of that fragment as a Graph; a SplittingReason instance includes references to the Context from where the statements that are being split belonged, the Graph that will hold these statements, the reason for which this split has been done, e.g. because the statements create inconsistencies w.r.t. another context (which is also linked to the reason), and the reification of the statements in the CRO that triggered the split, if any. Fig. 4. Architecture of RDFContextManager An example of SplittingReason generation is the one we will illustrate in detail in Sect. IV-C.1: let us have Contexts A CompatibilityRelation is a Java interface exposing meth- C1 and C2 , if we add to the CRO the statement S1 = C1 ods to: EXTENDS C2 , this will trigger a consistency check over • verify whether an implementation of CompatibilityRela- C1 t C2 . If there is an inconsistency, the statements in C2 that tion should be triggered into action by some statements cause the inconsistency are moved to a Graph G1 , and then a added to the CRO, e.g. the insertion of a statement C1 SplittingReason SR1 will be created in the CRO, linked to C2 EXTENDS C2 should trigger the consistency check over and G1 , with a reason of class Inconsistency which is linked C1 t C2 , and, if an inconsistency is detected, counter- to C1 and a part of relation between C2 and G1 ; S1 will measures should be undertaken, in order to guarantee be reified and attached to SR1 , so that the complete splitting that a View over C1 do not answer an inconsistent set process can be tracked. of statements The CRO also acts as a registry for CompatibilityRelation • carry out the check specific for this CompatibilityRelation implementations, since each declaration of a CompatibilityRe- • ask this implementation to provide a set of Contexts lation amounts to the declaration of a property in this ontology; or Graphs that would be excluded from a View over a an AnnotationProperty for this property, called implemen- Context C1 due to some reason, e.g. incompatibility due tation uri, gives the java class name of the corresponding to inconsistency implementation; this is used to retrieve the set of Compatibili- • ask the implementation to provide a set of Contexts or tyRelation that RDFContextManager will use when managing Graphs that would be included in a View over a Context the CRO and the knowledge base. C1 , e.g. because of an EXTENDS or a part of relation or chain of relations 2 www.mindswap.org/2003/pellet/ Fig. 5. Domain-range view of the CR Ontology C. Use of Compatibility Relations (CR) C1 adds information to both of them, even if the two The simplest use case for the framework is as follows: extended contexts are incompatible; in fact, a View over C1 , which is forced to be consistent, will include only • An external application adds one or more different Con- one of the extended contexts texts in RDFContextManager, assigning them URIs or • The implementation of EXT EN DS will be triggered to letting RDFContextManager choose one check matching with the three statements, and it will fire • The external application asserts some relations between the check for knowledge base reorganization the contexts or specific to a context; the relations between • The check performed by EXT EN DS consists of verify- the contexts are expressed through properties defined in ing that any View over C1 that follows the EXT EN DS the CRO chain does not produce an inconsistent model; therefore, • RDFContextManager receives these new assertions, and it takes the content of C1 and of C2 and runs a DL triggers all the CR implementations available into first reasoner (Pellet in this case) over the union. If any verifying if any of the new assertions is relevant (i.e. the inconsistency is detected, EXT EN DS tries to isolate asserted relation corresponds to the URI the implemen- the responsible statements, selects those that appear in tation is attached to) and then checking whether the new C2 and removes them from C2 ; the statements are then relation is likely to cause reorganization of the knowledge stored as a Graph G1 . The split is tracked by creating base; if this is the case, corrective actions are undertaken a SplittingReason object, connected to C2 , which is • The external application makes a query over the CRO to the source, and G1 , which is the result; it is also find out all the contexts that satisfy some conditions (e.g. connected to a reason, which in this case is instance of the all the contexts which have been created in a specific Inconsistency class, and in turn to C1 which is related date), and then asks to perform a query over the set as incompatible w.r.t G1 . The statements added to the of statements resulting from the union of the contexts; CRO are reified and attached to the SplittingReason as this involves creation of a View for each context that is triggers, in order for the split to be traceable, and finally selected by the query a part of relation is asserted between C2 and G1 . Since 1) EXTENDS Example: We will now use EXTENDS as a the EXT EN DS relation is defined transitive, in case practical example of the described use case: C2 is already connected through a EXT EN DS relation • Two contexts C1 and C2 are inserted in RDFContextMan- to other contexts, then the check is performed not against ager C2 alone but over the resulting View; the generated splits • C1 is asserted to extend C2 w.r.t a specific subject S1 : in the KB can then be distributed along the EXT EN DS the following statements are added to the CRO: chain, which is one of the scalability issues we analyze < C1 EXT EN DS C2 > in Sect. V < C1 describes S1 > • When a View over C1 is requested, all the CR imple- < C2 describes S1 > mentations are requested to provide a set of Contexts or Matching objects for the describes predicate are nec- Graphs that must not appear in the final view (EXCLUDE essary because this enables an application to say that set), i.e. are requested to forbid to follow some paths in C1 extends different unrelated contexts, in the sense that Model Consistency Model Consistency the CRO assertions; this is because, when multiple CR are number check (ms) number check (ms) present, some of them may forbid the presence of a result 0-1 63476 10 - 11 60091 that others would allow to appear in the results; simply 2-3 49529 12 - 13 62621 4-5 54184 14 - 15 59216 removing all the forbidden results after all the paths are 6-7 58410 16 - 17 58142 followed is not correct nor efficient, since this would 8-9 62342 18 - 19 62041 require complex pruning strategies. After the EXCLUDE TABLE I set has been computed, all the CR implementations are R ESULTS FOR 70000 TRIPLES MODELS required to provide the set of Contexts or Graphs that should appear in the resulting View (INCLUDE set), and they will prune their visiting graph as soon as a forbidden result is reached. The final View is then computed as 512 MB of RAM, which is not an adequate server setup). The the INCLUDE set plus the resources connected through time required to complete the consistency check and automatic part of to these elements (not including those in the splitting on models of greater size is around one minute, which EXCLUDE set) is acceptable from our point of view if we consider that this • The View can now be viewed as a single model, or the operation has to be done only once, and occasionally as new set of URI for the contexts and graphs can be used as relations are added. dataset for a SPARQL query to be issued to RDFCore, The most relevant point, here, is that requesting a View which in turn uses ARQ3 as SPARQL engine to interpret operation will return a set of graph identifiers that can be and answer it used as dataset for a SPARQL query, ensuring that the model resulting from the union of the queried data is consistent, V. R ESULTS without having to check at the time of querying; this also In this section we presents the empirical evaluation we have means that the memory requirements (at query time) of the conducted so far. In order to check the system for scalability, framework only depend on the number of relations between we needed to design a big knowledge base with non trivial Contexts and Graphs, and not on the size of the contained data, contents, and at the same time divided in smaller chunks or on their complexity. The memory needed by the SPARQL without changing the semantics of the content. This, however, engine to run the query itself, instead, depends heavily on the seems a very difficult task, and so far we have not found specific query; still no complete evaluation of the behavior real world ontologies that satisfy these requirements, so we of the system w.r.t. the possible kind of queries has been used a homemade tool to generate individuals for a generic performed. ontology; repeating the process many times gave us two well sized knowledge bases. VI. C ONCLUSION AND F URTHER W ORKS Using the SOFSEM ontology 4 , an ontology to describe the Basing on the opinion that contexts in Semantic Web KR are SOFSEM conference, we generated two knowledge bases, one a way to tackle some of the current limitations of the languages composed of 30 models containing about 70000 statements available and provide for better scalability in some cases, we each (for a total of more than 2 millions triples), and the other have presented a theoretical approach and an implementation containing 900 models of about 2000 triples each (1.8 millions of Contextual Reasoning in a Semantic Web KB and the triples); on the first one, we tried to chain the models with associated testing results. We have not only implemented a EXT EN DS relations involving two models at a time, while context mechanism into our KBMS to be able to use a context in the second one we chained tirthy models at a time, obtaining as a first-class object in assertions, but also illustrated a way many chains, and then joined the chains. The results are to provide for context relations with procedural semantics presented in Table I, where the results for the first experiment which – in our opinion – is required for a complete context are presented, in Table II for the second experiment. The functionality. second experiment is also depicted in Fig. 6. Our next steps will be directed towards the formal definition As is depicted in the graph, the time elapsed to create a and implementation of more compatibility relations. Some of view over the graphs is almost constant, even if the number of them will be as required by the VIKEF project, but we are also relations to navigate increases, while the time elapsed to check interested in exploring more general and domain independent the consistency of the models grows proportionally to their relations between contexts and their properties. size. It is important to note that the consistency check runs On the implementational side, these planned steps will be only when new relations are enterend in the CRO; the most accompanied by the development of a more standardized test frequent operation, then, will be the request to create a View set and a set of exemplary queries that specifically display and starting from some specified models, and the experimental make use of contexts, to assess the practicability, performance evaluation shows that this operation is usually performed in and scalability of our implementations. less than half a second on the test machine (a laptop with VII. ACKNOWLEDGMENTS 3 http://jena.sourceforge.net 4 http://nb.vse.cz/ svabo/oaei2006/data/Conference. ˜ This research was partially funded by the European Com- owl mission under the 6th Framework Programme IST Integrated Model View Consistency Model View Consistency Model View Consistency number (ms) check (ms) number (ms) check (ms) number (ms) check (ms) 0 91 2043 9 156 5913 18 208 9986 1 176 2542 10 152 6076 19 214 10950 2 111 2956 11 175 6497 20 218 10699 3 122 3185 12 166 7101 21 232 11434 4 153 3634 13 175 7383 22 267 11696 5 114 3894 14 186 8046 23 242 12064 6 132 4328 15 195 8421 24 244 12706 7 134 4907 16 208 8862 25 249 13095 8 149 5057 17 220 9486 26 276 13476 TABLE II R ESULTS FOR SMALL SIZED MODELS AND LONG CHAINS Fig. 6. Results trend for small sized models Project VIKEF - Virtual Information and Knowledge Envi- [9] Graham Klyne. Contexts for RDF Information Mod- ronment Framework (Contract no. 507173, Priority 2.3.1.7 elling. Content Technologies Ltd, October 2000. http://www.ninebynine.org/RDFNotes/RDFContexts.html. Semantic-based Knowledge Systems; more information at [10] Graham Klyne. Circumstance, provenance and partial http://www.vikef.net). knowledge - Limiting the scope of RDF assertions, 2002. http://www.ninebynine.org/RDFNotes/UsingContextsWithRDF.html. [11] John L. McCarthy. Generality in artificial intelligence. Commun. ACM, R EFERENCES 30(12):1029–1035, 1987. [12] John L. McCarthy. Notes on formalizing context. In IJCAI, pages 555– [1] Massimo Benerecetti, Paolo Bouquet, and Chiara Ghidini. Contextual 562, 1993. reasoning distilled. J. Exp. Theor. Artif. Intell., 12(3):279–305, 2000. [13] Luciano Serafini and Paolo Bouquet. Comparing formal theories of [2] Paolo Bouquet, Luciano Serafini, and Heiko Stoermer. Introducing context in ai. Artif. Intell., 155(1-2):41–67, 2004. Context into RDF Knowledge Bases. In Proceedings of SWAP 2005, [14] Heiko Stoermer, Ignazio Palmisano, Domenico Redavid, Luigi Iannone, the 2nd Italian Semantic Web Workshop, Trento, Italy, December Paolo Bouquet, and Giovanni Semeraro. RDF and Contexts: Use 14-16, 2005. CEUR Workshop Proceedings, ISSN 1613-0073, online of SPARQL and Named Graphs to Achieve Contextualization. In http://ceur-ws.org/Vol-166/70.pdf, December 2005. Proceedings of the First Jena User’s Conference, Bristol, UK, April [3] Jeremy Carroll, Christian Bizer, Patrick Hayes, and Patrick Stickler. 2006. http://jena.hpl.hp.com/juc2006/proceedings/palmisano/paper.pdf. Named Graphs, Provenance and Trust. In Proceedings of the Fourteenth International World Wide Web Conference (WWW2005), Chiba, Japan, volume 14, pages 613–622, May 2005. [4] F. Esposito, L. Iannone, I. Palmisano, and G. Semeraro. RDF Core: a Component for Effective Management of RDF Models. In Isabel F. Cruz, Vipul Kashyap, Stefan Decker, and Rainer Eckstein, editors, Proceedings of SWDB’03, The first International Workshop on Semantic Web and Databases, Co-located with VLDB 2003, Humboldt-Universität, Berlin, Germany, September 7-8, 2003, 2003. [5] Chiara Ghidini and Luciano Serafini. Distributed first order logics. In First International Workshop on Labelled Deduction [LD’98], 1998. [6] Fausto Giunchiglia. Contextual reasoning. Epistemologia - Special Issue on I Linguaggi e le Macchine, XVI:345–364, 1993. [7] Ramanathan V. Guha. Contexts: A Formalization and Some Applications. PhD thesis, Stanford, 1991. [8] Ramanathan V. Guha, Rob McCool, and Richard Fikes. Contexts for the semantic web. In Sheila A. McIlraith, Dimitris Plexousakis, and Frank van Harmelen, editors, International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 32–46. Springer, 2004.