The Features of BigOWLIM that Enabled the BBC’s World Cup Website Atanas Kiryakov, Barry Bishop, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, Ruslan Velkov Ontotext AD, 135 Tsarigradsko Chaussee, Sofia 1784, Bulgaria misconceptions about their essential features, which ABSTRACT are widely spread across many of the providers and Semantic repositories – RDF databases with users of semantic repositories. We will point out two inferencer and query answering engine – are set to of those: become a cornerstone of the Semantic Web (and • Misconception 1: Reasoning is not an Linked Open Data) due to their ability to store and important feature; materialization does not reason with the massive quantities of data involved. work, all the required inference can be In this paper, we describe the features of BigOWLIM handled efficiently during query evaluation; that have allowed it to penetrate into the commercial • Misconception 2: Data-partitioning is an sector, focusing on one particular use-case, that being important feature; it is the way to deal with its use in the BBC’s World Cup website. critical constraints of the technology, e.g. performance and scalability. General Terms We show that these are “urban myths” by presenting Query answering, Semantic Web, Full-text Search the basic design decisions behind BigOWLIM – a semantic repository which delivers best overall Keywords performance according to multiple independent Database, Triple Store, RDF, SPARQL, OWL evaluations conducted recently [2][11][9]. We will focus on features that appeared to be critical for the successful realization of BBC’s World Cup 2010 web site, which was qualified in [7] as “the first large 1. INTRODUCTION scale, mass media site to be using concept extraction, RDF and a Triple store to deliver content.” There is no formal definition of the term ‘semantic repository’ so for the purposes of this article we use this term for Database Management Systems (DBMS) that can be used to store, query and manage 2. BigOWLIM data structured according to the Resource Description Framework (RDF) standards [5]. Compared to OWLIM is a family of semantic repositories that Relational DBMS, such systems use flexible provide storage, inference and novel data-access ontological schemata where data is processed by an features delivered in a scalable, resilient, industrial- inference-engine according to a well-defined strength platform. The flagman of the family, semantics. BigOWLIM combines the robustness and scalability While semantic repositories have been around for of relational databases, the reasoning capabilities of more than a decade, so far they have not managed to inference engines, and the efficiency of column win the hearts of a sizeable fraction of software stores in handling sparse data and evolving schemata. architects. We believe there are two major reasons BigOWLIM delivers this functionality as an engine for this: immature tools and inconsistent feature sets. whose performance and resilience allowed it serve in The first is a natural child illness of each new the core of the semantic web publishing stack technology – mature tools can only appear on top of running the BBC’s World Cup web site [8]. Here large user communities, which are not present for BigOWLIM handles millions of queries per day in a young technology. More worrying are some mission critical production environment, where the data is updated hundreds of times per hour. Copying without fee all or part of this material is permitted only for BigOWLIM is also optimized to integrate and private and academic purposes, providing that the title of the reason with linked data – these capabilities are publication, the authors and its date of publication appear. Copying or use for commercial purposes, or to republish, to post on servers proven in a couple of linked data portals or to redistribute to lists, is forbidden unless an explicit permission (http://FactForge.net and http://LinkedLifeData.com), is acquired from the copyright owners; the authors of the material. which provide public access to billions of linked data Workshop on Semantic Data Management SemData@VLDB 2010 statements integrated from tens of datasets. September 17, 2010, Singapore. Copyright 2010: www.semdata.org. 3. INFERENCE CAPABILITIES closure incrementally. This technique labels statements to be deleted and then uses forward- The inferencing strategy in OWLIM is one of total chaining to identify those statements that can be materialization (apart from an optimization for inferred from them, followed by backward chaining owl:sameAs that is not discussed in this paper) to identify those inferred statements that are still based on R-Entailment (as defined by ter Horst [10]) supported by other means. where Datalog [3] like rules with inequality The result is that delete performance is only constraints operate directly on a single ternary slightly worse than the insertion of new statements. relation that represents all triples. This allows the repository to handle rapidly changing Total materialization involves computing all the data even when answering queries over tens of entailed statements at load time. While this billions of statements. introduces additional reasoning cost when loading statements in to a repository, the desirable 5. TRANSACTION CONTROL consequence is that query evaluation can proceed extremely quickly. OWLIM supports the ‘read committed’ transaction Several standard rule sets are included in all isolation level. It guarantees that changes will not editions of OWLIM and these include (in more or impact query evaluation, before the entire transaction less increasing levels of complexity): ‘empty’ (no they are part of is successfully committed. It does not inference), OWL-Horst [10], RDFS [1], owl-max guarantee that execution of a single transaction is (RDFS plus most of OWL-Lite) and OWL2-RL [6]. performed against a single state of the data in the In addition to the standard semantics, user-defined repository. Regarding concurrency, multiple rule-sets can be used. In this case the user provides update/modification/write transactions can be the full pathname to a custom rule file that contains initiated and stay open simultaneously, i.e. one definitions of axiomatic triples, rules and consistency transaction does not need to be committed in order to checks. allow another transaction to complete Furthermore, update transactions are processed in sequence and do 4. RETRACTING ASSERTIONS not block read requests in any way, i.e. hundreds of SPARQL queries can be evaluated in parallel (the As mentioned above, OWLIM materializes all processing is properly multi-threaded) while update inferred statements at load time and whenever new transactions are being handled on separate threads. statements are added to the repository. This has the One should note that OWLIM performs highly desirable advantage that query answering is materialization, making sure that all the statements very fast, due to the fact that no further inference which can be inferred from the current state of the needs to be done. Updates that simply add new repository are indexed and persisted. By the time the statements are treated in the same way as at load commit method completes, all reasoning activities time, i.e. new statements are fed to the inference related to changes introduced by the corresponding engine that applies the inference rules (making joins transaction will have already been performed. across new statements with existing statements) until no new inferences are obtained. Since the semantics 6. REPLICATION CLUSTER (both standard and custom) must be monotonic, insert operations incrementally add to the set of BigOWLIM can be used in a cluster configuration explicit and inferred statements. However, retracting where replication is used to improve resilience and explicit statements that are used to infer other provide scalable query answering. statements becomes more complicated. In The query performance of the cluster represents SwiftOWLIM, this is achieved by simply the sum of the throughputs that can be handled by invalidating all inferred statements and re-computing each of the instances. In a simple configuration of 3 the full-closure whenever an update is committed. or 4 worker nodes, hundreds of thousands of query This has the advantage of simplicity of requests can be answered per hour while at the same implementation, but the disadvantage of poor update time processing thousands of updates per hour – with performance and lack of scalability. non-trivial inference. BigOWLIM has a specific optimization for In a cluster configuration, there are two types of handling delete operations that updates the full- nodes: Masters and Workers. Masters act as the gateway to the cluster and all read/write requests go During normal operation, a master node will keep through these nodes. A cluster can have more than track of the size of each worker’s read request queue, one master node, but only one is allowed to operate such that each read request is sent to the worker with in read/write mode. The other master nodes operate the shortest read queue. Update requests are handled in read-only mode, otherwise known as ‘hot- differently. First of all, the update is tested against a standby’. They can be used for marshalling read single worker node. If the update is successful and requests and can take over handling updates if the subsequent consistency checks pass then the update current read/write master fails. Worker nodes are request is considered ‘safe’ and is passed to the rest standard BigOWLIM instances exposed by the of the worker nodes. Master nodes take additional Sesame HTTP server – a servlet running in Tomcat care to ensure that the states of all worker nodes are or similar. Read and write requests are passed to the properly synchronized and if an anomaly is detected, workers from the master nodes. This simple the problem worker node is released from the cluster. arrangement allows for a great deal of flexibility in The monitor and control JMX interface can be used the design of a cluster topology. The example given to return worker nodes to the cluster and initiate their in Fig. 1 has two master nodes and three worker synchronization. nodes. At any moment in time, clients of the cluster In the event of a failure of a worker node, the can send read requests (queries) to either master performance degradation is graceful with respect to node, but updates can only be handled by the master the number of healthy workers. The cluster can in read/write mode. If this master node should fail, remain operational with just a single worker node. the hot standby master can be brought in to read/write mode and from then on will handle both 7. CONCLUSION read requests and updates, as well as taking over responsibility for ensuring the synchronization of all The emerging Web of data has provided new the worker nodes. challenges for software components that must expose Each master node implements a JMX MBean [4] this data and enable its widespread consumption. The that is accessible using standard Java instrumentation OWLIM family of semantic repositories is ideally tools, such as JConsole, and can be used to monitor suited to this task due to its ability to store, reason and control the cluster while it is running. Typical and answer queries using the massive datasets activities supported include the monitoring of the involved. health of each node, statistics gathering, adding and OWLIM’s development over the last 6 years was removing worker nodes. driven by pragmatic design decisions aimed to meet the requirements of a range of real-world Dispatches queries Dispatches queries to applications, using it for data integration, metadata and updates to workers management and multi-paradigm information workers (read only) retrieval techniques that combine structured queries (read/write) and reasoning on the one hand with full-text search and co-occurrence analysis on the other. This allowed OWLIM to develop to the point of maturity and comprehension which allowed it to serve as the Read/Write Hot standby back end for such a high-profile application as the Master Master BBC’s World Cup 2010 web site. This use case demonstrated the viability of several design decisions: • Distributed configuration, based on data replication, is ideal for applications where resilience and horizontal scalability with Worker 1 Worker 2 Worker 3 respect to query loads are key; in such environments data partitioning is inefficient Standard BigOWLIM instances and inappropriate; • Reasoning based on forward chaining and Fig. 1 A typical replication cluster configuration materialization provides very good overall performance. When paired with intelligent retraction techniques, it can cope with large berlin.de/bizer/Berlin SPARQLBenchmark/results/V5/index.html numbers of updates, while simultaneously 3. Hervé Gallaire, Jack Minker (Eds.): Logic and Data Bases, dealing with heavy query loads. Symposium on Logic and Data Bases, Centre d'études et de OWLIM continues to evolve with various new recherches de Toulouse, 1977. Advances in Data Base features planned for the near future. The next release Theory, Plenum Press, New York, 1978. 4. Java Management Extensions (JMX), homepage: of OWLIM will include enhanced support for geo- http://download-llnw.oracle.com/javase/1.5.0/ spatial data and some of the widely accepted geo- docs/guide/jmx/ spatial vocabularies. Specialized indices will be used 5. Klyne, G; Carrol , J. J; (eds). (2004). Resource Description to access spatial data and a range of SPARQL Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation 10 Feb. 2004. http://www.w3.org/TR/rdf- extension functions will allow for expressive queries concepts/ using 2D and 3D geometry. 6. Motik, B; Cuenca Grau, B; Horrocks, I; Wu, Z; Fokoue, A; The next release will also include interfaces that Lutz, C. (eds.) (2009). OWL 2 Web Ontology Language support the JENA RDF framework, enabling Profiles. W3C Candidate Recommendation 11 June 2009. http://www.w3.org/TR/owl2-profiles/ OWLIM to be used with both Sesame and JENA, the 7. O'Donovan, J. The World Cup and a call to action around two most widely used RDF frameworks. Linked Data. BBC blog post. The current set of advanced features and world- http://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_ leading performance have helped to position cup_and_a_call_to_ac.html 8. Rayfield, J. BBC World Cup 2010 dynamic semantic OWLIM as the semantic repository of choice for all publishing", BBC blog post. environments that manage RDF data, particularly for http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_ Web-scale applications. The future evolution of cup_2010_dynamic_sem.html OWLIM towards better compatibility and even more 9. Stoilos G., Grau B. C., Horrocks I. How Incomplete is your Semantic Web Reasoner? In Proc. of the 20th Nat. Conf. on powerful data management features will ensure the Artificial Intelligence (AAAI 10), 2010 continued uptake of this technology. 10. ter Horst, H. J. Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity. In Proc. of ISWC 2005, Galway, Ireland, November, 2005, pp. 668-684 8. REFERENCES 11. Thakker, D., Osman, T., Gohil, S., Lakin, P, A Pragmatic Approach to Semantic Repositories Benchmarking. In Proc. 1. Brickley, D.; Guha, R.V, RDF Vocabulary Description of the 7th Extended Semantic Web Conference, ESWC 2010. Language 1.0: RDF Schema, W3C (10 Feb 2004) http://www.w3.org/TR/rdf-schema 2. Bizer, Ch., Schultz, A.: BSBM Results for Virtuoso, Jena TDB, BigOWLIM (November 2009). http://www4.wiwiss.fu-