The Features of BigOWLIM that Enabled the BBC’s
                    World Cup Website
     Atanas Kiryakov, Barry Bishop, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev,
                                     Ruslan Velkov
                        Ontotext AD, 135 Tsarigradsko Chaussee, Sofia 1784, Bulgaria


                                                                          misconceptions about their essential features, which
ABSTRACT                                                                  are widely spread across many of the providers and
Semantic repositories – RDF databases with                                users of semantic repositories. We will point out two
inferencer and query answering engine – are set to                        of those:
become a cornerstone of the Semantic Web (and                                • Misconception 1: Reasoning is not an
Linked Open Data) due to their ability to store and                              important feature; materialization does not
reason with the massive quantities of data involved.                             work, all the required inference can be
In this paper, we describe the features of BigOWLIM                              handled efficiently during query evaluation;
that have allowed it to penetrate into the commercial                        • Misconception 2: Data-partitioning is an
sector, focusing on one particular use-case, that being                          important feature; it is the way to deal with
its use in the BBC’s World Cup website.                                          critical constraints of the technology, e.g.
                                                                                 performance and scalability.
General Terms                                                             We show that these are “urban myths” by presenting
Query answering, Semantic Web, Full-text Search                           the basic design decisions behind BigOWLIM – a
                                                                          semantic repository which delivers best overall
Keywords                                                                  performance according to multiple independent
Database, Triple Store, RDF, SPARQL, OWL                                  evaluations conducted recently [2][11][9]. We will
                                                                          focus on features that appeared to be critical for the
                                                                          successful realization of BBC’s World Cup 2010 web
                                                                          site, which was qualified in [7] as “the first large
1.     INTRODUCTION                                                       scale, mass media site to be using concept extraction,
                                                                          RDF and a Triple store to deliver content.”
There is no formal definition of the term ‘semantic
repository’ so for the purposes of this article we use
this term for Database Management Systems
(DBMS) that can be used to store, query and manage                        2.   BigOWLIM
data structured according to the Resource Description
Framework (RDF) standards [5]. Compared to                                OWLIM is a family of semantic repositories that
Relational DBMS, such systems use flexible                                provide storage, inference and novel data-access
ontological schemata where data is processed by an                        features delivered in a scalable, resilient, industrial-
inference-engine according to a well-defined                              strength platform. The flagman of the family,
semantics.                                                                BigOWLIM combines the robustness and scalability
   While semantic repositories have been around for                       of relational databases, the reasoning capabilities of
more than a decade, so far they have not managed to                       inference engines, and the efficiency of column
win the hearts of a sizeable fraction of software                         stores in handling sparse data and evolving schemata.
architects. We believe there are two major reasons                        BigOWLIM delivers this functionality as an engine
for this: immature tools and inconsistent feature sets.                   whose performance and resilience allowed it serve in
The first is a natural child illness of each new                          the core of the semantic web publishing stack
technology – mature tools can only appear on top of                       running the BBC’s World Cup web site [8]. Here
large user communities, which are not present for                         BigOWLIM handles millions of queries per day in a
young technology. More worrying are some                                  mission critical production environment, where the
                                                                          data is updated hundreds of times per hour.
Copying without fee all or part of this material is permitted only for        BigOWLIM is also optimized to integrate and
private and academic purposes, providing that the title of the            reason with linked data – these capabilities are
publication, the authors and its date of publication appear. Copying
or use for commercial purposes, or to republish, to post on servers       proven in a couple of linked data portals
or to redistribute to lists, is forbidden unless an explicit permission   (http://FactForge.net and http://LinkedLifeData.com),
is acquired from the copyright owners; the authors of the material.       which provide public access to billions of linked data
Workshop on Semantic Data Management SemData@VLDB 2010                    statements integrated from tens of datasets.
September 17, 2010, Singapore.
Copyright 2010: www.semdata.org.
3.   INFERENCE CAPABILITIES                               closure incrementally. This technique labels
                                                          statements to be deleted and then uses forward-
The inferencing strategy in OWLIM is one of total         chaining to identify those statements that can be
materialization (apart from an optimization for           inferred from them, followed by backward chaining
owl:sameAs that is not discussed in this paper)           to identify those inferred statements that are still
based on R-Entailment (as defined by ter Horst [10])      supported by other means.
where Datalog [3] like rules with inequality                 The result is that delete performance is only
constraints operate directly on a single ternary          slightly worse than the insertion of new statements.
relation that represents all triples.                     This allows the repository to handle rapidly changing
   Total materialization involves computing all the       data even when answering queries over tens of
entailed statements at load time. While this              billions of statements.
introduces additional reasoning cost when loading
statements in to a repository, the desirable              5.   TRANSACTION CONTROL
consequence is that query evaluation can proceed
extremely quickly.                                        OWLIM supports the ‘read committed’ transaction
   Several standard rule sets are included in all         isolation level. It guarantees that changes will not
editions of OWLIM and these include (in more or           impact query evaluation, before the entire transaction
less increasing levels of complexity): ‘empty’ (no        they are part of is successfully committed. It does not
inference), OWL-Horst [10], RDFS [1], owl-max             guarantee that execution of a single transaction is
(RDFS plus most of OWL-Lite) and OWL2-RL [6].             performed against a single state of the data in the
   In addition to the standard semantics, user-defined    repository.    Regarding      concurrency,     multiple
rule-sets can be used. In this case the user provides     update/modification/write transactions can be
the full pathname to a custom rule file that contains     initiated and stay open simultaneously, i.e. one
definitions of axiomatic triples, rules and consistency   transaction does not need to be committed in order to
checks.                                                   allow another transaction to complete Furthermore,
                                                          update transactions are processed in sequence and do
4.   RETRACTING ASSERTIONS                                not block read requests in any way, i.e. hundreds of
                                                          SPARQL queries can be evaluated in parallel (the
As mentioned above, OWLIM materializes all                processing is properly multi-threaded) while update
inferred statements at load time and whenever new         transactions are being handled on separate threads.
statements are added to the repository. This has the         One should note that OWLIM performs
highly desirable advantage that query answering is        materialization, making sure that all the statements
very fast, due to the fact that no further inference      which can be inferred from the current state of the
needs to be done. Updates that simply add new             repository are indexed and persisted. By the time the
statements are treated in the same way as at load         commit method completes, all reasoning activities
time, i.e. new statements are fed to the inference        related to changes introduced by the corresponding
engine that applies the inference rules (making joins     transaction will have already been performed.
across new statements with existing statements) until
no new inferences are obtained. Since the semantics       6.   REPLICATION CLUSTER
(both standard and custom) must be monotonic,
insert operations incrementally add to the set of         BigOWLIM can be used in a cluster configuration
explicit and inferred statements. However, retracting     where replication is used to improve resilience and
explicit statements that are used to infer other          provide scalable query answering.
statements becomes more complicated. In                      The query performance of the cluster represents
SwiftOWLIM, this is achieved by simply                    the sum of the throughputs that can be handled by
invalidating all inferred statements and re-computing     each of the instances. In a simple configuration of 3
the full-closure whenever an update is committed.         or 4 worker nodes, hundreds of thousands of query
This has the advantage of simplicity of                   requests can be answered per hour while at the same
implementation, but the disadvantage of poor update       time processing thousands of updates per hour – with
performance and lack of scalability.                      non-trivial inference.
   BigOWLIM has a specific optimization for                  In a cluster configuration, there are two types of
handling delete operations that updates the full-         nodes: Masters and Workers. Masters act as the
gateway to the cluster and all read/write requests go      During normal operation, a master node will keep
through these nodes. A cluster can have more than          track of the size of each worker’s read request queue,
one master node, but only one is allowed to operate        such that each read request is sent to the worker with
in read/write mode. The other master nodes operate         the shortest read queue. Update requests are handled
in read-only mode, otherwise known as ‘hot-                differently. First of all, the update is tested against a
standby’. They can be used for marshalling read            single worker node. If the update is successful and
requests and can take over handling updates if the         subsequent consistency checks pass then the update
current read/write master fails. Worker nodes are          request is considered ‘safe’ and is passed to the rest
standard BigOWLIM instances exposed by the                 of the worker nodes. Master nodes take additional
Sesame HTTP server – a servlet running in Tomcat           care to ensure that the states of all worker nodes are
or similar. Read and write requests are passed to the      properly synchronized and if an anomaly is detected,
workers from the master nodes. This simple                 the problem worker node is released from the cluster.
arrangement allows for a great deal of flexibility in      The monitor and control JMX interface can be used
the design of a cluster topology. The example given        to return worker nodes to the cluster and initiate their
in Fig. 1 has two master nodes and three worker            synchronization.
nodes. At any moment in time, clients of the cluster          In the event of a failure of a worker node, the
can send read requests (queries) to either master          performance degradation is graceful with respect to
node, but updates can only be handled by the master        the number of healthy workers. The cluster can
in read/write mode. If this master node should fail,       remain operational with just a single worker node.
the hot standby master can be brought in to
read/write mode and from then on will handle both          7.   CONCLUSION
read requests and updates, as well as taking over
responsibility for ensuring the synchronization of all     The emerging Web of data has provided new
the worker nodes.                                          challenges for software components that must expose
   Each master node implements a JMX MBean [4]             this data and enable its widespread consumption. The
that is accessible using standard Java instrumentation     OWLIM family of semantic repositories is ideally
tools, such as JConsole, and can be used to monitor        suited to this task due to its ability to store, reason
and control the cluster while it is running. Typical       and answer queries using the massive datasets
activities supported include the monitoring of the         involved.
health of each node, statistics gathering, adding and         OWLIM’s development over the last 6 years was
removing worker nodes.                                     driven by pragmatic design decisions aimed to meet
                                                           the requirements of a range of real-world
   Dispatches queries              Dispatches queries to   applications, using it for data integration, metadata
     and updates to                       workers          management and multi-paradigm information
        workers                         (read only)        retrieval techniques that combine structured queries
      (read/write)                                         and reasoning on the one hand with full-text search
                                                           and co-occurrence analysis on the other. This
                                                           allowed OWLIM to develop to the point of maturity
                                                           and comprehension which allowed it to serve as the
      Read/Write                       Hot standby         back end for such a high-profile application as the
       Master                            Master            BBC’s World Cup 2010 web site. This use case
                                                           demonstrated the viability of several design
                                                           decisions:
                                                              • Distributed configuration, based on data
                                                                   replication, is ideal for applications where
                                                                   resilience and horizontal scalability with
    Worker 1            Worker 2              Worker 3             respect to query loads are key; in such
                                                                   environments data partitioning is inefficient
               Standard BigOWLIM instances                         and inappropriate;
                                                              • Reasoning based on forward chaining and
  Fig. 1 A typical replication cluster configuration               materialization provides very good overall
                                                                   performance. When paired with intelligent
        retraction techniques, it can cope with large               berlin.de/bizer/Berlin
                                                                    SPARQLBenchmark/results/V5/index.html
        numbers of updates, while simultaneously
                                                                3. Hervé Gallaire, Jack Minker (Eds.): Logic and Data Bases,
        dealing with heavy query loads.                             Symposium on Logic and Data Bases, Centre d'études et de
OWLIM continues to evolve with various new                          recherches de Toulouse, 1977. Advances in Data Base
features planned for the near future. The next release              Theory, Plenum Press, New York, 1978.
                                                                4. Java      Management Extensions (JMX), homepage:
of OWLIM will include enhanced support for geo-
                                                                    http://download-llnw.oracle.com/javase/1.5.0/
spatial data and some of the widely accepted geo-                   docs/guide/jmx/
spatial vocabularies. Specialized indices will be used          5. Klyne, G; Carrol , J. J; (eds). (2004). Resource Description
to access spatial data and a range of SPARQL                        Framework (RDF): Concepts and Abstract Syntax. W3C
                                                                    Recommendation 10 Feb. 2004. http://www.w3.org/TR/rdf-
extension functions will allow for expressive queries
                                                                    concepts/
using 2D and 3D geometry.                                       6. Motik, B; Cuenca Grau, B; Horrocks, I; Wu, Z; Fokoue, A;
   The next release will also include interfaces that               Lutz, C. (eds.) (2009). OWL 2 Web Ontology Language
support the JENA RDF framework, enabling                            Profiles. W3C Candidate Recommendation 11 June 2009.
                                                                    http://www.w3.org/TR/owl2-profiles/
OWLIM to be used with both Sesame and JENA, the
                                                                7. O'Donovan, J. The World Cup and a call to action around
two most widely used RDF frameworks.                                Linked            Data.       BBC          blog       post.
   The current set of advanced features and world-                  http://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_
leading performance have helped to position                         cup_and_a_call_to_ac.html
                                                                8. Rayfield, J. BBC World Cup 2010 dynamic semantic
OWLIM as the semantic repository of choice for all
                                                                    publishing",              BBC           blog          post.
environments that manage RDF data, particularly for                 http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_
Web-scale applications. The future evolution of                     cup_2010_dynamic_sem.html
OWLIM towards better compatibility and even more                9.  Stoilos G., Grau B. C., Horrocks I. How Incomplete is your
                                                                    Semantic Web Reasoner? In Proc. of the 20th Nat. Conf. on
powerful data management features will ensure the
                                                                    Artificial Intelligence (AAAI 10), 2010
continued uptake of this technology.                            10. ter Horst, H. J. Combining RDF and Part of OWL with
                                                                    Rules: Semantics, Decidability, Complexity. In Proc. of
                                                                    ISWC 2005, Galway, Ireland, November, 2005, pp. 668-684
8.    REFERENCES                                                11. Thakker, D., Osman, T., Gohil, S., Lakin, P, A Pragmatic
                                                                    Approach to Semantic Repositories Benchmarking. In Proc.
1.   Brickley, D.; Guha, R.V, RDF Vocabulary Description            of the 7th Extended Semantic Web Conference, ESWC 2010.
     Language 1.0: RDF Schema, W3C (10 Feb 2004)
     http://www.w3.org/TR/rdf-schema
2.   Bizer, Ch., Schultz, A.: BSBM Results for Virtuoso, Jena
     TDB, BigOWLIM (November 2009). http://www4.wiwiss.fu-