CycQL: A SPARQL Adapter for OpenCyc Steve Battle Sysemia Ltd, Bristol & Bath Science Park, Dirac Crescent, Emerson’s Green, Bristol BS16 7FR, UK steve.battle@sysemia.co.uk, WWW home page: http://www.sysemia.co.uk Abstract. CycQL is an Apache Jena/ARQ based SPARQL adapter for OpenCyc 4.0. It enables the Cyc inference engine to be used with Semantic Web tools via a SPARQL endpoint. Cyc achieves scalability and optimizes inference by restricting the search space to a relevant subset of microtheories. With this adapter, Cyc microtheories are identified with RDF named graphs. This paper demonstrates how greater efficiency is achieved by maximizing the chunks of SPARQL algebra that are translated into CycL. Keywords: Cyc, OpenCyc, CycL, Jena, RDF, SPARQL 1 Introduction Cyc [1] was one of the first Artificial Intelligence systems to develop an ontological approach to organizing knowledge. The modularity this affords enables Cyc to reason scalably in highly complex domains. OpenCyc 4.0, released in June 2012, includes the full Cyc ontology. However, Cyc is often overlooked as a reasoner on the web because of a lack of integration points with other semantic technologies. The full opencyc ontology is available as a downloadable OWL (Web Ontology Language) file, and individual concepts may be downloaded in RDF from the OpenCyc website. In addition, UMBEL [2] identifies a subset of OpenCyc which can be used as an upper ontology for the purpose of ontology alignment. This paper describes the development of a SPARQL adapter for OpenCyc, known as CycQL [3], which enables a Cyc instance to support a SPARQL endpoint. The ARQ SPARQL evaluator is an extensible framework for implementing SPARQL adapters. ARQ parses the query into SPARQL algebra; a tree structure representing query operators and triple or quad patterns. A query evaluation walks this algebra and combines the resulting bindings to produce the final result set. The CycQL adapter comprises a number of components. Firstly, a wrapper for the CycAccess object implements the Jena/ARQ DatasetGraph, providing an RDF graph representation of the Cyc knowledge base. Secondly, each step of query evaluation is routed through a custom OpExecutor that determines whether a given operator in the SPARQL abstract syntax tree can be compiled directly into CycL, or should be executed as usual by ARQ. Thirdly, SPARQL operators and patterns to be compiled into CycL are routed by the OpExecutor to a custom StageGenerator that generates bindings for a given stage in the SPARQL algebra. This paper explores the compilation of ever larger stages of the query to create a more efficient adapter. 2 Mapping CycL to RDF The CycL representation is a predicate-calculus like language that allows us to assert facts about the world, and to express queries over those facts. A CycL expression is a bracketed list with the predicate at the head of that list. It may contain constants, prefixed with #$. For example, the fact that Pluto orbits the Sun could be stated as follows. (#$orbits #$PlanetPluto #$TheSun) Translating this binary expression into RDF (Resource Description Framework) [4] involves transposing the order of the terms to give the typical subject, predicate, object arrangement. It also involves assigning URIs to the Cyc constants. In this work, the CycAccess object provided by OpenCyc is wrapped as an Apache Jena/ARQ DatasetGraph supporting a graph representation of an RDF Dataset. A URI base can be set on this wrapper which is simply prepended to the constant value to generate a URI. For example, the base may be set to the OpenCyc namespace , which may be used for English language versions of concepts. Alternatively, the OpenCyc corpus may supply an rdfURI property and in these cases this supplied URI is used in preference to the generated URI. The above assertion may be written (in RDF Turtle) as follows, with all URIs relativized to the document base. @base . . Type information is asserted in CycL with the #$isa predicate, which maps directly to rdf:type. Following Pluto’s demotion to a dwarf-planet in 2006, we may make the following assertion. (#$isa #$PlanetPluto #$DwarfPlanet) Using Turtle short-hand for rdf:type, the above may be written as follows, again with individuals expressed as document relative URIs (assuming that the @base is set as above). a . The constant #$PlanetPluto is described as being atomic, as it has no internal structure, simply properties that are extrinsic to that atomic concept. Non-Atomic Terms (NATs) are expressed as lists and do not have a truth value as such, but can be used within expressions. These may be used to express functional relationships between objects. Non-Atomic Terms can also be used to express qualifying information about a value, such as its type. For example, the orbital period of Pluto can be expressed as a number qualified by the units it is measured in. (#$orbitalPeriod #$PlanetPluto (#$DaysDuration 90739)) Such unary, un-nested NATs map nicely to RDF custom datatypes. The above may be expressed in Turtle as follows. "90739"^^ . CycL is able to form complex logical expressions using the truth functions (#$and, #$or, #$not, #$implies). Pluto has five known moons, and If we are to believe recent online polls, the recently discovered moons, P4 (2011) and P5 (2012), are to be named ’Vulcan’ and ’Cerberus’. This knowledge can be asserted in CycL as a logical conjunction. (#$and (#$orbits #$Charon-MoonOfPluto #$PlanetPluto) (#$orbits #$Nix-MoonOfPluto #$PlanetPluto) (#$orbits #$Hydra-MoonOfPluto #$PlanetPluto) (#$orbits #$Vulcan-MoonOfPluto #$PlanetPluto) (#$orbits #$Cerberus-MoonOfPluto #$PlanetPluto) ) In RDF it is straightforward to represent a conjunction of triples such as these, but we cannot directly represent disjunction, negation, or implication. This highlights the utility of using highly expressive languages such as CycL alongside RDF. We see later on how this expressiveness enables us to define CycL rules that may be used in conjunction with a SPARQL query. 2.1 Limitations of the mapping The mapping currently supports Non-Atomic Terms that are functions of literal values. The mapping does not currently support NATs that are functions of a constant. In such cases the function name typically ends with ‘Fn’. For example, we could talk about the moons of Pluto using the following CycL NAT. (#$MoonFn #$PlanetPluto) This doesn’t assert anything by itself but could be used within an assertion such as the following. (#$isa #$Charon-MoonOfPluto (#$MoonFn #$PlanetPluto)) As long as the function is unary, it can be expressed in RDF as a pair of triples, where moonofpluto is a so-called blank node; the object of a functional relationship MoonFn. The result is known as a Non-Atomic Reified Term and may be represented in RDF as follows. _:moonofpluto . a _:moonofpluto . One of the advantages of CycL is that it can express n-ary relationships. Rather than consider complex mappings of n-ary (for n > 2) expressions into RDF, these are instead invisible to the RDF mapping. It is assumed that Cyc inferencing may be used to unpack any such expression to generate the equivalent form in terms of binary predicates, if required. This can be done on demand; effectively defining a magic (or computed) property. An example of this, which will be explained later on, is the use of an orbitalRadius property of a planet which should be computed (on demand), but we can treat as though it were asserted as a simple relationship. "90739"^^ . 3 Mapping SPARQL to CycL queries SPARQL (SPARQL Protocol and RDF Query Language) [5] is a widely used query language for RDF datasets. The following SPARQL query selects for dwarf planets orbiting the Sun together with their moons, if any. The results are shown below. PREFIX : <> SELECT ?planet ?moon WHERE { ?planet a :DwarfPlanet ; :orbits :TheSun OPTIONAL { ?moon :orbits ?planet } } ---------------------------------------- | planet | moon | ======================================== | :PlanetPluto | :Charon-MoonOfPluto | | :PlanetPluto | :Nix-MoonOfPluto | | :PlanetPluto | :Hydra-MoonOfPluto | | :PlanetPluto | :Vulcan-MoonOfPluto | | :PlanetPluto | :Cerberus-MoonOfPluto | ---------------------------------------- The query above contains a number of triple patterns that can be translated into CycL as were the basic axioms of our planetary system. As in SPARQL, CycL represents variables by prefixing them with a ’ ?’. The three triple patterns that appear above may therefore be individually translated into CycL as follows. (#$isa ?PLANET #$DwarfPlanet) (#$orbits ?PLANET #$TheSun) (#$orbits ?MOON ?PLANET) 3.1 Microtheories as graphs The Cyc knowledge-base is divided into a number of microtheories, each of which corresponds to a particular domain of knowledge. Cyc’s knowledge of the planets is mostly held in UniverseDataMt. Microtheories are hierarchically arranged so that facts from any one microtheory can be inherited by more specialized microtheories. Microtheories are identified by a constant (microtheory names typically include ‘Mt’), so they can be treated like any other individual and be assigned a URI. Each access to the Cyc knowledge base must define a specific microtheory. SPARQL defines a default graph against which Basic Graph Patterns (BGPs), the sets of triples that appear in a query, are matched. The initial setting for this default graph is set in the DatasetGraph, but can be overriden with the addition of a FROM clause in the query. The query below explicitly defines a microtheory to be used as the new default graph. PREFIX : <> SELECT ?planet FROM :CurrentWorldDataCollectorMt-NonHomocentric WHERE { ?planet a :Planet ; :orbits :TheSun } Alternatively, to allow knowledge to be integrated from multiple sources graphs may be explicitly named within the body of the query using the GRAPH clause. Knowledge about dwarf planets appears in a a more general microtheory than planets and their orbits, so it can be more efficient to target specific parts of the query at specific microtheories as in the example below. PREFIX : <> SELECT ?planet FROM :CurrentWorldDataCollectorMt-NonHomocentric FROM NAMED :UniverseDataMt WHERE { ?planet a :DwarfPlanet GRAPH :UniverseDataMt { ?planet :orbits :TheSun } } 3.2 Negation Negation is not available in the RDF representation, though it is available as a feature in SPARQL query. While the query at the beginning of this section selects for dwarf planets, this time we wish to filter them out using the SPARQL 1.1 NOT EXISTS clause. PREFIX nat: PREFIX : <> SELECT ?planet ?orbital_period FROM :CurrentWorldDataCollectorMt-NonHomocentric WHERE { ?planet a :Planet ; :orbits :TheSun ; :orbitalPeriod ?orbital_period FILTER (NOT EXISTS { ?planet a :DwarfPlanet }) } ORDER BY nat:Integer(?orbital_period) In this example, we make use of Johannes Kepler’s observation that the distance of a planet from the Sun is proportional to its orbital period. More on this later, but for now this knowledge can be used to define a natural ordering over the planets as seen in the results below. ------------------------------------------- | planet | orbital_period | =========================================== | :PlanetMercury | "88"^^:DaysDuration | | :PlanetVenus | "225"^^:DaysDuration | | :PlanetEarth | "365"^^:DaysDuration | | :PlanetMars | "687"^^:DaysDuration | | :PlanetJupiter | "4329"^^:DaysDuration | | :PlanetSaturn | "10753"^^:DaysDuration | | :PlanetUranus | "30660"^^:DaysDuration | | :PlanetNeptune | "60152"^^:DaysDuration | ------------------------------------------- We see the Non-Atomic Term representing the orbital period in the output but the application of the function nat:Integer has yet to be explained. As these NATs are not numbers, any attempt to ORDER BY the term directly would be based on the order of their lexical value alone. The first character would be the most significant and would have the undesirable effect of placing Mercury at the outer edge of the solar system. The application of a custom java function in the nat namespace casts this lexical value to the indicated type, in this case an (XSD) Integer. 3.3 CycL Compilation levels Apache Jena/ARQ defines a StageGenerator interface for executing triple patterns, quad patterns and other algebraic operations and returns a binding iterator. These patterns and operators are compiled into a CycL query to be evaluated by Cyc. There are three natural levels at which one may group content from the original query for translation into CycL. These levels are enumerated below, each building on, and having greater efficiency than, the previous level. The negation query of the previous subsection will be used to demonstrate this compilation into CycL at different levels of grouping, with each level offering greater efficiency. These efficiencies are gained by reducing the depth of the search tree from a depth of 4 using the triple-pattern level, down to 1 with the operator-composition level. triple-pattern level At the simplest level, a query can be decomposed into individual triple-patterns. The four steps below represent a single branch of the search tree (of depth 4), with ?PLANET bound in step 1 to #$PlanetEarth. 1. (#$isa ?PLANET #$Planet) 2. (#$orbits #$PlanetEarth #$TheSun) 3. (#$orbitalPeriod #$PlanetEarth ?ORBITAL-PERIOD) 4. (#$isa #$PlanetEarth #$DwarfPlanet) Note that, in this case, the evaluation of NOT EXISTS takes place in the SPARQL engine rather than Cyc. If the final query returns no results then it succeeds. graph-pattern level SPARQL queries include blocks of triples that are known as Basic Graph Patterns (BGPs), or where named graphs are introduced, as Named Graph Patterns (NGPs). A pattern containing multiple triples may be translated by the StageGenerator into a single logical conjunction to be evaluated by Cyc in a single step. The depth of the search tree is thereby reduced to 2. 1. (#$and (#$isa ?PLANET #$Planet) (#$orbits ?PLANET #$TheSun) (#$orbitalPeriod ?PLANET ?ORBITAL-PERIOD)) 2. (#$isa #$PlanetEarth #$DwarfPlanet) As above, the evaluation of NOT EXISTS takes place in the SPARQL engine. operator-composition level Each step of query evaluation is routed through an OpExecutor that evaluates graph patterns, filters, sequences and joins. The OpExecutor composes the largest units of the SPARQL algebra that make sense as a single CycL query. The composed units are then routed to the StageGenerator to generate result bindings. The depth of this search is just 1. 1. (#$and (#$isa ?PLANET #$Planet) (#$not (#$isa ?PLANET #$DwarfPlanet)) (#$orbits ?PLANET #$TheSun) (#$orbitalPeriod ?PLANET ?ORBITAL-PERIOD)) Note that in this case, Cyc evaluates the negation directly using #$not. 4 Functions We wish to compute which planets lie in the Circumstellar Habitable Zone (CHZ). This is the famous ’Goldilocks zone’ containing the Earth and Mars, defined to be between 0.725 and 3 Astronomical Units from the Sun. Cyc defines a number of functions, which can be expressed as CycL Non-Atomic Terms. These functions are non-reifiable, meaning that they do not represent individuals in their own right, but are instead intended to be evaluated. We will make use of the mathematical functions, #$ExponentFn and #$QuotientFn. In the first instance we show how these functions can be surfaced within an Apache Jena/ARQ LET statement. This binds the variable on the left-hand side to the value of the expression on the right-hand side. The cyc namespace is introduced as a way to identify Cyc functions. Access to Cyc functions is provided by registering them with an ARQ function registry. Given the function URI, the function factory queries Cyc for the arity and return type of the function. In this case the return type is a double, so this value (?AU) may be used directly in the ORDER BY clause. Kepler’s third law states that ”The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit,” P 2 = R3 , where P is expressed in years, and R in Astronomical Units (AUs). PREFIX cyc: PREFIX nat: PREFIX : <> SELECT ?planet ?AUs FROM :UniverseDataMt WHERE { ?planet a :Planet ; :orbits :TheSun ; :orbitalPeriod ?orbital_period LET (?AU := cyc:ExponentFn( cyc:QuotientFn(nat:Double(?orbital_period),365), cyc:QuotientFn(2,3)) ) FILTER (0.725 < ?AU && ?AU < 3.0) } ORDER BY ?AU This query is translated into the CycL below, with only the remaining ORDER BY clause being evaluated by the SPARQL adapter. Note the introduction of the equality to extract the lexical value (?VAR1) of the orbital period NAT (??VAR0 binds to the name of the NAT). The results follow. (#$and (#$isa ?PLANET #$Planet) (#$orbits ?PLANET #$TheSun) (#$orbitalPeriod ?PLANET ?ORBITAL-PERIOD) (#$equals ?ORBITAL-PERIOD (??VAR0 ?VAR1)) (#$evaluate ?AU (#$ExponentFn (#$QuotientFn ?VAR1 365) (#$QuotientFn 2 3))) (#$lessThan 0.725 ?AU)(#$lessThan ?AU 3.0) ) --------------------------------------------------- | planet | AU | =================================================== | :PlanetEarth | "1.0"^^xsd:double | | :PlanetMars | "1.5244361831950344"^^xsd:double | --------------------------------------------------- 5 Rules Although the approach described in the previous section returns the correct results, it complicates matters with the unnecessary inclusion of Kepler’s law within the body of the query itself. In this section we will develop a CycL rule to perform a simple backward-chaining inference triggered by a SPARQL query. Cyc may, of course, perform other kinds of inference such as forward chaining, triggered opportunistically by the initial assertion of the axioms. At this point it may, for example, perform type closure over #$isa relationships. The aim is to use Cyc rules alongside SPARQL. These backward chaining rules are only invoked at the time the query is made. Neither are they derived from the query; they are native Cyc rules. While various schemes exist to implement rules in SPARQL they tend to be forward chaining engines. One example of this is the use of chained SPARQL CONSTRUCT queries as found in the SPARQL Inferencing Notation (SPIN) [6]. We define Kepler’s law using a CycL rule as follows. The orbitalRadius is inferred from the orbital period using Kepler’s 3rd law. (#$implies (#$and (#$orbitalPeriod ?BODY (#$DaysDuration ?PERIOD)) (#$evaluate ?RADIUS (#$ExponentFn (#$QuotientFn ?PERIOD 365)(#$QuotientFn 2 3)))) (#$orbitalRadius ?BODY (#$AstronomicalUnits ?RADIUS))) The SPARQL query is simplified using the magic property, #$orbitalRadius. PREFIX nat: PREFIX : <> SELECT ?planet ?orbital_radius FROM :UniverseDataMt WHERE { ?planet a :Planet ; :orbits :TheSun ; :orbitalRadius ?orbital_radius FILTER (0.725 < nat:Double(?orbital_radius) && nat:Double(?orbital_radius) < 3.0) } ORDER BY nat:Double(?orbital_radius) The query above is translated into the CycL below using operator-composition to combine the Basic Graph Pattern and the FILTER clause. When Cyc evaluates the goal with the predicate #$orbitalRadius, it triggers the rule above to compute the value. (#$and (#$isa ?PLANET #$Planet) (#$orbits ?PLANET #$TheSun) (#$orbitalRadius ?PLANET ?ORBITAL-RADIUS) (#$equals ?ORBITAL-RADIUS (??VAR0 ?VAR1)) (#$lessThan 0.725 ?VAR1) (#$lessThan ?VAR1 3) ) 6 Conclusion and next steps CycQL is a useful addition to the tool-box available to users of OpenCyc. As a proof-of-concept it demonstrates the efficiencies to be gained by compiling as much as the query as possible into CycL before handing off to Cyc. The next steps include refining the RDF mapping to support NARTs (Non-Atomic Reified Terms), and to provide a more complete coverage of SPARQL. This will include the addition of a describe-handler supporting SPARQL DESCRIBE queries, and support for multiple FROM clauses in the SPARQL query. References 1. Lenat, Douglas. B., Guha, R. V.: Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project (1989) 2. Bergman, M.K., UMBEL: A Subject Concepts Reference Layer for the Web, 3. CycQL SPARQL adapter for OpenCyc, . 4. Klyne, G., Carroll, J.J., Resource Description Framework (RDF): Concepts and Abstract Syntax, W3C Recommendation, 10 February 2004, . 5. Harris, S., Seaborne, A., SPARQL 1.1 Query Language: W3C Proposed Recommendation, 08 November 2012, . 6. Knublauch, H., Hendler, J., Idehen, K., SPIN - Overview and Motivation, W3C Member Submission 22 February 2011, .