Scalable Reasoning Infrastructure for Large Scale Knowledge Bases Hima Karanam, Sumit Neelam, Udit Sharma, Sumit Bhatia, Srikanta Bedathur, L. Venkata Subramaniam, Maria Chang, Achille Fokoue-Nkoutche, Spyros Kotoulas, Bassem Makni, Mariano Rodriguez Muro, Ryan Musa, Michael Witbrock IBM Research AI hkaranam@in.ibm.com 1 Introduction Knowledge Bases, supported by domain and common-sense ontologies, are critical in various AI-centric applications, and are beginning to be integrated as knowledge sources in primarily neural AI research. As information extraction technologies mature, we are seeing the growth of automatically extracted KGs, reaching millions of entities and billions of relations between them. Such KGs are often linked to various upper level ontologies and additional resources that makes it possible to infer sophisticated new, hitherto unknown knowledge from them. Although full-scale First-Order Logic (FOL) reasoning is not widely supported by reasoners, standards such as OWL2 are being increasingly employed to design enterprise systems that can compute limited but useful entailments over sophisticated, large ontologies. Moreover, in AI applications, and particularly applications along a path to general intelligence, hard guarantees such as decidability are of little importance, compared with practical effectiveness in producing correct, useful, explainable inferences; in these cases, expressivity, including the ability to express background knowledge that is not readily and practically expressible in FOL, becomes increasingly important. In this context, the commonly used state-of-the-art reasoners that can operate over knowledge graphs can be viewed as having been restricted in three main directions: Scalability of these reasoners is quite limited; they support relatively small ontologies and KGs. For instance, there are no available open source reasoners that can exploit big data frameworks such as Spark in order to scale expensive computations common during reasoning tasks, or use flexible graph data management systems built on horizontally scalable platforms such as JanusGraph. Expressivity of the languages for which effective inference can be performed is low; the majority of reasoners do not support general rules, let alone spatial and temporal reasoning, contextual reasoning, modal and second-order reasoning. These features are implicitly ubiquitous in text encoding human knowledge, and necessary for symbolic and semi-symbolic AI research. Modelling of complex multi-modal data including by representing uncertainty and imprecision (common during automatic extractions of KGs), and supporting reasoning over them is difficult. Further, in applications where KBs are used to aid natural language generation or understanding, we need ways to link entities/relationships and rules with their cross-modal counterparts (e.g., natural language support sentences, images, or neural network embeddings), or perhaps use text and images directly in inference with bounded error and suitably expressed, learned, inference “rule” functions. We discuss our ongoing efforts towards building a software toolkit to support scalable reasoning over large knowledge bases targeted towards industrial applications. We de- scribe our initial ongoing efforts and some preliminary results and discuss the directions for future work. 2 System Architecture The high level architecture of the proposed system is describe in Figure 1 and we describe the different components in detail below. Fig. 1: System Architecture Ingestion Pipeline and Storage Layer: We chose JanusGraph (a property graph) as our default storage technology to store knowledge base along with additional metadata such as context, evidence etc. Since property graphs can easily capture the context/confidence etc in the form of node or edge properties, it becomes easier to map these information coming from the data curation pipelines easily at one place. During ingestion, user data in various commonly used formats is converted and stored in the back-end storage. In order to ingest OWL/RDF, data we have extended RDF4J framework to add JanusGraph as additional store implementing data store interfaces. This takes RDF/OWL data in various forms and maps it to appropriate property graph model and creates required indices as well. Query Processing Unit processes the user queries and interprets them to decide the appropriate modules required to answer the query and interacts with various reasoners and backend store to retrieve the final answers. For example in case of symbolic reasoner, this unit will talk to symbolic reasoner using OWL APIs to get the TBox inference to do the query expansion or query rewrite. In cases where complex reasoning is needed, it sends the required ABox summary information to the reasoner to generate the answer. We are planning to implement the ABox summary builder described in the SHER system [2] to work with JanusGraph backend. We have extended RDF4J query interfaces for basic RDF querying. Reasoners: This component is responsible for standing up different reasoners with appropriate APIs to talk to the query and storage layers. In the current implementation, we are integrating Racer [3] using OWL APIs to support various OWL profiles. We will be extending this to include neural reasoner and default IR/Text reasoner to get a search answer in scenarios where we don’t get confident answers from other reasoners. 3 Graph Database for Reasoning The choice of storage technology is crucial for efficient implementation of underlying reasoning engines. While the in-memory reasoners maintain complete ABox data in memory leading to quick query times, systems designed for dealing large ABoxes store data in persistent storage backends such as relational database or triple stores. At query time, the required data is fetched and passed on to the reasoner for inference. We adopt a property graph database as our persistent ABox store. Graph databases have become quite popular for data storage due to their flexible schema and close relation of the storage model with real-world data. Such graph databases store data in the form of nodes and edges, provide index-free adjacency, and therefore, perform potentially better on linked datasets. Compared to relational databases, graph databases are better at join intensive queries [7]. Further, triple stores being strongly index based systems are good when relationship exploration is not very deep whereas graph databases are good at finding complex relations, and their compact representation helps in to speedup reasoning tasks by storing inferred information. 4 Query Engine In the current implementation of our system, the query engine is built using three basic modules for handling symbolic reasoning. Here, we take SPARQL queries and convert them to appropriate Gremlin or reasoning queries to get the final answer. Query Expansion: This module is responsible for expanding the input query as a union of multiple conjunctive queries in order to find some of the implicit solutions to the query. Query expansion is guaranteed to find all the solutions, when the underlying knowledge base follows DL-lite logic. SPARQL to Gremlin Translation: Query generated by query expansion module is transformed to the gremlin traversal so that it can be executed directly over the Janus- Graph. Gremlin supports both declarative and imperative queries whereas SPARQL only supports declarative queries. Transforming SPARQL queries into declarative style gremlin queries offer two advantages: (i) declarative gremlin queries are formulated using the Match step which consists of one or more traversal patterns similar to triples in a SPARQL query. So, SPARQL triples can be directly transformed to Gremlin patterns using the rule-based mapping [6]; and (ii) declarative gremlin queries use a Match Algorithm that sorts different patterns present in the query according to their filtering capabilities, so as to optimize the query execution. Graph meta-data generated during data ingestion phase can be used to differentiate between edge traversals or property traversals at the time of query translation. Consistency Checker: Any reasoning query over the expressive logic can be reduced to that of determining consistency of underlying knowledge base [5]. Negated assertion of a query is added to the knowledge base before performing the consistency test. Different versions of tableau algorithms [1] can be used for performing consistency detection depending upon the expressiveness of the underlying knowledge base. 5 Results and Discussions We report initial results achieved for RDF to Property graph conversion using JanusGraph for LUBM dataset1 for 100 universities. We compare query execution times for Virtuoso system, manually and automatically generated gremlin queries against JanusGraph for LUBM SPARQL benchmark queries. As note from Table 1, our proposed architecture achieves better performance for most of the queries, except in cases where large number of node scans are required for producing the output because vertices are spread on disk physically. For these experiments, we have used basic schema transformation described in [4] without much focusing on locality of reference. We are currently studying efficient schema transformation also which takes care of this issues. In addition, we are exploring how to integrate the standard logical reasoners with other approximate reasoners based on neural models that could act as a fall-back mechanism in cases where the required information is not present in the knowledge bases. Table 1: Query execution time for 100U-LUBM dataset (in milliseconds) Query Virtuoso JanusGraph Manual JanusGraph Auto Output Size Query Virtuoso JanusGraph Manual JanusGraph Auto Output Size Q1 68 3.6 9 4 Q8 262 822 898 7790 Q2 1006 220495 434692 264 Q9 1958 856103 867483 27247 Q3 55 1.5 2 6 Q10 61 1.3 0.9 4 Q4 79 5.3 26 34 Q11 55 17.8 15 224 Q5 63 58.9 66 719 Q12 51 2.5 27 15 Q6 25040 568078 522321 1048532 Q13 56 33.8 38 473 Q7 71 6.5 5 67 Q14 18585 443954 405736 795970 References 1. Baader, F., Sattler, U.: An overview of tableau algorithms for description logics. Studia Logica 69(1), 5–40 (2001) 2. Dolby, J., Fokoue, A., Kalyanpur, A., Schonberg, E., Srinivas, K.: Scalable highly expressive reasoner (sher). Web Semantics: Science, Services and Agents on the World Wide Web 7(4), 357–361 (2009) 3. Haarslev, V., Möller, R.: Racer system description. In: International Joint Conference on Automated Reasoning. pp. 701–705. Springer (2001) 4. Hartig, O.: Reconciliation of rdf* and property graphs. arXiv preprint arXiv:1409.3288 (2014) 5. Horrocks, I., Tessaris, S.: Querying the semantic web: a formal approach. In: International Semantic Web Conference. pp. 177–191. Springer (2002) 6. Thakkar, H., Punjani, D., Keswani, Y., Lehmann, J., Auer, S.: A stitch in time saves nine–sparql querying of property graphs using gremlin traversals. arXiv preprint arXiv:1801.02911 (2018) 7. Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th annual Southeast regional conference. p. 42. ACM (2010) 1 http://swat.cse.lehigh.edu/projects/lubm/