Enhanced Rule-based OWL Reasoning on Spark Zhihui Liu1,3 , Weimin Ge1,3 , Xiaowang Zhang1,3,∗ , and Zhiyong Feng1,2 1 School of Computer Science and Technology, Tianjin University, Tianjin, China 2 School of Computer Software, Tianjin University, Tianjin, China 3 Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China ∗Corresponding author: xiaowangzhang@tju.edu.cn Abstract. The rule-based OWL reasoning is to compute the deductive closure of an ontology by applying RDF/RDFS and OWL entailment rules. In previous work, we present an approach to enhancing the per- formance of the rule-based OWL reasoning on Spark based on a locally optimal executable strategy. However, some key optimizations that based on LUBM do not generalize to more diverse datasets. In this paper, we analyze these problems, and demonstrate the inference engine. We have evaluated the approach using the real-world datasets. The experimen- tal results show that our approach also achieve better performance as compared to Kim & Park’s algorithm (2015). 1 Introduction The Web Ontology Language [1] (OWL) is a semantic web language designed to represent rich and complex knowledge about things, groups of things, and relations between things. It provides the means for ontology to verify the con- sistency and to make implicit knowledge explicit. In [5] we already presented an method to enhancing the performance of the rule-based OWL reasoning on Spark. In this paper, we extend our approach to deal with more diverse datasets and demonstrate the high efficiency. The major contributions and novelties of our work are summarized as follows: the OWL-Horst rules are classified into four classes and we provide the optimal executable strategies for each class. The rest of this paper is organized as follows. Section 2 presents our locally optimal strategies. Section 3 implements our proposed strategy on Spark and Section 4 evaluates our method on the standard dataset. In Section 5, we summarize this paper. 2 Locally optimal strategy OWL-Horst [2] has a powerful expression and reasoning ability. There exist- s interdependency among the rules. If the input to a rule Ri depends on the output of another rule Rj , then the rule Ri is dependent on Rj . The rule Rj should be executed before the rule Ri . A rule dependency graph can be con- structed based on the interdependency among the rules. There are millions of 2 Z. Liu, W. Ge, X. Zhang, & Z. Feng executable strategies among the rules that adjacent rules satisfy dependency. And the longest strategies contain 22 rules that cannot fully execute all rules. The number is so huge that it is challenge to test every strategy. The OWL reasoning has close relationship with the characteristic of datasets. Therefore, we prefer to make a analysis of dataset. LUBM [3] is a widely used standard benchmark for evaluating the perfor- mance of ontology reasoning. We divide the dataset into three classes as follows: – Triples whose predicate is rdf:type. We call this class as type. – Triples whose predicate is owl:sameAs. We call this class as sameAs . – The remainder is classified as a class. We call this class as SPO. The data generator (UBA) uses specific rules to generate data so that all datasets generated by it have same data characteristic. Here we use the LUBM-50 as the sample and analyze the proportion of each class. The result is listed in Table 1. Dataset type sameAs SPO LUBM-50 20.055% 0 79.945% Table 1: The proportion of each type in LUBM-50 From the Table 1, we can see that the SPO triples account for absolute proportion, about 80% of the total. The type triples are in the second place, but the number of sameAs triples is zero. Therefore, the OWL reasoning should focus on the reasoning of SPO triples and type triples. Based on the statistics of dataset, we divide the OWL-Horst rules[2] into four classes as follow: – Rules whose condition or consequences have triples whose predicates are rdf:type, including R4, R5, R6, O13, O14, O15, O16. The dependency graph of type rules is shown in Figure 2. – Rules whose condition or consequences have triples whose predicates are owl:sameAs, including O1, O2, O5, O6, O8, O9, O10. The dependency graph of sameAs rules is shown in Figure 3. – Rules whose condition or consequences have triples of SPO class, including R3, O3, O4, O7a, O7b. The dependency graph of SPO rules is shown in Figure 1, which the rule O7a and O7b are classified as O7. – The remainder is classified as a class, including R1, R2, O11a, O11b, O11c, O12a, O12b, O12c. The dependency graph of schema rules is shown in Figure 4. O7 O15 O16 R3 R4 R5 O13 O3 O4 R6 O14 Fig. 1: SPO rules dependency graph Fig. 2: type rules dependency graph O5 O6 O11a O11b R1 O12a O12b R2 O1 O2 O10 O11c O12c Fig. 3: sameAs rules dependency graph Fig. 4: schema rules dependency graph Enhanced Rule-based OWL Reasoning on Spark 3 For each class, based on the dependency of rules, we use the depth-first-search (DFS) algorithm to find all possible executable orders among the rules. Through the experiments, there is a large number of orders for each class, but we choose the longest orders that contain rules as many as possible. We pick out the orders that spend the shortest time to infer same dataset as the optimal executable orders. The optimal executable orders for each class are listed in Table 2. SPO rules type rules R4 → R6 → O14 → O13 → O15 → O16 O3 → R3 → O7 → O4 R4 → R6 → O14 → O13 → O16 → O15 O7 → R3 → O3 → O4 R5 → R6 → O14 → O13 → O15 → O16 R5 → R6 → O14 → O13 → O16 → O15 schema rules sameAs rules O11a, O11b → R1 → O11c O1 → O10 → O2 → O6 → O5 O12a, O12b → R2 → O12c O2 → O10 → O1 → O6 → O5 Table 2: The optimal strategies of each class 3 Distributed rule-based OWL reasoning on Spark In this section, we design the overall strategy of the OWL reasoning. The rea- soning of instance triples will use the output of schema triples, so the schema reasoning should be executed firstly. There are several orders for each class of instance triples. We can choose any one order and then combine an new exe- cutable strategy. In this paper, we select the first order for each class to perform reasoning. The general workflow of the parallel OWL reasoning is described in figure 5. schema rules order SPO rules order type rules order sameAs rules order Fig. 5: The reasoning strategy 4 EVALUATION In this section, we conduct a series of experiments to compare the performance of the proposed approach with the KP [4] under the same environment. We set up a cluster with one master and four worker nodes. Each node has 48 Xeon E5 4607 2.20GHz processors, 64GB memory and 10TB 7200 RPM SATA hard disk. The nodes are connected with Gigabit Ethernet. All the nodes run on a 64-bit Ubuntu 12.04 LTS operating system. The version of the Spark is 1.0.2. And the corresponding Hadoop v2.2 with Java 1.7 is installed on this cluster. We use both the synthetic and real-world benchmarks. DBpedia is a widely used standard benchmark for semantic programs. We used four sets of DBpedia data: DBpedia-10(10 million triples), DBpedia-15(15 million triples), DBpedia-20(20 million triples), DBpedia-25(25 million triples) 4 Z. Liu, W. Ge, X. Zhang, & Z. Feng in our experiments. All experiments run three times and the average value is listed as follows. Experimental results are shown in figure 6. We can see that the running time increases almost linearly with the growth of data size, our approach performs better than KP on DBpedia, which is improved by about 30%. 5,000 Our Approach KP reasoning time/sec 4,000 3,000 2,000 1,000 10 15 20 25 DBpedia /millions Fig. 6: The running time of our approach and KP 5 Conclusions In this paper, we have shown an approach to enhancing the performance of the rule-based OWL reasoning on Spark and demonstrated inference over both stan- dard and real-world dataset. In terms of reasoning time, Our method performs much better than KP in the reasoning strategy. 6 Acknowledgments This work is supported by the program of the National Natural Science Foun- dation of China (NSFC) under grant number 61502336. Xiaowang Zhang is sup- ported by Tianjin Thousand Young Talents Program. References 1. L. McGuinness D., van Harmelen F. (2004) Web Ontology Language. W3C Rec- ommendation. 2. Liu C., Qi G., Wang H. & Yu Y. (2011) Large scale fuzzy pD* reasoning using MapReduce. In: Proc. of ISWC 2011, Springer, pp. 405–420. 3. Guo Y., Pan Z., & Heflin J.(2005) LUBM: A benchmark for OWL knowledge base systems. J. Web Sem., 3(2-3):158–182. 4. Kim J. & Park Y. (2015) Scalable OWL-Horst ontology reasoning using Spark. In: Proc. of BigComp 2015, pp. 79–86. 5. Liu Z., Feng Z., Zhang X., Wang X., & Rao G.(2016) RORS: Enhanced Rule-based OWL Reasoning on Spark In: Proc. of APWeb 2016, pp. 1–4.