Parallelization of Query Processing over Expressive Ontologies E. Patrick Shironoshita1 , Da Zhang2 , Mansur R. Kabuka1,2 and Jia Xu2 1 2 INFOTECH Soft, Inc. University of Miami 1201 Brickell Avenue, Suite 220 Coral Gables, Florida 33124, USA Miami, Florida 33131, USA patrick@infotechsoft.com Abstract data sets (Priya et al., 2014). However, even as rea- soners for very expressive DLs have been created, Efficient query answering over Descrip- performing reasoning over very large ABoxes is tion Logic (DL) ontologies with very large still prohibitive (Calvanese et al., 2013; Donini, datasets is becoming increasingly vital. 2003; Glimm et al., 2007). Recent years have seen the development of various approaches to ABox partition- In the last few years, methods for parallel and ing to enable parallel processing. Instance distributed reasoning over expressive ontologies checking using the enhanced most spe- have been published (Xu et al., 2013, 2015b; Priya cific concept (MSC) method is a partic- et al., 2014; Wandelt and Möller, 2012); these ularly promising approach. The applica- methods generally seek to make use of fast syn- bility of these distributed reasoning meth- tactic checks to generate a set of independent par- ods to typical ontologies has been shown titions that can be processed in parallel. In (Xu mainly through anecdotal observation. In et al., 2015a), a method for instance checking this paper, we present an analysis method is proposed, combining the work in (Xu et al., that makes use of random graph theory to 2013, 2015b) with the idea of a most specific con- show that the enhanced MSC method re- cept (MSC) (Nebel, 1990; Donini and Era, 1992; sults in very small, tractable concepts pro- Donini et al., 1994) to move the reasoning task vided that the number of role assertions re- from the very large ABox into a much smaller moved from consideration is large enough. TBox. Empirical evaluation of the method has We also present execution time and ef- shown its ability to perform sound and complete ficiency of a parallel implementation de- instance checking over SHI DLs within reason- ployed over computing clusters of various able time. Moreover, the method is inherently par- sizes, showing the ability of the method to allelizable, lending itself to implementation within process instance checking for large scale clusters of commodity hardware. In this paper, we datasets. examine the enhanced MSC method and its paral- lelization implementation. 1 Introduction Description Logics (DL) are increasingly being 2 The Enhanced Most Specific Concept used to model and represent structured and semi- (MSC) method structured data in different applications (Horrocks, 2.1 Basic MSC Method 2008). A core task for DL systems is to provide an efficient way to answer queries over the exten- A Description Logics (DL) knowledge base, also sional level of the ontology, that is, to compute referred to as an ontology, is typically defined a tu- answers that are not only asserted, but logically ple, denoted K = (T , A), where the terminologi- implied by the ontology (Calvanese et al., 2013). cal component or TBox T contains definitions of Considerable efforts have been dedicated to the concepts and roles, and the assertional component optimization of algorithms for query answering in or ABox A contains assertions about membership recent years (Calvanese et al., 2013; Möller et al., of individuals in concepts and about role relations 2007; Motik et al., 2007). One of the challenges between individuals. The set of roles, concepts, faced in this era of increasing data wealth is to pro- and individual instances in an ontology are de- duce responsive results for queries over very large noted respectively by R, C, and I. The discussion 116 TBox 1. for a given individual a, start with an empty Headmaster v Professor MSCT (A, a); Professor v Person MagicCourse v Course 2. for every concept assertion C(a) , Muggle v ¬Wizard MSCT (A, a) MSCT (A, a) u C; takesCourse.MagicCourse v Wizard 3. for every role assertion R(a, b), 9isHeadOf.School u Person v Headmaster ABox MSCT (A, a) MSCT (A, a) School(hogwarts) u 9R.MSCT (A\{R(a, b)}, b); Professor(albus) 4. for every individual equality assertion a = a0 , Professor(severus) MagicCourse(potions) MSCT (A, a) MSCT (A, a) MagicCourse(transfiguration) u MSCT (A, a0 ). Course(math) Student(harry) So, in the example ABox above, the MSC for Muggle(dudley) severus is isHeadOf(hogwarts, albus) takesCourse(harry, transfiguration) Professor u 9taughtBy .Course (1) taughtBy(transfiguration, albus) taughtBy(potions, severus) It is important to note that while class asser- tions generate relatively simple concepts, role as- Table 1: Example ABox sertions are capable of generating very complex concepts. Suppose for example that ABox An in this paper assumes that the reader is familiar consists of role assertions R0 (a0 , a1 ), R1 (a1 , a2 ), with DL concepts and notations; refer to (Baader . . . , Rn (an , an+1 ); then: et al., 2003) for details. For the discussion below, MSCT (An , a0 ) = we will use the example illustrated in Table 1. 9R1 .(9R2 .(. . . (9Rn .MSCT (an+1 )) . . . )) (2) Definition 1 (Most Specific Concept) (Nebel, 1990; Donini and Era, 1992) Let K = {T , A} Thus, in the example ABox of Table 1, the MSC be an ontology, and a be an individual in for harry is I. The most specific concept for a w.r.t. Student u 9takesCourse . A, written MSCT (A, a), is a concept such that for every concept D where K |= D(a), (Course u 9taughtBy. T |= MSCT (A.a) v D. (Professor u isHeadOf .School)) (3) If MSCT (A, a) can be derived, then, to test Two main issues preclude this basic MSC whether K |= Q(a) holds for an arbitrary con- method from proving fully useful with expressive cept Q it suffices to test if (T [ {Q}) |= ontologies. First, if assertion cycles are present in MSCT [{Q} (A, a) v Q. We call the concept Q the ABox, the method does not terminate. For ex- the query. Note that Q needs to be inserted into the ample, consider what would happen if TBox in simple form, which may require in turn the insertion of additional named concepts as well isProtegeOf(albus, harry) (4) as general concept inclusion (GCI) axioms to ex- press the necessary equivalences for these inserted is inserted into the ABox, then a cycle forms concepts. In the remainder of this paper, we will among the individuals harry, transfiguration, assume that the query Q has been inserted into the and albus. TBox so that MSCT (A, a) denotes the MSC cal- Second, unless the ABox is highly discon- culated considering Q. nected, the method has the potential to generate Computation of the MSC for a given individ- very large concepts, of size in the same order of ual a as defined above can be performed using a the ABox itself. Consider what happens if the fol- rolling up procedure adapted from the one first in- lowing assertion is added to the ABox: troduced in (Horrocks and Tessaris, 2000). Definition 2 (Basic MSC Rollup Procedure) takesCourse(harry, potions) (5) Provided that the ABox does not contain assertion In this case, all individuals become connected cycles, computation of the MSC can be performed with each other, and the MSC of every individual recursively as follows: ends up being the same size of the ABox. 117 Xu et.al. (2015a) define a set of improvements Use of the MSC method to perform instance re- that result in the Enhanced MSC Method. This en- trieval is straightforward. Suppose it is desired to hancement consists of two parts: a mechanism to query an ABox to retrieve all instances of a con- address assertion cycles, and a set of syntactic con- cept Q. It suffices to generate MSCT (A, a) for ev- ditions to reduce the size of the MSCs in practical ery individual a in the ABox, and then accept ev- ontologies. ery individual where (T [ Q) |= MSCT (A, a) v Q. Note that the query concept Q is added to the 2.2 MSC Computation with Assertion Cycles TBox being verified. To address assertion cycles, Xu et.al. (2015a) use nominals to indicate the joint node of a cycle, as 2.3 Syntactic Conditions previously suggested in (Donini and Era, 1992; In (Xu et al., 2015a), syntactic conditions verifi- Schaerf, 1994). able in polynomial time or better were defined, to Definition 3 (MSC Rollup of Assertion Cycles) enable reduction on the size of the MSC and thus When a cycle is found starting and ending at in- permit TBox reasoning in tractable time. These dividual ac , the individual is represented by conditions are based on the following: its corresponding nominal class {ac }, and the Lemma 1 Given two individuals a and b and a conversion of role assertions within the cycle role R, a role assertion R(a, b) influences the clas- requires modification of the rollup method as sification of individual a into concept A if and only follows: If ac is an individual where a cycle is if found, select a direction to go through the cycle K |= 9R.B u A0 v A (8) and: where K |= B(b) and where A0 6v A summarizes • for R(ac , x), x 6= ac information about a not contained in A. MSCT (A, ac ) MSCT (A, ac )u ({ac } u 9R.MSCT (A\{R(ac , x)}, x)) Proposition 1 Let K = (T , A) be a SHI on- • for R(y, ac ), y 6= ac tology containing named concept A, concepts A0 MSCT (A, y) MSCT (A, y) u 9R.{ac } and B, and role R. If equation (8) holds, there • for R(ac , ac ), must exist some GCIs in T of the form MSCT (A, ac ) MSCT (A, ac ) u {ac } u9R.{ac } 9R0 .C1 ./ C2 v C3 (9) • for any other R(x, y) in the cycle, for x 6= ac where R v R0 , ./ is a placeholder for either u or and y 6= ac , t, and Ci ’s are concepts. MSCT (A, x) MSCT (A, x) u9R.MSCT (A\{R(x, y)}, y) Proof of this proposition can be found in (Xu et al., Thus, if an ABox 2013). Proposition 1 directly leads to a first syntactic Ac = {R0 (a0 , a1 ), R1 (a1 , a2 ), . . . , Rn (an , a0 )} condition denoted SYN COND. Definition 4 (SYN COND) Role assertions of the MSC of an assertion cycle is obtained as fol- the form R(a, b) are said to be true for lows: SYN COND if role R participates in at least one axiom that can be logically converted to the form MSCT (Ac , a0 ) = of equation 9 for some R v R0 , false otherwise. {a0 } u 9R1 .(9R2 .(. . . (9Rn .{a0 }) . . . )) (6) Assertions R(a, b) with SYN COND = false do So, suppose that the ABox in Table 1 is aug- not affect the classification of a unless either R mented with the assertion in equation (4), then the or some R0 such that R ✓ R0 exist in the query, MSC for harry becomes and can be safely removed from the calculation of MSCT (A, a). By symmetry to the inverse role {harry} u Student u 9takesCourse . R , this condition also applies to b. In practice, (Course u 9taughtBy. the axioms more likely to be found are of the form C v 9R.D, which is equivalent to C u > v (Professor u isHeadOf .School 9R.D. Also note that (C v 8R.D) ⌘ (9R.¬C v u isProtegeOf .{harry})) (7) ¬D). In the example in Table 1, assertions with 118 role isHeadOf have SYN COND=true, while as- MSC of any individual. For example, the role as- sertions with roles takesCourse and taughtBy sertion have SYN COND=false and can be safely re- Headmaster(albus) (12) moved from consideration, unless the roles exist in the query itself. Note that in this case, the MSC would have SYN COND SC=true. for harry is reduced to Application of these three syntactic conditions to the computation of the MSC of a given individ- Student u 9takesCourse .Course (10) ual results in a significant reduction in the size of the MSC for practical ontologies. The reasons for A second syntactic condition presented in (Xu this reduction will be established in more detail in et al., 2015a) relies on the identification of explicit the next section, but first we provide a definition concept assertions and disjointness axioms that in- for the parallelization of the algorithm. dicate that an individual cannot be classified to an Remark 1 The MSC method is guaranteed to be existential restriction: sound and complete for SHI DL (Xu et al., Definition 5 (SYN COND DJ) Role assertions 2015a). It can be extended to DL with expressiv- R(a, b) with SYN COND= true are said to be ity SHIQ with unique name assumption, that is, true for SYN COND DJ if there do not exist any if it is assumed that two individuals a and b are of: different if they have different names. Details of • an explicit concept assertion B0 (b) such that such extension are outside the scope of this paper, K |= B0 v ¬C1 ; but they are straightforward, and generally follow • an explicit concept assertion A0 (a) such that the extensions to equality-free module extraction K |= A0 v ¬C3 ; or detailed in (Xu et al., 2013). • an explicit concept assertion A0 (a) such that K |= A0 v ¬(C3 t ¬C2 ) 2.4 Parallelization of the MSC Method for C1 , C2 , an C3 as in Eq.(9) and R v R0 . The MSC method, including the syntactic condi- Role assertions with SYN COND DJ=false can tions detailed above, is highly parallelizable, due be safely removed from the calculation of the to the following: MSC of any individual, since they do not affect Proposition 2 For a given knowledge base K = classification of either a or b. For example, in Ta- (T , A), computation of MSCT (A, a) for an in- ble 1, role assertions with role takesCourse have dividual a can be performed independently of the SYN COND=true due to the assertion computation of the MSC of any other individual. Proof of this proposition follows directly from the takesCourse.MagicCourse v Wizard. definition of the MSC computation in Definitions 2 and 3, as well as in the definition of the syntac- However, suppose that the following assertion tic conditions, where it should be clear that MSC were inserted into the ABox: computation depends only on the initial state of takesCourse(dudley, math) (11) the knowledge base. This means that the calcula- tion of the MSCs for all individuals can be done in This assertion has SYN COND DJ=false, since parallel. Instance checking then needs to be per- dudley is an instance of Muggle, which in turn is formed against T [ MSCT (A, a) for every indi- disjoint with Wizard, the filler concept in the as- vidual a. sertion above. Note that naive parallelized processing of in- dividual equality assertions (i.e., owl:sameAs ex- Definition 6 (SYN COND SC) Role assertions pressions) using the procedures outlined in Defi- R(a, b) with SYN COND= true are said to be nition 2 results in redundant computations. Pre- true for SYN COND SC if, for every GCI of the computation of the reflexive-transitive closure of form in Eq.(9) there exists an explicit concept as- these equalities avoids such redundancy. Note sertion A0 (a) such that K |= A0 v C3 . also that anonymous individuals, typically pre- Role assertions with SYN COND SC=true are sented as blank nodes in OWL, are processed in redundant for the classification of a, and can there- the same way as named individuals for calculation fore be safely removed from the calculation of the purposes. 119 As is shown later in Section 4, parallelization would have to be involved in a GCI of the form of of the MSC method allows for the processing of equation (9). extremely large ABoxes within clusters of com- From a practical standpoint, it seems more in- modity hardware. The resulting computational teresting to study the performance of the enhanced complexity is sublinear with respect to the size MSC method when used with typical ABoxes. of the ABox, which indicates linear complexity Graph theory, and particularly random graph the- for single-machine implementation. In the next ory, provide methods and tools for the study of section, we present an analysis of the expected typical graphs. computational complexity of the enhanced MSC method, and show why it has tractable expected 3.2 Random Graph Theory performance for practical ontologies. Random graph theory is concerned with the study of graphs as probabilistic random variables with a 3 Data Complexity and Random Graphs defined probability distribution. Here we provide a short summary of the points of interest to this 3.1 MSC Method Worst-Case Complexity paper; the reader is referred to (Chung, 2009) for Typically, the size of the ABox as measured by more details. the number of facts it contains is orders of mag- Many real-world network structures are so- nitude greater than the size of the TBox, and also called scale-free, that is, they exhibit power-law much larger than the size of the query. While it has degree distributions (Mika, 2007; Newman, 2002). been shown that, even with exponential time com- A random power-law graph model can be defined plexity, TBoxes of realistic size can be processed as G(C, ↵), where the number of nodes nk with in realistic time (Donini, 2003), practical ABoxes degree k is given by can contain million of individuals. Data complex- ity for instance checking in DLs with expressiv- nk = C · k ↵ (13) ity SHIQ has an upper bound of 2EXPTIME where C is the volume of the graph and ↵ is the (Glimm et al., 2007). A family of less expres- log-log rate of growth of the graph. sive languages called DL-Lite have been shown to It is important to note that random power-law be first-order-logic rewritable and thus with data graphs do not contain nodes with degree of 0, complexity in AC0 , while even small additions to since at this value the distribution is undefined; this language family renders the data complexity therefore, analysis of random power-law graphs for instance checking in co-NP complete or worse discards all zero-degree nodes. Where it is nec- (Calvanese et al., 2013). essary to make the distinction, we will denote the It is clear that the MSC method for instance number of non-zero-degree nodes the effective or- checking is EXPTIME-complete for SHI. Since der of the graph. the computation of the MSC is PTIME, as it is es- It has been shown that a phase transition oc- sentially depth-first search, complexity of check- curs as the average node degree increases (where ing depends on TBox reasoning. From Definition the degree of a node is the number of edges con- 1, it should be clear that for a connected ABox, nected to it.), so that at a critical value a giant com- the size of the MSC is the size of the ABox it- ponent forms with high probability 1 (Erdős and self. Since testing for the subsumption of the Rényi, 1960; Molloy and Reed, 1995). A compo- MSC against a concept in the TBox is reducible nent (also called connected component) is a maxi- to satisfiability testing of the union of the TBox mal subset of the nodes of the graph and all edges and the MSC, instance checking with the MSC between those nodes such that there is a path be- method is equivalent to satisfiability testing for tween any two nodes in the component. A giant SHI DLs, which is EXPTIME-complete (Tobies, component is a connected component with num- 2001). Given that the syntactic conditions defined ber of nodes in the same order of magnitude as the in subsection 2.3 do not guarantee that any role graph itself. For any random graph, the phase tran- assertions will be actually removed from the cal- sition occurs at the point given by the solution of: culation of the MSC, the worst-case complexity 1 of the enhanced MSC method is still EXPTIME. Since random graph theory is probabilistic, conclusions are always obtained ”with high probability”. Henceforth, we This situation is unlikely to occur in practice how- will abbreviate this as w.h.p. This is also sometimes stated as ever, as it would mean that every role in the ABox ”asymptotically almost surely”, or just ”almost surely”. 120 Thus, collapsing parallel edges into a single un- 1 X labeled edge results in minimal reduction in the k(k 2)pk = 0 (14) complexity of MSC calculations. k=0 Certain properties in the role assertion graph GA where pk is the degree distribution. For power-law can be readily related to expected characteristics graphs, this reduces to (Newman, 2002; Chung, of the MSCs calculated for the underlying ABox. 2009): Specifically: 1 X • the size of MSCT (A, a), as measured by k 1 ↵ k 2 ↵ = ⇣(↵ 2) 2⇣(↵ 1) = 0 number of simple form concepts that it con- k=0 tains, is equivalent to the order of the con- (15) nected component that contains the node that P represents the individual a. after integration, where ⇣(t) = 1 n=1 n t is the • the number of conjuncts at the top level of Riemann zeta function. Thus, the phase transi- MSCT (A, a) is proportional to the degree of tion point occurs at the value of the power-law ex- the node representing a. ponent ↵0 = 3.47875.... Graphs with exponent • the maximum quantification depth of ↵ < ↵0 show a unique giant component of size MSCT (A, a) is given by the diameter of the O(n) w.h.p. It can also be shown that for ↵ > 2 connected component containing the node the second largest component is in size ✓(log n), corresponding to a. for 1 < ↵ < 2 the second largest component is It is clear that the appearance of a giant component in ✓(1), and for ↵ < 1 the graph is connected, all then means that the MSC for a significant number w.h.p (Chung, 2009). of individuals is in O(n), and that therefore the complexity of MSC instance checking is exponen- 3.3 Random Graphs and the MSC method tial in n. To apply random graph theory to the study of DL The application of the syntactic conditions out- ABoxes, we define the role assertion graph as fol- lined in subsection 2.3 can be reflected in the role lows: assertion graph through the removal of assertions Definition 7 (ABox Role Assertion Graph) that do not satisfy the conditions. An important The set of role assertions in a SHI ABox define question to ask is how such removal affects the an unlabeled, undirected graph GA , where the characteristics of the resulting graph. Since ran- nodes represent the individuals in the ABox, and dom graphs are generative processes, it is reason- where there is an edge between nodes if there is able to assume that removal of edges follows the at least one role assertion between the underlying same degree distribution as the original generation individuals. of the graphs themselves. If this is the case, the resulting graphs retain their power-law degree dis- In other words, the ABox graph GA replaces tribution. Thus, if the original ABox role assertion multiple parallel edges between two nodes with graph had a giant component of size O(n), for n a single, unlabeled edge. Self-loops are still the number of nodes, w.h.p. the modified graph allowed. Note that with respect to the MSC will also show a giant component. method, parallel edges result in only a small in- Note however that the reduction in the number crease in MSC size, equivalent to the number of of edges necessarily results also in a reduction in parallel edges. Suppose multiple role assertions the effective order of the graph, that is, in the num- R1 (a, b), R2 (a, b), . . . , Rn (a, b) exist in between ber of non-zero-degree nodes. It is in fact possible individuals a and b in the ABox A; application of to calculate both the number of nodes n and num- the rollup procedure for assertion cycles in Defini- ber of edges E in the graph based on the volume tion 3 results in the following: and log growth rate parameters from equation 13. MSCT (A, a) MSCT (A, a) u {a} For ↵ > 2, the equations are as follows (Chung, 2009): u 9R1 .(9R2 {a} u · · · u 9Rn {a} u MSCT (A\{R1 (a, b), R2 (a, b) 1 X . . . Rn (a, b)}, b)) (16) n= C · k ↵ ⇡ C · ⇣(↵) (17) k=1 121 Existential be done against other reasoners such as Pellet5 and Classes Restrictions Konclude6 . These tests were performed against a MSC HermiT MSC HermiT single department dataset for the University On- Min. 7.82 0.67 7.30 0.76 tology Benchmark (UOBM) (Ma et al., 2006), Max. 19.66 780.20 26.94 2,848.68 containing about 150,000 triples. The UOBM Avg. 8.38 19.42 9.14 489.95 TBox was modified to convert cardinality restric- Std. Dev. 1.42 90.50 3.24 698.14 tions to existential restrictions, since our current Median 8.16 0.79 8.19 135.94 implementation only handles expressivity up to SHI. This modified version can be provided upon Table 2: Times (in seconds) for execution of class request. The UOBM Tbox contains a total of and existential restriction queries 113 named classes and 35 object properties. We tested our enhanced MSC implementation against all 35 named class queries, and all 3,955 possi- 1 ble single-depth existential restriction queries, and 1X 1 E= k · C · k ↵ ⇡ · C · ⇣(↵ 1) (18) verified 100% agreement between our enhanced 2 2 MSC method implementation and HermiT. In ad- k=1 The ratio En is then dition, we recorded the running time for all query executions - the results can be seen in Table 2. It is n 2⇣(↵) interesting to note that the enhanced MSC method ⇡ (19) E ⇣(↵ 1) performs much better than HermiT for existential restrictions. Also note the large standard deviation which depends only on ↵. Therefore, if ↵ remains found when running a hypertableaux-based rea- constant, a reduction in the number of edges soner like HermiT, where a few queries take a very necessarily results in a proportional reduction long time to finish. This result also suggests the in the number of (non-zero-degree) nodes, that possibility of using enhanced MSC in tandem with is, in the effective order of the graph. As will be a traditional reasoner when dealing with smaller shown in Section 4, it is this property that enables datasets. the enhanced MSC method to provide extremely good performance for very large ABoxes. 4.2 Parallelization 4 Experimental Evaluation To test the parallelization of the MSC method, we used the c3.8xlarge instances,which provide 32 4.1 Accuracy virtual CPUs and 60 GBs of storage. We used To test both the accuracy and to measure paral- the Lehigh University Benchmark (LUBM) (Guo lelization speedup of the enhanced MSC method, et al., 2005) to generate data sets of up to 500 mil- we set up clusters of compute-optimized instances lion triples. LUBM was chosen as an initial test through Amazon Web Services (AWS)2 . The en- ontology due to its ability to generate datasets of hanced MSC method with syntactic condition cor- varying size, while providing a reasonably expres- rection was implemented in Java and Scala to work sive TBox. These data sets were stored using the over Apache Spark3 , installed over Hadoop HDFS TitanDB7 graph database interface over an Apache and YARN. HBase8 backend. Both TitanDB and HBase were For the accuracy tests, we used an in-memory installed over Hadoop HDFS 2.7. Our prototype version of our enhanced MSC implementation. application over Spark accesses TitanDB through We set up clusters of two c4.xlarge machines, each its standard Java interface. Data distribution and containing 4 virtual CPUs and 7.5GB of mem- replication are performed by the database and are ory, to run the enhanced MSC method, and com- transparent to our application. pared the results against the HermiT reasoner ver- A test was performed to evaluate the scalability sion 3.8.14 , running on a single c4.xlarge ma- of the parallel MSC method over the number of chine; HermiT was chosen as a comparison stan- triples in the ABox. This test was performed over dard due to its stability and speed; future tests will 5 https://github.com/stardog-union/pellet 2 6 aws.amazon.com http://derivo.de/produkte/konclude/ 3 7 https://spark.apache.org/ http://thinkaurelius.github.io/titan/ 4 8 http://www.hermit-reasoner.com http://hbase.apache.org/ 122 (a) Efficiency for small number of machines Figure 1: Execution time vs. data set size with LUBM datasets, in AWS, for 10 c3.8xlarge ma- chines with 32 cores each. a cluster of 10 c3.8xlarge machines in AWS. The results are shown in the log-log diagram in Figure 1. As can be observed, the method shows clear sub-linear performance with respect to the size of (b) Efficiency for large number of machines the data set, as expected from an algorithm with linear performance in a sequential machine. Figure 2: Execution time and efficiency of paral- lelization. The performance with respect to the number of machines was evaluated in two parts. First, to evaluate under small cluster conditions, a 500,000 some measurements. Nevertheless, it is clear that triple set was assembled and used to test against the MSC method provides performance compara- 1 to 10 machines. The execution time and the ble with top-of-the-line database technologies and efficiency with respect to single-machine execu- very broad scalability. tion are shown in Figure 2a. To obtain evaluation These results demonstrate that the enhanced for large numbers of machines, we used the ABox MSC method is inherently parallelizable, enabling with 500 million triples and ran it against 10 to it to work with large scale ABoxes. They also 50 machines; to provide a more realistic estima- show that the fact that in the worst case the en- tion, the efficiency value was corrected against the hanced MSC method is in EXPTIME does not result for 500,000 triples in 10 machines. These preclude its usefulness with practical ontologies. higher scalability results are shown in Figure 2b. In the next section, we evaluate the characteristics In terms of raw performance, the parallel en- of typical ABoxes that support this assertion using hanced MSC algorithm was capable of performing random graph models. instance checking for a dataset with 500 million triples and over 110 million individual instances 4.3 Random Graph Complexity Analysis in about 1,240 seconds, or around 20 min., using For this purpose, we have created ABox role as- a cluster of 10 machines and a total of 320 execu- sertion graphs for the same set of ontologies used tion cores. Using 50 machines, the execution time for the empirical evaluation in (Xu et al., 2015a), was 346 seconds, or somewhat less than 6 min- and have calculated the least-squares fit for their utes. As a comparison, although the difference degree distributions. The data sets used are the in algorithms means that execution times are not following: directly comparable, Oracle reports full ABox in- 1. LUBM1 and LUBM4: Lehigh University ference over 869 million triples in 62 minutes, and Benchmark ontologies about higher educa- query performance over this pre-reasoned ABox in tion entities generated using the tool provided about 4.3 min., using specialized hardware (W3C, by (Guo et al., 2005); and 2015). It is also important to note that, since tests 2. Arabidopsis thaliana (AT) and Caenorhabdi- were performed over an uncontrolled environ- tis elegans (CE), two ABoxes built over the ment, external perturbations could have affected BioPAX TBox about biomedical pathways in 123 n E C ↵ R2 Eff. E n LUBM1 17,174 49,321 19,608 2.051 0.75 LUBM4 78,579 236,514 162,116 2.276 0.72 LUBM1 30 15 AT 42,695 74,680 75,558 2.653 0.92 LUBM4 142 71 CE 37,162 68,034 147,066 2.89 0.92 AT 11,364 11,573 CE 9,123 8,240 Table 3: Degree distribution of role assertion # Max. Avg. non- graphs of common ontologies: The least-squares Size Size sngltn fit approximation for the degree distributions of LUBM1 17,160 2 1.00 15 the role assertion graphs of selected common on- LUBM4 78,508 2 1.00 71 AT 32,911 442 1.30 1,580 tologies is shown. n is the number of nodes; E the CE 29,913 251 1.24 1,874 number of edges; C and ↵ are the volume and log- growth of the power law; and R2 is the correlation Table 4: Component sizes after application of syn- coefficient of the approximation. tactic conditions. The ’non-sngltn” column gives the number of components that have more than the respective organisms; the TBox and both one node. ABoxes can be provided upon request. We have performed the least-squares fit up to the different conditions on the AT ABox; similar a maximum degree of 100, which accounts for trends can be seen with the other data sources. As over 99.8% of all nodes in every graph considered. can be seen, the initial SYN COND condition ac- Since there usually are discrepancies when the de- counts for a significant portion of the reduction, gree is very small or very large (Chung, 2009), this but still leaves highly connected components, and truncation eliminates a significant source of inac- is therefore not sufficient on its own to enable curacy in the models. Table 3 provides the total processing of instance checking in realistic time. number of edges and nodes of the role assertion Both the disjointness and subclass conditions con- graph (remember that the number of edges differs tribute to a further reduction, resulting finally in a from the total number of properties in the ABoxes realistic component size for TBox reasoning pur- due to the conversion to a graph), values for C and poses. In general, SYN COND SC contributes ↵ in Equation 13, and correlation coefficient R2 . more of the reduction in size, for all studied on- Note that the larger graphs approximate a power- tologies. It should be also noted that a very large law degree distribution to a very high degree. Note number of resulting components are singletons, also that the exponents of the approximations are which in the MSC method result in subsumption all between 2 and 3, which is the typical range for checking based only on class assertions; this can small-world, scale-free networks (Mika, 2007). be performed very fast since the class hierarchy of We have then calculated the effective order and the TBox can be precomputed. log-growth rate of the degree distribution of the Up to now, we have assumed that the power ABox role assertion graphs for all ontologies after law exponent remains constant after application of application of SYN COND, SYN COND DJ, and the various syntactic conditions. It is interesting SYN COND SC; the results, shown in Table 4, then to evaluate the behavior of the exponent in demonstrate that these conditions indeed produce the power law distribution of equation 13. Figure a drastic reduction in the effective order of the role 3 shows this for the AT data source on a log-log assertion graph and thus in the size of the result- scale; as can be seen, other than significant de- ing connected components, even if a giant compo- viations at very low degree values, the exponent nent still appears. Observe in particular that for remains relatively constant. the LUBM ontologies, the reduction is so signifi- cant that only a very small number of individuals 5 Discussion and Future Work remain connected to others. It is interesting to examine the relative impact The enhanced MSC method is inherently paral- of the various syntactic conditions. Note that lelizable, since instance checking for every indi- SYN COND must always be applied, since with- vidual in the ABox can be performed indepen- out it, the other conditions do not make sense. dently. Coupled with recent advances in clus- Table 5 shows the impact for the application of ter computing such as Apache Spark, large triple 124 Graph size Components Effective E # Max. Avg. non- n Size Size sngltn SYN COND 33,235 33,015 12,447 2,411 3.43 2,987 SYN COND DJ 23,377 22,374 22,161 862 1.93 2,843 SYN COND SC 19,197 19,099 25,801 1,695 1.66 2,303 SYN COND ALL 11,364 11,573 32,911 442 1.30 1,580 Table 5: Component sizes after application of different syntactic conditions on the AT data set. SYN COND ALL refers to the application of all conditions. SPARQL query answering will also enable us to perform tests with other existing benchmarks such as the DBPedia SPARQL Benchmark (DBPSB). In addition, we are working on improvements in the efficiency of the parallelization of the MSC method. In particular, we are looking into com- bining multiple individuals that form part of the same connected component in the ABox role as- sertion graph in the same parallel task, since it can be seen in Definitions 2 and 3 that portions of the Figure 3: Power law distribution of AT data source MSC computation can be shared among individu- after application of different syntactic conditions. als provided that they are connected to each other. 6 Conclusion stores can be queried efficiently using commodity In this paper, we have presented a parallel imple- hardware clusters or cloud platforms. mentation of the enhanced MSC method, and we Even as worst-case complexity of the method have evaluated execution time and efficiency as we is in exponential time for a single sequential ma- varied the size of the data and the number of pro- chine, and thus intractable for parallelization, the cessors used. Since the method performs indepen- method performs within realistic time in practi- dent checking for every individual in the ABox, it cal, real-world ontologies, which typically present is inherently parallelizable. The results show sub- a power-law degree distribution. As discussed in linear performance with respect to the size of the our random graph theory analysis, this is due to the ABox, which stems from its performance in linear ability of the method to reduce the size of the con- time in the computation of the MSCs, and almost nected components formed within the ABox Role constant time for reasoning due to the small size Assertion Graph, which in turn means a reduction of the resulting MSC. in the size of the MSCs generated for the individ- Acknowledgements uals in the ABox. Our method is thus most ef- fective when the different syntactic conditions can This work is supported by grant # R44GM097851 be applied to effect this reduction, and will possi- from the National Institute of General Medical bly prove less beneficial if most of the roles in the Sciences (NIGMS), part of the U.S. National In- TBox are engaged in axioms of the form of equa- stitutes of Health (NIH). tion (9). We are currently exploring the use of the References method to address full conjunctive queries over Franz Baader, Diego Calvanese, Deborah L. McGuin- DL knowledge bases expressed in the SPARQL ness, Daniele Nardi, and Peter F. Patel-Schneider, query language. This requires the expansion of the editors. 2003. The description logic handbook: the- MSC method to retrieve individuals a and b that ory, implementation, and applications. Cambridge can be inferred to be in a relation R(a, b). This University Press, New York, NY, USA. extension is being implemented based on Theorem Diego Calvanese, Giuseppe De Giacomo, Domenico 4.1 in (Xu et al., 2015a). The implementation of Lembo, Maurizio Lenzerini, and Riccardo Rosati. 125 2013. Data complexity of query answering in de- Boris Motik, Rob Shearer, and Ian Horrocks. 2007. scription logics. Artificial Intelligence 195:335– Optimized Reasoning in Description Logics Using 360. https://doi.org/10.1016/j.artint.2012.10.003. Hypertableaux. In Frank Pfenning, editor, Auto- mated Deduction – CADE-21, Springer Berlin Hei- Fan Chung. 2009. A whirlwind tour of random graphs. delberg, number 4603 in Lecture Notes in Computer In Robert A. Meyers, editor, Encyclopedia of Com- Science, pages 67–83. plexity and Systems Science, Springer, pages 7493– 7505. Ralf Möller, Volker Haarslev, and Michael Wessel. 2007. On the Scalability of Description Logic F Donini and A Era. 1992. Most specific concepts for Instance Retrieval. In Christian Freksa, Michael knowledge bases with incomplete information. In Kohlhase, and Kerstin Schill, editors, KI 2006: Proceedings of CIKM. Baltimore, MD, USA, pages Advances in Artificial Intelligence, Springer Berlin 545–551. Heidelberg, number 4314 in Lecture Notes in Com- puter Science, pages 188–201. Francesco M. Donini. 2003. Complexity of Reason- ing. In Franz Baader, Diego Calvanese, Debo- Bernhard Nebel. 1990. Reasoning and Revision in Hy- rah L. McGuinness, Daniele Nardi, and Peter F. brid Representation Systems. In Lecture Notes in Patel-Schneider, editors, The description logic hand- Artificial Intelligence. Springer-Verlag. book: theory, implementation, and applications, Cambridge University Press, New York, NY, USA, Mark E. J. Newman. 2002. Random graphs as models pages 96–136. of networks. In Stefan Bornholdt and Hans Georg Schuster, editors, Handbook of Graphs and Net- Francesco M. Donini, Maurizio Lenzerini, Daniele works, Wiley-VCH Verlag GmbH & Co. KGaA, Nardi, and Andrea Schaerf. 1994. Deduction in pages 35–68. Concept Languages: from Subsumption to Instance Checking. J Logic Computation 4(4):423–452. Sambhawa Priya, Yuanbo Guo, Michael Spear, and https://doi.org/10.1093/logcom/4.4.423. Jeff Heflin. 2014. Partitioning OWL Knowledge Bases for Parallel Reasoning. IEEE, pages 108–115. P. Erdős and A. Rényi. 1960. On the Evolution of Ran- https://doi.org/10.1109/ICSC.2014.34. dom Graphs. In Publication of the Mathematical In- stitute of the Hungarian Academy of Sciences. pages Andrea Schaerf. 1994. Reasoning with individuals in 17–61. concept languages. Data & Knowledge Engineering 13(2):141–176. Birte Glimm, Ian Horrocks, Carsten Lutz, and Uli Sat- Stephan Tobies. 2001. Complexity Results and Practi- tler. 2007. Conjunctive Query Answering for the cal Algorithms for Logics in Knowledge Represen- Description Logic SHIQ. In Proceedings of the 20th tation. arXiv:cs/0106031 ArXiv: cs/0106031. International Joint Conference on Artifical Intelli- gence. Morgan Kaufmann Publishers Inc., San Fran- W3C. 2015. Large Triple Stores - W3c Wiki. cisco, CA, USA, IJCAI’07, pages 399–404. https://www.w3.org/wiki/LargeTripleStores. Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. Sebastian Wandelt and Ralf Möller. 2012. Towards LUBM: A Benchmark for OWL Knowledge Base ABox Modularization of Semi-expressive Descrip- Systems. Web Semant. 3(2-3):158–182. tion Logics. Appl. Ontol. 7(2):133–167. Ian Horrocks. 2008. Ontologies and the Se- Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel John, mantic Web. Commun. ACM 51(12):58–67. and Mansur Kabuka. 2013. Extract ABox Modules https://doi.org/10.1145/1409360.1409377. for Efficient Ontology Querying. arXiv:1305.4859 [cs] ArXiv: 1305.4859. Ian Horrocks and Sergio Tessaris. 2000. A conjunctive query language for description logic ABoxes. In In Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel In AAAI/IAAI. pages 399–404. John, and Mansur Kabuka. 2015a. Convert- ing Instance Checking to Subsumption: A Re- Li Ma, Yang Yang, Zhaoming Qiu, Guotong Xie, Yue think for Object Queries over Practical On- Pan, and Shengping Liu. 2006. Towards a Com- tologies. International Journal of Intelligence plete OWL Ontology Benchmark. In The Semantic Science 05(01):44–62. ArXiv: 1412.7585. Web: Research and Applications. Springer, Berlin, https://doi.org/10.4236/ijis.2015.51005. Heidelberg, pages 125–139. Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel Peter Mika. 2007. Social Networks and the Seman- John, and Mansur Kabuka. 2015b. Module Ex- tic Web, volume 5 of Semantic Web and Beyond. traction for Efficient Object Queries over On- Springer US, Boston, MA. tologies with Large ABoxes. AIA 2(1):8–31. https://doi.org/10.15764/AIA.2015.01002. Michael Molloy and Bruce Reed. 1995. A critical point for random graphs with a given degree sequence. Random Struct. Alg. 6(2-3):161–180. 126