Parallelization of Query Processing over Expressive Ontologies

          E. Patrick Shironoshita1 , Da Zhang2 , Mansur R. Kabuka1,2 and Jia Xu2
                1                                       2
                  INFOTECH Soft, Inc.                     University of Miami
           1201 Brickell Avenue, Suite 220        Coral Gables, Florida 33124, USA
              Miami, Florida 33131, USA
          patrick@infotechsoft.com


                     Abstract                              data sets (Priya et al., 2014). However, even as rea-
                                                           soners for very expressive DLs have been created,
    Efficient query answering over Descrip-
                                                           performing reasoning over very large ABoxes is
    tion Logic (DL) ontologies with very large
                                                           still prohibitive (Calvanese et al., 2013; Donini,
    datasets is becoming increasingly vital.
                                                           2003; Glimm et al., 2007).
    Recent years have seen the development
    of various approaches to ABox partition-                   In the last few years, methods for parallel and
    ing to enable parallel processing. Instance            distributed reasoning over expressive ontologies
    checking using the enhanced most spe-                  have been published (Xu et al., 2013, 2015b; Priya
    cific concept (MSC) method is a partic-                et al., 2014; Wandelt and Möller, 2012); these
    ularly promising approach. The applica-                methods generally seek to make use of fast syn-
    bility of these distributed reasoning meth-            tactic checks to generate a set of independent par-
    ods to typical ontologies has been shown               titions that can be processed in parallel. In (Xu
    mainly through anecdotal observation. In               et al., 2015a), a method for instance checking
    this paper, we present an analysis method              is proposed, combining the work in (Xu et al.,
    that makes use of random graph theory to               2013, 2015b) with the idea of a most specific con-
    show that the enhanced MSC method re-                  cept (MSC) (Nebel, 1990; Donini and Era, 1992;
    sults in very small, tractable concepts pro-           Donini et al., 1994) to move the reasoning task
    vided that the number of role assertions re-           from the very large ABox into a much smaller
    moved from consideration is large enough.              TBox. Empirical evaluation of the method has
    We also present execution time and ef-                 shown its ability to perform sound and complete
    ficiency of a parallel implementation de-              instance checking over SHI DLs within reason-
    ployed over computing clusters of various              able time. Moreover, the method is inherently par-
    sizes, showing the ability of the method to            allelizable, lending itself to implementation within
    process instance checking for large scale              clusters of commodity hardware. In this paper, we
    datasets.                                              examine the enhanced MSC method and its paral-
                                                           lelization implementation.
1   Introduction
Description Logics (DL) are increasingly being             2     The Enhanced Most Specific Concept
used to model and represent structured and semi-                 (MSC) method
structured data in different applications (Horrocks,
                                                           2.1    Basic MSC Method
2008). A core task for DL systems is to provide
an efficient way to answer queries over the exten-         A Description Logics (DL) knowledge base, also
sional level of the ontology, that is, to compute          referred to as an ontology, is typically defined a tu-
answers that are not only asserted, but logically          ple, denoted K = (T , A), where the terminologi-
implied by the ontology (Calvanese et al., 2013).          cal component or TBox T contains definitions of
Considerable efforts have been dedicated to the            concepts and roles, and the assertional component
optimization of algorithms for query answering in          or ABox A contains assertions about membership
recent years (Calvanese et al., 2013; Möller et al.,      of individuals in concepts and about role relations
2007; Motik et al., 2007). One of the challenges           between individuals. The set of roles, concepts,
faced in this era of increasing data wealth is to pro-     and individual instances in an ontology are de-
duce responsive results for queries over very large        noted respectively by R, C, and I. The discussion


                                                     116
                   TBox                                1. for a given individual a, start with an empty
  Headmaster v Professor                                  MSCT (A, a);
  Professor v Person
  MagicCourse v Course                                 2. for every concept assertion C(a) ,
  Muggle v ¬Wizard                                        MSCT (A, a)       MSCT (A, a) u C;
  takesCourse.MagicCourse v Wizard                     3. for every role assertion R(a, b),
  9isHeadOf.School u Person v Headmaster
                   ABox
                                                          MSCT (A, a)       MSCT (A, a)
  School(hogwarts)                                           u 9R.MSCT (A\{R(a, b)}, b);
  Professor(albus)                                     4. for every individual equality assertion a = a0 ,
  Professor(severus)
  MagicCourse(potions)
                                                          MSCT (A, a)       MSCT (A, a)
  MagicCourse(transfiguration)                               u MSCT (A, a0 ).
  Course(math)
  Student(harry)                                         So, in the example ABox above, the MSC for
  Muggle(dudley)                                         severus is
  isHeadOf(hogwarts, albus)
  takesCourse(harry, transfiguration)                            Professor u 9taughtBy .Course                (1)
  taughtBy(transfiguration, albus)
  taughtBy(potions, severus)
                                                              It is important to note that while class asser-
                                                         tions generate relatively simple concepts, role as-
             Table 1: Example ABox                       sertions are capable of generating very complex
                                                         concepts. Suppose for example that ABox An
in this paper assumes that the reader is familiar        consists of role assertions R0 (a0 , a1 ), R1 (a1 , a2 ),
with DL concepts and notations; refer to (Baader         . . . , Rn (an , an+1 ); then:
et al., 2003) for details. For the discussion below,   MSCT (An , a0 ) =
we will use the example illustrated in Table 1.
                                                            9R1 .(9R2 .(. . . (9Rn .MSCT (an+1 )) . . . )) (2)
Definition 1 (Most Specific Concept) (Nebel,
1990; Donini and Era, 1992) Let K = {T , A}              Thus, in the example ABox of Table 1, the MSC
be an ontology, and a be an individual in                for harry is
I.    The most specific concept for a w.r.t.
                                                            Student u 9takesCourse .
A, written MSCT (A, a), is a concept such
that for every concept D where K |= D(a),                      (Course u 9taughtBy.
T |= MSCT (A.a) v D.                                              (Professor u isHeadOf .School))             (3)
If MSCT (A, a) can be derived, then, to test
                                                           Two main issues preclude this basic MSC
whether K |= Q(a) holds for an arbitrary con-
                                                         method from proving fully useful with expressive
cept Q it suffices to test if (T [ {Q}) |=
                                                         ontologies. First, if assertion cycles are present in
MSCT [{Q} (A, a) v Q. We call the concept Q
                                                         the ABox, the method does not terminate. For ex-
the query. Note that Q needs to be inserted into the
                                                         ample, consider what would happen if
TBox in simple form, which may require in turn
the insertion of additional named concepts as well                   isProtegeOf(albus, harry)                (4)
as general concept inclusion (GCI) axioms to ex-
press the necessary equivalences for these inserted      is inserted into the ABox, then a cycle forms
concepts. In the remainder of this paper, we will        among the individuals harry, transfiguration,
assume that the query Q has been inserted into the       and albus.
TBox so that MSCT (A, a) denotes the MSC cal-               Second, unless the ABox is highly discon-
culated considering Q.                                   nected, the method has the potential to generate
   Computation of the MSC for a given individ-           very large concepts, of size in the same order of
ual a as defined above can be performed using a          the ABox itself. Consider what happens if the fol-
rolling up procedure adapted from the one first in-      lowing assertion is added to the ABox:
troduced in (Horrocks and Tessaris, 2000).
Definition 2 (Basic MSC Rollup Procedure)
                                                                   takesCourse(harry, potions)                (5)
Provided that the ABox does not contain assertion          In this case, all individuals become connected
cycles, computation of the MSC can be performed          with each other, and the MSC of every individual
recursively as follows:                                  ends up being the same size of the ABox.


                                                   117
   Xu et.al. (2015a) define a set of improvements                     Use of the MSC method to perform instance re-
that result in the Enhanced MSC Method. This en-                   trieval is straightforward. Suppose it is desired to
hancement consists of two parts: a mechanism to                    query an ABox to retrieve all instances of a con-
address assertion cycles, and a set of syntactic con-              cept Q. It suffices to generate MSCT (A, a) for ev-
ditions to reduce the size of the MSCs in practical                ery individual a in the ABox, and then accept ev-
ontologies.                                                        ery individual where (T [ Q) |= MSCT (A, a) v
                                                                   Q. Note that the query concept Q is added to the
2.2 MSC Computation with Assertion Cycles                          TBox being verified.
To address assertion cycles, Xu et.al. (2015a) use
nominals to indicate the joint node of a cycle, as                 2.3   Syntactic Conditions
previously suggested in (Donini and Era, 1992;                     In (Xu et al., 2015a), syntactic conditions verifi-
Schaerf, 1994).                                                    able in polynomial time or better were defined, to
Definition 3 (MSC Rollup of Assertion Cycles)                      enable reduction on the size of the MSC and thus
When a cycle is found starting and ending at in-                   permit TBox reasoning in tractable time. These
dividual ac , the individual is represented by                     conditions are based on the following:
its corresponding nominal class {ac }, and the                     Lemma 1 Given two individuals a and b and a
conversion of role assertions within the cycle                     role R, a role assertion R(a, b) influences the clas-
requires modification of the rollup method as                      sification of individual a into concept A if and only
follows: If ac is an individual where a cycle is                   if
found, select a direction to go through the cycle                                  K |= 9R.B u A0 v A                (8)
and:
                                                                   where K |= B(b) and where A0 6v A summarizes
• for R(ac , x), x 6= ac
                                                                   information about a not contained in A.
        MSCT (A, ac )     MSCT (A, ac )u
           ({ac } u 9R.MSCT (A\{R(ac , x)}, x))                    Proposition 1 Let K = (T , A) be a SHI on-
• for R(y, ac ), y 6= ac                                           tology containing named concept A, concepts A0
        MSCT (A, y)      MSCT (A, y) u 9R.{ac }                    and B, and role R. If equation (8) holds, there
• for R(ac , ac ),                                                 must exist some GCIs in T of the form
        MSCT (A, ac )     MSCT (A, ac ) u {ac }
           u9R.{ac }                                                              9R0 .C1 ./ C2 v C3                (9)
• for any other R(x, y) in the cycle, for x 6= ac
                                                                   where R v R0 , ./ is a placeholder for either u or
   and y 6= ac ,
                                                                   t, and Ci ’s are concepts.
        MSCT (A, x)      MSCT (A, x)
           u9R.MSCT (A\{R(x, y)}, y)                               Proof of this proposition can be found in (Xu et al.,
Thus, if an ABox                                                   2013).
                                                                     Proposition 1 directly leads to a first syntactic
Ac = {R0 (a0 , a1 ), R1 (a1 , a2 ), . . . , Rn (an , a0 )}         condition denoted SYN COND.
                                                                   Definition 4 (SYN COND) Role assertions of
the MSC of an assertion cycle is obtained as fol-                  the form R(a, b) are said to be true for
lows:                                                              SYN COND if role R participates in at least one
                                                                   axiom that can be logically converted to the form
 MSCT (Ac , a0 ) =
                                                                   of equation 9 for some R v R0 , false otherwise.
    {a0 } u 9R1 .(9R2 .(. . . (9Rn .{a0 }) . . . )) (6)
                                                                   Assertions R(a, b) with SYN COND = false do
 So, suppose that the ABox in Table 1 is aug-                      not affect the classification of a unless either R
mented with the assertion in equation (4), then the                or some R0 such that R ✓ R0 exist in the query,
MSC for harry becomes                                              and can be safely removed from the calculation
                                                                   of MSCT (A, a). By symmetry to the inverse role
      {harry} u Student u 9takesCourse .                           R , this condition also applies to b. In practice,
        (Course u 9taughtBy.                                       the axioms more likely to be found are of the form
                                                                   C v 9R.D, which is equivalent to C u > v
           (Professor u isHeadOf .School
                                                                   9R.D. Also note that (C v 8R.D) ⌘ (9R.¬C v
               u isProtegeOf .{harry}))                (7)         ¬D). In the example in Table 1, assertions with


                                                             118
role isHeadOf have SYN COND=true, while as-                 MSC of any individual. For example, the role as-
sertions with roles takesCourse and taughtBy                sertion
have SYN COND=false and can be safely re-                                  Headmaster(albus)                (12)
moved from consideration, unless the roles exist
in the query itself. Note that in this case, the MSC        would have SYN COND SC=true.
for harry is reduced to                                        Application of these three syntactic conditions
                                                            to the computation of the MSC of a given individ-
     Student u 9takesCourse .Course              (10)       ual results in a significant reduction in the size of
                                                            the MSC for practical ontologies. The reasons for
   A second syntactic condition presented in (Xu            this reduction will be established in more detail in
et al., 2015a) relies on the identification of explicit     the next section, but first we provide a definition
concept assertions and disjointness axioms that in-         for the parallelization of the algorithm.
dicate that an individual cannot be classified to an        Remark 1 The MSC method is guaranteed to be
existential restriction:                                    sound and complete for SHI DL (Xu et al.,
Definition 5 (SYN COND DJ) Role assertions                  2015a). It can be extended to DL with expressiv-
R(a, b) with SYN COND= true are said to be                  ity SHIQ with unique name assumption, that is,
true for SYN COND DJ if there do not exist any              if it is assumed that two individuals a and b are
of:                                                         different if they have different names. Details of
   • an explicit concept assertion B0 (b) such that         such extension are outside the scope of this paper,
     K |= B0 v ¬C1 ;                                        but they are straightforward, and generally follow
   • an explicit concept assertion A0 (a) such that         the extensions to equality-free module extraction
     K |= A0 v ¬C3 ; or                                     detailed in (Xu et al., 2013).
   • an explicit concept assertion A0 (a) such that
     K |= A0 v ¬(C3 t ¬C2 )                                 2.4   Parallelization of the MSC Method
for C1 , C2 , an C3 as in Eq.(9) and R v R0 .               The MSC method, including the syntactic condi-
Role assertions with SYN COND DJ=false can                  tions detailed above, is highly parallelizable, due
be safely removed from the calculation of the               to the following:
MSC of any individual, since they do not affect             Proposition 2 For a given knowledge base K =
classification of either a or b. For example, in Ta-        (T , A), computation of MSCT (A, a) for an in-
ble 1, role assertions with role takesCourse have           dividual a can be performed independently of the
SYN COND=true due to the assertion                          computation of the MSC of any other individual.
                                                            Proof of this proposition follows directly from the
      takesCourse.MagicCourse v Wizard.
                                                            definition of the MSC computation in Definitions
                                                            2 and 3, as well as in the definition of the syntac-
However, suppose that the following assertion
                                                            tic conditions, where it should be clear that MSC
were inserted into the ABox:
                                                            computation depends only on the initial state of
           takesCourse(dudley, math)             (11)       the knowledge base. This means that the calcula-
                                                            tion of the MSCs for all individuals can be done in
This assertion has SYN COND DJ=false, since                 parallel. Instance checking then needs to be per-
dudley is an instance of Muggle, which in turn is           formed against T [ MSCT (A, a) for every indi-
disjoint with Wizard, the filler concept in the as-         vidual a.
sertion above.                                                 Note that naive parallelized processing of in-
                                                            dividual equality assertions (i.e., owl:sameAs ex-
Definition 6 (SYN COND SC) Role assertions
                                                            pressions) using the procedures outlined in Defi-
R(a, b) with SYN COND= true are said to be
                                                            nition 2 results in redundant computations. Pre-
true for SYN COND SC if, for every GCI of the
                                                            computation of the reflexive-transitive closure of
form in Eq.(9) there exists an explicit concept as-
                                                            these equalities avoids such redundancy. Note
sertion A0 (a) such that K |= A0 v C3 .
                                                            also that anonymous individuals, typically pre-
Role assertions with SYN COND SC=true are                   sented as blank nodes in OWL, are processed in
redundant for the classification of a, and can there-       the same way as named individuals for calculation
fore be safely removed from the calculation of the          purposes.


                                                      119
   As is shown later in Section 4, parallelization       would have to be involved in a GCI of the form of
of the MSC method allows for the processing of           equation (9).
extremely large ABoxes within clusters of com-              From a practical standpoint, it seems more in-
modity hardware. The resulting computational             teresting to study the performance of the enhanced
complexity is sublinear with respect to the size         MSC method when used with typical ABoxes.
of the ABox, which indicates linear complexity           Graph theory, and particularly random graph the-
for single-machine implementation. In the next           ory, provide methods and tools for the study of
section, we present an analysis of the expected          typical graphs.
computational complexity of the enhanced MSC
method, and show why it has tractable expected           3.2    Random Graph Theory
performance for practical ontologies.                    Random graph theory is concerned with the study
                                                         of graphs as probabilistic random variables with a
3   Data Complexity and Random Graphs                    defined probability distribution. Here we provide
                                                         a short summary of the points of interest to this
3.1 MSC Method Worst-Case Complexity
                                                         paper; the reader is referred to (Chung, 2009) for
Typically, the size of the ABox as measured by           more details.
the number of facts it contains is orders of mag-           Many real-world network structures are so-
nitude greater than the size of the TBox, and also       called scale-free, that is, they exhibit power-law
much larger than the size of the query. While it has     degree distributions (Mika, 2007; Newman, 2002).
been shown that, even with exponential time com-         A random power-law graph model can be defined
plexity, TBoxes of realistic size can be processed       as G(C, ↵), where the number of nodes nk with
in realistic time (Donini, 2003), practical ABoxes       degree k is given by
can contain million of individuals. Data complex-
ity for instance checking in DLs with expressiv-                               nk = C · k ↵                       (13)
ity SHIQ has an upper bound of 2EXPTIME
                                                         where C is the volume of the graph and ↵ is the
(Glimm et al., 2007). A family of less expres-
                                                         log-log rate of growth of the graph.
sive languages called DL-Lite have been shown to
                                                            It is important to note that random power-law
be first-order-logic rewritable and thus with data
                                                         graphs do not contain nodes with degree of 0,
complexity in AC0 , while even small additions to
                                                         since at this value the distribution is undefined;
this language family renders the data complexity
                                                         therefore, analysis of random power-law graphs
for instance checking in co-NP complete or worse
                                                         discards all zero-degree nodes. Where it is nec-
(Calvanese et al., 2013).
                                                         essary to make the distinction, we will denote the
   It is clear that the MSC method for instance
                                                         number of non-zero-degree nodes the effective or-
checking is EXPTIME-complete for SHI. Since
                                                         der of the graph.
the computation of the MSC is PTIME, as it is es-
                                                            It has been shown that a phase transition oc-
sentially depth-first search, complexity of check-
                                                         curs as the average node degree increases (where
ing depends on TBox reasoning. From Definition
                                                         the degree of a node is the number of edges con-
1, it should be clear that for a connected ABox,
                                                         nected to it.), so that at a critical value a giant com-
the size of the MSC is the size of the ABox it-
                                                         ponent forms with high probability 1 (Erdős and
self. Since testing for the subsumption of the
                                                         Rényi, 1960; Molloy and Reed, 1995). A compo-
MSC against a concept in the TBox is reducible
                                                         nent (also called connected component) is a maxi-
to satisfiability testing of the union of the TBox
                                                         mal subset of the nodes of the graph and all edges
and the MSC, instance checking with the MSC
                                                         between those nodes such that there is a path be-
method is equivalent to satisfiability testing for
                                                         tween any two nodes in the component. A giant
SHI DLs, which is EXPTIME-complete (Tobies,
                                                         component is a connected component with num-
2001). Given that the syntactic conditions defined
                                                         ber of nodes in the same order of magnitude as the
in subsection 2.3 do not guarantee that any role
                                                         graph itself. For any random graph, the phase tran-
assertions will be actually removed from the cal-
                                                         sition occurs at the point given by the solution of:
culation of the MSC, the worst-case complexity
                                                             1
of the enhanced MSC method is still EXPTIME.                   Since random graph theory is probabilistic, conclusions
                                                         are always obtained ”with high probability”. Henceforth, we
This situation is unlikely to occur in practice how-     will abbreviate this as w.h.p. This is also sometimes stated as
ever, as it would mean that every role in the ABox       ”asymptotically almost surely”, or just ”almost surely”.


                                                   120
                                                               Thus, collapsing parallel edges into a single un-
                1
                X                                              labeled edge results in minimal reduction in the
                      k(k     2)pk = 0              (14)
                                                               complexity of MSC calculations.
                k=0
                                                                   Certain properties in the role assertion graph GA
where pk is the degree distribution. For power-law             can be readily related to expected characteristics
graphs, this reduces to (Newman, 2002; Chung,                  of the MSCs calculated for the underlying ABox.
2009):                                                         Specifically:
  1
  X                                                                • the size of MSCT (A, a), as measured by
        k 1 ↵ k 2 ↵ = ⇣(↵       2)    2⇣(↵      1) = 0                number of simple form concepts that it con-
  k=0                                                                 tains, is equivalent to the order of the con-
                                                    (15)              nected component that contains the node that
                                   P                                  represents the individual a.
after integration, where ⇣(t) = 1     n=1 n
                                            t is the
                                                                   • the number of conjuncts at the top level of
Riemann zeta function. Thus, the phase transi-                        MSCT (A, a) is proportional to the degree of
tion point occurs at the value of the power-law ex-                   the node representing a.
ponent ↵0 = 3.47875.... Graphs with exponent                       • the maximum quantification depth of
↵ < ↵0 show a unique giant component of size                          MSCT (A, a) is given by the diameter of the
O(n) w.h.p. It can also be shown that for ↵ > 2                       connected component containing the node
the second largest component is in size ✓(log n),                     corresponding to a.
for 1 < ↵ < 2 the second largest component is                  It is clear that the appearance of a giant component
in ✓(1), and for ↵ < 1 the graph is connected, all             then means that the MSC for a significant number
w.h.p (Chung, 2009).                                           of individuals is in O(n), and that therefore the
                                                               complexity of MSC instance checking is exponen-
3.3 Random Graphs and the MSC method
                                                               tial in n.
To apply random graph theory to the study of DL                    The application of the syntactic conditions out-
ABoxes, we define the role assertion graph as fol-             lined in subsection 2.3 can be reflected in the role
lows:                                                          assertion graph through the removal of assertions
Definition 7 (ABox Role Assertion Graph)                       that do not satisfy the conditions. An important
The set of role assertions in a SHI ABox define                question to ask is how such removal affects the
an unlabeled, undirected graph GA , where the                  characteristics of the resulting graph. Since ran-
nodes represent the individuals in the ABox, and               dom graphs are generative processes, it is reason-
where there is an edge between nodes if there is               able to assume that removal of edges follows the
at least one role assertion between the underlying             same degree distribution as the original generation
individuals.                                                   of the graphs themselves. If this is the case, the
                                                               resulting graphs retain their power-law degree dis-
In other words, the ABox graph GA replaces
                                                               tribution. Thus, if the original ABox role assertion
multiple parallel edges between two nodes with
                                                               graph had a giant component of size O(n), for n
a single, unlabeled edge. Self-loops are still
                                                               the number of nodes, w.h.p. the modified graph
allowed. Note that with respect to the MSC
                                                               will also show a giant component.
method, parallel edges result in only a small in-
                                                                   Note however that the reduction in the number
crease in MSC size, equivalent to the number of
                                                               of edges necessarily results also in a reduction in
parallel edges. Suppose multiple role assertions
                                                               the effective order of the graph, that is, in the num-
R1 (a, b), R2 (a, b), . . . , Rn (a, b) exist in between
                                                               ber of non-zero-degree nodes. It is in fact possible
individuals a and b in the ABox A; application of
                                                               to calculate both the number of nodes n and num-
the rollup procedure for assertion cycles in Defini-
                                                               ber of edges E in the graph based on the volume
tion 3 results in the following:
                                                               and log growth rate parameters from equation 13.
     MSCT (A, a)        MSCT (A, a) u {a}                      For ↵ > 2, the equations are as follows (Chung,
                                                               2009):
            u 9R1 .(9R2 {a} u · · · u 9Rn {a}
            u MSCT (A\{R1 (a, b), R2 (a, b)                                    1
                                                                               X
                            . . . Rn (a, b)}, b))   (16)                  n=         C · k ↵ ⇡ C · ⇣(↵)         (17)
                                                                               k=1


                                                         121
                                           Existential       be done against other reasoners such as Pellet5 and
                      Classes
                                          Restrictions       Konclude6 . These tests were performed against a
                  MSC      HermiT       MSC HermiT           single department dataset for the University On-
    Min.           7.82       0.67       7.30       0.76     tology Benchmark (UOBM) (Ma et al., 2006),
    Max.          19.66     780.20      26.94 2,848.68       containing about 150,000 triples. The UOBM
    Avg.           8.38      19.42       9.14     489.95     TBox was modified to convert cardinality restric-
    Std. Dev.      1.42      90.50       3.24     698.14     tions to existential restrictions, since our current
    Median         8.16       0.79       8.19     135.94     implementation only handles expressivity up to
                                                             SHI. This modified version can be provided upon
Table 2: Times (in seconds) for execution of class           request. The UOBM Tbox contains a total of
and existential restriction queries                          113 named classes and 35 object properties. We
                                                             tested our enhanced MSC implementation against
                                                             all 35 named class queries, and all 3,955 possi-
             1                                               ble single-depth existential restriction queries, and
       1X              1
    E=    k · C · k ↵ ⇡ · C · ⇣(↵               1) (18)      verified 100% agreement between our enhanced
       2               2                                     MSC method implementation and HermiT. In ad-
            k=1

The ratio En is then                                         dition, we recorded the running time for all query
                                                             executions - the results can be seen in Table 2. It is
                      n    2⇣(↵)                             interesting to note that the enhanced MSC method
                        ⇡                          (19)
                      E   ⇣(↵ 1)                             performs much better than HermiT for existential
                                                             restrictions. Also note the large standard deviation
which depends only on ↵. Therefore, if ↵ remains
                                                             found when running a hypertableaux-based rea-
constant, a reduction in the number of edges
                                                             soner like HermiT, where a few queries take a very
necessarily results in a proportional reduction
                                                             long time to finish. This result also suggests the
in the number of (non-zero-degree) nodes, that
                                                             possibility of using enhanced MSC in tandem with
is, in the effective order of the graph. As will be
                                                             a traditional reasoner when dealing with smaller
shown in Section 4, it is this property that enables
                                                             datasets.
the enhanced MSC method to provide extremely
good performance for very large ABoxes.                      4.2    Parallelization
4        Experimental Evaluation                             To test the parallelization of the MSC method,
                                                             we used the c3.8xlarge instances,which provide 32
4.1 Accuracy                                                 virtual CPUs and 60 GBs of storage. We used
To test both the accuracy and to measure paral-              the Lehigh University Benchmark (LUBM) (Guo
lelization speedup of the enhanced MSC method,               et al., 2005) to generate data sets of up to 500 mil-
we set up clusters of compute-optimized instances            lion triples. LUBM was chosen as an initial test
through Amazon Web Services (AWS)2 . The en-                 ontology due to its ability to generate datasets of
hanced MSC method with syntactic condition cor-              varying size, while providing a reasonably expres-
rection was implemented in Java and Scala to work            sive TBox. These data sets were stored using the
over Apache Spark3 , installed over Hadoop HDFS              TitanDB7 graph database interface over an Apache
and YARN.                                                    HBase8 backend. Both TitanDB and HBase were
   For the accuracy tests, we used an in-memory              installed over Hadoop HDFS 2.7. Our prototype
version of our enhanced MSC implementation.                  application over Spark accesses TitanDB through
We set up clusters of two c4.xlarge machines, each           its standard Java interface. Data distribution and
containing 4 virtual CPUs and 7.5GB of mem-                  replication are performed by the database and are
ory, to run the enhanced MSC method, and com-                transparent to our application.
pared the results against the HermiT reasoner ver-              A test was performed to evaluate the scalability
sion 3.8.14 , running on a single c4.xlarge ma-              of the parallel MSC method over the number of
chine; HermiT was chosen as a comparison stan-               triples in the ABox. This test was performed over
dard due to its stability and speed; future tests will          5
                                                                  https://github.com/stardog-union/pellet
     2                                                          6
       aws.amazon.com                                             http://derivo.de/produkte/konclude/
     3                                                          7
       https://spark.apache.org/                                  http://thinkaurelius.github.io/titan/
     4                                                          8
       http://www.hermit-reasoner.com                             http://hbase.apache.org/


                                                       122
                                                                 (a) Efficiency for small number of machines
Figure 1: Execution time vs. data set size with
LUBM datasets, in AWS, for 10 c3.8xlarge ma-
chines with 32 cores each.


a cluster of 10 c3.8xlarge machines in AWS. The
results are shown in the log-log diagram in Figure
1. As can be observed, the method shows clear
sub-linear performance with respect to the size of               (b) Efficiency for large number of machines
the data set, as expected from an algorithm with
linear performance in a sequential machine.               Figure 2: Execution time and efficiency of paral-
                                                          lelization.
   The performance with respect to the number of
machines was evaluated in two parts. First, to
evaluate under small cluster conditions, a 500,000        some measurements. Nevertheless, it is clear that
triple set was assembled and used to test against         the MSC method provides performance compara-
1 to 10 machines. The execution time and the              ble with top-of-the-line database technologies and
efficiency with respect to single-machine execu-          very broad scalability.
tion are shown in Figure 2a. To obtain evaluation            These results demonstrate that the enhanced
for large numbers of machines, we used the ABox           MSC method is inherently parallelizable, enabling
with 500 million triples and ran it against 10 to         it to work with large scale ABoxes. They also
50 machines; to provide a more realistic estima-          show that the fact that in the worst case the en-
tion, the efficiency value was corrected against the      hanced MSC method is in EXPTIME does not
result for 500,000 triples in 10 machines. These          preclude its usefulness with practical ontologies.
higher scalability results are shown in Figure 2b.        In the next section, we evaluate the characteristics
   In terms of raw performance, the parallel en-          of typical ABoxes that support this assertion using
hanced MSC algorithm was capable of performing            random graph models.
instance checking for a dataset with 500 million
triples and over 110 million individual instances         4.3   Random Graph Complexity Analysis
in about 1,240 seconds, or around 20 min., using          For this purpose, we have created ABox role as-
a cluster of 10 machines and a total of 320 execu-        sertion graphs for the same set of ontologies used
tion cores. Using 50 machines, the execution time         for the empirical evaluation in (Xu et al., 2015a),
was 346 seconds, or somewhat less than 6 min-             and have calculated the least-squares fit for their
utes. As a comparison, although the difference            degree distributions. The data sets used are the
in algorithms means that execution times are not          following:
directly comparable, Oracle reports full ABox in-           1. LUBM1 and LUBM4: Lehigh University
ference over 869 million triples in 62 minutes, and             Benchmark ontologies about higher educa-
query performance over this pre-reasoned ABox in                tion entities generated using the tool provided
about 4.3 min., using specialized hardware (W3C,                by (Guo et al., 2005); and
2015). It is also important to note that, since tests       2. Arabidopsis thaliana (AT) and Caenorhabdi-
were performed over an uncontrolled environ-                    tis elegans (CE), two ABoxes built over the
ment, external perturbations could have affected                BioPAX TBox about biomedical pathways in


                                                    123
              n        E         C         ↵     R2                                    Eff.       E
                                                                                        n
 LUBM1     17,174     49,321    19,608   2.051   0.75
 LUBM4     78,579    236,514   162,116   2.276   0.72                     LUBM1            30        15
   AT      42,695     74,680    75,558   2.653   0.92                     LUBM4           142        71
   CE      37,162     68,034   147,066    2.89   0.92                     AT           11,364    11,573
                                                                          CE            9,123     8,240
Table 3: Degree distribution of role assertion                                 #       Max.       Avg.      non-
graphs of common ontologies: The least-squares                                         Size       Size     sngltn
fit approximation for the degree distributions of                 LUBM1      17,160         2      1.00        15
the role assertion graphs of selected common on-                  LUBM4      78,508         2      1.00        71
                                                                  AT         32,911       442      1.30     1,580
tologies is shown. n is the number of nodes; E the                CE         29,913       251      1.24     1,874
number of edges; C and ↵ are the volume and log-
growth of the power law; and R2 is the correlation            Table 4: Component sizes after application of syn-
coefficient of the approximation.                             tactic conditions. The ’non-sngltn” column gives
                                                              the number of components that have more than
      the respective organisms; the TBox and both             one node.
      ABoxes can be provided upon request.
   We have performed the least-squares fit up to              the different conditions on the AT ABox; similar
a maximum degree of 100, which accounts for                   trends can be seen with the other data sources. As
over 99.8% of all nodes in every graph considered.            can be seen, the initial SYN COND condition ac-
Since there usually are discrepancies when the de-            counts for a significant portion of the reduction,
gree is very small or very large (Chung, 2009), this          but still leaves highly connected components, and
truncation eliminates a significant source of inac-           is therefore not sufficient on its own to enable
curacy in the models. Table 3 provides the total              processing of instance checking in realistic time.
number of edges and nodes of the role assertion               Both the disjointness and subclass conditions con-
graph (remember that the number of edges differs              tribute to a further reduction, resulting finally in a
from the total number of properties in the ABoxes             realistic component size for TBox reasoning pur-
due to the conversion to a graph), values for C and           poses. In general, SYN COND SC contributes
↵ in Equation 13, and correlation coefficient R2 .            more of the reduction in size, for all studied on-
Note that the larger graphs approximate a power-              tologies. It should be also noted that a very large
law degree distribution to a very high degree. Note           number of resulting components are singletons,
also that the exponents of the approximations are             which in the MSC method result in subsumption
all between 2 and 3, which is the typical range for           checking based only on class assertions; this can
small-world, scale-free networks (Mika, 2007).                be performed very fast since the class hierarchy of
   We have then calculated the effective order and            the TBox can be precomputed.
log-growth rate of the degree distribution of the                Up to now, we have assumed that the power
ABox role assertion graphs for all ontologies after           law exponent remains constant after application of
application of SYN COND, SYN COND DJ, and                     the various syntactic conditions. It is interesting
SYN COND SC; the results, shown in Table 4,                   then to evaluate the behavior of the exponent in
demonstrate that these conditions indeed produce              the power law distribution of equation 13. Figure
a drastic reduction in the effective order of the role        3 shows this for the AT data source on a log-log
assertion graph and thus in the size of the result-           scale; as can be seen, other than significant de-
ing connected components, even if a giant compo-              viations at very low degree values, the exponent
nent still appears. Observe in particular that for            remains relatively constant.
the LUBM ontologies, the reduction is so signifi-
cant that only a very small number of individuals             5    Discussion and Future Work
remain connected to others.
   It is interesting to examine the relative impact           The enhanced MSC method is inherently paral-
of the various syntactic conditions. Note that                lelizable, since instance checking for every indi-
SYN COND must always be applied, since with-                  vidual in the ABox can be performed indepen-
out it, the other conditions do not make sense.               dently. Coupled with recent advances in clus-
Table 5 shows the impact for the application of               ter computing such as Apache Spark, large triple


                                                        124
                                       Graph size                            Components
                                  Effective         E              #          Max.       Avg.     non-
                                      n                                       Size       Size    sngltn
          SYN COND                   33,235      33,015            12,447      2,411      3.43     2,987
          SYN COND DJ                23,377      22,374            22,161        862      1.93     2,843
          SYN COND SC                19,197      19,099            25,801      1,695      1.66     2,303
          SYN COND ALL               11,364      11,573            32,911        442      1.30     1,580

Table 5: Component sizes after application of different syntactic conditions on the AT data set.
SYN COND ALL refers to the application of all conditions.


                                                              SPARQL query answering will also enable us to
                                                              perform tests with other existing benchmarks such
                                                              as the DBPedia SPARQL Benchmark (DBPSB).
                                                              In addition, we are working on improvements in
                                                              the efficiency of the parallelization of the MSC
                                                              method. In particular, we are looking into com-
                                                              bining multiple individuals that form part of the
                                                              same connected component in the ABox role as-
                                                              sertion graph in the same parallel task, since it can
                                                              be seen in Definitions 2 and 3 that portions of the
Figure 3: Power law distribution of AT data source            MSC computation can be shared among individu-
after application of different syntactic conditions.          als provided that they are connected to each other.

                                                              6   Conclusion
stores can be queried efficiently using commodity             In this paper, we have presented a parallel imple-
hardware clusters or cloud platforms.                         mentation of the enhanced MSC method, and we
   Even as worst-case complexity of the method                have evaluated execution time and efficiency as we
is in exponential time for a single sequential ma-            varied the size of the data and the number of pro-
chine, and thus intractable for parallelization, the          cessors used. Since the method performs indepen-
method performs within realistic time in practi-              dent checking for every individual in the ABox, it
cal, real-world ontologies, which typically present           is inherently parallelizable. The results show sub-
a power-law degree distribution. As discussed in              linear performance with respect to the size of the
our random graph theory analysis, this is due to the          ABox, which stems from its performance in linear
ability of the method to reduce the size of the con-          time in the computation of the MSCs, and almost
nected components formed within the ABox Role                 constant time for reasoning due to the small size
Assertion Graph, which in turn means a reduction              of the resulting MSC.
in the size of the MSCs generated for the individ-
                                                              Acknowledgements
uals in the ABox. Our method is thus most ef-
fective when the different syntactic conditions can           This work is supported by grant # R44GM097851
be applied to effect this reduction, and will possi-          from the National Institute of General Medical
bly prove less beneficial if most of the roles in the         Sciences (NIGMS), part of the U.S. National In-
TBox are engaged in axioms of the form of equa-               stitutes of Health (NIH).
tion (9).
   We are currently exploring the use of the                  References
method to address full conjunctive queries over
                                                              Franz Baader, Diego Calvanese, Deborah L. McGuin-
DL knowledge bases expressed in the SPARQL
                                                                ness, Daniele Nardi, and Peter F. Patel-Schneider,
query language. This requires the expansion of the              editors. 2003. The description logic handbook: the-
MSC method to retrieve individuals a and b that                 ory, implementation, and applications. Cambridge
can be inferred to be in a relation R(a, b). This               University Press, New York, NY, USA.
extension is being implemented based on Theorem               Diego Calvanese, Giuseppe De Giacomo, Domenico
4.1 in (Xu et al., 2015a). The implementation of                Lembo, Maurizio Lenzerini, and Riccardo Rosati.


                                                        125
  2013. Data complexity of query answering in de-           Boris Motik, Rob Shearer, and Ian Horrocks. 2007.
  scription logics. Artificial Intelligence 195:335–          Optimized Reasoning in Description Logics Using
  360. https://doi.org/10.1016/j.artint.2012.10.003.          Hypertableaux. In Frank Pfenning, editor, Auto-
                                                              mated Deduction – CADE-21, Springer Berlin Hei-
Fan Chung. 2009. A whirlwind tour of random graphs.           delberg, number 4603 in Lecture Notes in Computer
  In Robert A. Meyers, editor, Encyclopedia of Com-           Science, pages 67–83.
  plexity and Systems Science, Springer, pages 7493–
  7505.                                                     Ralf Möller, Volker Haarslev, and Michael Wessel.
                                                              2007. On the Scalability of Description Logic
F Donini and A Era. 1992. Most specific concepts for          Instance Retrieval. In Christian Freksa, Michael
  knowledge bases with incomplete information. In             Kohlhase, and Kerstin Schill, editors, KI 2006:
  Proceedings of CIKM. Baltimore, MD, USA, pages              Advances in Artificial Intelligence, Springer Berlin
  545–551.                                                    Heidelberg, number 4314 in Lecture Notes in Com-
                                                              puter Science, pages 188–201.
Francesco M. Donini. 2003. Complexity of Reason-
  ing. In Franz Baader, Diego Calvanese, Debo-              Bernhard Nebel. 1990. Reasoning and Revision in Hy-
  rah L. McGuinness, Daniele Nardi, and Peter F.              brid Representation Systems. In Lecture Notes in
  Patel-Schneider, editors, The description logic hand-       Artificial Intelligence. Springer-Verlag.
  book: theory, implementation, and applications,
  Cambridge University Press, New York, NY, USA,            Mark E. J. Newman. 2002. Random graphs as models
  pages 96–136.                                              of networks. In Stefan Bornholdt and Hans Georg
                                                             Schuster, editors, Handbook of Graphs and Net-
Francesco M. Donini, Maurizio Lenzerini, Daniele             works, Wiley-VCH Verlag GmbH & Co. KGaA,
  Nardi, and Andrea Schaerf. 1994. Deduction in              pages 35–68.
  Concept Languages: from Subsumption to Instance
  Checking. J Logic Computation 4(4):423–452.               Sambhawa Priya, Yuanbo Guo, Michael Spear, and
  https://doi.org/10.1093/logcom/4.4.423.                     Jeff Heflin. 2014. Partitioning OWL Knowledge
                                                              Bases for Parallel Reasoning. IEEE, pages 108–115.
P. Erdős and A. Rényi. 1960. On the Evolution of Ran-       https://doi.org/10.1109/ICSC.2014.34.
   dom Graphs. In Publication of the Mathematical In-
   stitute of the Hungarian Academy of Sciences. pages      Andrea Schaerf. 1994. Reasoning with individuals in
   17–61.                                                     concept languages. Data & Knowledge Engineering
                                                              13(2):141–176.
Birte Glimm, Ian Horrocks, Carsten Lutz, and Uli Sat-
                                                            Stephan Tobies. 2001. Complexity Results and Practi-
  tler. 2007. Conjunctive Query Answering for the
                                                               cal Algorithms for Logics in Knowledge Represen-
  Description Logic SHIQ. In Proceedings of the 20th
                                                               tation. arXiv:cs/0106031 ArXiv: cs/0106031.
  International Joint Conference on Artifical Intelli-
  gence. Morgan Kaufmann Publishers Inc., San Fran-         W3C. 2015.    Large Triple Stores - W3c Wiki.
  cisco, CA, USA, IJCAI’07, pages 399–404.                   https://www.w3.org/wiki/LargeTripleStores.
Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005.          Sebastian Wandelt and Ralf Möller. 2012. Towards
  LUBM: A Benchmark for OWL Knowledge Base                    ABox Modularization of Semi-expressive Descrip-
  Systems. Web Semant. 3(2-3):158–182.                        tion Logics. Appl. Ontol. 7(2):133–167.
Ian Horrocks. 2008.         Ontologies and the Se-          Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel John,
   mantic Web.         Commun. ACM 51(12):58–67.               and Mansur Kabuka. 2013. Extract ABox Modules
   https://doi.org/10.1145/1409360.1409377.                    for Efficient Ontology Querying. arXiv:1305.4859
                                                               [cs] ArXiv: 1305.4859.
Ian Horrocks and Sergio Tessaris. 2000. A conjunctive
   query language for description logic ABoxes. In In       Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel
   In AAAI/IAAI. pages 399–404.                                John, and Mansur Kabuka. 2015a.            Convert-
                                                               ing Instance Checking to Subsumption: A Re-
Li Ma, Yang Yang, Zhaoming Qiu, Guotong Xie, Yue               think for Object Queries over Practical On-
   Pan, and Shengping Liu. 2006. Towards a Com-                tologies.     International Journal of Intelligence
   plete OWL Ontology Benchmark. In The Semantic               Science 05(01):44–62.          ArXiv:    1412.7585.
   Web: Research and Applications. Springer, Berlin,           https://doi.org/10.4236/ijis.2015.51005.
   Heidelberg, pages 125–139.
                                                            Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel
Peter Mika. 2007. Social Networks and the Seman-               John, and Mansur Kabuka. 2015b. Module Ex-
  tic Web, volume 5 of Semantic Web and Beyond.                traction for Efficient Object Queries over On-
  Springer US, Boston, MA.                                     tologies with Large ABoxes.       AIA 2(1):8–31.
                                                               https://doi.org/10.15764/AIA.2015.01002.
Michael Molloy and Bruce Reed. 1995. A critical point
  for random graphs with a given degree sequence.
  Random Struct. Alg. 6(2-3):161–180.


                                                      126