Enumerating Justifications using Resolution

                         Yevgeny Kazakov and Peter Skočovský

                             The University of Ulm, Germany
                      {yevgeny.kazakov, peter.skocovsky}@uni-ulm.de


       Abstract. We propose a new procedure that can enumerate justifications of a
       logical entailment given a set of inferences using which this entailment can be
       derived from axioms in the ontology. The procedure is based on the extension
       of the resolution method with so-called answer literals. In comparison to other
       (SAT-based) methods for enumerating justifications, our procedure can enumer-
       ate justifications in any user-defined order that extends the subset relation. The
       procedure is easy to implement and can be parametrized with ordering and selec-
       tion strategies used in resolution. We describe an implementation of the procedure
       provided in PULi—a new Java-based Proof Utility Library, and provide an em-
       pirical comparison of (several strategies of) our procedure and other SAT-based
       tools on popular EL ontologies. The experiments show that our procedure pro-
       vides a comparable, if not better performance than those highly optimized tools.
       For example, using one of the strategies, we were able to compute all justifica-
       tions for all direct subsumptions of Snomed CT in about 1.5 hour. No other tool
       used in our experiments was able to do it even within a much longer period.


1   Introduction and Motivation

Axiom pinpointing, or computing justifications—minimal subsets of axioms of the on-
tology that entail a given logical consequence—has been a widely studied research
topic in ontology engineering [2–4, 10, 11, 17, 19, 23, 21]. Most of the recent methods
focus on the so-called EL family of description logics (DLs), in which logical conse-
quences can be proved by deriving new axioms from existing ones using inferences
rules. The resulting inferences are usually encoded as propositional (Horn) clauses, and
justifications are computed from them using (modifications of) SAT solvers. To ensure
correctness, the input inference set must be complete, that is, the inferences are enough
to derive the consequence from any subset of the ontology from which it follows.
     In this paper, we present a new resolution-based procedure that enumerates all jus-
tifications of an entailment given a complete set of inferences. Apart from requiring
completeness, the form of inferences can be arbitrary and does not depend on any
logic. For example, our method can be used with the inferences provided by existing
consequence-based procedures [5, 16, 14, 22]. The procedure can enumerate justifica-
tions in any given order, provided it extends the proper subset relation on sets of axioms.
Performance of the procedure depends on the strategy it follows while enumerating jus-
tifications. We have empirically evaluated three simple strategies and experimentally
compared our procedure with other highly optimized justification computation tools.
    The paper is organized as follows. In Section 2 we describe related work. Section 3
defines basic notions such as inferences and resolution. In Section 4 we present the new
procedure, and in Section 5 we describe its implementation and empirical evaluation.


2   Related Work

There are, generally, two kinds of procedures for computing justifications [19] using a
reasoner. Black-Box procedures use a reasoner solely for entailment checking, and thus
can be used for any reasoner and DL. Glass-Box procedures require some additional
information provided from a reasoner, such as inferences that the reasoner has used,
and thus can only work with such reasoners.
     In a nutshell, Black-Box procedures [6, 11–13] add or remove axioms from a set of
axioms (a justification candidate) while maintaining the invariant that the set entails the
given goal logical consequence. An external reasoner is used for checking this invari-
ant. Finding one justification is relatively easy and can be done by trying to remove each
axiom once and check if the result still entails the goal. If it does not, the removed ax-
iom is added back. Provided the entailment check takes polynomial time in the number
of all axioms, e.g., in the case of EL, this procedure runs in polynomial time. However,
finding all justifications is hard also in the case of EL, because there may be exponen-
tially many of them [6]. Should a procedure for finding one justification be extended to
a procedure for justification enumeration, the main problem is how to prevent repetition
of justifications. Most techniques rely on the hitting set duality that was introduced in
the field of Model Based Diagnosis [9, 20] and later adapted for DLs [10, 13]. Having a
collection of sets, its hitting set contains some element from each of them. A minimal
hitting set of a set of all justifications of some goal is its repair—when removed from
the ontology, the goal is no longer entailed. Having a number of justifications of some
goal, we can find a new one by removing one of their hitting sets from the ontology and
applying the procedure for finding one justification.
     Many Glass-Box procedures focus on EL, because EL reasoning is relatively sim-
ple and efficient. Some of them employ propositional satisfiability (SAT) solvers. Their
input are inferences recorded by the reasoner during classification of an ontology. The
inferences are used to reduce entailment to satisfiability of Horn clauses. This approach
was first used by EL+SAT [21, 23], that includes an EL reasoner which records the in-
ferences plus a few optimizations for reducing the set of inferences relevant to the goal.
A common feature of the SAT-based procedures is enumeration of the candidate ax-
iom sets exploiting the hitting set duality and further minimization of these candidates.
EL+SAT and EL2MUS [4] use two instances of a SAT-solver—one for enumeration of
candidates and the other one for checking whether a candidate entails the goal. SAT-
Pin [17] uses one SAT-solver for both of these tasks and encodes a candidate as the set
of axiom atoms currently assigned true. Enumeration of the same candidate twice is
avoided by adding a blocking clause to the enumerating solver. If a candidate contains
a justification, its blocking clause consists of negated atoms that encode axioms from
that justification. EL+SAT and SATPin block candidates that do not entail the goal in
the same way as justifications. EL2MUS does it in a different way. When its specialized
enumeration solver finds a non-entailing candidate, its complement is automatically a
repair. A repair is blocked by adding a clause consisting of positive atoms of its axioms.
Further differences are that in EL2MUS the entailment checking solver is specialized
for Horn clauses and that EL+SAT and SATPin extract justifications by a deletion-based
procedure, while EL2MUS uses more efficient insertion-based procedure. EL2MCS [3]
uses MaxSAT [1, 18] to enumerate all repairs and extracts justifications from them using
the hitting set duality. BEACON [2] is a tool that integrates the justification procedure
of EL2MUS.
    Up to few optimizations, the mentioned SAT-based procedures use inferences only
for the entailment check. Had they delegated the entailment check to a separate DL
reasoner, they could be considered Black-Box. Our approach uses a similar encoding
of derivability check, however, it does not rely on the hitting set duality and, instead,
manipulates the inferences directly.

3     Preliminaries
3.1   Inferences, Supports, and Justifications
In this section, we introduce a general type of inferences that can manipulate any
types of objects, that we call axioms. An inference is an expression inf of the form
hα1 , . . . , αn ` αi where α1 , . . . , αn is a (possibly empty) sequence of axioms called
the premises of inf, and α is an axiom called the conclusion of inf. An ontology is a
finite set of axioms.
     Let I be an inference set. An I-derivation from O is a sequence of inferences d =
hinf1 , . . . , infk i from I such that for every i with (1 ≤ i ≤ k), and each premise α0 of
infi that is not in O, there exists j < i such that α0 is the conclusion of infj . An axiom
α is derivable from O using I (notation: O `I α) if either α ∈ O or there exists an I-
derivation d = hinf1 , . . . , infk i from O such that α is the conclusion of infk . A support
for O `I α is a subset of O0 ⊆ O such that O0 `I α. A subset-minimal support for
O `I α is called a justification for O `I α.
     Suppose that |= is an entailment relation between ontologies and axioms. A support
for O |= α is a subset of O0 ⊆ O such that O0 |= α. A subset-minimal support for
O |= α is called a justification for O |= α (also called a minimal axiom set MinA [6]).
An inference hα1 , . . . , αn ` αi is sound if {α1 , . . . , αn } |= α. A set of inferences I is
complete for the entailment O |= α if O0 |= α implies O0 `I α for every O0 ⊆ O.
Note that if I is a set of sound inferences that is complete for the entailment O |= α
then O0 |= α iff O0 `I α for every O0 ⊆ O. In particular, supports and justifications for
O `I α coincide with supports and, respectively, justifications for O |= α.
Example 1. Consider the ontology Oe = {A v B u C; A v B; A v C} over DL ax-
ioms and the axiom αe = A v C u B. Assume that some consequence-based pro-
cedure performed inferences Ie = {hA v B u C ` A v Bi, hA v B u C ` A v Ci,
hA v C, A v B ` A v C u Bi} in order to derive αe from O. It is easy to see that
Oe0 = {A v B; A v C} `Ie αe and Oe00 = {A v B u C} `Ie αe , but {A v B} 6`Ie αe
and {A v C} 6`Ie αe . Hence, Oe0 and Oe00 are justifications for Oe `Ie αe . All inferences
in Ie are also sound for the standard entailment relation |=. Since Oe0 and Oe00 are the
only two justifications for Oe |= αe , the inference set Ie is complete for the entailment
Oe |= αe .
                                      c ∨ a ¬a ∨ d                       c∨a∨a
                         Resolution                          Factoring
                                           c∨d                            c∨a
                            Fig. 1. Propositional resolution and factoring rules


3.2        Resolution with Answer Literals
In this section we introduce the resolution calculus, which is a popular method for
automated theorem proving [7]. We will mainly use resolution for propositional Horn
clauses. A (propositional) atom is a propositional variable a. A (propositional) literal is
either an atom l = a (positive literal) or a negation of atom l = ¬a (negative literal).
A (propositional) clause is a disjunction of literals c = l1 ∨ · · · ∨ ln , n ≥ 0. As usual,
we do not distinguish between the order of literals in clauses, i.e., we associate clauses
with multisets of literals. Given two clauses c1 and c2 , we denote by c1 ∨ c2 the clause
consisting of all literals from c1 and all literals from c2 . A clause is Horn if it has at
most one positive literal. The empty clause  is the clause containing n = 0 literals. The
inference rules for the propositional resolution calculus are given in Figure 1. We say
that a set of clauses S is closed under the resolution rules, if every clause derived by the
rules in Figure 1 from S already occurs in S. The resolution calculus is refutationally
complete: a set of clauses S that is closed under the resolution rules is satisfiable if and
only if it does not contain the empty clause. This means that for checking satisfiability
of the input set of clauses, it is sufficient to deductively close this set of clauses under
the resolution rules and check if the empty clause is derived in the closure.
     To reduce the number of resolution inferences (and hence the size of the closure)
several refinements of the resolution calculus were proposed. The rules in Figure 1 can
be restricted using orderings and selection functions [7]. In particular, for Horn clauses,
it is sufficient to select one (positive or negative) literal in each clause, and require that
the resolution inferences are applied only on those (Theorem 7.2 in [7]).1 This strategy
is called resolution with free selection. In addition to rule restrictions, one can also use
a number of simplification rules that can remove or replace clauses in the closure S. We
will use two such rules. Elimination of duplicate literals removes all duplicate literals
from a clause (including duplicate negative literals). Subsumption deletion removes a
clause c from S if there exists another sub-clause c0 of c in S, i.e., c = c0 ∨ c00 for some
(possibly empty) clause c00 . In this case we say that c0 subsumes c.

Example 2. Consider the set of Horn clauses 1-7 below. We apply resolution with free
selection that selects the underlined literals in clauses. Clauses 8-10 are obtained by
resolution inferences from clauses shown on the right.
                1: ¬p1 ∨ p2              4: p1      7: ¬p4          8: ¬p3 ∨ p4 h3, 5i
                2: ¬p1 ∨ p3              5: p2                      9: p4       h6, 8i
                3: ¬p2 ∨ ¬p3 ∨ p4        6: p3                     10:         h7, 9i

    Note that the resolution rule was not applied, e.g., to clauses 3 and 6 because literal
¬p3 in clause 3 is not selected. Also note that many clauses in the closure above can
be derived by several resolution inferences. For example, clause 5 can be obtained by
      1
          Note that the factoring rule cannot apply to Horn clauses.
resolving clauses 1 and 4 and clause 6 by resolving 2 and 4. Therefore the empty clause
10 can be derived from several subsets of the original clauses 1-7.

     Although the resolution calculus is mainly used for checking satisfiability of a
clause set, it can be also used for finding unsatisfiable subsets of clauses. To do this,
it is sufficient to add to every input clause a fresh positive answer literal [8]. Resolution
rules can then be applied to the extended clauses on the remaining (ordinary) literals
using the usual orderings and selection functions. Thus, if a clause with answer liter-
als is derived, then this clause with the answer literals removed, can be derived from
the clauses for which the answer literals were introduced. In particular, if a clause con-
taining only answer literals is derived, then the set of clauses that corresponds to these
answer literals is unsatisfiable. Completeness of resolution means that all such unsatisfi-
able sets of clauses can be found in this way. If answer literals are added to some but not
all clauses and a clause with only answer literals is derived, then the set of clauses that
corresponds to the answer literals plus clauses without answer literals is unsatisfiable.

Example 3. Consider the clauses 1-7 from Example 2. Let us add answer literals a1 -a3
to clauses 4-6 and apply the resolution inferences on the remaining (underlined) literals
like in Example 2, eliminating duplicate literals if they appear.
        1: ¬p1 ∨ p2                8: p2 ∨ a1       h1, 4i      15: p4 ∨ a1 h9, 11i
        2: ¬p1 ∨ p3                9: p3 ∨ a1       h2, 4i      16: a2 ∨ a3 h7, 12i
        3: ¬p2 ∨ ¬p3 ∨ p4         10: ¬p3 ∨ p4 ∨ a2 h3, 5i      17: a1 ∨ a2 h7, 13i
        4: p1 ∨ a1                11: ¬p3 ∨ p4 ∨ a1 h3, 8i      18: a1 ∨ a3 h7, 14i
        5: p2 ∨ a2                12: p4 ∨ a2 ∨ a3 h6, 10i      19: a1      h7, 15i
        6: p3 ∨ a3                13: p4 ∨ a1 ∨ a2 h9, 10i
        7: ¬p4                    14: p4 ∨ a1 ∨ a3 h6, 11i

    The framed clauses 16-19 contain only answer literals, so the corresponding sets of
clauses are unsatisfiable in conjunction with the input clauses without answer literals.
For example, clause 16 means that clauses 1-3, 5-7 are unsatisfiable and clause 19
means that clauses 1-4, 7 are also unsatisfiable. Note that clause 19 subsumes clauses
17-18; if subsumed clauses are deleted, we obtain only clauses with answer literals that
correspond to minimal subsets of clauses 4-6 that are unsatisfiable in conjunction with
the remaining input clauses 1-3, 7.


4   Enumerating Justifications using Resolution

In this section, we present a new procedure that, given an ontology O, an inference set I
and a goal axiom αg , enumerates justifications for O `I αg . It uses the usual reduction
of the derivability problem O `I αg to satisfiability of propositional Horn clauses in
combination with the resolution procedure with answer literals.
     Given the derivability problem O `I αg , we assign to each axiom αi occurring in
I a fresh atom pαi . Each inference hα1 , . . . , αn ` αi ∈ I is then translated to the Horn
clause ¬pα1 ∨ · · · ∨ ¬pαn ∨ pα . In addition, for each axiom α ∈ O that appears in I, we
introduce a (unit) clause pα . Finally, we add the clause ¬pαg encoding the assumption
that αg is not derivable. It is easy to see that O `I αg if and only if the resulting set of
clauses is unsatisfiable.
    We now extend this reduction to find justifications for O `I αg . Recall that a subset
O0 ⊆ O is a support for O `I αg if O0 `I αg . Hence, the subset of clauses pα for
α ∈ O0 is unsatisfiable in combination with the clauses for the encoding of inferences
and ¬pαg . We can find all such minimal subsets (corresponding to justifications) by
adding a fresh answer literal to every clause pα with α ∈ O, and applying resolution on
non-answer literals together with elimination of redundant clauses.

Example 4. Consider the ontology Oe , inferences Ie and axiom αe from Example 1.
To encode the derivability problem Oe `Ie αe we assign atoms p1 –p4 to the axioms
occurring in Ie as follows:

            p1 : A v B u C,        p2 : A v B,      p3 : A v C,       p4 : A v C u B.

The encoding produces clauses 1-7 from Example 3: the inferences Ie are encoded
by clauses 1-3, the axioms in Oe result in clauses 4-6 with answer literals, and the
assumption that αe is not derivable is encoded by clause 7. The derived clauses 16-19
correspond to supports of Oe `Ie αe , and by eliminating redundant clauses 17-18, we
obtain clauses 16 and 19 that correspond to justifications Oe0 and Oe00 from Example 1.

     One disadvantage of the described procedure is that it requires the closure under the
resolution rules to be fully computed before any justification can be found. Indeed, since
derived clauses may be subsumed by later clauses, one cannot immediately see whether
a clause with only answer literals corresponds to a justification or not. For example,
clauses 17-18 in Example 3 are subsumed by clause 19 that is derived later, and hence
do not correspond to justifications. We address this problem by using non-chronological
application of resolution inferences. Intuitively, instead of processing clauses in the
order in which they are derived, we process clauses containing fewer answer literals
first. Thus, in Example 3, we process clause 15 before clauses 12-14.
     The improved procedure can enumerate justifications, i.e., return justifications one
by one without waiting for the algorithm to terminate. The procedure is described in
Algorithm 1. It is a minor variation of the standard saturation-based procedure for com-
puting the closure under (resolution) rules, which uses a priority queue to store unpro-
cessed clauses instead of an ordinary queue. Let - be a total preorder on clauses (a
transitive reflexive relation for which every two clauses are comparable). As usual, we
write c1 ≺ c2 if c1 - c2 but c2 6- c1 . We say that - is admissible if c1 ≺ c2 when-
ever the set of answer literals of c1 is a proper subset of the set of answer literals of c2 .
For example, it is required that ¬p3 ∨ p4 ∨ a1 ≺ p4 ∨ a1 ∨ a2 , but not necessary that
p4 ∨ a1 ≺ p4 ∨ a2 ∨ a3 . Note that if c is derived by resolution from clauses c1 and c2
then c1 - c and c2 - c since c contains the answer literals of both c1 and c2 .
     We say that a clause d (not necessarily occurring in Q) is minimal w.r.t. Q if there
exists no clause c ∈ Q such that c ≺ d. A priority queue based on - is a queue in which
the remove operation can take out only minimal elements w.r.t. Q.2 Given such a queue
Q, Algorithm 1 initializes it with the translation of the input problem O `I α (line 3)
   2
       If there are several minimal elements in the queue, one of them is chosen arbitrarily.
 Algorithm 1: Enumeration of justifications using resolution
     Enumerate(O `I α, -): enumerate justifications for O `I α
     input : O `I α – the problem for which to enumerate justifications,
            - – an admissible preorder on clauses
 1 S ← createEmptyList() ; // for processed clauses
 2 Q ← createEmptyQueue(-) ; // for unprocessed clauses
 3 Q.addAll(encode(O `I α)); // add the clause encoding of the problem
 4 while Q 6= ∅ do
 5     c ← Q.remove(); // take one minimal element out of the queue
 6     c ← simplify(c); // remove duplicate literals from c
 7     if c is not subsumed by any c0 ∈ S then
 8           S.add(c);
 9           if c contains only answer literals then
10                report decode(c); // a new justification is found
11           else // apply resolution rules to c and clauses in S
12                for c0 ∈ resolve(c,S) do
13                    Q.add(c0 );


and then repeatedly applies resolution inferences between minimal clauses taken out
of this queue (loop 4-13) and the clauses in S that were processed before. Specifically,
the removed minimal clause c is first simplified by removing duplicate literals (line 6)
and then checked if it is subsumed by previously processed clauses in S (in particular,
if c was processed before). If c is subsumed by some c0 ∈ S, it is ignored and the next
(minimal) clause is taken from the queue Q. Otherwise, c is added to S (line 8). If c
contains only answer literals, then it corresponds to a justification (as we show next),
which is then reported by the algorithm (line 10). Otherwise, resolution inferences are
then applied on the selected non-answer literal in c (line 12). The new clauses derived
by resolution are then added to Q (line 13) and the loop continues.
     We now prove that Algorithm 1 in line 10 always returns a (new) justification. It is
easy to see that if a clause d was minimal w.r.t. Q in the beginning of the while loop
(line 4) then it remains minimal w.r.t. Q at the end of the loop (line 13). Indeed, for the
clause c taken from the queue (line 5), we have c 6≺ d. For all clauses c0 obtained by
resolving c with clauses from S (line 12) we have c - c0 . Hence c0 6≺ d for all c0 added
to Q (line 13) (for otherwise, c - c0 ≺ d). This, in particular, implies that each clause
in S is always minimal w.r.t. Q and, consequently, if c1 was added to S before c2 then
c1 - c2 (for otherwise c2 ≺ c1 and c1 would not be minimal w.r.t. Q when c2 ∈ Q).
Hence, there cannot be two clauses c1 and c2 in S that contain only answer literals such
that c1 is a proper sub-clause of c2 since in this case c1 ≺ c2 , thus c2 must be added to
S after c1 , but then c2 would be subsumed by c1 (see line 7). Hence each result returned
in line 10 is a (new) justification.
     Since clauses are added to S in the order defined by -, the justifications are also
returned according to this order. Hence Algorithm 1 can return justifications in any
user-defined order - on subsets of axioms as long as s1 ( s2 implies s1 ≺ s2 . Indeed,
any such an order - can be lifted to an admissible order on clauses by comparing the
sets of answer literals of clauses like the corresponding sets of axioms. For example,
one can define s1 - s2 by ||s1 || ≤ ||s2 || where ||s|| is the cardinality of s. Instead of ||s||
one can use any other measure m(s) that is monotonic over the proper subset relation
(i.e., s1 ( s2 implies m(s1 ) < m(s2 )), e.g., the length of s—the total number of
symbols needed to write down all axioms in s.


5       Implementation and Evaluation
We have implemented Algorithm 1 as a part of the new Java-based Proof Utility Library
(PULi).3 In our implementation, we used the standard Java priority queue for Q, and
employed a few optimisations to improve the performance of the algorithm.
    First, we have noticed that our implementation spends over 95% of time on checking
subsumptions in line 7. To improve subsumption checks, we developed a new datastruc-
ture for storing sets of elements and checking if a given set is a superset of some stored
set. In a nutshell, we index the sets by 128 bit vectors, represented as a pair of 64 bit
integers, where each element in a set sets one bit of the bit vector to 1 using its hash
value. This idea is reminiscent of Bloom filters.4 We store the sets in a trie5 with the
bit vector as the key, and use bitwise operations to determine if one vector has all bits
of the other vector, which gives us a necessary condition for set inclusion. Using this
datastructure, we were able to significantly improve the subsumption tests.
    We have also noticed that the queue Q often contains about 10 times more elements
than the closure S. To improve the memory consumption, we do not create the resolvents
c0 immediately (see line 12), but instead store in the queue Q the pairs of clauses (from
S) from which these resolvents were obtained. This does not reduce the number of
elements in the queue, but reduces the memory consumed by each element to essentially
a few pointers.
    We have evaluated our implementation on inferences computed for entailed ax-
ioms in some large EL ontologies, and compared performance with SAT-based tools
for enumeration of justifications from inferences EL2MUS [4], EL2MCS [3] and SAT-
Pin [17]. The inferences were extracted by EL+SAT [21, 23] (in the following called sat
inferences) and ELK [15] (in the following called elk inferences). Both are capable of
computing small inference sets that derive particular entailed axioms and are complete
for these entailments (see Section 3.1). See Table 2 for statistics about the inferences
obtained for the entailments.
    For our evaluation, we chose ontologies GO-P LUS, G ALEN and S NOMED, which
contain (mostly) EL axioms. GO-P LUS is a recent version of Gene Ontology,6 which
imports a number of other ontologies. The provided distribution included subsumption
axioms that were inferred (annotated with is_inferred), which we have removed.
G ALEN is the version 7 of OpenGALEN.7 We did not use the more recent version 8,
    3
      https://github.com/liveontologies/puli
    4
      https://en.wikipedia.org/wiki/Bloom_filter
    5
      https://en.wikipedia.org/wiki/Trie
    6
      http://geneontology.org/page/download-ontology
    7
      http://www.opengalen.org/sources/sources.html
   Table 1. Summary of the input ontologies      Table 2. Summary of sizes of inference sets

            GO-P LUS G ALEN S NOMED                      GO-P LUS G ALEN S NOMED
  # axioms 105557     44475   315521                average 470.3 59140.0     997.8
  # concepts 57173    28482   315510            sat median    39.0 110290.0     1.0
  # roles        157    964       77                max    15915.0 152802.0 39381.0
  # queries    90443  91332   468478                average 166.9    3602.0   110.3
                                                elk median    43.0   3648.0     8.0
                                                    max     7919.0 81501.0   1958.0


because the other tools were running out of memory. S NOMED is the version of Snomed
CT8 released on 2015-01-31. From the first two ontologies we removed non-EL axioms,
such as functional property axioms, and axioms that contain inverse property expres-
sions and disjunctions. We have also adapted the input ontologies, so that they could
be processed by (the reasoner of) EL+SAT. We removed disjointness axioms and re-
placed property equivalences with pairs of property inclusions. Duplicate axioms were
removed by loading and saving the ontologies with OWL API. With these ontologies,
we have computed justifications for the entailed direct subsumptions between atomic
concepts (in the following called the queries) using various tools. All queries were pro-
cessed by tools in a fixed random order to achieve a fair distribution of easy and hard
problems. We used a global timeout of one hour for each tool and a local timeout of
one minute per query. To run the experiments we used a PC with Intel Core i5 2.5 GHz
processor and 7.7 GiB RAM operated under 64-bit OS Ubuntu 14.04. Table 1 shows the
numbers of axioms, atomic concepts, atomic roles, and queries of each input ontology.
     As an admissible order on clauses for our implementation of Algorithm 1, we chose
the relation - that compares the number of different answer literals in clauses. When
using this order, cardinality-minimal justifications are found first. Note that finding such
justifications is NP-hard [6], however, practically, our algorithm with this order found
first justifications of all the queries of GO-P LUS, G ALEN and S NOMED respectively
in about 13 minutes, 2 hours and 1.5 hours. To control resolution inferences, we used
three different selection strategies (for Horn clauses) that we detail next. For a proposi-
tional atom p, let #(p) be the number of input clauses in which p appears as a (positive)
literal. Given a clause c, the BottomUp strategy, selects a negative literal ¬p of c whose
value #(p) is minimal; if there are no negative literals, the (only) positive literal of c is
selected. The TopDown strategy selects a positive literal, if there is one, and otherwise
selects a negative literal like in BottomUp. Finally, the Threshold strategy selects a neg-
ative literal ¬p with the minimal value #(p) if #(p) does not exceed a given threshold
value or there is no positive literal in c; otherwise the positive literal is selected. In our
experiments we used the threshold value of 2. Intuitively, the BottomUp strategy simu-
lates the Unit resolution, the TopDown simulates the SLD resolution, and the Threshold
is a combination thereof.
     Table 3 shows for how many queries all justifications were computed within the
global and local timeouts.9 The first six rows correspond to experiments on sat infer-
   8
       http://www.snomed.org/
   9
       All experimental data is available at https://osf.io/9sj8n/
Table 3. Number of queries processed in 1h / number of 60s timeouts / % of processed queries in
the number of all queries / % of 60s timeouts in the number of queries attempted in 1h

                        GO-P LUS                   G ALEN                  S NOMED
       BottomUp     2004 / 43 / 2.2 / 2.15    123 / 56 / 0.1 / 45.5    3270 / 31 / 0.7 / 0.95
       TopDown     16662 / 48 / 18.4 / 0.29 5028 / 18 / 5.5 / 0.36     7388 / 30 / 1.6 / 0.41
       Threshold   23968 / 45 / 26.5 / 0.19 2873 / 3 / 3.1 / 0.10 18700 / 14 / 4.0 / 0.07
 sat


       EL2MUS      10357 / 42 / 11.5 / 0.41 4862 / 32 / 5.3 / 0.66 12773 / 37 / 2.7 / 0.29
       EL2MCS       6369 / 43 / 7.0 / 0.68 3194 / 27 / 3.5 / 0.85      6163 / 46 / 1.3 / 0.75
       SATPin       4390 / 53 / 4.9 / 1.21 1475 / 39 / 1.6 / 2.64      3490 / 46 / 0.7 / 1.32
       BottomUp     3113 / 41 / 3.4 / 1.32 8204 / 23 / 9.0 / 0.28 85977 / 20 / 18.4 / 0.02
       TopDown     18392 / 43 / 20.3 / 0.23 6757 / 33 / 7.4 / 0.49 73063 / 15 / 15.6 / 0.02
       Threshold   30579 / 47 / 33.8 / 0.15 25824 / 15 / 28.3 / 0.06 291151 / 1 / 62.1 / 0.00
 elk


       EL2MUS      12877 / 48 / 14.2 / 0.37 12349 / 40 / 13.5 / 0.32 15054 / 42 / 3.2 / 0.28
       EL2MCS       6694 / 49 / 7.4 / 0.73 8545 / 47 / 9.4 / 0.55      6466 / 48 / 1.4 / 0.74
       SATPin       4689 / 50 / 5.2 / 1.07 3050 / 48 / 3.3 / 1.57      4320 / 52 / 0.9 / 1.20


ences and the other six rows to experiments on elk inferences. Note that every tool
processed more queries and had less relative timeouts on elk inferences. Notice that the
resolution procedure with the Threshold selection strategy processed the most queries
and had the least number of timeouts on all ontologies. In particular, for S NOMED it
timed out only in one case. It turns out, that without any timeout Threshold was able
to find all justifications for all queries of S NOMED within about 1.5 hours, with the
longest query taking less than 4 minutes. None of the SAT-based tools was able to find
all justifications for all queries of S NOMED even within 24 hours.
     To determine whether the Threshold strategy was the best among the resolution
strategies on all queries, we have plotted in Figure 2 the distributions of the query times
for all resolution strategies. Each point hx, yi of a plot represents the proportion x of
queries that were solved by the method in under the time y. For instance, TopDown
solved about 90% of the queries of G ALEN in under 0.1 seconds. Each plot considers
only queries attempted by all tools on that plot. Since each plot represents the distri-
bution of times and not a direct comparison of times for each query, even if one line
is completely below another one, this does not mean that the corresponding method
is faster for every query. To get a more detailed comparison, we have also plotted the
distribution of minimum query times with a thin black line. For a query, minimum time
is the time spent by the tool that was the fastest on that query (among the tools on the
same plot). If a plot for some tool coincides with this black line at point (x, y), then
all queries solved within time y by some tool were also solved within time y by this
tool. In particular, this tool is the fastest for all queries with the minimal time y. This
analysis shows, for example, that TopDown was the best tool for easy queries (solved
under 0.1 seconds by some tool) in G ALEN, and Threshold was the best tool for hard
queries (solved over 1 second by all tools) on all ontologies.
    In Figure 3 we present a similar comparison of Threshold with the SAT-based
tools. Threshold comes as the winner for hard queries (solved over 0.1 seconds) in
60


                                                             60


                                                                                                                               60
                                GO-P LUS                                                         G ALEN                                                           S NOMED

       time in seconds


                                                                    time in seconds


                                                                                                                                      time in seconds
10


                                                             10


                                                                                                                               10
1


                                                             1


                                                                                                                               1
0.1


                                                             0.1


                                                                                                                               0.1
0.01


                                                             0.01


                                                                                                                               0.01
                                 % of queries                                                   % of queries                                                      % of queries
0                        60      80        90           100 0                         60        80        90               100 0                        60        80        90   100

                                                  BottomUp                                 TopDown             Threshold                                minimum


                              Fig. 2. Distribution of query times for the selection strategies on elk inferences
60


                                                             60


                                                                                                                               60
                                GO-P LUS                                                         G ALEN                                                           S NOMED
       time in seconds


                                                                    time in seconds


                                                                                                                                      time in seconds
10


                                                             10


                                                                                                                               10
1


                                                             1


                                                                                                                               1
0.1


                                                             0.1


                                                                                                                               0.1
0.01


                                                             0.01


                                                                                                                               0.01
                                 % of queries                                                   % of queries                                                      % of queries
0                        60      80        90           100 0                         60        80        90               100 0                        60        80        90   100

                                         SATPin              EL2MCS                                  EL2MUS                Threshold                              minimum


                                                Fig. 3. Distribution of query times on elk inferences10


S NOMED.11 EL2MUS wins on queries solved between about 10 and 50 milliseconds in
G ALEN. For all other query ranges, there appears to be no absolute winner.

6                        Summary
We presented a new procedure that enumerates justifications using inferences that de-
rive the goal consequence from an ontology. The inferences are encoded as Horn clauses
and resolution with answer literals is applied. Our procedure can be parameterized by
an ordering in which the justifications should be enumerated (as long as it extends
the subset relation) and by a strategy that selects literals for resolution. The algorithm
is relatively easy to implement and it can be easily used also with non-Horn and non-
propositional clauses. Our empirical evaluation shows that the procedure provides com-
parable, if not better performance than other tools that also use inferences as input. We
were able to compute all justifications for all direct subsumptions of Snomed CT in
about 1.5 hours. Currently, we cannot explain the difference in the performance of the
evaluated selection strategies. We hope to explore this question in the future.
       10
                         Plots on sat inferences have the same shape, except that all lines are shifted up.
       11
                         For other ontologies Threshold times out on some query for which some other tool does not.
References
 1. Ansótegui, C., Bonet, M.L., Levy, J.: Sat-based maxsat algorithms. Artif. Intell. 196, 77–105
    (2013), http://dx.doi.org/10.1016/j.artint.2013.01.002
 2. Arif, M.F., Mencía, C., Ignatiev, A., Manthey, N., Peñaloza, R., Marques-Silva, J.: BEACON:
    an efficient sat-based tool for debugging EL+ ontologies. In: Creignou, N., Berre, D.L. (eds.)
    Proc. 19th Int. Conf. on Theory and Applications of Satisfiability Testing (SAT’16). Lecture
    Notes in Computer Science, vol. 9710, pp. 521–530. Springer (2016), http://dx.doi.
    org/10.1007/978-3-319-40970-2_32
 3. Arif, M.F., Mencía, C., Marques-Silva, J.: Efficient axiom pinpointing with EL2MCS.
    In: Hölldobler, S., Krötzsch, M., Peñaloza, R., Rudolph, S. (eds.) Proc. 38th An-
    nual GermanConf. on Artificial Intelligence (KI’15). Lecture Notes in Computer Sci-
    ence, vol. 9324, pp. 225–233. Springer (2015), http://dx.doi.org/10.1007/
    978-3-319-24489-1_17
 4. Arif, M.F., Mencía, C., Marques-Silva, J.: Efficient MUS enumeration of horn formulae with
    applications to axiom pinpointing. CoRR abs/1505.04365 (2015), http://arxiv.org/
    abs/1505.04365
 5. Baader, F., Lutz, C., Suntisrivaraporn, B.: Efficient reasoning in EL+ . In: Parsia, B., Sat-
    tler, U., Toman, D. (eds.) Proc. 19th Int. Workshop on Description Logics (DL’06). CEUR
    Workshop Proceedings, vol. 189. CEUR-WS.org (2006)
 6. Baader, F., Peñaloza, R., Suntisrivaraporn, B.: Pinpointing in the description logic EL+ . In:
    Hertzberg, J., Beetz, M., Englert, R. (eds.) Proc. 30th Annual GermanConf. on Artificial
    Intelligence (KI’07). Lecture Notes in Computer Science, vol. 4667, pp. 52–67. Springer
    (2007), https://doi.org/10.1007/978-3-540-74565-5_7
 7. Bachmair, L., Ganzinger, H.: Resolution theorem proving. In: Robinson, J.A., Voronkov, A.
    (eds.) Handbook of Automated Reasoning, pp. 19–99. Elsevier and MIT Press (2001)
 8. Green, C.: Theorem proving by resolution as a basis for question-answering systems. Ma-
    chine Intelligence pp. 183–205 (1969)
 9. Greiner, R., Smith, B.A., Wilkerson, R.W.: A correction to the algorithm in reiter’s the-
    ory of diagnosis. Artif. Intell. 41(1), 79–88 (1989), http://dx.doi.org/10.1016/
    0004-3702(89)90079-9
10. Horridge, M.: Justification based explanation in ontologies. Ph.D. thesis, Univer-
    sity of Manchester, UK (2011), http://www.manchester.ac.uk/escholar/
    uk-ac-man-scw:131699
11. Junker, U.: QUICKXPLAIN: preferred explanations and relaxations for over-constrained
    problems. In: McGuinness, D.L., Ferguson, G. (eds.) Proc. 19th AAAI Conf. on Artificial
    Intelligence (AAAI’04). pp. 167–172. AAAI Press / The MIT Press (2004), http://www.
    aaai.org/Library/AAAI/2004/aaai04-027.php
12. Kalyanpur, A.: Debugging and Repair of OWL Ontologies. Ph.D. thesis, University of Mary-
    land College Park, USA (2006)
13. Kalyanpur, A., Parsia, B., Horridge, M., Sirin, E.: Finding all justifications of OWL DL
    entailments. In: Aberer, K., Choi, K.S., Noy, N., Allemang, D., Lee, K.I., Nixon, L., Golbeck,
    J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) Proc. 6th
    Int. Semantic Web Conf. (ISWC’07). LNCS, vol. 4825, pp. 267–280. Springer (2007)
14. Kazakov, Y.: Consequence-driven reasoning for Horn SHIQ ontologies. In: Boutilier, C.
    (ed.) Proc. 21st Int. Joint Conf. on Artificial Intelligence (IJCAI’09). pp. 2040–2045. IJCAI
    (2009)
15. Kazakov, Y., Klinov, P.: Goal-directed tracing of inferences in EL ontologies. In: The Se-
    mantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda,
    Italy, October 19-23, 2014. Proceedings, Part II. pp. 196–211 (2014)
16. Kazakov, Y., Krötzsch, M., Simančík, F.: The incredible ELK: From polynomial procedures
    to efficient reasoning with EL ontologies. J. of Automated Reasoning 53(1), 1–61 (2014)
17. Manthey, N., Peñaloza, R., Rudolph, S.: Efficient axiom pinpointing in EL using SAT
    technology. In: Lenzerini, M., Peñaloza, R. (eds.) Proc. 29th Int. Workshop on Descrip-
    tion Logics (DL’16). CEUR Workshop Proceedings, vol. 1577. CEUR-WS.org (2016),
    http://ceur-ws.org/Vol-1577/paper_33.pdf
18. Morgado, A., Liffiton, M.H., Marques-Silva, J.: Maxsat-based MCS enumeration. In: Biere,
    A., Nahir, A., Vos, T.E.J. (eds.) Proc. 8th Int. Haifa Verification Conf. (HVC’12). Lecture
    Notes in Computer Science, vol. 7857, pp. 86–101. Springer (2012), http://dx.doi.
    org/10.1007/978-3-642-39611-3_13
19. Parsia, B., Sirin, E., Kalyanpur, A.: Debugging OWL ontologies. In: Ellis, A., Hagino, T.
    (eds.) Proc. 14th Int. Conf. on World Wide Web (WWW’05). pp. 633–640. ACM (2005),
    http://doi.acm.org/10.1145/1060745.1060837
20. Reiter, R.: A theory of diagnosis from first principles. Artif. Intell. 32(1), 57–95 (1987),
    http://dx.doi.org/10.1016/0004-3702(87)90062-2
21. Sebastiani, R., Vescovi, M.: Axiom pinpointing in lightweight description logics via horn-
    sat encoding and conflict analysis. In: Schmidt, R.A. (ed.) Proc. 22st Conf. on Automated
    Deduction (CADE’09). Lecture Notes in Computer Science, vol. 5663, pp. 84–99. Springer
    (2009), http://dx.doi.org/10.1007/978-3-642-02959-2_6
22. Simančík, F., Kazakov, Y., Horrocks, I.: Consequence-based reasoning beyond Horn ontolo-
    gies. In: Walsh, T. (ed.) Proc. 22nd Int. Joint Conf. on Artificial Intelligence (IJCAI’11). pp.
    1093–1098. AAAI Press/IJCAI (2011)
23. Vescovi, M.: Exploiting SAT and SMT Techniques for Automated Reasoning and Ontology
    Manipulation in Description Logics. Ph.D. thesis, University of Trento, Italy (2011), http:
    //eprints-phd.biblio.unitn.it/477/