=Paper=
{{Paper
|id=Vol-2211/paper-13
|storemode=property
|title=Computing Best Ontology Excerpts via Weighted Partial Max-SAT Solving
|pdfUrl=https://ceur-ws.org/Vol-2211/paper-13.pdf
|volume=Vol-2211
|authors=Jieying Chen,Yue Ma,Dirk Walther
|dblpUrl=https://dblp.org/rec/conf/dlog/ChenM018
}}
==Computing Best Ontology Excerpts via Weighted Partial Max-SAT Solving==
<pdf width="1500px">https://ceur-ws.org/Vol-2211/paper-13.pdf</pdf>
<pre>
            Computing Best Ontology Excerpts via
             Weighted Partial Max-SAT Solving?

                         Jieying Chen1 , Yue Ma1 , Dirk Walther2
       1
           Laboratoire de Recherche en Informatique, Université Paris-Sud, France
                           2
                             Fraunhofer IVI, Dresden, Germany
               {jieying.chen,yue.ma}@lri.fr, dirk.walther@ivi.fraunhofer.de


           Abstract. We consider the problem of computing best excerpts of on-
           tologies, which are selections of a certain, small number k of axioms that
           best capture the knowledge regarding a given set of weighted terms. Such
           excerpts can be useful for ontology selection, for instance. The weights of
           terms are to reflect different importances of considered terms, for which
           we propose an incompleteness measure of excerpts based on the term
           weights and the notion of logical difference. We present a method to
           compute best k-excerpts by finding all subsumption justifications and
           solving a weighted partial Max-SAT problem. We demonstrate the via-
           bility of our approach with an evaluation on biomedical ontologies.


1     Introduction
In the presence of an ever growing amount of information, organizations and
human users need to be able to focus on certain key pieces of information and to
intentionally ignore all other possibly relevant parts. Knowledge about complex
systems that is represented in ontologies yields collections of axioms that are
too large for human users to browse, let alone to comprehend or reason about
it. To address the need for ontology summary, the notion of an ontology excerpt
was introduced [3]. An ontology excerpt is a fixed-size subset of an ontology,
consisting of the most relevant axioms for a given set of terms. These axioms
preserve as much as possible the knowledge about the considered terms described
in the ontology.
    To evaluate the quality of ontology excerpts, we define a semantics-based
measure, using Logical Difference [6], to quantify how much semantic meaning
is preserved in an excerpt compared to the original ontology. We take the logical
difference to be the set of queries that are relevant to an application domain and
that produce different answers when evaluated over the considered ontologies.
We consider concept subsumption queries over the terms of interest only.
    Using an exhaustive search to find the excerpts of an ontology that best
preserve the semantic information w.r.t. the ontology is futile as it involves com-
puting all (i.e. exponentially many) subsets of the ontology. In [3] efficient excerpt
extraction techniques are investigated that are based on methods stemming from
?
    This work is partially funded by the ANR project GoAsQ (ANR-15-CE23-0022).
the area of information retrieval [10], a research area which is generally concerned
with developing techniques to extract the “most relevant” documents for a query
from large data sources. However, these techniques do not yield the best excerpts
possible. In this paper, we present an algorithm that computes a best excerpt of
an ontology, i.e., there is no subset of the ontology of the same size that better
preserves the knowledge about the selected terms. This paper extends the ap-
proach from [4] using an encoding to a weighted partial Max-SAT problem to
handle weighted signatures.
    We proceed as follows. We start in Section 2 with reviewing the notions of
logical difference together with difference witness and best excerpt as well as the
weighted partial Max-SAT problem. A best k-excerpt of a TBox is a sub-ontology
that containing at most k axioms that best capture the knowledge about user’s
interests in terms of weighted signature. Moreover, we introduce the notion of
subsumption modules of a TBox for a given set of terms. Taking the union of
some subsumption modules can yield best excerpts for a weighted signature, as
shown in Section 3. We encode the problem of selecting the right subsumption
modules on weighted signature as a weighted partial Max-SAT problem. Overall
we obtain an algorithm for computing best k-excerpts on weighted signature.
In Section 4, we propose a method to rank axioms in excerpts when presenting
user a best excerpt. We evaluate our algorithm by using it with the prominent
bio-medical ontologies Snomed CT in Section 5. We demonstrate computing best
k-excerpts is a viable approach for the task of summarizing large ontologies with
a controllable small number of axioms that are most relevant for a given set of
terms. We conclude the paper in Section 6.


2     Preliminaries

We briefly recall basic notions related to the ontology excerpts, and the logical
difference between terminologies [6, 8]. We assume familiarity with Description
Logic; see [1] for a detailed introduction.


2.1   Ontology Excerpt

In this section, we review the notions of a k-excerpt, an incompleteness measure
and a best k-excerpt of a TBox [3]. We recall that a signature is a finite set of
concept and role names, and we use sig(ξ) to denote the signature of an syntactic
object ξ.
    A k-excerpt of an ontology is a subset of the ontology containing no more than
k axioms. An incompleteness measure quantifies how much of the knowledge of
a TBox about the terms in a signature is captured (or how much is missing) by
an excerpt.

Definition 1 (Incompleteness Measure). Let T be a TBox, and let Σ be a
signature. An incompleteness measure µΣ T for T and Σ is a function that maps
every subset E ⊆ T to a non-negative real number µ(E).
   A best k-excerpt best captures the knowledge about Σ in T among all k-
excerpts. How well a k-excerpt captures knowledge is measured by a suitable
incompleteness measure. Note that best k-excerpts are generally not unique.

Definition 2 (Best Excerpt). Let T be a TBox, let Σ be a signature, and let
k > 0 be a natural number. Additionally, let µΣ
                                              T be an incompleteness measure
for T and Σ. A best k-excerpt of T w.r.t. Σ under µΣ  T is a k-excerpt E of T
such that
                                  0     0
               µΣ             Σ
                T (E) = min{ µT (E ) | E is a k-excerpt of T }.

    In [3,4], the incompleteness measure µ was set to be the number |cWtnΣ (T , E)|
of concept inclusion difference witnesses in Σ w.r.t. T , which will be introduced
in the next section.


2.2   Logical Difference

We now recall basic notions related to the logical difference between two ELHr -
terminologies for EL-inclusions over a given signature as query language [6, 8].

Definition 3 (Logical Difference). Let T1 and T2 be two EL-terminologies,
and let Σ be a signature. The concept inclusion difference between T1 and T2
w.r.t. Σ is the set cDiff Σ (T1 , T2 ) of all EL-inclusions α of the form C v D for
EL-concepts C and D such that sig(α) ⊆ Σ, T1 |= α, and T2 6|= α.

   In case two terminologies T1 and T2 are logically different, the set cDiff Σ (T1 , T2 )
consists of infinitely many concept inclusions. The primitive witnesses theorems
from [6] allow us to consider only certain inclusions of a simpler syntactic form.
Theorem 1. Let T1 and T2 be two EL-terminologies, and let Σ be a signature.
If α ∈ cDiff Σ (T1 , T2 ), then either A v D or C v A is a member of cDiff Σ (T1 , T2 ),
where A ∈ sig(α) is a concept name, and C, D are EL-concepts occurring in α.
    We use the primitive witnesses theorem to obtain a succinct representation
of the set cDiff Σ (T1 , T2 ).

Definition 4 (Primitive Witnesses). Let T1 and T2 be two EL-terminologies,
and let Σ be a signature. We say that concept inclusion difference witnesses in Σ
w.r.t. T1 and T2 are concept names contained in Σ that occur on the left-hand
side of inclusions of the form A v C in cDiff Σ (T1 , T2 ) or on the right-hand side
of inclusions of the form D v A in cDiff Σ (T1 , T2 ). The set of all such witnesses
will be denoted by cWtnΣ (T1 , T2 ).

   The set cWtnΣ (T1 , T2 ) is finite as Σ is finite. In particular, it holds that
cDiff Σ (T1 , T2 ) = ∅ iff cWtnΣ (T1 , T2 ) = ∅.
   For (possibly cyclic) EL-terminologies T1 and T2 , the set cWtnΣ (T1 , T2 ) can
be computed using the current version of the CEX tool [7] implementing a proof-
theoretic approach to the logical difference problem using hypergraphs as intro-
duced in [5] and further extended in [9].
2.3   Weighted Partial Max-SAT problem

Let Π be a countably infinite set of propositional variables which we denote with
p, q, etc. The set Φ of propositional logic formulas ϕ is built according to the
following grammar rule: ϕ ::= p | ¬ϕ | ϕ ∧ ϕ | ϕ ∨ ϕ | ϕ → ϕ | ϕ ↔ ϕ with p ∈ Π.
     A propositional valuation v : Π → {0, 1} is function mapping a propositional
variable p ∈ Π to a truth value v(p) ∈ {0, 1}. A propositional formula ϕ ∈ Φ is
satisfied in v iff v(ϕ) = 1. We say that ϕ is satisfiable iff there exists a valuation v
such that v(ϕ) = 1.
     A literal ` is a propositional variable p or a negated propositional variable
¬p, and a clause is a disjunction `1 ∨ . . . ∨ `n of literals `i (0 ≤ i ≤ n). For a
propositional formula ϕ ∈ Φ, we denote with Cls(ϕ) the transformation of ϕ into
an equi-satisfiable clause set, i.e. Cls(ϕ)
                                       V is a finite set of propositional clauses such
that ϕ is satisfiable iff the formula ψ∈Ψ ψ is satisfiable, where Ψ = Cls(ϕ). It is
known that every propositional formula ϕ ∈ Φ can be transformed in polynomial
time (in the size of ϕ) into an equi-satisfiable set of clauses. Note that the clauses
in Cls(ϕ) may contain propositional variables that are not present in ϕ.
     The Boolean Satisfiability Problem (SAT), also known as Propositional Sat-
isfiability Problem, is the problem of deciding whether there exists a valuation
that satisfies a given propositional formula ϕ, or alternatively, all the clauses
contained in a set of clauses Ψ = Cls(ϕ). As an extension, the set Ψ is split into
two sets H and S, called hard and soft clause sets. The problem of finding a
valuation that satisfies all the clauses in H and a maximal number clauses in S
is called the partial Max-SAT problem.
     A weighted clause is a pair (ψ, wψ ), where ψ is a clause and wψ is a natural
number or infinity meaning the cost for falsifying the clause ψ. If a clause ψ ∈ H,
then wψ is infinite. Given an assignment I and a multiset of weighted clauses Φ,
the cost of assignment I on Φ is the sum of the weights of the clauses falsified by
I. The Weighted Partial Max-SAT problem for weighted clauses is the problem
of finding an optimal assignment to the variables of Φ that minimizes the cost
of the assignment on Φ.


2.4   Subsumption Justifications

Subsumption module and justification are defined to capture the set of ax-
ioms required to preserve the logical consequences related to a particular con-
cept name [4]. An algorithm for computing subsumption justifications for ELH-
terminologies is presented in [4].

Definition 5 (Subsumption Module & Justification). Let T be a termi-
nology, let Σ be a signature, and let X ∈ NC . Additionally, let M ⊆ T .
    We say that M is an hX, Σi-subsumption module of T iff for every C, D ∈
ELΣ : T |= X v C implies M |= X v C. Plus T |= D v X implies M |= D v
X.
    An hX, Σi- subsumption justification M of T is an hX, Σi-subsumption mod-
ule of T that is minimal w.r.t. (.
Property 1. Let T be a terminology, and let Σ be a signature. Additionally, let
M be a hX, Σi-subsumption justification of T . Then X 6∈ cDiff Σ (T , M).


3    Computing Best k-Excerpts for Weighted Signatures
Best k-excerpts provide users with a way to extract the most relevant axioms
from a big ontology with respect to a given signature. However, in certain cases,
concepts or roles in the signature can have different importances. To this end,
we propose a new incompleteness measure considering a weighted signature and
study the corresponding best k-excerpts.
   First, we introduce the definition of weighted signature.
Definition 6 (Weighted Signature). A weight function is a function w :
NC ∪ NR → R mapping a symbol σ ∈ NC ∪ NR to a value w(σ) ∈ R ∪ {∞}.
Given a signature Σ and a weight function w, a weighted signature Σw is a set
Σw = { (σ, w(σ)) | σ ∈ Σ }.
    A signature Σ is weighted uniformly under a weight function w iff w assigns
the same value to every symbol in Σ, i.e. w(σ1 ) = w(σ2 ) for every σ1 , σ2 ∈ Σ. For
a weight function w whose range equals {1.0} (i.e., w maps every symbol to 1.0),
we also write Σ ◦ instead of Σw . For instance, Σw = {(A, 2.0), (B, 0.9), (r, 1.3)} is
the weighted signature for the signature Σ = {A, B, r} and a weight function w
satisfying w(A) = 2.0, w(B) = 0.9 and w(r) = 1.3. On the other hand, a
weight function w0 satisfying w0 (A) = w0 (B) = w0 (r) = 1.4 weighs the symbols
in Σ uniformly resulting in Σw0 = {(A, 1.4), (B, 1.4), (r, 1.4)}, whereas Σ ◦ =
{(A, 1.0), (B, 1.0), (r, 1.0)}.
    For best k-excerpts under a weighted signature Σw , we use, as an incom-
pleteness measure µΣ   T
                         w
                            for Σw and T , the sum of the weights of the EL-concept
inclusion difference witnesses in Σ w.r.t. T and E, formally:
                                               X
                                µΣ
                                 T (E) =
                                   w
                                                  w(X)
                                        X∈cWtnΣ (T ,E)

     For computing the best k-excerpts for a weighted signature, we present an
encoding of the problem to a weighted partial Max-SAT problem, with the aim of
delegating the task of finding the best excerpt to a Max-SAT solver. In that way
we can leverage the decades of research efforts dedicated to developing efficient
SAT solvers for our problem setting.
     Note that our encoding requires that a cover of the hA, Σi-subsumption jus-
tifications has already been computed for every concept name A ∈ Σ. Moreover
our partial Max-SAT encoding is language agnostic, i.e., it works for any TBox
as long as its subsumption justifications can be computed.
     As we are interested in finding a subset M ⊆ T of a TBox T that fulfills
certain properties, we encode the presence of an axiom α ∈ T in M using a
propositional variable pα . Intuitively, in a solution v to our partial Max-SAT
problem we will have that v(pα ) = 1 iff the axiom α is contained in the best
k-excerpt.
   Next we introduce the formulas FM and GM with M ⊆ T and M ⊆ 2T that
we will use to encode the covers M of hA, Σi-subsumption justifications M.

Definition 7 (Encoding of TBox Subsets). Let T be a terminology. For
every axiom α ∈ T , let pα ∈ Π be a fresh propositional variable.
          V M ⊆ T , we define the propositional formula FM that encodes M
   For a subset
as FM := α∈M pα . Moreover, for a set M ⊆ 2T , we define the propositional
formula GM associated with M as
                              _            _ ^
                       GM :=      FM =        (      pα ).
                                M∈M          M∈M α∈M

Example 1. Let T = {α1 , α2 , α3 , α4 , α5 }, M = {α1 , α2 , α3 }, and M = {{α2 , α3 },
{α1 , α4 }}. Then FM = pα1 ∧ pα2 ∧ pα3 and GM = (pα2 ∧ pα3 ) ∨ (pα1 ∧ pα4 ).

    We can now define our encoding of the best k-excerpt problem for weighted
signature into weighted partial Max-SAT.

Definition 8 (Encoding of the Best Excerpt Problem for Weighted
Signature). Let T be an EL-TBox, let Σw be a weighted signature, and let
0 ≤ k ≤ |T |. Additionally, for every A ∈ Σw , let MA be a cover of the hA, Σw i-
subsumption justifications of T , and let qA be a fresh propositional variable.
   The weighted partial Max-SAT problem for finding best k-excerpts of T w.r.t. Σw ,
denoted with Pk (T , Σw ), is defined as follows.
   We set Pk (T , Σw ) := (Hk (T ), Sk (T , Σw )), where
                                               [
                  Hk (T ) := Card(T , k) ∪           Cls(qA ↔ GMA ),
                                             A∈Σ∩NC

                Sk (T , Σw ) := { (qA , w(A)) | A ∈ Σw ∩ NC },

and Card(T , k) is a set of clauses specifying that at most k clauses from the set
{ pα | α ∈ T } must be satisfied.

    In the hard part of our weighted partial Max-SAT problem, the clauses in
Card(T , k) specify that the cardinality of the resulting excerpt E ⊆ T must
be equal to k. We do not fix a certain encoding that should be used to obtain
Card(T , k), but we note that there exist several techniques that require a poly-
nomial number of clauses in k and in the size of T 1 (see e.g. [11]). Moreover,
for every concept name A ∈ Σw , the variable qA is set to be equivalent to the
formula GMA , i.e. qA will be satisfied in a valuation iff the resulting excerpt will
not have A as a difference witness (A 6∈ cWtnΣ (T , E)).
    In the soft part of the problem, the set Sk (T , Σw ) specifies that the sum of the
weights of the concept names in the signature must be maximal by a solution
of the problem, enforcing that the resulting excerpt E will yield the smallest
1
    In our implementation we used the encodings for cardinality constraints that are
    provided by the Max-SAT solver Sat4j.
possible sum of weights of difference witnesses (whilst obeying the constraint
that |E| = k).
   We can now show the correctness of our encoding, i.e. a best k-excerpt for
a weighted signature can be exactly obtained from any solution to the weighted
partial Max-SAT problem Pk (T , Σw ).

Theorem 2. Let T be an EL-TBox, and let Σw be a weighted signature. Then
v is a solution of the weighted partial Max-Sat problem Pk (T , Σw ) if and only if
{ α ∈ T | v(pα ) = 1 } is a best k-excerpt of T w.r.t. Σw .

    Our algorithm for finding a best k-excerpt given a TBox T and a weighted
signature Σw (and k with 0 ≤ k ≤ |T |) as input is shown in Algorithm 1. First,
the algorithm checks whether k = 0 or k = |T |, in which case the empty set or T
itself is the best k-excerpt, respectively. Otherwise, the algorithm continues by
computing subsumption justifications for each concept name A in Σw and then
transfer them to propositional formula GJA . After the iteration over all the
concept names A in Σw is complete, the partial Max-SAT problem Pk (T , Σw ) is
constructed with the help of the formulas GMA that are stored in S. Subsequently,
a solution v of Pk (T , Σw ) is computed using a partial Max-SAT solver and the
best k-excerpt is returned by analyzing which variables pα have been set to 1 in
the valuation v.
 Algorithm 1: Computing Best k-Excerpts of T for Weighted Signature Σ
 1   function ComputeBestExcerpt(T , Σw , k)
 2       if k = 0 then
 3            return ∅
 4       if k = |T | then
 5            return T
 6       S := ∅
 7       for every A ∈ Σw ∩ NC do
 8            Compute hA, Σw i-subsumption justifications of T : JT (A, Σw )
 9            Transfer JT (A, Σw ) to its propositional formula GJA
10            S := S ∪ {GJA }
11       Compute Pk (T , Σw ) using S
12       Find the set of solutions V of Pk (T , Σw ) using weighted partial Max-SAT solver
13       return { α ∈ T | v(pα ) = 1, v ∈ V }


   Our algorithm first computes subsumption justifications, which requires ex-
ponential time in the size of T and Σw . Then the weighted partial Max-SAT
encoding runs in polynomial time. Deciding whether there is a solution of an in-
stance of a Max-SAT problem is NP-complete. Overall we have that Algorithm 1
runs in exponential time in the size of T and Σw in the worst case.

Example 2. Let T = {α1 := A v X1 u X2 u X3 u X4 , α2 := X1 v B1 ,
α3 := X2 v B1 , α4 := X3 v B2 , α5 := X4 v B2 } be a TBox. Let Σ =
{A, B1 , B2 } be a signature. Then, for X ∈ Σ, we obtain the following hX, Σi-
subsumption justifications MX : MA = {{α1 , α2 }, {α1 , α3 }, {α1 , α4 }, {α1 , α5 }},
MB1 = {{α1 , α2 }, {α1 , α3 }} and MB2 = {{α1 , α4 }, {α1 , α5 }}. Each subsump-
tion justification MX is encoded as a propositional logic formula GX as follows:
GA = ((p1 ∧ p2 ) ∨ (p1 ∧ p3 ) ∨ (p1 ∧ p4 ) ∨ (p1 ∧ p5 )), GB1 = ((p1 ∧ p2 ) ∨ (p1 ∧ p3 ))
and GB2 = ((p1 ∧ p4 ) ∨ (p1 ∧ p5 )).
    Now let w1 , w2 , w3 be weight functions such that w1 (A) = w1 (B1 ) = w1 (B2 ) =
1.0, w2 (A) = w2 (B1 ) = 1.0 and w2 (B2 ) = 2.0, and w3 (A) = w3 (B2 ) = 1.0,
w3 (B1 ) = 2.0. Feeding the resulting weighted signatures Σw1 , Σw2 and Σw3 to-
gether with k = 2 to a SAT solver yields the following solutions. For Σw1 , we
obtain four assignments of truth values for the propositional variables in the tu-
ple hp1 , p2 , p3 , p4 , p5 i (in this order): h1, 1, 0, 0, 0i, h1, 0, 1, 0, 0i, h1, 0, 0, 1, 0i and
h1, 0, 0, 0, 1i. Then we decode the assignments to sets of axioms. We obtain the
following best 2-excerpts for Σw1 : {α1 , α2 }, {α1 , α3 }, {α1 , α4 } and {α1 , α5 }.
    In the case of Σw2 , however, the solutions returned by the SAT-solver are
the assignments h1, 0, 0, 1, 0i and h1, 0, 0, 0, 1i. Consequently, the best 2-excerpts
for Σw2 are {α1 , α4 } and {α1 , α5 }.
    Finally, for Σw3 , the SAT-solver yields h1, 1, 0, 0, 0i and h1, 0, 0, 1, 0i resulting
in {α1 , α2 } and {α1 , α4 } as the best 2-excerpts for Σw3 .

    Example 2 shows that, different weights of Σ-symbols can influent the final
results of best excerpts. By giving higher weights on preferring Σ-concept names
(like Σw2 or Σw3 ), users can reduce the number of best k-excerpts. However, in
some extreme cases, as shown in the following example where a signature has a
much higher weight then the others, the best excerpts for the wighted signature
can be totally irrelevant to those w.r.t. Σ ◦ .

Example 3. Let T = {α1 := A1 v X, α2 := X v Y, α3 := Y v Z, α4 := A2 v
B1 , α5 := A3 v B2 } be a TBox. Let Σ = {A1 , A2 , A3 , B1 , B2 , Y, Z} be a signa-
ture. Now let w1 , w2 be weight functions such that w1 (A1 ) = w1 (A2 ) = w1 (A3 ) =
w1 (B1 ) = w1 (B2 ) = w1 (Y ) = w1 (Z) = 1.0 (i.e., Σw1 is Σ ◦ ) and w2 (A1 ) =
w2 (A2 ) = w2 (A3 ) = w2 (Y ) = 1.0 and w2 (Z) = 9.0. We obtain that the best
3-excerpt EΣ ◦ for Σ ◦ is {α4 , α5 } 2 . The best 3-excerpt EΣw2 for Σw2 is {α1 , α2 }.
However, | cWtnΣ (T , EΣ ◦ ) |=| {A1 , X, Y } |= 3 and | cWtnΣ (T , EΣw2 ) |=|
{A2 , A3 , B1 , B2 } |= 4.

    In Example 3, as the w2 (Z) is much higher than the rest concept names in
signature, EΣ ◦ is not the same as EΣw . The following theorem provides a guide
of how to restrict weight function in order to keep monotonicity of best excerpts
even though w changes.

Proposition 1. Let T be an EL-TBox, Σw be a weighted signature and let Σ ◦
be a uniformed signature under the weighted function w.
                             1
  (i) If 1 ≤ w(σ) 1 + |Σ∩N      C|
                                   for every σ ∈ Σ ∩ NC , then every best k-excerpt
      of PT w.r.t. Σw will be  a best k-excerpt of T w.r.t. Σ ◦ , but not vice versa.
 (ii) If σi ∈Σ1 wσi > σj ∈Σ2 wσj for any Σ1 , Σ2 ∈ 2Σ∩NC and | Σ1 |>| Σ2 |,
                         P
      iff every best k-excerpt of T w.r.t. Σw is a best k-excerpt of T w.r.t. Σ ◦ .
2
    The best 3-excerpt EΣ ◦ for Σw1 only contains two axioms α4 and α5 , as for any best
                       ◦
                               Σ◦
    3-excerpt E 0 , µΣ   0
                     T (E ) = µT ({α4 , α5 })
4    Ranking Axioms in Best Excerpts
Best excerpts are useful to work with when a whole ontology exceeds the capacity
of a system or a human user. However, multiple axioms often exist in the best
excerpts. To better assist users, it would be desirable to give a ranking of these
axioms according to their importances. In this section we present an approach
to weight axioms according to weighted signatures.
    To achieve a ranking, we introduce the definition of weighted axioms.
Definition 9 (Weighted Axiom). An axiom weight function φ is a function
φ : T → R+ mapping an axiom α ∈ T to a value φ(α) ∈ R+ .
   Given an axiom weight function, it is then natural to introduce an axiom
ranking. Intuitively, an axiom with a larger weight should be ranked more highly.
Definition 10 (Axiom Ranking). Given an axiom weight function φ and two
axioms α and β, we say α is preferred than β, denoted α  β, if φ(α) ≥ φ(β).
   To achieve an axiom ranking, in the following, we show that a signature
weight function can induce an axiom weight function. Intuitively, an axiom that
appears in subsumption justifications of more Σ-symbols with higher weights
should be more highly weighted.

Definition 11. Let T be a TBox, Σw a signature with the weight function w.
Additionally, Mσ be the set of all subsumption justifications of σ. Then the axiom
weight function φ(·) induced by w is:
                                          X
                   φΣw (α) =                              w(σ)
                               {σ∈Σw | ∃M ∈Mσ s.t. α∈M }

We denote Σw the axiom ranking induced by φΣw .

Example 4 (Example 2 contd.). Note that α1 (resp. α2 , α3 , α4 , α5 ) is contained
in at least one of the subsumption justifications of Σ-symbols {A, B1 , B2 } (resp.
{A, B1 }, {A, B1 }, {A, B2 }, {A, B2 }). So for w3 , φ(α1 ) = w(A)+w(B1 )+w(B2 ) =
4.0, φ(α2 ) = φ(α3 ) = w(A)+w(B1 ) = 2.0, and φ(α4 ) = φ(α5 ) = w(A)+w(B2 ) =
3.0. Hence, for the 2-best excerpts w.r.t. w3 : {α1 , α2 } and {α1 , α4 }, we can order
the axioms by α1 Σw α2 and α1 Σw α4 .

     It is easy to see that the axiom function φ(·) in Definition 11 gives a total
ordering over the axioms in T w.r.t. to a given weighted signature Σw . Moreover,
it is independent of excerpts as stated in the following.
Proposition 2. Let T be a TBox, Σw a signature with the weight function w,
for any best excerpts E1 and E2 and axioms α1 , α2 ∈ E1 ∩ E2 , α1  α2 in E1 iff.
α1  α2 in E2 .
That is,  is a stable ordering of the axioms given a weighted signature. However,
note that the axioms in best k-excerpts are not monotonic with respect to the
size of k (see a counter-example given in [3]).
                      Snomed CT                       NCI
    Time (s) Minimal Maximal Median STDEV Minimal Maximal Median STDEV
    MaxSat    0.01     1.27     0.15    0.35     0.01     0.16     0.01    0.02
Table 1. Times for computing best excerpts encoding the input for and running SAT4J


5     Evaluation

To demonstrate that our algorithm works in practise, we implemented a JAVA
prototype using SAT4J as a Max-SAT solver [2]. All the tests were performed
on a server running Ubuntu 15.10 with Intel Xeon 2.50GHz Core 4 Duo CPU
with 64G RAM. We used the prominent medical ontologies Snomed CT (ver-
sion Jan 2016)3 consisting of 317 891 axioms, and the NCI Thesaurus (NCIt,
version 16.03d)4 containing 165 341 axioms.
    The first experiment is to test the efficiency of our encoding method. Table 1
shows the time needed to compute best excerpts of NCI and Snomed CT for
randomly generated signatures of different sizes (10/30/50 concept names and
10/30/50 roles names). In total over 3 000 signatures have been considered. After
precomputing the subsumption justifications for every concept name in every
signature, the time was measured as that needed to prepare the input and run
the partial Max-SAT solver to compute best k-excerpts for every size k. The
minimal, maximal, median, and standard deviation of the execution times show
that computing best excerpts of any size using Max-SAT is very efficient given
that the subsumption justifications have been computed.
    In the second experiment, we compare the size of locality based modules
with the number of axioms in best excerpts needed to preserve a certain amount
of knowledge. We denote with #(PreservedΣ (best) = n), for n ∈ {1, 2}, the
minimal number of axioms needed by a best excerpt to preserve the knowledge
of n concept names w.r.t. the signature Σ (i.e., the number k of axioms of a
best excerpt E of T such that n = |Σ ∩ NC \ cWtnΣ (T , E)|). Instead of using
random signatures, however, we consider a scenario where a user searches for
sub-ontologies of Snomed CT related to a particular concept name. We com-
puted 2 500 different signatures each consisting of a concept name related to
diseases, the TOP-concept (called ‘SNOMED CT Concept’) and all role names of
Snomed CT. All weights of Σ-symbols are equal in this experiment.
    In Figure 1, the 2 500 signatures are presented along the x-axis in ascending
order w.r.t. the size of their respective ⊥>∗ -local modules (black line). The y-axis
represents the number of axioms in the module and excerpts for a signature. The
red (resp. green) line presents the sizes of best excerpts that preserve the knowl-
edge for one (resp. two) concept name(s), i.e., #P reservedΣ (best) = 1 (resp.
#P reservedΣ (best) = 2). Best excerpts provide a conciser way for zooming in
on an ontology, in particular, when sacrificing completeness is acceptable. For
3
    http://www.ihtsdo.org/snomed-ct
4
    http://evs.nci.nih.gov/ftp1/NCI_Thesaurus
                                   140           Module
                                             #Preserved (best) = 1
                                   120       #Preserved (best) = 2


                Number of Axioms
                                   100
                                    80
                                    60
                                    40
                                    20
                                     0
                                         0     500       0        0      0         0
                                                      100      150    200    250
                                                     Signature ID

Fig. 1. Comparison of the size of best excerpts preserving one or two concept names
with the size of ⊥>∗ -local modules for 2 500 signatures

this experiment, after computing the subsumption justifications of all concept
names in a signature, it only took 0.15s on average to compute the best excerpts.


6   Conclusion

We have presented an algorithm for computing a best k-excerpt of a TBox for
given weighted signature. Best k-excerpts are k-element subsets of an ontology
that best preserve entailments over the target signature compared to the on-
tology [3]. The measure of incompleteness of the excerpt w.r.t. the ontology is
based on the incompleteness measure taking into account weights of concept
names in the signature. We have implemented our algorithm and we have pre-
sented an experimental evaluation to demonstrate the viability of computing
best excerpts of large biomedical ontologies in practice. We leave investigating
alternative incompleteness measures for future work.


References

 1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
    (eds.): The description logic handbook: theory, implementation, and applications.
    Cambridge University Press, 2 edn. (June 2010)
 2. Berre, D.L., Parrain, A.: The sat4j library, release 2.2. JSAT 7(2-3), 59–6 (2010)
 3. Chen, J., Ludwig, M., Ma, Y., Walther, D.: Towards extracting ontology excerpts.
    In: Proceedings of KSEM’15. pp. 78–89 (2015)
 4. Chen, J., Ludwig, M., Ma, Y., Walther, D.: Zooming in on ontologies: Minimal
    modules and best excerpts. In: Proceeding of ISWC’17. pp. 173–189 (2017)
 5. Ecke, A., Ludwig, M., Walther, D.: The concept difference for EL-terminologies
    using hypergraphs. In: Proceedings of DChanges’13 (2013)
 6. Konev, B., Ludwig, M., Walther, D., Wolter, F.: The logical difference for the
    lightweight description logic EL. Journal of Artificial Intelligence Research 44,
    633–708 (2012)
 7. Konev, B., Ludwig, M., Wolter, F.: Logical difference computation with CEX2.5.
    In: Proceedings of IJCAR’12. pp. 371–377 (2012)
 8. Konev, B., Walther, D., Wolter, F.: The logical difference problem for description
    logic terminologies. In: Proceedings of IJCAR’08. pp. 259–274 (2008)
 9. Ludwig, M., Walther, D.: The logical difference for ELHr-terminologies using hy-
    pergraphs. In: Proceedings of ECAI’14. pp. 555–560 (2014)
10. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
    Cambridge University Press (2008)
11. Sinz, C.: Towards an optimal cnf encoding of boolean cardinality constraints. In:
    Proceedings of CP’05. pp. 827–831 (2005)

</pre>