Introduction

Computing Best Ontology Excerpts via Weighted Partial Max-SAT Solving?

Jieying Chen

Yue Ma

yue.mag@lri.fr 1

Dirk Walther

dirk.walther@ivi.fraunhofer.de 0 0 Fraunhofer IVI , Dresden , Germany 1 Laboratoire de Recherche en Informatique, Universite Paris-Sud , France

We consider the problem of computing best excerpts of ontologies, which are selections of a certain, small number k of axioms that best capture the knowledge regarding a given set of weighted terms. Such excerpts can be useful for ontology selection, for instance. The weights of terms are to re ect di erent importances of considered terms, for which we propose an incompleteness measure of excerpts based on the term weights and the notion of logical di erence. We present a method to compute best k-excerpts by nding all subsumption justi cations and solving a weighted partial Max-SAT problem. We demonstrate the viability of our approach with an evaluation on biomedical ontologies.

Introduction

In the presence of an ever growing amount of information, organizations and human users need to be able to focus on certain key pieces of information and to intentionally ignore all other possibly relevant parts. Knowledge about complex systems that is represented in ontologies yields collections of axioms that are too large for human users to browse, let alone to comprehend or reason about it. To address the need for ontology summary, the notion of an ontology excerpt was introduced [ 3 ]. An ontology excerpt is a xed-size subset of an ontology, consisting of the most relevant axioms for a given set of terms. These axioms preserve as much as possible the knowledge about the considered terms described in the ontology.

To evaluate the quality of ontology excerpts, we de ne a semantics-based measure, using Logical Di erence [6], to quantify how much semantic meaning is preserved in an excerpt compared to the original ontology. We take the logical di erence to be the set of queries that are relevant to an application domain and that produce di erent answers when evaluated over the considered ontologies. We consider concept subsumption queries over the terms of interest only.

Using an exhaustive search to nd the excerpts of an ontology that best preserve the semantic information w.r.t. the ontology is futile as it involves computing all (i.e. exponentially many) subsets of the ontology. In [ 3 ] e cient excerpt extraction techniques are investigated that are based on methods stemming from ? This work is partially funded by the ANR project GoAsQ (ANR-15-CE23-0022). the area of information retrieval [10], a research area which is generally concerned with developing techniques to extract the \most relevant" documents for a query from large data sources. However, these techniques do not yield the best excerpts possible. In this paper, we present an algorithm that computes a best excerpt of an ontology, i.e., there is no subset of the ontology of the same size that better preserves the knowledge about the selected terms. This paper extends the approach from [ 4 ] using an encoding to a weighted partial Max-SAT problem to handle weighted signatures.

We proceed as follows. We start in Section 2 with reviewing the notions of logical di erence together with di erence witness and best excerpt as well as the weighted partial Max-SAT problem. A best k-excerpt of a TBox is a sub-ontology that containing at most k axioms that best capture the knowledge about user's interests in terms of weighted signature. Moreover, we introduce the notion of subsumption modules of a TBox for a given set of terms. Taking the union of some subsumption modules can yield best excerpts for a weighted signature, as shown in Section 3. We encode the problem of selecting the right subsumption modules on weighted signature as a weighted partial Max-SAT problem. Overall we obtain an algorithm for computing best k-excerpts on weighted signature. In Section 4, we propose a method to rank axioms in excerpts when presenting user a best excerpt. We evaluate our algorithm by using it with the prominent bio-medical ontologies Snomed CT in Section 5. We demonstrate computing best k-excerpts is a viable approach for the task of summarizing large ontologies with a controllable small number of axioms that are most relevant for a given set of terms. We conclude the paper in Section 6. 2

Preliminaries

We brie y recall basic notions related to the ontology excerpts, and the logical di erence between terminologies [6, 8]. We assume familiarity with Description Logic; see [ 1 ] for a detailed introduction. 2.1

Ontology Excerpt

In this section, we review the notions of a k-excerpt, an incompleteness measure and a best k-excerpt of a TBox [ 3 ]. We recall that a signature is a nite set of concept and role names, and we use sig( ) to denote the signature of an syntactic object .

A k-excerpt of an ontology is a subset of the ontology containing no more than k axioms. An incompleteness measure quanti es how much of the knowledge of a TBox about the terms in a signature is captured (or how much is missing) by an excerpt.

De nition 1 (Incompleteness Measure). Let T be a TBox, and let be a

signature. An incompleteness measure T for T and is a function that maps every subset E T to a non-negative real number (E ).

A best k-excerpt best captures the knowledge about in T among all kexcerpts. How well a k-excerpt captures knowledge is measured by a suitable incompleteness measure. Note that best k-excerpts are generally not unique. De nition 2 (Best Excerpt). Let T be a TBox, let be a signature, and let k > 0 be a natural number. Additionally, let T be an incompleteness measure for T and . A best k-excerpt of T w.r.t. under T is a k-excerpt E of T such that

T (E ) = minf T (E 0) j E 0 is a k-excerpt of T g:

In [ 3,4 ], the incompleteness measure was set to be the number jcWtn (T ; E )j of concept inclusion di erence witnesses in w.r.t. T , which will be introduced in the next section. 2.2

Logical Di erence

r We now recall basic notions related to the logical di erence between two E LH terminologies for E L-inclusions over a given signature as query language [6, 8].

De nition 3 (Logical Di erence). Let T1 and T2 be two E L-terminologies,

and let be a signature. The concept inclusion di erence between T1 and T2 w.r.t. is the set cDi (T1; T2) of all E L-inclusions of the form C v D for E L-concepts C and D such that sig( ) , T1 j= , and T2 6j= .

In case two terminologies T1 and T2 are logically di erent, the set cDi (T1; T2) consists of in nitely many concept inclusions. The primitive witnesses theorems from [6] allow us to consider only certain inclusions of a simpler syntactic form. Theorem 1. Let T1 and T2 be two E L-terminologies, and let be a signature. If 2 cDi (T1; T2), then either A v D or C v A is a member of cDi (T1; T2), where A 2 sig( ) is a concept name, and C; D are E L-concepts occurring in .

(T1; T2).

We use the primitive witnesses theorem to obtain a succinct representation of the set cDi

De nition 4 (Primitive Witnesses). Let T1 and T2 be two E L-terminologies,

and let be a signature. We say that concept inclusion di erence witnesses in w.r.t. T1 and T2 are concept names contained in that occur on the left-hand side of inclusions of the form A v C in cDi (T1; T2) or on the right-hand side of inclusions of the form D v A in cDi (T1; T2). The set of all such witnesses will be denoted by cWtn (T1; T2).

The set cWtn (T1; T2) is nite as is nite. In particular, it holds that cDi (T1; T2) = ; i cWtn (T1; T2) = ;.

For (possibly cyclic) E L-terminologies T1 and T2, the set cWtn (T1; T2) can be computed using the current version of the CEX tool [7] implementing a prooftheoretic approach to the logical di erence problem using hypergraphs as introduced in [ 5 ] and further extended in [9]. 2.3

Weighted Partial Max-SAT problem

Let be a countably in nite set of propositional variables which we denote with p; q, etc. The set of propositional logic formulas ' is built according to the following grammar rule: ' ::= p j :' j ' ^ ' j ' _ ' j ' ! ' j ' $ ' with p 2 .

A propositional valuation v : ! f0; 1g is function mapping a propositional variable p 2 to a truth value v(p) 2 f0; 1g. A propositional formula ' 2 is satis ed in v i v(') = 1. We say that ' is satis able i there exists a valuation v such that v(') = 1.

A literal ` is a propositional variable p or a negated propositional variable :p, and a clause is a disjunction `1 _ : : : _ `n of literals `i (0 i n). For a propositional formula ' 2 , we denote with Cls(') the transformation of ' into an equi-satis able clause set, i.e. Cls(') is a nite set of propositional clauses such tkhnaotw'n itshasatteisvearyblperioptohsietifoonrmalufloarmVul2a ' 2is scaatins baebtlrea,nwshfoerrmeed =inCplosl(y'n)o. mItiaisl time (in the size of ') into an equi-satis able set of clauses. Note that the clauses in Cls(') may contain propositional variables that are not present in '.

The Boolean Satis ability Problem (SAT), also known as Propositional Satis ability Problem, is the problem of deciding whether there exists a valuation that satis es a given propositional formula ', or alternatively, all the clauses contained in a set of clauses = Cls('). As an extension, the set is split into two sets H and S, called hard and soft clause sets. The problem of nding a valuation that satis es all the clauses in H and a maximal number clauses in S is called the partial Max-SAT problem.

A weighted clause is a pair ( ; w ), where is a clause and w is a natural number or in nity meaning the cost for falsifying the clause . If a clause 2 H, then w is in nite. Given an assignment I and a multiset of weighted clauses , the cost of assignment I on is the sum of the weights of the clauses falsi ed by I. The Weighted Partial Max-SAT problem for weighted clauses is the problem of nding an optimal assignment to the variables of that minimizes the cost of the assignment on . 2.4

Subsumption Justi cations

Subsumption module and justi cation are de ned to capture the set of axioms required to preserve the logical consequences related to a particular concept name [ 4 ]. An algorithm for computing subsumption justi cations for E LHterminologies is presented in [ 4 ].

De nition 5 (Subsumption Module & Justi cation). Let T be a termi

nology, let be a signature, and let X 2 NC. Additionally, let M T .

We say that M is an hX; i-subsumption module of T i for every C; D 2 E L : T j= X v C implies M j= X v C. Plus T j= D v X implies M j= D v X.

An hX; i- subsumption justi cation M of T is an hX; i-subsumption module of T that is minimal w.r.t. (.

Property 1. Let T be a terminology, and let be a signature. Additionally, let M be a hX; i-subsumption justi cation of T . Then X 62 cDi (T ; M). 3

Computing Best k-Excerpts for Weighted Signatures

Best k-excerpts provide users with a way to extract the most relevant axioms from a big ontology with respect to a given signature. However, in certain cases, concepts or roles in the signature can have di erent importances. To this end, we propose a new incompleteness measure considering a weighted signature and study the corresponding best k-excerpts.

First, we introduce the de nition of weighted signature.

De nition 6 (Weighted Signature). A weight function is a function w :

NC [ NR ! R mapping a symbol 2 NC [ NR to a value w( ) 2 R [ f1g. Given a signature and a weight function w, a weighted signature w is a set w = f ( ; w( )) j 2 g.

A signature is weighted uniformly under a weight function w i w assigns the same value to every symbol in , i.e. w( 1) = w( 2) for every 1; 2 2 . For a weight function w whose range equals f1:0g (i.e., w maps every symbol to 1:0), we also write instead of w. For instance, w = f(A; 2:0); (B; 0:9); (r; 1:3)g is the weighted signature for the signature = fA; B; rg and a weight function w satisfying w(A) = 2:0, w(B) = 0:9 and w(r) = 1:3. On the other hand, a weight function w0 satisfying w0(A) = w0(B) = w0(r) = 1:4 weighs the symbols in uniformly resulting in w0 = f(A; 1:4); (B; 1:4); (r; 1:4)g, whereas = f(A; 1:0); (B; 1:0); (r; 1:0)g.

For best k-excerpts under a weighted signature w, we use, as an incompleteness measure T w for w and T , the sum of the weights of the E L-concept inclusion di erence witnesses in w.r.t. T and E , formally:

T w (E ) =

X w(X)

X2cWtn (T ;E)

For computing the best k-excerpts for a weighted signature, we present an encoding of the problem to a weighted partial Max-SAT problem, with the aim of delegating the task of nding the best excerpt to a Max-SAT solver. In that way we can leverage the decades of research e orts dedicated to developing e cient SAT solvers for our problem setting.

Note that our encoding requires that a cover of the hA; i-subsumption justi cations has already been computed for every concept name A 2 . Moreover our partial Max-SAT encoding is language agnostic, i.e., it works for any TBox as long as its subsumption justi cations can be computed.

As we are interested in nding a subset M T of a TBox T that ful lls certain properties, we encode the presence of an axiom 2 T in M using a propositional variable p . Intuitively, in a solution v to our partial Max-SAT problem we will have that v(p ) = 1 i the axiom is contained in the best k-excerpt.

Next we introduce the formulas FM and GM with M T and M 2T that we will use to encode the covers M of hA; i-subsumption justi cations M.

De nition 7 (Encoding of TBox Subsets). Let T be a terminology. For

every axiom 2 T , let p 2 be a fresh propositional variable.

For a subset M T , we de ne the propositional formula FM that encodes M as FM := V 2M p . Moreover, for a set M 2T , we de ne the propositional formula GM associated with M as

GM :=

_ M2M

FM =

_ ( ^ M2M 2M p ): Example 1. Let T = f 1; 2; 3; 4; 5g, M = f 1; 2; 3g, and M = ff 2; 3g; f 1; 4gg. Then FM = p 1 ^ p 2 ^ p 3 and GM = (p 2 ^ p 3 ) _ (p 1 ^ p 4 ).

We can now de ne our encoding of the best k-excerpt problem for weighted signature into weighted partial Max-SAT.

De nition 8 (Encoding of the Best Excerpt Problem for Weighted

Signature). Let T be an E L-TBox, let w be a weighted signature, and let 0 k jT j. Additionally, for every A 2 w, let MA be a cover of the hA; wisubsumption justi cations of T , and let qA be a fresh propositional variable.

The weighted partial Max-SAT problem for nding best k-excerpts of T w.r.t. denoted with Pk(T ; w), is de ned as follows.

We set Pk(T ; w) := (Hk(T ); Sk(T ; w)), where w, Hk(T ) := Card(T ; k) [

Cls(qA $ GMA ); [

A2 \NC Sk(T ; w) := f (qA; w(A)) j A 2 w \ NC g; and Card(T ; k) is a set of clauses specifying that at most k clauses from the set f p j 2 T g must be satis ed.

In the hard part of our weighted partial Max-SAT problem, the clauses in Card(T ; k) specify that the cardinality of the resulting excerpt E T must be equal to k. We do not x a certain encoding that should be used to obtain Card(T ; k), but we note that there exist several techniques that require a polynomial number of clauses in k and in the size of T 1 (see e.g. [11]). Moreover, for every concept name A 2 w, the variable qA is set to be equivalent to the formula GMA , i.e. qA will be satis ed in a valuation i the resulting excerpt will not have A as a di erence witness (A 62 cWtn (T ; E )).

In the soft part of the problem, the set Sk(T ; w) speci es that the sum of the weights of the concept names in the signature must be maximal by a solution of the problem, enforcing that the resulting excerpt E will yield the smallest 1 In our implementation we used the encodings for cardinality constraints that are provided by the Max-SAT solver Sat4j. possible sum of weights of di erence witnesses (whilst obeying the constraint that jE j = k).

We can now show the correctness of our encoding, i.e. a best k-excerpt for a weighted signature can be exactly obtained from any solution to the weighted partial Max-SAT problem Pk(T ; w).

Theorem 2. Let T be an E L-TBox, and let w be a weighted signature. Then v is a solution of the weighted partial Max-Sat problem Pk(T ; w) if and only if f 2 T j v(p ) = 1 g is a best k-excerpt of T w.r.t. w.

Our algorithm for nding a best k-excerpt given a TBox T and a weighted signature w (and k with 0 k jT j) as input is shown in Algorithm 1. First, the algorithm checks whether k = 0 or k = jT j, in which case the empty set or T itself is the best k-excerpt, respectively. Otherwise, the algorithm continues by computing subsumption justi cations for each concept name A in w and then transfer them to propositional formula GJA . After the iteration over all the concept names A in w is complete, the partial Max-SAT problem Pk(T ; w) is constructed with the help of the formulas GMA that are stored in S. Subsequently, a solution v of Pk(T ; w) is computed using a partial Max-SAT solver and the best k-excerpt is returned by analyzing which variables p have been set to 1 in the valuation v.

Algorithm 1: Computing Best k-Excerpts of T for Weighted Signature

Our algorithm rst computes subsumption justi cations, which requires exponential time in the size of T and w. Then the weighted partial Max-SAT encoding runs in polynomial time. Deciding whether there is a solution of an instance of a Max-SAT problem is NP-complete. Overall we have that Algorithm 1 runs in exponential time in the size of T and w in the worst case. Example 2. Let T = f 1 := A v X1 u X2 u X3 u X4; 2 := X1 v B1; 3 := X2 v B1; 4 := X3 v B2; 5 := X4 v B2g be a TBox. Let = fA; B1; B2g be a signature. Then, for X 2 , we obtain the following hX; isubsumption justi cations MX : MA = ff 1; 2g; f 1; 3g; f 1; 4g; f 1; 5gg, MB1 = ff 1; 2g; f 1; 3gg and MB2 = ff 1; 4g; f 1; 5gg. Each subsumption justi cation MX is encoded as a propositional logic formula GX as follows: GA = ((p1 ^ p2) _ (p1 ^ p3) _ (p1 ^ p4) _ (p1 ^ p5)), GB1 = ((p1 ^ p2) _ (p1 ^ p3)) and GB2 = ((p1 ^ p4) _ (p1 ^ p5)).

Now let w1; w2; w3 be weight functions such that w1(A) = w1(B1) = w1(B2) = 1:0, w2(A) = w2(B1) = 1:0 and w2(B2) = 2:0, and w3(A) = w3(B2) = 1:0, w3(B1) = 2:0. Feeding the resulting weighted signatures w1 , w2 and w3 together with k = 2 to a SAT solver yields the following solutions. For w1 , we obtain four assignments of truth values for the propositional variables in the tuple hp1; p2; p3; p4; p5i (in this order): h1; 1; 0; 0; 0i, h1; 0; 1; 0; 0i, h1; 0; 0; 1; 0i and h1; 0; 0; 0; 1i. Then we decode the assignments to sets of axioms. We obtain the following best 2-excerpts for w1 : f 1; 2g, f 1; 3g, f 1; 4g and f 1; 5g.

In the case of w2 , however, the solutions returned by the SAT-solver are the assignments h1; 0; 0; 1; 0i and h1; 0; 0; 0; 1i. Consequently, the best 2-excerpts for w2 are f 1; 4g and f 1; 5g.

Finally, for w3 , the SAT-solver yields h1; 1; 0; 0; 0i and h1; 0; 0; 1; 0i resulting in f 1; 2g and f 1; 4g as the best 2-excerpts for w3 .

Example 2 shows that, di erent weights of -symbols can in uent the nal results of best excerpts. By giving higher weights on preferring -concept names (like w2 or w3 ), users can reduce the number of best k-excerpts. However, in some extreme cases, as shown in the following example where a signature has a much higher weight then the others, the best excerpts for the wighted signature can be totally irrelevant to those w.r.t. .

Example 3. Let T = f 1 := A1 v X; 2 := X v Y; 3 := Y v Z; 4 := A2 v B1; 5 := A3 v B2g be a TBox. Let = fA1; A2; A3; B1; B2; Y; Zg be a signature. Now let w1; w2 be weight functions such that w1(A1) = w1(A2) = w1(A3) = w1(B1) = w1(B2) = w1(Y ) = w1(Z) = 1:0 (i:e:; w1 is ) and w2(A1) = w2(A2) = w2(A3) = w2(Y ) = 1:0 and w2(Z) = 9:0. We obtain that the best 3-excerpt E for is f 4; 5g 2. The best 3-excerpt E w2 for w2 is f 1; 2g. However, j cWtn (T ; E ) j=j fA1; X; Y g j= 3 and j cWtn (T ; E w2 ) j=j fA2; A3; B1; B2g j= 4.

In Example 3, as the w2(Z) is much higher than the rest concept names in signature, E is not the same as E w . The following theorem provides a guide of how to restrict weight function in order to keep monotonicity of best excerpts even though w changes.

Proposition 1. Let T be an E L-TBox, w be a weighted signature and let be a uniformed signature under the weighted function w.

(i) If 1 w( ) 1 + j \1NCj for every 2 \ NC, then every best k-excerpt of T w.r.t. w will be a best k-excerpt of T w.r.t. , but not vice versa. (ii) If P i2 1 w i > P j2 2 w j for any 1; 2 2 2 \NC and j 1 j>j 2 j, i every best k-excerpt of T w.r.t. w is a best k-excerpt of T w.r.t. . 2 The best 3-excerpt E 3-excerpt E0, T (E0) = for w1 only contains two axioms 4 and 5, as for any best

T (f 4; 5g)

Ranking Axioms in Best Excerpts

Best excerpts are useful to work with when a whole ontology exceeds the capacity of a system or a human user. However, multiple axioms often exist in the best excerpts. To better assist users, it would be desirable to give a ranking of these axioms according to their importances. In this section we present an approach to weight axioms according to weighted signatures.

To achieve a ranking, we introduce the de nition of weighted axioms.

De nition 9 (Weighted Axiom). An axiom weight function

: T ! R+ mapping an axiom 2 T to a value ( ) 2 R+. is a function

Given an axiom weight function, it is then natural to introduce an axiom ranking. Intuitively, an axiom with a larger weight should be ranked more highly. De nition 10 (Axiom Ranking). Given an axiom weight function axioms and , we say is preferred than , denoted , if ( ) and two ( ).

To achieve an axiom ranking, in the following, we show that a signature weight function can induce an axiom weight function. Intuitively, an axiom that appears in subsumption justi cations of more -symbols with higher weights should be more highly weighted.

De nition 11. Let T be a TBox, w a signature with the weight function w. Additionally, M be the set of all subsumption justi cations of . Then the axiom weight function ( ) induced by w is: w ( ) =

X f 2 w j 9M2M s.t. 2Mg w( ) We denote w the axiom ranking induced by w .

Example 4 (Example 2 contd.). Note that 1 (resp. 2; 3; 4; 5) is contained in at least one of the subsumption justi cations of -symbols fA; B1; B2g (resp. fA; B1g, fA; B1g, fA; B2g, fA; B2g). So for w3, ( 1) = w(A)+w(B1)+w(B2) = 4:0, ( 2) = ( 3) = w(A)+w(B1) = 2:0, and ( 4) = ( 5) = w(A)+w(B2) = 3:0. Hence, for the 2-best excerpts w.r.t. w3: f 1; 2g and f 1; 4g, we can order the axioms by 1 w 2 and 1 w 4.

It is easy to see that the axiom function ( ) in De nition 11 gives a total ordering over the axioms in T w.r.t. to a given weighted signature w. Moreover, it is independent of excerpts as stated in the following.

Proposition 2. Let T be a TBox, w a signature with the weight function w, for any best excerpts E1 and E2 and axioms 1; 2 2 E1 \ E2, 1 2 in E1 i .

1 2 in E2.

That is, is a stable ordering of the axioms given a weighted signature. However, note that the axioms in best k-excerpts are not monotonic with respect to the size of k (see a counter-example given in [ 3 ]).

Snomed CT NCI Time (s) Minimal Maximal Median STDEV Minimal Maximal Median STDEV MaxSat To demonstrate that our algorithm works in practise, we implemented a JAVA prototype using SAT4J as a Max-SAT solver [ 2 ]. All the tests were performed on a server running Ubuntu 15.10 with Intel Xeon 2.50GHz Core 4 Duo CPU with 64G RAM. We used the prominent medical ontologies Snomed CT (version Jan 2016)3 consisting of 317 891 axioms, and the NCI Thesaurus (NCIt, version 16.03d)4 containing 165 341 axioms.

The rst experiment is to test the e ciency of our encoding method. Table 1 shows the time needed to compute best excerpts of NCI and Snomed CT for randomly generated signatures of di erent sizes (10/30/50 concept names and 10/30/50 roles names). In total over 3 000 signatures have been considered. After precomputing the subsumption justi cations for every concept name in every signature, the time was measured as that needed to prepare the input and run the partial Max-SAT solver to compute best k-excerpts for every size k. The minimal, maximal, median, and standard deviation of the execution times show that computing best excerpts of any size using Max-SAT is very e cient given that the subsumption justi cations have been computed.

In the second experiment, we compare the size of locality based modules with the number of axioms in best excerpts needed to preserve a certain amount of knowledge. We denote with #(Preserved (best) = n), for n 2 f1; 2g, the minimal number of axioms needed by a best excerpt to preserve the knowledge of n concept names w.r.t. the signature (i.e., the number k of axioms of a best excerpt E of T such that n = j \ NC n cWtn (T ; E)j). Instead of using random signatures, however, we consider a scenario where a user searches for sub-ontologies of Snomed CT related to a particular concept name. We computed 2 500 di erent signatures each consisting of a concept name related to diseases, the TOP-concept (called `SNOMED CT Concept') and all role names of Snomed CT. All weights of -symbols are equal in this experiment.

In Figure 1, the 2 500 signatures are presented along the x-axis in ascending order w.r.t. the size of their respective ?> -local modules (black line). The y-axis represents the number of axioms in the module and excerpts for a signature. The red (resp. green) line presents the sizes of best excerpts that preserve the knowledge for one (resp. two) concept name(s), i.e., #P reserved (best) = 1 (resp. #P reserved (best) = 2). Best excerpts provide a conciser way for zooming in on an ontology, in particular, when sacri cing completeness is acceptable. For 3 http://www.ihtsdo.org/snomed-ct 4 http://evs.nci.nih.gov/ftp1/NCI_Thesaurus

Module #Preserved (best) = 1 #Preserved (best) = 2 0 500 1000 1500 Signature ID 2000 2500 this experiment, after computing the subsumption justi cations of all concept names in a signature, it only took 0:15s on average to compute the best excerpts. 6

Conclusion

We have presented an algorithm for computing a best k-excerpt of a TBox for given weighted signature. Best k-excerpts are k-element subsets of an ontology that best preserve entailments over the target signature compared to the ontology [ 3 ]. The measure of incompleteness of the excerpt w.r.t. the ontology is based on the incompleteness measure taking into account weights of concept names in the signature. We have implemented our algorithm and we have presented an experimental evaluation to demonstrate the viability of computing best excerpts of large biomedical ontologies in practice. We leave investigating alternative incompleteness measures for future work. 6. Konev, B., Ludwig, M., Walther, D., Wolter, F.: The logical di erence for the lightweight description logic EL. Journal of Arti cial Intelligence Research 44, 633{708 (2012) 7. Konev, B., Ludwig, M., Wolter, F.: Logical di erence computation with CEX2.5.

In: Proceedings of IJCAR'12. pp. 371{377 (2012) 8. Konev, B., Walther, D., Wolter, F.: The logical di erence problem for description logic terminologies. In: Proceedings of IJCAR'08. pp. 259{274 (2008) 9. Ludwig, M., Walther, D.: The logical di erence for ELHr-terminologies using hypergraphs. In: Proceedings of ECAI'14. pp. 555{560 (2014) 10. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval.

Cambridge University Press (2008) 11. Sinz, C.: Towards an optimal cnf encoding of boolean cardinality constraints. In: Proceedings of CP'05. pp. 827{831 (2005)

1. Baader , F. , Calvanese , D. , McGuinness , D.L. , Nardi , D. , Patel-Schneider , P.F . (eds.): The description logic handbook: theory, implementation, and applications . Cambridge University Press, 2 edn . ( June 2010 )

2. Berre , D.L. , Parrain , A. : The sat4j library, release 2.2. JSAT 7 ( 2-3 ), 59 {6 ( 2010 )

3. Chen , J. , Ludwig , M. , Ma , Y. , Walther , D. : Towards extracting ontology excerpts . In: Proceedings of KSEM'15 . pp. 78 { 89 ( 2015 )

4. Chen , J. , Ludwig , M. , Ma , Y. , Walther , D. : Zooming in on ontologies: Minimal modules and best excerpts . In: Proceeding of ISWC'17 . pp. 173 { 189 ( 2017 )

5. Ecke , A. , Ludwig , M. , Walther , D. : The concept di erence for EL-terminologies using hypergraphs . In: Proceedings of DChanges'13 ( 2013 )