<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exploiting Partial Information in Taxonomy Construction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Rob</forename><surname>Shearer</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Oxford University Computing Laboratory</orgName>
								<address>
									<settlement>Oxford</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ian</forename><surname>Horrocks</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Oxford University Computing Laboratory</orgName>
								<address>
									<settlement>Oxford</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Boris</forename><surname>Motik</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Oxford University Computing Laboratory</orgName>
								<address>
									<settlement>Oxford</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exploiting Partial Information in Taxonomy Construction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">3C43DA243A13C823523A1CB27BF5501C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T23:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the core services provided by description logic (DL) reasoners is classification: determining the subsumption quasi-ordering over the concept names occurring in a knowledge base (KB) and caching this information in the form of a directed acyclic graph known as the concept hierarchy or taxonomy. For less expressive DLs, such as members of the EL family, it may be possible to derive all the relevant subsumption relationships in a single computation <ref type="bibr" target="#b2">[Baader et al., 2005]</ref>. In general, however, it will be necessary to "deduce" the subsumption relation by performing individual subsumption tests between pairs of concept names. For n concept names this will, in the worst case, require n<ref type="foot" target="#foot_1">2</ref> tests, but for the tree-shaped hierarchies typically found in realistic KBs much better results can be achieved using algorithms that construct the taxonomy incrementally by traversing the partially-constructed taxonomy in order to find the right place to insert each concept name.</p><p>This kind of algorithm suffers from two main difficulties. First, individual subsumption tests can be computationally expensive-for some complex KBs, even state-of-theart reasoners may take a long time to perform a single test. Second, even when subsumption tests themselves are very fast, a knowledge base containing a very large number of concepts<ref type="foot" target="#foot_0">1</ref> will obviously result in a very large taxonomy, and repeatedly traversing this structure can be costly.</p><p>The first difficulty is usually addressed by using an optimized construction that tries to minimize the number of subsumption tests performed in order to deduce the subsumption relation. Most implemented systems use an "enhanced traversal" algorithm due to Ellis <ref type="bibr" target="#b4">[1991]</ref> and to Baader et al. <ref type="bibr" target="#b0">[1994]</ref> which adds concepts to the taxonomy one at a time using a two-phase top-down and bottom-up breadth-first search of the partially-constructed taxonomy. The algorithm exploits the structure of the KB to identify "obvious" subsumers (so-called told-subsumers) of each concept, and uses this information in a heuristic that chooses the order in which concepts are added, the goal being to construct the taxonomy top-down; it also exploits information from the topdown search in order to prune the bottom-up search. 2  The second difficulty can be addressed by optimizations that try to identify a subset of the concepts for which complete information about the subsumption relation can be deduced without performing any individual subsumption tests. This can be achieved, e.g., by identifying completely-defined concepts <ref type="bibr" target="#b10">[Tsarkov et al., 2007]</ref>-those for which only structurally-obvious subsumption relationships hold. Having constructed part of the taxonomy using such a technique, the remaining concepts can be added using the standard enhanced traversal algorithm.</p><p>In this paper we present a new classification algorithm that generalizes and refines the above techniques. Starting with a set of known subsumption relationships and a (larger) set of possible subsumption relationships, it computes the subsumption quasiordering by extending the known set and reducing the possible set until the two coincide. An important advantage of our algorithm is that it is able to exploit partial information about all concepts as well as complete information about some concepts. We show how such known and possible relationships can be derived from data generated in the course of (hyper) tableau-based subsumption and satisfiability testing; this approach provides an efficient generalization of the told-subsumer and completely-defined optimizations, both of which derive partial information from structural analysis of the KB. When the known and possible sets do not coincide, our algorithm incrementally computes additional (non-)subsumption relationships, and maximally exploits the resulting information to refine the sets of known and possible subsumers; this can be seen as a generalization of the search-pruning optimizations introduced by Baader et al..</p><p>We have used a prototypical implementation of our new algorithm to compare its behavior with that of the classification algorithms implemented in state-of-the-art DL systems. The comparison shows that our algorithm can dramatically reduce the number of subsumption tests performed when classifying a KB. Moreover, in contrast to the completely-defined optimization, the behavior of our algorithm degrades gracefully as the gap between the sets of initially-known and possible subsumption relationships increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>Given a set of elements U = {a, b, c, ...}, let R be a binary relation over U , i.e., a subset of U × U . We say that there is a path from a to b in R if there exist elements c 0 , ..., c n ∈ U such that c 0 = a, c n = b, and</p><formula xml:id="formula_0">c i , c i+1 ∈ R for all 0 ≤ i &lt; n. The transitive closure of R is the relation R + such that a, b ∈ R + iff there is a path from a to b in R. The transitive-reflexive closure R * of R is the transitive closure of the reflexive extension of R, i.e. R + ∪ { a, a | a ∈ U }.</formula><p>A binary relation is a quasi-ordering if it is both reflexive and transitive. Clearly, the subsumption relation on a set of concepts is a quasi-ordering. Note, however, that it is not a partial-ordering, because it is not antisymmetric: We assume that we begin with partial information about R: we are provided with a set K = { a 0 , b 0 , ..., a m , b m } where a i , b i ∈ R for 0 ≤ i ≤ m, and also with a set K neg = { c 0 , d 0 , ..., c n , d n } where c i , d i ∈ R for 0 ≤ i ≤ n. We call the set K the known portion of R. In this paper we do not operate on the set K neg directly; our presentation instead refers to its complement U × U \ K neg , which we denote by P and call the possible portion of R. It is thus the case that K ⊆ R ⊆ P . If no partial information is available, then K = ∅ and P = U × U .</p><p>We can use the result of each test a, b ∈ ? R to further refine the bounds on R by either adding a, b to K or removing it from P ; eventually</p><formula xml:id="formula_1">K[D] = R[D] = P [D].</formula><p>We next show, however, that the bounds on R can sometimes be refined without performing additional tests by combining information from K and P .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Maximizing Partial Information</head><p>The key to minimizing the number of explicit tests required to discover R[D] is maximizing the information gained from K and P . To do so, we exploit the knowledge that R is a quasi-ordering. In this case, K ⊆ R obviously implies that K * ⊆ R, so we can use K * to obtain a tighter lower bound on R. Less obvious is the fact that we can also obtain a tighter upper bound on R by identifying pairs in P which are not consistent with K and the transitivity of R.</p><p>For example, consider the case shown in Figure <ref type="figure" target="#fig_1">1</ref>(a). If we know that b is a successor of a in R (i.e., a, b ∈ K), then an element c can be a successor of b only if it is also a successor of a (if a, c ∈ P then b, c ∈ R). Further, a can be a successor of an element d only if b is also a successor of d.</p><p>Both of these examples are special cases of the structure shown in Figure <ref type="figure" target="#fig_1">1</ref>(b): if u is a successor of u and v is a successor of v, then an edge from u to v would form a path all the way from u to v , requiring v to be a successor of u . Since R is reflexive we can choose u = u or v = v to see that v can be a successor of u only if v is a successor of u and v is also a successor of u. We use this to formalize a subset P K of P , and show that P K is the tightest possible upper bound on R.</p><p>Definition 1 Let K and P denote two relations such that K * ⊆ P . We define the reduction P K of P due to K as follows:</p><formula xml:id="formula_2">P K = P ∩ { u, v | ∀u , v : { u , u , v, v } ⊆ K * → u , v ∈ P }</formula><p>Lemma 1 Let K and P denote two relations such that K * ⊆ P . (i) For all quasiorders R such that K ⊆ R ⊆ P , it is the case that R ⊆ P K . (ii) Let S be a proper subrelation of P K . Then there exists a quasi-ordering R such that K ⊆ R ⊆ P and R ⊆ S; i.e. P K is minimal. </p><formula xml:id="formula_3">Proof. (i) Let u, v be a tuple in R. For every u , v such that { u , u , v, v } ⊆ K * , K * ⊆ R implies that { u , u , v, v } ⊆ R. Because R is transitive and u, v ∈ R, it must also be the case that u , v ∈ R and thus that u , v ∈ P . Consequently, u, v ∈ P K , so R ⊆ P K . (ii) Choose elements a and b such that a, b ∈ P K but a, b ∈ S. Let R be the transitive-reflexive closure of the relation K ∪ { a, b }. Clearly K ⊆ R and R ⊆ S.</formula><p>Let u, v be any tuple in R. There are three cases:</p><formula xml:id="formula_4">1. u, v = a, b . Then u, v ∈ P since a, b ∈ P K and P K ∈ P . 2. u, v ∈ K + . Then u, v ∈ P since K * ⊆ P . 3. u, a ∈ K * and b, v ∈ K + . Then u, v ∈ P since a, b ∈ P K .</formula><p>For any tuple u, v ∈ R, it is the case that u, v ∈ P , thus K ⊆ R ⊆ P and R ⊆ S.</p><p>Note that P K itself is not necessarily transitive: given three elements a, b, and c and the relation P = { a, a , b, b , c, c , a, b , b, c }, it is the case that P ∅ = P . Of course no transitive subrelation R of P contains both a, b and b, c .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Taxonomy Construction and Searching</head><p>As described in Section 3.1, given relations K and P such that <ref type="figure">and a, b</ref> is not an element of R if a, b ∈ P K ; the only "unknown" elements of R are the tuples in P K \ K * . Further, each test of the form a, b ∈ ? R provides additional information which can be used to extend K or restrict P . This suggests the following simple procedure for deducing the restriction R[D] of a quasi-ordering R to domain D:</p><formula xml:id="formula_5">K ⊆ R ⊆ P for some unknown quasi-ordering R, a tuple a, b is an element of R if a, b ∈ K * ,</formula><formula xml:id="formula_6">COMPUTE-ORDERING(K, P, D) 1 while K * [D] = P K [D] 2 do choose some a, b ∈ D such that a, b ∈ P K \ K * 3 if a, b ∈ ? R then add a, b to K 4 else remove a, b from P 5 return K[D]</formula><p>In the case where no information about the quasi-ordering R[D] is available other than K and P , the above procedure performs well. In many cases, however, some general properties about R[D] can be assumed. In the case where R represents subsumption relationships between concepts, for example, R[D] is typically much smaller than D × D (i.e., relatively few pairs of concepts are in a subsumption relationship). In such cases, it is beneficial to use heuristics that exploit the (assumed) properties of R[D] when choosing a and b in line 2 of the above procedure.</p><p>We summarize below a variant of COMPUTE-ORDERING which performs well when the restriction to be computed is treelike in structure and little information about the ordering is available in advance. This procedure is designed to perform individual tests in an order similar to the enhanced traversal algorithm; however, it minimizes the number of individual tests performed by maximally exploiting partial information.</p><p>The algorithm chooses an element of a ∈ D for which complete information about R <ref type="bibr">[D]</ref> is not yet known. It identifies the subset V ↑ ⊆ D of elements b for which a, b ∈ R, and the subset V ↓ ⊆ D of elements b for which b, a ∈ R, updating K and P accordingly. In order to compute these sets efficiently, we make use of the subroutines SUCCESSORS and PREDECESSORS, which perform the actual tests. The SUCCESSORS and PREDECESSORS functions are derived from the enhanced traversal algorithm: they perform a breadth-first search of the transitive reduction K of the known subsumptions K-the smallest relation whose transitive closure is K * . In order to avoid the cost of repeated traversals of K , we restrict the searches to, respectively, the possible successors and predecessors of a. We omit the details of these search routines for the sake of brevity. COMPUTE-ORDERING-2(K, P, D)</p><formula xml:id="formula_7">1 while K * [D] = P K [D] 2 do choose some a, x ∈ D s.t. a, x ∈ P K \ K * or x, a ∈ P K \ K * 3 let B be the possible successors of a, i.e. D ∩ {b | a, b ∈ P K \ K * } 4 if B = ∅ then V ↑ ← SUCCESSORS(a, K [B]) 5 add a, b to K for every element b of V ↑ 6 remove a, b from P for every element b of B \ V ↑ 7 let B be the possible predecessors of a, i.e. D ∩ {b | b, a ∈ P K \ K * } 8 if B = ∅ then V ↓ ← PREDECESSORS(a, K [B]) 9 add b, a to K for every element b of V ↓ 10 remove b, a from P for every element b of B \ V ↓ 11 return K[D]</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Extracting Subsumption Information from Models</head><p>We next turn our attention to the specific case of identifying all subsumption relationships between the concepts of a knowledge base K. Instead of treating a reasoning service as an oracle that answers boolean queries of the form "is A subsumed by B w.r.t. K?" (which we will write K |= ? A B), we consider how information generated internally by common reasoning algorithms can be exploited to discover information about the subsumption quasi-ordering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Identifying Non-Subsumptions</head><p>Most modern reasoners for Description Logics, including HermiT, Pellet, and FaCT++, transpose subsumption queries into satisfiability problems; in particular, to determine if K |= A ⊥, these reasoners test whether the concept A is satisfiable w.r.t. K. They do this by trying to construct (an abstraction of) a Tarski-style model of K in which the extension of A is nonempty. We begin by providing an abbreviated formalization of such models (see Baader et al. <ref type="bibr" target="#b1">[2003]</ref> for more details):</p><p>Definition 2 Given sets of concept names N C , role names N R and individual names N I , an interpretation I = (∆ I , • I ) consists of a nonempty set ∆ I and an interpretation function • I which maps every element of N C to a subset of ∆ I , every element of N R to a subset of ∆ I × ∆ I and every element of N I to an element of ∆ I . An interpretation I is a model of an axiom A B if A I ⊆ B I (similar definitions hold for other kinds of statement); it is a model of a KB K if it models every statement in K.</p><p>Let A and B be concepts. A model I of K is a witness for the satisfiability of A w.r.t. K if A I is nonempty; it is a witness for the nonsubsumption A B w.r.t. K if A I ⊆ B I , i.e., if there exists i ∈ ∆ I s.t. i ∈ A I and i ∈ B I .</p><p>The algorithms in question typically represent the model being constructed as an ABox, i.e., as a set of assertions of the form x : C and x, y : R for individuals x, y, (possibly complex) concepts C and roles R <ref type="bibr">[Baader et al., 2003</ref>]. An ABox including the assertion x : C represents a model in which x I ∈ C I . To construct a witness for the satisfiability of a concept A, the ABox is initialised with an assertion x : A and the construction proceeds in a goal-directed manner by adding further assertions only as necessary in order to ensure that the ABox represents a model of K.</p><p>Assuming that the construction is successful, the resulting ABox/model provides a rich source of information. For example, for any concept B such that x :(¬B) is in the ABox, it is the case that x I ∈ B I ; thus the model is a witness for the non-subsumption K |= A B for all such concepts B. In many cases, the non-presence of x : B in the ABox is sufficient to conclude the relevant non-subsumption; in fact, when using a hypertableau algorithm, this is always the case.</p><p>The goal-directed nature of the ABox construction means that the models constructed are typically quite small. As a result, these models tend to be extremely rich in non-subsumption information: in a typical witness for the satisfiability of A, i.e., a model I of K with i ∈ A I , there will be relatively few other concepts B such that i ∈ B I , and thus I will identify the vast majority of concepts in K as nonsubsumers of A. For this reason, it is almost always more efficient to record the set P A = {B | i ∈ A I and i ∈ B I for some i} of "possible subsumers" of A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Identifying Subsumptions</head><p>While single models allow us to detect non-subsumptions, additional information about the space of possible models is required in order to identify subsumption relationships. Sound and complete tableau reasoning algorithms systematically explore the space of all "canonical" models (typically tree-or forest-shaped models), on the basis that, if any model exists, then one of these canonical models also exists. In particular, when K includes disjunctions or other sources of nondeterminism, it may be necessary to choose between several possible ways of modelling such statements, and to backtrack and try other possible choices if the construction fails.</p><p>For such algorithms, it is usually easy to show that, if the ABox was initialized with x : A, the construction did not involve any nondeterministic choices, and the resulting ABox includes the assertion x : B, then it is the case that in any model I of K, i ∈ A I implies i ∈ B I , i.e., that K |= A B. Moreover, as we have already seen in Section 4.1, such an ABox is (at least in the hypertableau case) a witness to the non-subsumption K |= A B for all concepts B such that x : B is not in the ABox. Thus, when testing the satisfiability of a concept A, it may be possible to derive complete information about the subsumers of A.</p><p>The hypertableau-based HermiT reasoner is designed to reduce nondeterminism, and completely avoids it when dealing with Horn-SHIQ KBs; for such KBs it is thus able to derive complete information about the subsumers of a concept A using a single satisfiability test. This allows HermiT to derive all relevant subsumption relationships in a Horn-SHIQ knowledge base as a side effect of performing satisfiability tests on each of the named concepts <ref type="bibr" target="#b9">[Motik et al., 2007]</ref>.</p><p>This idea can be extended so as to also derive useful information from nondeterministic constructions by exploiting the dependency labeling typically used to enable "dependency-directed backtracking"-an optimization which reduces the effects of nondeterminism in reasoning <ref type="bibr" target="#b8">[Horrocks, 1997]</ref>. In the resulting ABoxes, each assertion is labelled with the set of choice points on which it depends. An empty label indicates that the relevant assertion will always be present in the ABox, regardless of any choices made during the construction process. Thus, if the ABox is initialized with x : A, an empty-labelled assertion x : B in the resulting ABox can be treated in the same way as if the construction had been completely deterministic. Performing a satisfiability test on A may, therefore, allow some subsumers of A to be identified even when nondeterministic choices are made during reasoning. In practice, almost all of the actual subsumers of A can usually be identified in this way.</p><p>It is easy to see that this idea is closely related to, and largely generalizes, the told subsumer and completely-defined optimizations. For a completely defined concept A, a satisfiability test on A will be deterministic (and typically rather trivial), and so will provide complete information about the subsumers of A. Similarly, if B is a told subsumer of A, then an ABox initialized with x : A will always produce x : B, and almost always deterministically (it is theoretically possible that x : B will be added first due to some nondeterministic axiom in the KB).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>Computing a quasi-(or partial-) ordering for a set of n incomparable elements clearly requires n 2 individual tests-naïvely comparing all pairs is thus "optimal" by the simplest standard. The literature therefore focuses on a slightly more sophisticated metric which considers both the number of elements in the ordering as well as the width of the ordering-the maximum size of a set of mutually incomparable elements. Faigle and Turán <ref type="bibr" target="#b5">[1985]</ref> have shown that the number of comparisons needed to deduce an ordering of n elements with width w is at most O(wn log(n/w)) and Daskalakis et al. provide an algorithm which approaches this bound by executing O(n(w + log n)) comparisons <ref type="bibr">[2007]</ref>. Taxonomies, however, tend to resemble trees in structure, and the width of a subsumption ordering of n elements is generally close to n/2. Further, the algorithms of Faigle and Turán as well as Daskalakis et al. rely on data structures which require O(nw) storage space even in the best case, and thus exhibit quadratic performance when constructing a taxonomy.</p><p>A taxonomy-construction strategy which performs well for tree-like relations is described by Ellis <ref type="bibr" target="#b4">[1991]</ref>: elements are inserted into the taxonomy one at a time by finding, for each element, its subsumers using a breadth-first search of all previously-inserted elements top-down, and then its subsumees using a breadth-first search bottom-up. Baader et al. further refine this technique to avoid redundant subsumption tests during each search phase: during the top search phrase, a test K |= ? A B is performed only if K |= A C for all subsumers C of B <ref type="bibr" target="#b0">[1994]</ref>. This can be seen as a special case of our P K pruning of possible subsumers, with the restriction that it only applies to subsumption tests performed in a prescribed order. These traversal algorithms can be further optimized using the clustering technique proposed by Haarslev and Möller <ref type="bibr">[2001]</ref>, which avoids the inefficiency of traversing flat taxonomies by introducing new concepts to enforce a maximum branching factor for all taxonomy nodes. This optimization can also be incorporated into our approach.</p><p>Baader et al. also describe techniques for identifying subsumers without the need for multiple subsumption tests by analyzing the syntax of concept definitions in a KB: if a KB contains an axiom of the form A B C where A and B are atomic concepts, then B is a "told subsumer" of A, as are all the told subsumers of B. The various simplification and absorption techniques described by Horrocks <ref type="bibr" target="#b8">[1997]</ref> increase the applicability of such analysis. Haarslev et al. further extend this analysis to detect nonsubsumption: an axiom of the form A ¬B C implies that A and B are disjoint, thus neither concept subsumes the other (unless both are unsatisfiable) <ref type="bibr">[2001]</ref>. Tsarkov et al. describe a technique for precisely determining the subsumption relationships between "completely defined concepts"-concepts whose definitions contain only conjunctions of other completely defined concepts <ref type="bibr">[2007]</ref>. All these optimizations can be seen as special cases of (non-)subsumption information being derived from (possibly incomplete) calculi as described in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Empirical Evaluation</head><p>In order to determine if our new algorithm is likely to improve classification performance in practice we conducted two experiments using large KBs derived from lifescience applications.</p><p>First, we compared the performance of our new algorithm with the enhanced traversal algorithm. In order to analyze how much improvement is due to the information extracted directly from models and how much is due to our new approach to taxonomy construction, we extend the enhanced traversal algorithm such that it first performs a satisfiability test on every concept and constructs a cache of information derived from the resulting models using the techniques described in Section 4. During the subsequent taxonomy construction, subsumption tests are performed only if the relevant subsumption relationship cannot be determined by consulting the cache. Note that this caching technique strictly subsumes the "told subsumer" and "primitive component" optimizations described by Baader et al.. We implemented both algorithms within the HermiT reasoner <ref type="bibr" target="#b9">[Motik et al., 2007</ref>] and performed testing using the well-known US National Cancer Institute thesaurus (NCI), a large but simple KB containing 27,653 classes. The models constructed by HermiT during satisfiability testing of these classes provide complete information about the subsumption ordering for this KB, so both algorithms are able to classify it without performing any additional tests. To study how the algorithms compare when lessthan-complete information is available, we limited the amount of information extracted from HermiT's models, reducing the number of known subsumptions and increasing the number of possible subsumptions to varying degrees. The number of full subsumption tests required for classification as well as the running times for each algorithm are given in Table <ref type="table" target="#tab_0">1</ref>.</p><p>As the table shows, our simple implementation of the enhanced traversal algorithm (ET) is substantially slower than the new algorithm even when complete information is available; this is the result of the "insertion sort" behavior of ET described in Section 5.</p><p>When complete information is not available, our algorithm consistently reduces the number of subsumption tests needed to fully classify the knowledge base by an order of magnitude.</p><p>In a second experiment, we compared the implementation of our new algorithm in HermiT with the widely-used Description Logic classifiers FaCT++ and Pellet. Both of these systems are quite mature and implement a wide range of optimizations to both taxonomy construction and subsumption reasoning; we were thus able to compare our new algorithm with existing state-of-the-art implementations. In this case, in addition to NCI we used the Gene Ontology (GO), and the wellknown GALEN medical terminology KB. Both NCI and GO have been specifically constructed to fall within the language fragment which existing reasoners are able to classify quickly; GALEN, in contrast, necessitates substantially more difficult subsumption testing but contains an order of magnitude fewer concepts. In order to estimate how the different algorithms would behave with more expressive KBs, for each KB K we constructed two extensions: K ∃ which adds the single axiom ∃R.A for a fresh role name R and fresh concept A, and K which adds the axiom A B for fresh concepts A and B. For NCI we constructed a further extension NCI ∃∀ by adding the axioms ∃R.A and C ∀R.B for each of the 17 most general concepts C occurring in the KB. Each of these extensions increases the complexity of individual subsumption tests and reduces the effectiveness of optimizations that try to avoid performing some or all of the tests that would otherwise be needed during classification.</p><p>The number of classes in each KB as well as the number of tests performed (including all concept satisfiability and subsumption tests) and the time taken by each reasoner are shown in Table <ref type="table" target="#tab_1">2</ref>. The Pellet system makes use of a special-purpose reasoning procedure for KBs that fall within the EL fragment <ref type="bibr" target="#b2">[Baader et al., 2005]</ref>; for such KBs we do not, therefore, list the number of subsumption tests performed by Pellet.</p><p>As Table <ref type="table" target="#tab_1">2</ref> shows, HermiT's new classification algorithm dramatically reduces the number of subsumption tests performed when classifying these KBs. This does not, however, always result in faster performance. This is largely due to optimizations used by the other reasoners which greatly reduce the cost of subsumption testing for simple KBs: the overwhelming majority of subsumption tests performed by FaCT++, for example, can be answered using the pseudo-model merging technique described by Horrocks <ref type="bibr" target="#b8">[1997]</ref>.</p><p>Most of these optimizations could equally well be used in HermiT, but in the existing implementation each subsumption test performed by HermiT is far more costly. The number of subsumption tests performed by HermiT is, however, far smaller than for the other reasoners, and its performance also degrades far more gracefully as the complexity of a knowledge base increases: adding a single GCI or disjunction to a KB can prevent the application of special-case optimizations in Pellet and FaCT++, greatly increasing the cost of subsumption testing and, due to the very large number of tests being performed, vastly increasing the time required for classification. The NCI ∃∀ knowledge base, for example, eliminates any benefit from the pseudo-model merging optimization (since no two pseudo-models can be trivially merged), and this causes the classification time to increase by roughly two orders of magnitude for both Pellet and FaCT++. In contrast, HermiT's classification time is unaffected. The relatively poor performance of HermiT on the GALEN KB is due to the fact that the underlying satisfiability testing procedure is particularly costly when there are large numbers of branching points, even if no backtracking is actually required.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion and Future Work</head><p>We have described a new algorithm for taxonomy construction that effectively exploits partial information derived from structural analysis and/or reasoning procedures, and we have shown that, when compared to the widely-used enhanced traversal algorithm, it can dramatically reduce both the number of individual comparisons and the total processing time for realistic data sets. For simple KBs, our prototype implementation makes the HermiT reasoner competitive with state-of-the-art reasoners which implement special-purpose optimizations of the subsumption testing procedure for such cases; on more expressive KBs our new system substantially outperforms existing systems.</p><p>Future work will include extending HermiT to incorporate some of the subsumption testing optimizations used in other systems, in particular reducing the overhead cost of individual subsumption tests. We believe that this will greatly improve HermiT's performance on simple KBs; as we have seen, it is already highly competitive on more complex KBs.</p><p>The procedure we describe in Section 4 extracts subsumption relationships involving only the concept used to initialize a model. This is because the dependency labeling implemented in tableau reasoners is currently designed only to allow the application of dependency-directed backtracking, and discards a great deal of dependency information. We intend to explore more sophisticated dependency labeling strategies which allow the extraction of additional subsumption information.</p><p>We also want to investigate meaningful complexity bounds for taxonomy searching and construction tasks. As we have seen, a completely naïve search routine is optimal if only the number of elements is considered. We will attempt to obtain tighter bounds for certain classes of relation: relations with linear taxonomy graphs, for example, can be deduced with only n log n comparisons. Bounds based on more sophisticated metrics may also be possible; e.g., bounds based on the total number of subsumption relations instead of the number of elements.</p><p>Finally, preliminary testing demonstrates that when significant partial information is available, the COMPUTE-ORDERING-2 procedure, based on the breadth-first search of the enhanced traversal algorithm, offers little advantage over COMPUTE-ORDERING, which performs tests in an arbitrary order; in many cases the performance of COMPUTE-ORDERING-2 is actually worse. Investigating other heuristics for choosing the order in which to perform tests will also be part of our future work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>C D and D C does not imply that C = D. The restriction of a relation R to a subset D of U is the relation R[D] = R ∩ (D × D). All restrictions of a reflexive relation are reflexive, and all restrictions of a transitive relation are transitive; thus, a restriction of a quasi-ordering is itself a quasi-ordering. Further, if R ⊆ S for relations R and S, then R[D] ⊆ S[D] for all D ⊆ U . Given a universe U , a quasi-ordering R over U and a finite set of elements D ⊆ U , we consider the problem of computing the restriction R[D] via tests of the form a, b ∈ ? R. If U is the set of (arbitrary) concepts in a DL L, R is the subsumption relation over U and D is the set of concept names occurring in an L-KB K, then computing R[D] is equivalent to classifying K, and the relevant tests are subsumption tests.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Eliminating possible edges: if the solid edges are known to be in quasi-ordering R, then the gray edges can be in R only if the indicated dashed edges are in R.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Algorithm Comparison</figDesc><table><row><cell>Relation Size</cell><cell></cell><cell>ET</cell><cell></cell><cell>New</cell></row><row><cell>Known Possible</cell><cell>Tests</cell><cell>Seconds</cell><cell>Tests</cell><cell>Seconds</cell></row><row><cell>335 476 335 476</cell><cell>0</cell><cell>190</cell><cell>0</cell><cell>17</cell></row><row><cell>335 476 2 244 050</cell><cell>152 362</cell><cell>246</cell><cell>24 796</cell><cell>22</cell></row><row><cell>335 476 4 147 689</cell><cell>303 045</cell><cell>257</cell><cell>49 308</cell><cell>31</cell></row><row><cell>335 476 6 046 804</cell><cell>455 054</cell><cell>292</cell><cell>73 945</cell><cell>33</cell></row><row><cell>335 476 7 940 847</cell><cell>606 205</cell><cell>305</cell><cell>98 613</cell><cell>34</cell></row><row><cell>251 880 335 476</cell><cell>80 878</cell><cell>634</cell><cell>19 773</cell><cell>28</cell></row><row><cell>251 880 2 244 050</cell><cell>439 002</cell><cell>740</cell><cell>50 143</cell><cell>32</cell></row><row><cell>251 880 4 147 689</cell><cell>794 513</cell><cell>809</cell><cell>79 038</cell><cell>40</cell></row><row><cell>251 880 6 046 804</cell><cell>1 151 134</cell><cell>836</cell><cell>107 416</cell><cell>46</cell></row><row><cell>251 880 7 940 847</cell><cell>1 506 752</cell><cell>919</cell><cell>136 190</cell><cell>50</cell></row><row><cell>168 052 335 476</cell><cell>143 913</cell><cell>1079</cell><cell>62 153</cell><cell>62</cell></row><row><cell>168 052 2 244 050</cell><cell>673 768</cell><cell>1267</cell><cell>146 823</cell><cell>91</cell></row><row><cell>168 052 4 147 689</cell><cell>1 201 904</cell><cell>1320</cell><cell>226 670</cell><cell>93</cell></row><row><cell>168 052 6 046 804</cell><cell>1 729 553</cell><cell>1414</cell><cell>304 784</cell><cell>98</cell></row><row><cell>168 052 7 940 847</cell><cell>-</cell><cell>-</cell><cell>381 330</cell><cell>130</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>System Comparison</figDesc><table><row><cell></cell><cell></cell><cell cols="2">FaCT++</cell><cell>Pellet</cell><cell></cell><cell>HermiT</cell></row><row><cell>KB</cell><cell>Classes</cell><cell cols="2">Tests Seconds</cell><cell cols="3">Tests Seconds Tests Seconds</cell></row><row><cell>NCI</cell><cell cols="2">27 653 4 506 097</cell><cell>2.3</cell><cell>-</cell><cell cols="2">16.1 27 653 22</cell></row><row><cell cols="3">NCI ∃ 27 654 8 658 610</cell><cell>4.4</cell><cell>-</cell><cell cols="2">16.7 27 654 21.0</cell></row><row><cell>NCI</cell><cell cols="2">27 655 8 687 327</cell><cell cols="4">5.1 10 659 876 95.4 48 389 37.0</cell></row><row><cell cols="7">NCI ∃∀ 27 656 18 198 060 473.9 10 746 921 1098.3 27 656 20.8</cell></row><row><cell>GO</cell><cell cols="3">19 529 26 322 937 8.6</cell><cell>-</cell><cell cols="2">6.0 19 529 9.2</cell></row><row><cell>GO ∃</cell><cell cols="3">19 530 26 904 495 12.7</cell><cell>-</cell><cell cols="2">6.9 19 530 9.7</cell></row><row><cell>GO</cell><cell cols="6">19 531 26 926 653 15.5 21 280 377 170.0 32 614 15.2</cell></row><row><cell cols="3">GALEN 2749 313 627</cell><cell>11.1</cell><cell>131 125</cell><cell>8.4</cell><cell>2749</cell><cell>3.3</cell></row><row><cell cols="5">GALEN ∃ 2750 327 756 473.5 170 244</cell><cell>9.7</cell><cell>2750</cell><cell>3.5</cell></row><row><cell cols="5">GALEN 2751 329 394 450.5 175 859</cell><cell>9.8</cell><cell>4657 40.5</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">For the sake of brevity we will from now on take concept to mean concept name unless otherwise stated.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">2 Other optimizations can be used to decrease the cost of individual subsumption tests (see, e.g.,<ref type="bibr" target="#b10">[Tsarkov et al., 2007]</ref>), but these techniques are largely orthogonal to classification optimizations.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">An empirical analysis of optimization techniques for terminological representation systems or: Making KRIS get a move on</title>
		<author>
			<persName><forename type="first">Franz</forename><surname>Baader</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernhard</forename><surname>Hollunder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernhard</forename><surname>Nebel</surname></persName>
		</author>
		<author>
			<persName><surname>Hans Jurgen Pro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Enrico</forename><surname>Tlich</surname></persName>
		</author>
		<author>
			<persName><surname>Franconi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Artificial Intelligence. Special Issue on Knowledge Base Management</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="270" to="281" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m">The Description Logic Handbook: Theory, Implementation and Applications</title>
				<editor>
			<persName><forename type="first">Franz</forename><surname>Baader</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Diego</forename><surname>Calvanese</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Deborah</forename><surname>Mcguinness</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Daniele</forename><surname>Nardi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Peter</forename><forename type="middle">F</forename><surname>Patel-Schneider</surname></persName>
		</editor>
		<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Pushing the EL envelope</title>
		<author>
			<persName><forename type="first">F</forename><surname>Baader</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brandt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lutz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005)</title>
				<meeting>of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005)</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="364" to="369" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Sorting and selection in posets</title>
		<author>
			<persName><forename type="first">Constantinos</forename><surname>Daskalakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><forename type="middle">M</forename><surname>Karp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elchanan</forename><surname>Mossel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samantha</forename><surname>Riesenfeld</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elad</forename><surname>Verbin</surname></persName>
		</author>
		<idno>CoRR, abs/0707.1532</idno>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Compiled hierarchical retrieval</title>
		<author>
			<persName><forename type="first">Gerard</forename><surname>Ellis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">6th Annual Conceptual Graphs Workshop</title>
				<imprint>
			<date type="published" when="1991">1991</date>
			<biblScope unit="page" from="285" to="310" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Sorting and recognition problems for ordered sets</title>
		<author>
			<persName><forename type="first">Ulrich</forename><surname>Faigle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">György</forename><surname>Turán</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">STACS</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">Kurt</forename><surname>Mehlhorn</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1985">1985</date>
			<biblScope unit="volume">182</biblScope>
			<biblScope unit="page" from="109" to="118" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">High performance reasoning with very large knowledge bases: A practical case study</title>
		<author>
			<persName><forename type="first">V</forename><surname>Haarslev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Möller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Seventeenth International JointŁ Conference on Artificial Intelligence, IJCAI-01</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Nebel</surname></persName>
		</editor>
		<meeting>Seventeenth International JointŁ Conference on Artificial Intelligence, IJCAI-01</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="161" to="166" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Exploiting pseudo models for tbox and abox reasoning in expressive description logics</title>
		<author>
			<persName><forename type="first">Ralf</forename><surname>Volker Haarslev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anni-Yasmin</forename><surname>Möller</surname></persName>
		</author>
		<author>
			<persName><surname>Turhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAR</title>
				<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="61" to="75" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Optimising Tableaux Decision Procedures for Description Logics</title>
		<author>
			<persName><forename type="first">Ian</forename><surname>Horrocks</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
		<respStmt>
			<orgName>University of Manchester</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Optimized Reasoning in Description Logics using Hypertableaux</title>
		<author>
			<persName><forename type="first">Boris</forename><surname>Motik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rob</forename><surname>Shearer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><surname>Horrocks</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 21st Conference on Automated Deduction (CADE-21)</title>
				<editor>
			<persName><forename type="first">Frank</forename><surname>Pfenning</surname></persName>
		</editor>
		<meeting>of the 21st Conference on Automated Deduction (CADE-21)<address><addrLine>Bremen, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">July 17-20 2007</date>
			<biblScope unit="volume">4603</biblScope>
			<biblScope unit="page" from="67" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Optimising terminological reasoning for expressive description logics</title>
		<author>
			<persName><forename type="first">Dmitry</forename><surname>Tsarkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><surname>Horrocks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peter</forename><forename type="middle">F</forename><surname>Patel-Schneider</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">J. of Automated Reasoning</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
