Concept Projection in Algebras for Computing Certain Answer Descriptions

Introduction

We introduce a projection operator over concepts in a given description logic that produces least subsumers of concepts in a language fragment that satisfies additional syntactic requirements specified as a parameter of the operator. The operator complements existing operators for concept selection and concept ordering that have been proposed in earlier work [3,4]. We show how, altogether, the operators can form the basis of an algebra for computing more informative and user friendly varieties of certain answers over what we call a description base. This is a description logic (DL) knowledge base in which the ABox is replaced by a fixed and finite collection of arbitrary concepts.

In the sections that follow, we introduce the notion of a description base, of a certain answer description over a description base and then present an algebra for computing a simple variety of certain answer descriptions that incorporates the new projection operator. We elaborate on the semantics of this operator, in particular the notion of a projection description, in the remaining sections of the paper where we consider the operator's semantics as well as methods for efficient evaluation. In the final section, we outline a motivating example and suggest useful ways in which both projection descriptions and our query algebra might be extended.

Certain Answer Descriptions and Description Bases

Assume K (= (T , A)) denotes a knowledge base in which concepts conform to a given DL, and consider the purpose served by K's ABox A when it comes to computing certain answers over K to a conjunctive query of the form

Q(x 1 , ..., x k ) ← QueryBody,

where QueryBody in turn has the form ∃x k+1 , ..., ∃x m :

C(x i ) ∧ R(x i , x j )

in which the predicate names correspond to concepts and roles in the DL. Most importantly, the ABox supplies a fixed and finite collection of constant symbols {a 1 , ..., a n } which in turn defines a space of n k possible answers to Q: k-tuples θ i of the form "{(x 1 , a i1 ), ...., (x k , a i k )}" in which query variables x j are paired with constant symbols a ij . Viewed as a substitution, recall that θ i is a certain answer to Q exactly when K |= QueryBody θ i . Remember, however, that constant symbols and certain answers are still syntactic artifacts and that the former are not actual domain elements in an interpretation. Now assume the DL supports nominals, {a}. Answers θ i can then have the alternative form

{(x 1 , {a i1 }), ..., (x k , {a i k })}(1)

in which query variables x j are now paired with nominal concepts {a ij }. In this case, (1) will be a certain answer exactly when

K |= ∀x 1 , ..., ∀x k : ({a i1 }(x 1 ) ∧ ... ∧ {a i k }(x k ) → QueryBody).(2)

Among other things, nominals allow one to add inclusion dependencies {a i } C and {a i } ∃R.{a j } to T that correspond to ABox assertions C(a i ) and R(a i , a j ) in A, respectively. An ABox can then be replaced by a fixed and finite collection of nominals {{a 1 }, ..., {a n }} since its only remaining purpose is to list the nominal concepts that can occur in (1) above, and therefore in the antecedent of the implication in (2) above.

From a user's perspective, this characterization of answers generalizes in an appealing way when one allows the collection of nominals that replaces an ABox to consist instead of a fixed and finite collection of arbitrary concepts {C 1 , ..., C n }. In this way, the concepts appearing in (1) can provide more informative descriptions to the user that necessarily describe x i -components of answer k-tuples. In such circumstances, and for the remainder of the paper, we refer to this list of arbitrary concepts as a DBox or description box, to a knowledge base K that replaces an ABox with a DBox as a description base, and to certain answers that pair query variables with arbitrary concepts as certain answer descriptions.

Note that, for a description base K to qualify as consistent, it is sensible to add the condition that there exists a model for which the interpretation of each concept in K's DBox is nonempty in addition to the standard requirement that the model satisfies each inclusion dependency in K's TBox. This ensures that concepts describing an x i -component of an answer k-tuple are never vacuous, indeed, that each concept in a DBox is there because it describes some entity of interest to a user. Such a condition can always be captured in a TBox by adding inclusion dependencies of the form " ∃R.C i ", where R is a fresh role name.

Towards an Algebra for Description Bases

Returning to the standard notion of K as a knowledge base, consider the subclass of queries having the form "Q(x) ← C(x)". In this simplest case, answers may elide any mention of the single query variable x, and correspond more simply to the constant symbols a i that must satisfy a QueryBody with conditions that can be rolled up in a single concept C. For a description base, when K = (T , {C 1 , ..., C n }), such queries return unary certain answer descriptions, that is, each concept C i for which T |= C i C.

The basis of an algebra for computing unary certain answer descriptions has been proposed in earlier work [3,4]. The algebra consists of three operators given by the following grammar for a query Q.

Q ::= K :: Od | σ C (Q) | Sort Od (Q)

Since the focus of this earlier work was on performance and scalability, the first priority was to introduce operators that could express such things as an index scan or a sort, operations that are fundamental to issues of efficiency in database query evaluation. In particular, this entailed the development of a theory of ordering descriptions that could be used to characterize strict partial orders over the concepts generated by a given DL (the "Od" part of the first and third operators are ordering descriptions) together with the notion of a binary search tree index of concepts called a description index. Altogether, this enabled a formal consideration of sorting and comparison based search over a description base DBox.

The first operator has two purposes: it defines a description index of the concepts that occur in the DBox of a given description base, and it computes a list of concepts that corresponds to an inorder traversal of the index. The second operator filters concepts occurring in a concept list by retaining only those subsumed by a given parameter concept. The third operator sorts a concept list according to a partial order defined by a given ordering description.

An outstanding issue with this algebra is that it is necessary to supply, a priori, all concepts that might be of interest to any user in the DBox of a description base. Since users will want concepts that reflect different external views and interests, the DBox will almost certainly require multiple concepts for the same underlying entity e together with a means to distinguish among the possible concepts for e. In this paper, we propose to address this problem by adding a new projection operator to the grammar for the above algebra.

Q ::= K :: Od | σ C (Q) | Sort Od (Q) | π P d (Q)(3)

The operator can be used to rewrite concepts produced by a subquery Q to subsuming concepts that satisfy an additional syntactic requirement given by the P d part of the new operator. We call the syntactic requirement a projection description. The new operator replaces each concept in the concept list produced by its subquery by a least subsuming concept that satisfies the projection description.

As in earlier work on ordering descriptions, our main objective is to develop a theory of projection descriptions that will work for a variety of DLs that satisfy basic requirements. The first is that any qualifying DL has EL as a subdialect. One important reason is that EL concepts are a close approximation to nested relations and to XML and are therefore an ideal starting point for user friendly descriptions. Another reason is that we can show that π P d computes least subsumers of argument concepts with respect to an EL fragment satisfying P d. We also assume that any qualifying DL will have at least one concrete domain that supports constant terms and an underlying total order. These constant terms can then be used in descriptions that correspond to the notion of a tuple in the relational model. As a consequence, our projection operator then has relational tuple projection as a special case. In addition, the operator naturally accommodates nested relations and their projections as well.

We formally define projection descriptions in the next section, and then proceed to consider how projection operations can be efficiently implemented by appealing only to concept subsumption in a given DL, e.g., by using classifications of concepts that occur in a given terminology.

Projection Descriptions

To introduce projection descriptions, we use the DL ALC(D) in which the concrete domain D supports equality and comparison operations over an underlying domain consisting of rationals and strings.1 However, to reiterate, any DL that contains EL(D) would qualify.

Definition 1 (Description Logic ALC(D)). Let {A, A 1 , . . .}, {R, R 1 , . . .}, {f, g, f 1 , . . .}, and {k, k 1 , . . .} denote sets of primitive concepts, roles, concrete features, and constants respectively. A concept is defined by the grammar:

C, D ::= | ⊥ | A | ¬C | C D | ∃R.C | f = k (equality over D ) | f < g (linear order over D )

An inclusion dependency has the form C D. A terminology or TBox T is a finite set of inclusion dependencies. An interpretation I is a 2-tuple ( , • I ) in which is a domain that contains D , the set of rationals and finite strings, as a proper subset. We write A to denote the abstract domain \ D . Also, • I is an interpretation function that maps each concrete feature f to a total function (f ) I : A → D , each role R to a relation (R) I ⊆ ( A × A ), each primitive concept A to a set (A) I ⊆ A , the "=" symbol to the equality relation on D , the "<" symbol to the binary relation for the linear order on D , and each constant k to a constant in D . The interpretation function is extended to arbitrary concepts in the standard way.

An interpretation I satisfies an inclusion dependency C D if (C) I ⊆ (D) I , and T |= C D if (C) I ⊆ (D) I for any interpretation I that satisfies all inclusion dependencies in T .

We use standard abbreviations such as C D for ¬(¬C ¬D) and f ≤ k for (f = k) ((f < g) (g = k)). Also, given a finite set S of ALC(D) concepts, we write S to denote if S is empty and the concept

D 1 • • • D n otherwise, when S = {D 1 , ..., D n }.

With the presumption of ALC(D), the syntax and semantics of our new projection operation are given by the following.

Definition 2 (Projection Description). Let L be a DL that includes ALC(D), possibly without negation, and f , R and C any concrete feature, role and concept in L, respectively. A projection description P d is defined by the grammar:

P d ::= C? | C ↑ | f | P d 1 P d 2 | P d 1 ∪ P d 2 | ∃R.P d(4)

Now let α(P d) be defined as follows:

α(P d) =    ∅ if P d = "C?", "C ↑ "∪ P d 2 , α(P d 1 ) ∩ α(P d 2 ) = ∅.

The condition that P d is well-formed ensures that requests for information concerning a given role are never distributed in different parts of an answer (sub) tuple. This is a necessary condition to help combat combinatorial effects that would otherwise make projection involving roles infeasible.

Definition 3 (Induced Concepts). Let L be a DL that includes ALC(D), possibly without negation, and P d a projection description over roles in L. We define the sets L |P d and L TUP |P d , the L concepts generated by P d and L tuple concepts generated by P d, respectively, as follows:

L |P d = { S | S ⊆ fin L TUP |P d },andL TUP |P d =                {C} if P d = "C?"; A if P d = "C ↑ "; {f = k | k ∈ D } if P d = "f "; {C 1 C 2 | C 1 ∈ L TUP |P d1 ∧ C 2 ∈ L TUP |P d2 } if P d = "P d 1 P d 2 "; L TUP |P d1 ∪ L TUP |P d2 if P d = "P d 1 ∪ P d 2 "; and { S | S ⊆ fin {∃R.C | C ∈ L |P d1 }} if P d = "∃R.P d 1 ",

where A is the set of all primitive concepts in the alphabet of L.

Note that, while in general the language L |P d is infinite, for every fixed and finite terminology T and concept C, the language L |P d restricted to the symbols used in T and C is necessarily finite. Case P d = "C 1 ?": S = {C 1 } if T |= C ∃Rp.C 1 , and ∅ otherwise.

Case P d = "C ↑ 1 ": S = (A 1 ? ∪ • • • ∪ A n ?) T ,Rp (C),

where {A 1 , ..., A n } are all primitive concepts A occurring in T and C such that T |= ∃Rp.A ∃Rp.C 1 ; S = ∅ for n = 0. Case P d = "f ":

S = ((f = k 1 )? ∪ • • • ∪ (f = k n )?) T ,Rp (C),

where {k 1 , ..., k n } are all constants occurring in T and C; S = ∅ for n = 0. Case P d = "P d 1 P d 2 ": In the final case, we write S T ,Rp to denote the reduction of S with respect to T and Rp, that is, the set of all D 1 ∈ S such that there is no D 2 ∈ S for which T |= ∃Rp.D 2 ∃Rp.D 1 and T |= ∃Rp.D 1 ∃Rp.D 2 .

S = {D 1 D 2 | (∀i ∈ {1, 2} : D i ∈ (P d i ) T ,Rp (C)) ∧ T |= C ∃Rp.(D 1 D 2 )}.

Our main result now follows, in which functional projection is related to least subsumers.

Theorem 1. Let L be a DL that includes ALC(D), possibly without negation, and T , C and P d a respective terminology, concept and well-formed projection description in L.

Then (P d) T ,Id (C) T ,Id is a least subsumer of C in L |P d .

Proof (outline). By induction on the structure of P d.

Note that the restriction to L |P d is essential to guarantee that the least subsumer always exists (otherwise, least subsumers may not exist [2]).

Computing Projection Descriptions

A top-level procedure for computing (P d) T ,Rp (C) is given by a definition of Project in Figure 1. Clearly, a straightforward coding for the procedures invoked by Project to handle each of the six conditions follows directly from Definition 6. However, it is evident that this simple coding strategy for many of Project(P d, T , Rp, C) switch on P d case "C1?": return ProjectConcept(C1, T , Rp, C) case "C ↑ 1 ": return ProjectAtomicConcepts(C1, T , Rp, C) case "f ": return ProjectFeature(f, T , Rp, C) case "P d1 P d2": return ProjectJoin(P d1, P d2, T , Rp, C) case "P d1 ∪ P d2": return Project(P d1, T , Rp, C) ∪ Project(P d2, T , Rp, C) case "∃R.P d1": return ProjectRole(P d1, R, T , Rp, C) Fig. 1. Computing Projections at the Top-Level these procedures will lead to an unacceptably large number of subsumption tests. In particular, the resulting ProjectAtomicConcepts procedure for the "C ↑ " case will require a number of such tests proportional to the number of primitive concepts in the terminology, or much worse: exponential in the number of primitive concepts, e.g., when called indirectly by a naïvely coded ProjectRole procedure. This happens, for example, with a projection description of the form "∃R.(⊥ ↑ )".

A naïvely coded ProjectJoin procedure for the "P d 1 P d 2 " case is also problematic since it requires a number of subsumption tests equal to the product of the number of descriptions computed by P d 1 and P d 2 . This circumstance quickly becomes intolerable for iterated uses of this procedure such as for projection descriptions of the form "(f 1 • • • f n )" that compute descriptions of relational tuples, cases that are likely to occur in practice.

While these performance issues are unavoidable in the worst case, it is possible to improve the performance of projection computation for many situations. Starting with procedure ProjectJoin, we now outline algorithms for the procedures invoked by Project in Figure 1 that will have much better performance in situations that are more likely to occur in practice.

Procedure ProjectJoin

Our algorithm for this procedure follows a "nested loops" strategy in which concepts computed by an outer loop may be used to further qualify concepts computed by an inner loop. A simple way to accomplish this is to replace Definition 5 above with a more general notion of a role path that enables sideways communication of outer loop concepts.

Definition 7 (General Role Path). Let L be a DL that includes ALC(D), possibly without negation, and R and C be any role or concept in L, respectively. A general role path Rp is defined by the grammar:

Rp ::= Id | Rp.R | Rp.C

The expression ∃Rp.C denotes the concept C when Rp = " Id ", the concept ∃Rp 1 .∃R.C when Rp = "Rp 1 .R", and the concept ∃Rp

1 .(C 1 C) otherwise, when Rp = "Rp 1 .C 1 ". Algorithm 1: ProjectJoin(P d 1 , P d 2 , T , Rp, C) result ← ∅ foreach C1 ∈ Project(P d1, T , Rp, C) do foreach C2 ∈ Project(P d2, T , Rp.C1, C) do result ← result ∪ {(C1 C2)} return result

There are additional opportunities for improving the performance of Algorithm 1.

For one, the procedure might explore alternative permutations of projection descriptions with the form "(P d 1 • • • P d n )" that might result in a more efficient nesting order. For another, the procedure may opt to only append C 1 to Rp in the inner loop if the overall cost of subsumption checks is expected to decrease.

Procedure ProjectAtomicConcepts

The expected time for this procedure can be improved considerably by employing a classification of the primitive concepts in a given terminology. In particular, one conducts a depth-first-search of the classification, taking advantage of pruning opportunities when parent atomic concepts in the classification are disqualified. An implementation of this procedure given by Algorithm 2 below assumes the following definition.

Definition 8 (TBox Classification). Let T be a TBox in ALC(D). A classification of T is a directed acyclic graph G (= (N, E)) with nodes N corresponding to the primitive concepts in T and with E consisting of all edges (A 1 , A 2 ) such that:

-T |= A 1 A 2 , -T |= A 2 A 1 , and -there is no distinct A 3 ∈ N such that T |= A 1 A 3 and T |= A 3 A 2 .

We write N G (resp. E G ) to denote the nodes (resp. edges) of G, where the terminology is assumed by context.

Again, there are opportunities for improving the performance of Algorithm 2. For example, it would help to ensure the performance gains obtained by pruning during depth-first-search if one adds new internal primitive concepts to the classification if there are a large number of root nodes, or to split cases in which a large number of edges would otherwise lead to an existing node.

Procedure ProjectRole

The naïve implementation of this procedure involves quantifying over all subsets of concepts computed by the evaluation of the nested projection description. However, for a given terminology T , role path Rp and concept C, consider the following:

Algorithm 2: ProjectAtomicConcepts(C These observations motivate the implementation of ProjectRole given by Algorithm 3. The algorithm uses a queue to conduct a breadth-first-search of subsets in such a way that ensures a "frontier" of disqualified sets will prune containing candidate sets. Note that, to avoid considering more than one permutation of a candidate, the procedure assumes a total lexicographic order "≺" over all concepts.

The algorithm works particularly well in cases where the argument projection description parameter P d has the form "f P d " such as in the case of projection descriptions that compute descriptions of sets of relational tuples. In this case, the number of subsumption checks performed directly by the algorithm is quadratic in the size of result computed by the recursive call to Project in the first line.

Procedures ProjectConcept, ProjectFeature and Reduce An implementation of the remaining procedures is straightforward. In the case of ProjectConcept, it clearly suffices to mirror the corresponding case specification in Definition 2. An efficient algorithm for ProjectFeature requires only a bit more thought: one proceeds by a progressive refinement of intervals over the concrete domain, narrowing by binary search on intervals to the required set of concepts of the form "f = k". Finally, an implementation of procedure Reduce is given by Algorithm 4.

Discussion

An application that serves as a guiding influence on this work relates to a hypothetical description base system to which an internet browser submits queries in our algebra. The queries originate in the web pages for an enterprise, and are replaced on the fly by the browser with the resulting lists of EL-like concepts (suitably mapped to HTML) that are returned by the system in response. For example, consider the case of an online supplier of photography equipment. As part of a web presence, the supplier maintains a description base K of concepts that describe items that are available for purchase as well as web pages with embedded queries over this description base. One query Q in some web page might have the form "π P d (Sort P rice (σ C (K :: P rice)))" in which the selection condition C is the concept "P roductCode = "digicam" P rice < 300" and where the projection description is given by (N ame P rice ∃Supplier.(OnlineAddress Rating)).

(5)

After first rewriting Q as a projection of an index scan "π P d (σ C (K :: P rice))" (see below), the system returns the result of evaluating Q on the description base. Thus, when browsing the page containing Q, a user sees an ordered list of inexpensive digital camera information ordered by price, and with each item displaying the name and price of the camera together with a sublist of supplier URL addresses and ratings for that camera. This example suggests a useful extension to projection descriptions that can be easily accommodated. In particular, adding the case "P d :: Od " as an additional option in (4) would enable a useful refinement to (5) above:

(N ame P rice ∃Supplier.((OnlineAddress Rating) :: Rating)).

Consequently, the user will see the sublist of supplier URL addresses and ratings for each camera in the order of the supplier rating.

Also, an algebraic approach to querying description bases lends itself nicely to problems in query rewriting (witness the rewriting on Q above). In particular, the following identities hold in our algebra: σ C (Sort Od (Q)) ≡ Sort Od (σ C (Q)), Sort Od (Sort Od (Q)) ≡ Sort Od (Q) and σ C1 (σ C2 (Q)) ≡ σ C1 C2 (Q).

The identities yield a normal form "Sort Od2 (σ C (K :: Od 1 ))" for queries. One of the objectives of earlier work [3,4] was to determine when queries in normal form could be evaluated efficiently, in particular when they could be rewritten as an index scan "σ C (K :: Od 1 )" in which pruning made possible by the filtering concept C during an inorder traversal of the description index K :: Od 1 was sufficient to guarantee logarithmic performance (and allowing that the choice of permutation for a concept list could be more limited than required by the original query). An underlying problem addressed in the earlier work is to determine the truth of a "≺ T ,C " predicate over the pair (Od 1 , Od 2 ) of ordering descriptions, that is, to reason when, for terminology T , the partial order induced by Od 1 contains the partial order induced by Od 2 over concepts that are subsumed by C. This predicate can be usefully "overloaded" to apply to projection descriptions as well. In particular, a sufficient condition for the additional algebraic identity π P d2 (π P d1 (σ C (Q))) ≡ π P d2 (σ C (Q)) might be expressed as "P d 1 ≺ T ,C P d 2 ", another topic for future work.

Definition 4 (4Projection Semantics). The semantics of the new projection operator is given as follows: query π P d (Q) replaces each concept C in the concept list computed by Q by a least subsumer of C in L |P d . The remainder of this section begins to consider how least subsumers in L |P d can be effectively computed when P d is well-formed. To start, we introduce the notion of a role path that helps to simplify manipulating concepts of the form ∃R 1 . ... .∃R n .C. Definition 5 (Role Path). Let L be a DL that includes ALC(D), possibly without negation, and R be any role in L. A role path Rp is defined by the grammar: Rp ::= Id | Rp.R The expression ∃Rp.C denotes the concept C when Rp = " Id ", and the concept ∃Rp 1 .∃R.C otherwise, when Rp = "Rp 1 .R". Definition 6 (Projection Function). Let L be a DL that includes ALC(D), possibly without negation, and T , C, Rp and P d a respective TBox, concept, role path and projection description in L. The functional projection of C for P d with respect to T and Rp, written (P d) T ,Rp (C), is a finite set S of L concepts satisfying the following conditions.

CaseP d = "P d 1 ∪ P d 2 ": S = (P d 1 ) T ,Rp (C) ∪ (P d 2 ) T ,Rp (C). Case P d = "∃R.P d 1 ": S = {∃R.( S ) | S ⊆ (P d 1 ) T ,Rp.R (C) ∧ T |= C ∃Rp.∃R.( S )} T ,Rp .

Algorithm 3 :3ProjectRole(P d, R, T , Rp, C)

Algorithm 4 :4Reduce(S, T , Rp) result ← ∅ foreach C1 ∈ S do minimal ← true foreach C2 ∈ S − {C1} do if T |= ∃Rp.C2 ∃Rp.C1 and T |= ∃Rp.C1 ∃Rp.C2 then minimal ← f alse break if minimal then result ← result ∪ {C1} return result

or "f "; {R} if P d = "∃R.P d 1 "; and α(P d 1 ) ∪ α(P d 2 ) if P d = "P d 1 P d 2 " or "P d 1 ∪ P d 2 ". Then P d is well formed if, for any subexpression P d 1 P d 2 or P d 1

1 , T , Rp, C) foreach A occurring in C such that A ∈ NG and T |= C ∃Rp.A do result ← result ∪ {A} return result 1. It is less likely that T |= C ∃Rp.( S), for concept sets S with larger numbers of elements; and 2. For any pair of concept sets S 1 and S 2 , if T |= C ∃Rp.( S 1 ), then T |= C ∃Rp.( (S 1 ∪ S 2 )).

stack ← ∅result ← ∅// Initialize stack with the roots of the classification of Tforeach node A ∈ NG with no outgoing edges dostack.Push(A)// Perform a depth first search of the classificationwhile stack.NotEmpty() doA ← stack.Pop()if T |= C ∃Rp.A and T |= ∃Rp.A ∃Rp.C1 thenresult ← result ∪ {A}foreach (A , A) ∈ EG dostack.Push(A )

See[1] for an excellent introduction to description logics.

FranzBaader DiegoCalvanese DeborahLMcguinness DanieleNardi PeterFPatel-Schneider The Description Logic Handbook: Theory, Implementation, and Applications Cambridge University Press 2003 Computing the least common subsumer w.r.t. a background terminology FranzBaader BarisSertkaya Anni-YasminTurhan J. Applied Logic 5 3 2007 On Ordering Descriptions in a Description Logic JeffreyPound LubomirStanchev DavidToman GrantWeddell 20th International Workshop on Description Logics 2007 On Ordering and Indexing Metadata for the Semantic Web JeffreyPound LubomirStanchev DavidToman GrantWeddell 21st International Workshop on Description Logics 2008