Exploring Controlled English Ontology-Based Data Access Camilo Thorne, Diego Calvanese KRDB Research Centre Free University of Bozen-Bolzano, Italy {cthorne,calvanese}@inf.unibz.it 1 Introduction Controlled languages (CLs) are subsets of natural language with minimal ambiguity (lexical, structural, or semantic) tailored to fullfil data management tasks. Recently [5, 11], they have been proposed as a means of providing natural language interfaces to databases (DBs) and to ontology-based data access systems (OBDASs) centered around the W3C standard ontology language OWL1 (Web Ontology Language) [7, 9]. CL ut- terances are so-to-speak ”compiled” (compositionally translated by a symbolic trans- lation) into a formal query/ontology language expression. They have given rise to a number of applications and implementations [5, 11], among which ACE-OWL [7, 9], which maps to OWL DL and fragments of it (such as OWL-Lite). The formal underpin- ning of OWL DL is provided by description logics (DLs) [3, 8], in particular, OWL Lite corresponds to the DL SHIF (Actually SHIF(D), but we do not consider datatypes). Given that OBDASs may contain large amounts of data, we are interested in know- ing whether querying such systems through CL interfaces scales up with the size of the data. An ODBAS can be modelled by a DL knowledge base (KB), whose informa- tion is accessed through queries belonging to some fragment of SQL. A fragment that is considered sufficiently expressive but still computationally manageable (since query answering is decidable in significant cases) is that of conjunctive queries [1]. The basic problem in this setting is (conjunctive) query answering (QA) [6], which is a form of logical entailment [3]. To evaluate the efficiency of querying through CL interfaces, we focus on the so-called data complexity of QA, i.e., the computational complexity of the problem measured in terms of the size of the data only [16]. The optimum lies in L data complexity, which is the complexity of answering SQL queries over plain databases. Data complexity beyond PTime (i.e., intractable) indicates low scalability of query answering through CL interfaces. The data complexity of query answering in SHIF, and hence in OWL Lite and ACE-OWL Lite (the fragment of ACE that maps to OWL Lite) is known to be coNP-complete [12]. Hence, CLs like ACE-OWL do not scale up with data, although they contain fragments that do. This computational behaviour depends on the language constructs they cover. In this paper we pinpoint (some of) the English constructs and a fortiori of ACE, that give rise to these computational properties. To this effect, we study a CL that maps into ALCI, a DL simpler than SHIF but with the same data complexity of query answer- ing, and fragments thereof that are either (i) minimal w.r.t. intractability or (ii) maximal w.r.t. tractability, mirroring Pratt and Third in [13] (cf. also [14]). 1 http://www.w3.org/TR/owl-ref/ 2 QA over Ontologies In an OBDAS, an ontology provides a conceptual view on the data stored in a database. Ontologies are formally underpinned by DLs [3], which structure the domain of dis- course in terms of concepts (representing classes, i.e., sets of objects) and roles (repre- senting binary relations between objects). We are interested in DLs of different expres- siveness, ranging from DL-Liteu [6] to ALCI. In ALCI, concepts C and roles R are formed according to the following syntax: R → P | P− C → > | A | ∃R:C | ¬C | C u C where A stands for a concept name (a unary predicate), P for a role name (a binary predicate) and P − for its inverse. The semantics of the concept and role constructors is the standard one for DLs, see [3]. We can enrich the set of ALCI concepts, modulo the following (explicit) definitions: (i) ∀R:C := ¬∃R:¬C, (ii) C t C 0 := ¬(¬C u ¬C 0 ), (iii) ⊥ := ¬>, and (iv) ∃R := ∃R:>. In a DL ontology O, intensional knowledge is specified by means of a set of (con- cept inclusion) assertions of the form Cl v Cr , stating inclusion (or IS-A) between the instances of the concept Cl on the left and those of the concept Cr on the right. In ALCI, Cl and Cr may be arbitrary concepts, while fragments of ALCI, such as DL-Liteu [6], can be obtained by suitably restricting the syntax for Cl and for Cr . A database (DB), expressing extensional knowledge, is a finite set D of unary and binary ground atoms of the form A(c), P (c, c0 ). A knowledge base (KB) is a pair hO, Di, where O is an ontology and D a DB. As query language, we consider conjunctive queries, i.e., SELECT- PROJECT- JOIN SQL queries and tree shaped conjunctive queries (TCQs), which are those CQs that are built using only unary and binary relations and that are tree-isomorphic [12]. The query answering (QA) decision problem for (T)CQs and KBs is the (FOL) entailment problem stated as follows: given a KB hO, Di, a sequence c̄ of constants and a (T)CQ q, check whether there exists a grounded substitution σ of the free variables of q with c̄ s.t. O ∪D |= qσ, i.e., whether O ∪D (when seen as a FOL theory) entails the grounding of q (seen as a FOL open formula) by σ. We are interested in the data complexity of QA, namely, in its computational complexity when we consider D as the only input of the problem. 3 Expressing QA in Controlled English Given an ontology language L and a query language Q, to express QA in controlled English we: (i) Define a grammar GL and a compositional translation τ (·) for the cor- responding controlled declarative fragment L(GL ) s.t. τ (L(GL )) = L. (ii) Define a grammar GQ and a compositional translation τ 0 (·) for the corresponding controlled interrogative fragment L(GQ ) s.t. τ 0 (L(GQ )) = Q. Following formal semantics termi- nology we call such ontology/query language expressions the meaning representations (MRs) of the CL utterance. In [4, 15], we have shown how to express the DL DL-Liteu and TCQs, for which QA is in L [6], with the CLs Lite-English and GCQ-English, respectively. We want now 2 S → NP VP NP → Det Nom VP → TV NP VP → is a Nom VP → is TV by NP NP → Pro Relp VP VP → is Adj VP → IV VP → is Neg TV by NP NP → Pro VP → does Neg IV VP → is Neg a Nom Nom → Nom Relp VP Nom → Adj Nom VP → is Neg Adj VP → VP Crd VP Nom → Nom Crd Nom Nom → N τ (VP) := τ (NP)(τ (TV)) τ (VP) := τ (Crd)(τ (VP))(τ (VP)) τ (S) := τ (NP)(τ (VP)) τ (VP) := τ (Neg)(τ (NP)(τ (TV))) τ (VP) := τ (Neg)(τ (Adj)) τ (VP) := τ (Neg)(τ (Nom)) τ (VP) := τ (Neg)(τ (IV)) τ (VP) := τ (Adj) τ (NP) := τ (Pro) τ (VP) := τ (IV) τ (VP) := τ (Nom) τ (NP) := τ (Det)(τ (Nom)) τ (NP) := τ (Pro)(τ (Relp)(τ (VP))) τ (Nom) := τ (N) τ (Nom) := τ (Nom)(τ (Relp)(τ (VP))) τ (Nom) := τ (Crd)(τ (Nom))(τ (Nom)) τ (Nom) := τ (Adj)(τ (Nom)) Pro → anybody τ (Pro) := λC.λC 0 .C v C 0 Pro → somebody τ (Pro) := λR.∃R Pro → nobody τ (Pro) := λC.λC 0 .C v ¬C 0 Pro → nobody τ (Pro) := λR.¬∃R Crd → and τ (Crd) := λC.λC 0 .C u C 0 Crd → or τ (Crd) := λC.λC 0 .C t C 0 Relp → who τ (Relp) := λC.C Relp → who τ (Relp) := λC.λC 0 .C:C 0 Neg → not τ (Neg) := λC.¬C Pro → only τ (Pro) := λR.λC.∀R:C Pro → everybody τ (Pro) := λC.> v C Pro → nobody τ (Pro) := λC.C v ⊥ Det → some τ (Det) := λR.λC.∃R:C Det → every τ (Det) := λC.λC 0 .C v C 0 Det → no τ (Det) := λC.λC 0 .C v ¬C 0 Fig. 1. The grammar GDL of DL-English. to know which fragments of ACE-OWL are (i) maximal w.r.t. tractable data complexity (i.e., in PTime), and hence scale up with data, and (ii) minimal w.r.t. intractable data complexity (i.e., coNP-hard), and hence do not scale up. Figure 1 introduces DL-English’s grammar GDL . Following DL conventions [3], we associate (and hence map) the non-recursive word categories N, Adj and IV to atomic concepts. Category TV is associated to role names. Recursive constituents, by contrast, are associated to arbitrary concepts. For reasons of simplicity and space, we disregard morphology and polarity issues. We also omit specifying the (open) class of content words. An example of a sentence recognized by DL-English (we spell out its MR underneath) is: No man who runs some business that does not make some money is a businessman. Man u ∃run:(Business u ¬(∃make:Money)) v ¬Businessman We now turn to the computational properties of each of the constructs in isolation of DL-English. We do it by essentially restricting the kind of right (i.e., Cr ) and left (i.e., Cl ) concepts we may express. All utterances comply with the sentence patterns ”every αl αr ” and ”everybody who αl αr ”. The constituents αl and αr map to, respectively, left and right concepts, while sentences map to IS-A assertions of the form Cl v Cr . We consider in this paper only 8 out of all possible combinations obtained by allowing in Cl and Cr some subset of the DL constructs in the upper part of Figure 2, giving rise to the family {IS-Ai }i∈[0,7] of CLs shown in Figure 2. The basic kind of assertion they all express is IS-A among atomic concepts, viz., A v A0 , captured by IS-A0 . The complexity result for IS-A0 follow from the fact that Lite-English subsumes it. For all the other fragments, the complexity lower bounds follow from the results in [6]. The upper bounds for the CLs IS-Ai , for i ∈ {2, 3, 4}, follow from results by [10] for the DL EL, which subsumes the DL assertions they express. Membership in NL for IS-A1 follows by reducing QA over IS-A1 KBs to linear datalog program evaluation, which is well-known to be in NL (w.r.t. data complexity) [1]. 3 Concept Cf Constituent αf Grammar Rules A Nomf , VPf ∃P :A TV some Nomf , TV somebody who VPf VPf → is a Nomf | IV | is Adj ∃P − :A TV by some Nomf , TV by somebody who VPf Nomf → N ∀P :A TV only VPf , TV only who VPf ∃P TV something, TV somebody ∅ A1 u· · ·uAn Adj Nomf , Nomf who VPf VPf → is a Nomf | IV | is Adj | VPf and VPf Nomf and Nomf , VPf and VPf Nomf → N | Adj Nomf | Nomf and Nomf A1 t· · ·tAn VPf or VPf VPf → is a Nomf | IV | is Adj | VPf and VPf Nomf → N | Nomf and Nomf ¬A is not Adj, does not IV, is not a Nomf Nomf → N Fragment Assertions Sample Sentence(s) Data Complexity IS-A0 A v A1 u· · ·uAn every businessman is a cunning man in L IS-A1 A v ∀P :A every herbivorous eats only herbs NL-complete IS-A2 A1 u· · ·uAn v ∀P :(A1 u· · ·uAm ) every Italian man drinks only strong coffee PTime-complete IS-A3 ∃P :A v A1 u· · ·uAn anybody who murders some person PTime-complete is a heartless killer, ∃P − :A v A1 u· · ·uAn anybody who is loved by some person is a happy person, A v ∃P every driver drives something IS-A4 A1 u· · ·uAn v A1 u· · ·uAm every cruel man is a bad man, PTime-complete ∃P :(A1 u· · ·uAn ) v A1 u· · ·uAm anybody who runs some bankrupt company is a bad businessman IS-A5 ∀P :A v A1 u· · ·uAn anybody who values only money is a coNP-complete greedy person IS-A6 A v A1 t· · ·tAn every mammal is male or is female coNP-complete IS-A7 ¬A v A1 u· · ·uAn anybody who is not selfish is a reasonable coNP-complete person Lite English EL-English DL-English ACE-OWL Data Complexity in L PTime-complete coNP-complete coNP-hard Fig. 2. The {IS-Ai }i∈[0,7] fragments and DL-English, where f ∈ {l, r} and IS-Ai , for all i > 0, contains also the assertions of IS-A0 . We can now individuate the constructs of DL-English, and a fortiori of any CL expressing a coNP-hard ontology language such as SHIF (as does ACE-OWL) that negatively affect the scalability of CL interfaces to ODBASs, namely: – ”only” in subject position (coNP-hardness of IS-A5 ), – disjunction in predicate position (coNP-hardness of IS-A6 ), – negation in subject position (coNP-hardness of IS-A7 ). They also allow us to identify maximal CLs contained in DL-English (and a fortiori ACE-OWL) w.r.t. scalability (i.e., tractable data complexity). By merging the (tractable) fragments IS-Ai , for i ≤ 4, we essentially express the ELI ontology language, with syntax (notice that the assertion A v ∀P :A0 is equivalent to ∃P − :A v A0 ): R → P | P− C → > | A | ∃R:C | C u C That is, the DL where negation- and disjunction free existential concepts are allowed to arbitrarily nest on both sides of v. ELI induces a PTime-complete fragment of DL- English, that we term EL-English, which captures most of the constraints and axioms of real-world large-scale biomedical ontologies such as GALEN or SNOWMED [2]. We can define EL-English top-down pretty easily by removing from GDL the grammar rules for negation, disjunction, and universal quantification, and the negative function words. In such a CL arbitrary sentence subordination (and relatives), in combination 4 with, existential quantification and conjunction among VPs or Noms is allowed. Uni- versal quantification is highly controlled and negation and disjunction are ruled out. 4 Conclusions We have studied the computational complexity of querying OBDASs in controlled English. Optimal (i.e., L) data complexity is attained with Lite-English (expressing DL-Liteu ) and GCQ-English (expressing TCQs) [4, 15]. Relaxing, however, the con- straints put on negation, disjunction, universal determiners, and pronouns, until the components in sentence subjects and predicates become symmetrical, as in DL-English or more expressive CLs, yields coNP-hardness. In particular, data access in ODBASs through CL interfaces is intractable when coverage in declarations is extended to nega- tion and universal quantification in subject NPs, or when we allow disjunction in pred- icate VPs. We also identify a maximal (scalable) CL, EL-English, which is in PTime w.r.t. data complexity. Acknowledgements. We thank Raffaella Bernardi and Norbert Fuchs for discussions and criticism regarding earlier versions of this paper. References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995. 2. F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In Proc. of IJCAI 2005. 3. F. Baader, D. Calvanese, D. Nardi, P. Patel-Schneider, and Deborah McGuinness. The De- scription Logic Handbook. Cambridge University Press, 2003. 4. R. Bernardi, D. Calvanese, and C. Thorne. Lite Natural Language. In Proc. of the 7th Int. Workshop on Computational Semantics (IWCS-7), 2007. 5. A. Bernstein, E. Kaufman, A. Göhring, and C. Kiefer. Querying ontologies: A controlled English interface for end-users. In Proc. of ISWC 2005. 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Data complexity of query answering in description logics. In Proc. of KR 2006. 7. N. E. Fuchs and K. Kaljurand. Mapping Attempto Controlled English to OWL DL. In Demos and Posters of the 3rd European Semantic Web Conf. (ESWC 2006). 8. I. Horrocks, P. F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL: The making of a web ontology language. J. of Web Semantics, 1(1):7–26, 2003. 9. K. Kaljurand and N. E. Fuchs. Verbalizing OWL in Attempto Controlled English. In Proc. of OWLED 2007. 10. A. Krisnadhi and C. Lutz. Data complexity in the EL family of description logics. In Proc. of LPAR 2007. 11. S. Mador-Haim, Y. Winter, and A. Braun. Controlled language for geographical information system queries. In Proc. of Inference in Computational Semantics, 2006. 12. M. Ortiz, D. Calvanese, and T. Eiter. Data complexity of query answering in expressive description logics via tableaux. J. of Automated Reasoning, 41(1):61–98, 2008. 13. I. Pratt and A. Third. More fragments of language. Notre Dame J. of Formal Logic, 2005. 14. M. Slavkovik. Deep analysis for an interactive question answering system. Master’s thesis, KRDB Research Centre, Free University of Bozen-Bolzano, 2007. 15. C. Thorne. Expressing aggregate queries over DL-Lite ontologies with controlled English. In Proc. of the ESSLLI 2008 Student Session, 2008. 16. M. Y. Vardi. The complexity of relational query languages. In Proc. of STOC’82. 5