Exploring Controlled English Ontology-Based Data Access

                             Camilo Thorne, Diego Calvanese

                                  KRDB Research Centre
                           Free University of Bozen-Bolzano, Italy
                        {cthorne,calvanese}@inf.unibz.it


 1     Introduction
 Controlled languages (CLs) are subsets of natural language with minimal ambiguity
 (lexical, structural, or semantic) tailored to fullfil data management tasks. Recently [5,
 11], they have been proposed as a means of providing natural language interfaces to
 databases (DBs) and to ontology-based data access systems (OBDASs) centered around
 the W3C standard ontology language OWL1 (Web Ontology Language) [7, 9]. CL ut-
 terances are so-to-speak ”compiled” (compositionally translated by a symbolic trans-
 lation) into a formal query/ontology language expression. They have given rise to a
 number of applications and implementations [5, 11], among which ACE-OWL [7, 9],
 which maps to OWL DL and fragments of it (such as OWL-Lite). The formal underpin-
 ning of OWL DL is provided by description logics (DLs) [3, 8], in particular, OWL Lite
 corresponds to the DL SHIF (Actually SHIF(D), but we do not consider datatypes).
     Given that OBDASs may contain large amounts of data, we are interested in know-
 ing whether querying such systems through CL interfaces scales up with the size of
 the data. An ODBAS can be modelled by a DL knowledge base (KB), whose informa-
 tion is accessed through queries belonging to some fragment of SQL. A fragment that
 is considered sufficiently expressive but still computationally manageable (since query
 answering is decidable in significant cases) is that of conjunctive queries [1]. The basic
 problem in this setting is (conjunctive) query answering (QA) [6], which is a form of
 logical entailment [3]. To evaluate the efficiency of querying through CL interfaces, we
 focus on the so-called data complexity of QA, i.e., the computational complexity of the
 problem measured in terms of the size of the data only [16].
     The optimum lies in L data complexity, which is the complexity of answering SQL
 queries over plain databases. Data complexity beyond PTime (i.e., intractable) indicates
 low scalability of query answering through CL interfaces. The data complexity of query
 answering in SHIF, and hence in OWL Lite and ACE-OWL Lite (the fragment of
 ACE that maps to OWL Lite) is known to be coNP-complete [12]. Hence, CLs like
 ACE-OWL do not scale up with data, although they contain fragments that do. This
 computational behaviour depends on the language constructs they cover.
     In this paper we pinpoint (some of) the English constructs and a fortiori of ACE, that
 give rise to these computational properties. To this effect, we study a CL that maps into
 ALCI, a DL simpler than SHIF but with the same data complexity of query answer-
 ing, and fragments thereof that are either (i) minimal w.r.t. intractability or (ii) maximal
 w.r.t. tractability, mirroring Pratt and Third in [13] (cf. also [14]).
  1
      http://www.w3.org/TR/owl-ref/
2   QA over Ontologies
In an OBDAS, an ontology provides a conceptual view on the data stored in a database.
Ontologies are formally underpinned by DLs [3], which structure the domain of dis-
course in terms of concepts (representing classes, i.e., sets of objects) and roles (repre-
senting binary relations between objects). We are interested in DLs of different expres-
siveness, ranging from DL-Liteu [6] to ALCI. In ALCI, concepts C and roles R are
formed according to the following syntax:

               R → P | P−              C → > | A | ∃R:C | ¬C | C u C

where A stands for a concept name (a unary predicate), P for a role name (a binary
predicate) and P − for its inverse. The semantics of the concept and role constructors is
the standard one for DLs, see [3]. We can enrich the set of ALCI concepts, modulo the
following (explicit) definitions: (i) ∀R:C := ¬∃R:¬C, (ii) C t C 0 := ¬(¬C u ¬C 0 ),
(iii) ⊥ := ¬>, and (iv) ∃R := ∃R:>.
     In a DL ontology O, intensional knowledge is specified by means of a set of (con-
cept inclusion) assertions of the form Cl v Cr , stating inclusion (or IS-A) between
the instances of the concept Cl on the left and those of the concept Cr on the right.
In ALCI, Cl and Cr may be arbitrary concepts, while fragments of ALCI, such as
DL-Liteu [6], can be obtained by suitably restricting the syntax for Cl and for Cr . A
database (DB), expressing extensional knowledge, is a finite set D of unary and binary
ground atoms of the form A(c), P (c, c0 ). A knowledge base (KB) is a pair hO, Di,
where O is an ontology and D a DB.
     As query language, we consider conjunctive queries, i.e., SELECT- PROJECT- JOIN
SQL queries and tree shaped conjunctive queries (TCQs), which are those CQs that
are built using only unary and binary relations and that are tree-isomorphic [12]. The
query answering (QA) decision problem for (T)CQs and KBs is the (FOL) entailment
problem stated as follows: given a KB hO, Di, a sequence c̄ of constants and a (T)CQ
q, check whether there exists a grounded substitution σ of the free variables of q with c̄
s.t. O ∪D |= qσ, i.e., whether O ∪D (when seen as a FOL theory) entails the grounding
of q (seen as a FOL open formula) by σ. We are interested in the data complexity of
QA, namely, in its computational complexity when we consider D as the only input of
the problem.


3   Expressing QA in Controlled English
Given an ontology language L and a query language Q, to express QA in controlled
English we: (i) Define a grammar GL and a compositional translation τ (·) for the cor-
responding controlled declarative fragment L(GL ) s.t. τ (L(GL )) = L. (ii) Define a
grammar GQ and a compositional translation τ 0 (·) for the corresponding controlled
interrogative fragment L(GQ ) s.t. τ 0 (L(GQ )) = Q. Following formal semantics termi-
nology we call such ontology/query language expressions the meaning representations
(MRs) of the CL utterance.
    In [4, 15], we have shown how to express the DL DL-Liteu and TCQs, for which
QA is in L [6], with the CLs Lite-English and GCQ-English, respectively. We want now


                                            2
          S → NP VP                                                           NP → Det Nom
          VP → TV NP          VP → is a Nom          VP → is TV by NP         NP → Pro Relp VP
          VP → is Adj         VP → IV                VP → is Neg TV by NP     NP → Pro
          VP → does Neg IV    VP → is Neg a Nom      Nom → Nom Relp VP        Nom → Adj Nom
          VP → is Neg Adj     VP → VP Crd VP         Nom → Nom Crd Nom        Nom → N

τ (VP) := τ (NP)(τ (TV))               τ (VP) := τ (Crd)(τ (VP))(τ (VP))      τ (S) := τ (NP)(τ (VP))
τ (VP) := τ (Neg)(τ (NP)(τ (TV)))      τ (VP) := τ (Neg)(τ (Adj))
τ (VP) := τ (Neg)(τ (Nom))             τ (VP) := τ (Neg)(τ (IV))              τ (VP) := τ (Adj)
τ (NP) := τ (Pro)                      τ (VP) := τ (IV)                       τ (VP) := τ (Nom)
τ (NP) := τ (Det)(τ (Nom))             τ (NP) := τ (Pro)(τ (Relp)(τ (VP)))    τ (Nom) := τ (N)
τ (Nom) := τ (Nom)(τ (Relp)(τ (VP)))   τ (Nom) := τ (Crd)(τ (Nom))(τ (Nom))   τ (Nom) := τ (Adj)(τ (Nom))

      Pro → anybody τ (Pro) := λC.λC 0 .C v C 0        Pro → somebody τ (Pro) := λR.∃R
      Pro → nobody    τ (Pro) := λC.λC 0 .C v ¬C 0     Pro → nobody   τ (Pro) := λR.¬∃R
      Crd → and       τ (Crd) := λC.λC 0 .C u C 0      Crd → or       τ (Crd) := λC.λC 0 .C t C 0
      Relp → who      τ (Relp) := λC.C                 Relp → who     τ (Relp) := λC.λC 0 .C:C 0
      Neg → not       τ (Neg) := λC.¬C                 Pro → only     τ (Pro) := λR.λC.∀R:C
      Pro → everybody τ (Pro) := λC.> v C              Pro → nobody   τ (Pro) := λC.C v ⊥
      Det → some      τ (Det) := λR.λC.∃R:C            Det → every    τ (Det) := λC.λC 0 .C v C 0
      Det → no        τ (Det) := λC.λC 0 .C v ¬C 0

                                 Fig. 1. The grammar GDL of DL-English.

to know which fragments of ACE-OWL are (i) maximal w.r.t. tractable data complexity
(i.e., in PTime), and hence scale up with data, and (ii) minimal w.r.t. intractable data
complexity (i.e., coNP-hard), and hence do not scale up.
     Figure 1 introduces DL-English’s grammar GDL . Following DL conventions [3],
we associate (and hence map) the non-recursive word categories N, Adj and IV to
atomic concepts. Category TV is associated to role names. Recursive constituents, by
contrast, are associated to arbitrary concepts. For reasons of simplicity and space, we
disregard morphology and polarity issues. We also omit specifying the (open) class of
content words. An example of a sentence recognized by DL-English (we spell out its
MR underneath) is:
 No man who runs some business that does not make some money is a businessman.
         Man u ∃run:(Business u ¬(∃make:Money)) v ¬Businessman
     We now turn to the computational properties of each of the constructs in isolation
of DL-English. We do it by essentially restricting the kind of right (i.e., Cr ) and left
(i.e., Cl ) concepts we may express. All utterances comply with the sentence patterns
                   ”every αl αr ”           and        ”everybody who αl αr ”.
The constituents αl and αr map to, respectively, left and right concepts, while sentences
map to IS-A assertions of the form Cl v Cr . We consider in this paper only 8 out of
all possible combinations obtained by allowing in Cl and Cr some subset of the DL
constructs in the upper part of Figure 2, giving rise to the family {IS-Ai }i∈[0,7] of CLs
shown in Figure 2. The basic kind of assertion they all express is IS-A among atomic
concepts, viz., A v A0 , captured by IS-A0 .
    The complexity result for IS-A0 follow from the fact that Lite-English subsumes it.
For all the other fragments, the complexity lower bounds follow from the results in [6].
The upper bounds for the CLs IS-Ai , for i ∈ {2, 3, 4}, follow from results by [10] for
the DL EL, which subsumes the DL assertions they express. Membership in NL for
IS-A1 follows by reducing QA over IS-A1 KBs to linear datalog program evaluation,
which is well-known to be in NL (w.r.t. data complexity) [1].

                                                   3
  Concept Cf                  Constituent αf                          Grammar Rules
      A                        Nomf , VPf
    ∃P :A       TV some Nomf , TV somebody who VPf            VPf → is a Nomf | IV | is Adj
   ∃P − :A    TV by some Nomf , TV by somebody who VPf                  Nomf → N
    ∀P :A           TV only VPf , TV only who VPf
     ∃P                TV something, TV somebody                             ∅
 A1 u· · ·uAn          Adj Nomf , Nomf who VPf           VPf → is a Nomf | IV | is Adj | VPf and VPf
                     Nomf and Nomf , VPf and VPf          Nomf → N | Adj Nomf | Nomf and Nomf
 A1 t· · ·tAn                  VPf or VPf                VPf → is a Nomf | IV | is Adj | VPf and VPf
                                                              Nomf → N | Nomf and Nomf
     ¬A           is not Adj, does not IV, is not a Nomf                Nomf → N

 Fragment               Assertions                       Sample Sentence(s)                Data Complexity
   IS-A0            A v A1 u· · ·uAn             every businessman is a cunning man        in L
   IS-A1                A v ∀P :A                  every herbivorous eats only herbs       NL-complete
   IS-A2  A1 u· · ·uAn v ∀P :(A1 u· · ·uAm ) every Italian man drinks only strong coffee   PTime-complete
   IS-A3         ∃P :A v A1 u· · ·uAn             anybody who murders some person          PTime-complete
                                                          is a heartless killer,
                ∃P − :A v A1 u· · ·uAn          anybody who is loved by some person
                                                           is a happy person,
                         A v ∃P                      every driver drives something
   IS-A4    A1 u· · ·uAn v A1 u· · ·uAm              every cruel man is a bad man,         PTime-complete
          ∃P :(A1 u· · ·uAn ) v A1 u· · ·uAm anybody who runs some bankrupt company
                                                         is a bad businessman
   IS-A5         ∀P :A v A1 u· · ·uAn            anybody who values only money is a        coNP-complete
                                                             greedy person
   IS-A6            A v A1 t· · ·tAn              every mammal is male or is female        coNP-complete
   IS-A7           ¬A v A1 u· · ·uAn          anybody who is not selfish is a reasonable   coNP-complete
                                                                 person

                                  Lite English EL-English     DL-English    ACE-OWL
                Data Complexity       in L     PTime-complete coNP-complete coNP-hard

Fig. 2. The {IS-Ai }i∈[0,7] fragments and DL-English, where f ∈ {l, r} and IS-Ai , for all i > 0,
contains also the assertions of IS-A0 .

   We can now individuate the constructs of DL-English, and a fortiori of any CL
expressing a coNP-hard ontology language such as SHIF (as does ACE-OWL) that
negatively affect the scalability of CL interfaces to ODBASs, namely:
  – ”only” in subject position (coNP-hardness of IS-A5 ),
  – disjunction in predicate position (coNP-hardness of IS-A6 ),
  – negation in subject position (coNP-hardness of IS-A7 ).
    They also allow us to identify maximal CLs contained in DL-English (and a fortiori
ACE-OWL) w.r.t. scalability (i.e., tractable data complexity). By merging the (tractable)
fragments IS-Ai , for i ≤ 4, we essentially express the ELI ontology language, with
syntax (notice that the assertion A v ∀P :A0 is equivalent to ∃P − :A v A0 ):

                     R → P | P−                   C → > | A | ∃R:C | C u C

That is, the DL where negation- and disjunction free existential concepts are allowed
to arbitrarily nest on both sides of v. ELI induces a PTime-complete fragment of DL-
English, that we term EL-English, which captures most of the constraints and axioms
of real-world large-scale biomedical ontologies such as GALEN or SNOWMED [2].
We can define EL-English top-down pretty easily by removing from GDL the grammar
rules for negation, disjunction, and universal quantification, and the negative function
words. In such a CL arbitrary sentence subordination (and relatives), in combination


                                                    4
with, existential quantification and conjunction among VPs or Noms is allowed. Uni-
versal quantification is highly controlled and negation and disjunction are ruled out.

4   Conclusions
We have studied the computational complexity of querying OBDASs in controlled
English. Optimal (i.e., L) data complexity is attained with Lite-English (expressing
DL-Liteu ) and GCQ-English (expressing TCQs) [4, 15]. Relaxing, however, the con-
straints put on negation, disjunction, universal determiners, and pronouns, until the
components in sentence subjects and predicates become symmetrical, as in DL-English
or more expressive CLs, yields coNP-hardness. In particular, data access in ODBASs
through CL interfaces is intractable when coverage in declarations is extended to nega-
tion and universal quantification in subject NPs, or when we allow disjunction in pred-
icate VPs. We also identify a maximal (scalable) CL, EL-English, which is in PTime
w.r.t. data complexity.

Acknowledgements. We thank Raffaella Bernardi and Norbert Fuchs for discussions
and criticism regarding earlier versions of this paper.

References
 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995.
 2. F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In Proc. of IJCAI 2005.
 3. F. Baader, D. Calvanese, D. Nardi, P. Patel-Schneider, and Deborah McGuinness. The De-
    scription Logic Handbook. Cambridge University Press, 2003.
 4. R. Bernardi, D. Calvanese, and C. Thorne. Lite Natural Language. In Proc. of the 7th Int.
    Workshop on Computational Semantics (IWCS-7), 2007.
 5. A. Bernstein, E. Kaufman, A. Göhring, and C. Kiefer. Querying ontologies: A controlled
    English interface for end-users. In Proc. of ISWC 2005.
 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Data complexity of
    query answering in description logics. In Proc. of KR 2006.
 7. N. E. Fuchs and K. Kaljurand. Mapping Attempto Controlled English to OWL DL. In Demos
    and Posters of the 3rd European Semantic Web Conf. (ESWC 2006).
 8. I. Horrocks, P. F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL:
    The making of a web ontology language. J. of Web Semantics, 1(1):7–26, 2003.
 9. K. Kaljurand and N. E. Fuchs. Verbalizing OWL in Attempto Controlled English. In Proc.
    of OWLED 2007.
10. A. Krisnadhi and C. Lutz. Data complexity in the EL family of description logics. In Proc.
    of LPAR 2007.
11. S. Mador-Haim, Y. Winter, and A. Braun. Controlled language for geographical information
    system queries. In Proc. of Inference in Computational Semantics, 2006.
12. M. Ortiz, D. Calvanese, and T. Eiter. Data complexity of query answering in expressive
    description logics via tableaux. J. of Automated Reasoning, 41(1):61–98, 2008.
13. I. Pratt and A. Third. More fragments of language. Notre Dame J. of Formal Logic, 2005.
14. M. Slavkovik. Deep analysis for an interactive question answering system. Master’s thesis,
    KRDB Research Centre, Free University of Bozen-Bolzano, 2007.
15. C. Thorne. Expressing aggregate queries over DL-Lite ontologies with controlled English.
    In Proc. of the ESSLLI 2008 Student Session, 2008.
16. M. Y. Vardi. The complexity of relational query languages. In Proc. of STOC’82.


                                              5