=Paper=
{{Paper
|id=Vol-448/paper-20
|storemode=property
|title=Exploring Controlled English Ontology-Based Data Access
|pdfUrl=https://ceur-ws.org/Vol-448/paper28.pdf
|volume=Vol-448
|dblpUrl=https://dblp.org/rec/conf/cnl/ThorneC09a
}}
==Exploring Controlled English Ontology-Based Data Access==
Exploring Controlled English Ontology-Based Data Access
Camilo Thorne, Diego Calvanese
KRDB Research Centre
Free University of Bozen-Bolzano, Italy
{cthorne,calvanese}@inf.unibz.it
1 Introduction
Controlled languages (CLs) are subsets of natural language with minimal ambiguity
(lexical, structural, or semantic) tailored to fullfil data management tasks. Recently [5,
11], they have been proposed as a means of providing natural language interfaces to
databases (DBs) and to ontology-based data access systems (OBDASs) centered around
the W3C standard ontology language OWL1 (Web Ontology Language) [7, 9]. CL ut-
terances are so-to-speak ”compiled” (compositionally translated by a symbolic trans-
lation) into a formal query/ontology language expression. They have given rise to a
number of applications and implementations [5, 11], among which ACE-OWL [7, 9],
which maps to OWL DL and fragments of it (such as OWL-Lite). The formal underpin-
ning of OWL DL is provided by description logics (DLs) [3, 8], in particular, OWL Lite
corresponds to the DL SHIF (Actually SHIF(D), but we do not consider datatypes).
Given that OBDASs may contain large amounts of data, we are interested in know-
ing whether querying such systems through CL interfaces scales up with the size of
the data. An ODBAS can be modelled by a DL knowledge base (KB), whose informa-
tion is accessed through queries belonging to some fragment of SQL. A fragment that
is considered sufficiently expressive but still computationally manageable (since query
answering is decidable in significant cases) is that of conjunctive queries [1]. The basic
problem in this setting is (conjunctive) query answering (QA) [6], which is a form of
logical entailment [3]. To evaluate the efficiency of querying through CL interfaces, we
focus on the so-called data complexity of QA, i.e., the computational complexity of the
problem measured in terms of the size of the data only [16].
The optimum lies in L data complexity, which is the complexity of answering SQL
queries over plain databases. Data complexity beyond PTime (i.e., intractable) indicates
low scalability of query answering through CL interfaces. The data complexity of query
answering in SHIF, and hence in OWL Lite and ACE-OWL Lite (the fragment of
ACE that maps to OWL Lite) is known to be coNP-complete [12]. Hence, CLs like
ACE-OWL do not scale up with data, although they contain fragments that do. This
computational behaviour depends on the language constructs they cover.
In this paper we pinpoint (some of) the English constructs and a fortiori of ACE, that
give rise to these computational properties. To this effect, we study a CL that maps into
ALCI, a DL simpler than SHIF but with the same data complexity of query answer-
ing, and fragments thereof that are either (i) minimal w.r.t. intractability or (ii) maximal
w.r.t. tractability, mirroring Pratt and Third in [13] (cf. also [14]).
1
http://www.w3.org/TR/owl-ref/
2 QA over Ontologies
In an OBDAS, an ontology provides a conceptual view on the data stored in a database.
Ontologies are formally underpinned by DLs [3], which structure the domain of dis-
course in terms of concepts (representing classes, i.e., sets of objects) and roles (repre-
senting binary relations between objects). We are interested in DLs of different expres-
siveness, ranging from DL-Liteu [6] to ALCI. In ALCI, concepts C and roles R are
formed according to the following syntax:
R → P | P− C → > | A | ∃R:C | ¬C | C u C
where A stands for a concept name (a unary predicate), P for a role name (a binary
predicate) and P − for its inverse. The semantics of the concept and role constructors is
the standard one for DLs, see [3]. We can enrich the set of ALCI concepts, modulo the
following (explicit) definitions: (i) ∀R:C := ¬∃R:¬C, (ii) C t C 0 := ¬(¬C u ¬C 0 ),
(iii) ⊥ := ¬>, and (iv) ∃R := ∃R:>.
In a DL ontology O, intensional knowledge is specified by means of a set of (con-
cept inclusion) assertions of the form Cl v Cr , stating inclusion (or IS-A) between
the instances of the concept Cl on the left and those of the concept Cr on the right.
In ALCI, Cl and Cr may be arbitrary concepts, while fragments of ALCI, such as
DL-Liteu [6], can be obtained by suitably restricting the syntax for Cl and for Cr . A
database (DB), expressing extensional knowledge, is a finite set D of unary and binary
ground atoms of the form A(c), P (c, c0 ). A knowledge base (KB) is a pair hO, Di,
where O is an ontology and D a DB.
As query language, we consider conjunctive queries, i.e., SELECT- PROJECT- JOIN
SQL queries and tree shaped conjunctive queries (TCQs), which are those CQs that
are built using only unary and binary relations and that are tree-isomorphic [12]. The
query answering (QA) decision problem for (T)CQs and KBs is the (FOL) entailment
problem stated as follows: given a KB hO, Di, a sequence c̄ of constants and a (T)CQ
q, check whether there exists a grounded substitution σ of the free variables of q with c̄
s.t. O ∪D |= qσ, i.e., whether O ∪D (when seen as a FOL theory) entails the grounding
of q (seen as a FOL open formula) by σ. We are interested in the data complexity of
QA, namely, in its computational complexity when we consider D as the only input of
the problem.
3 Expressing QA in Controlled English
Given an ontology language L and a query language Q, to express QA in controlled
English we: (i) Define a grammar GL and a compositional translation τ (·) for the cor-
responding controlled declarative fragment L(GL ) s.t. τ (L(GL )) = L. (ii) Define a
grammar GQ and a compositional translation τ 0 (·) for the corresponding controlled
interrogative fragment L(GQ ) s.t. τ 0 (L(GQ )) = Q. Following formal semantics termi-
nology we call such ontology/query language expressions the meaning representations
(MRs) of the CL utterance.
In [4, 15], we have shown how to express the DL DL-Liteu and TCQs, for which
QA is in L [6], with the CLs Lite-English and GCQ-English, respectively. We want now
2
S → NP VP NP → Det Nom
VP → TV NP VP → is a Nom VP → is TV by NP NP → Pro Relp VP
VP → is Adj VP → IV VP → is Neg TV by NP NP → Pro
VP → does Neg IV VP → is Neg a Nom Nom → Nom Relp VP Nom → Adj Nom
VP → is Neg Adj VP → VP Crd VP Nom → Nom Crd Nom Nom → N
τ (VP) := τ (NP)(τ (TV)) τ (VP) := τ (Crd)(τ (VP))(τ (VP)) τ (S) := τ (NP)(τ (VP))
τ (VP) := τ (Neg)(τ (NP)(τ (TV))) τ (VP) := τ (Neg)(τ (Adj))
τ (VP) := τ (Neg)(τ (Nom)) τ (VP) := τ (Neg)(τ (IV)) τ (VP) := τ (Adj)
τ (NP) := τ (Pro) τ (VP) := τ (IV) τ (VP) := τ (Nom)
τ (NP) := τ (Det)(τ (Nom)) τ (NP) := τ (Pro)(τ (Relp)(τ (VP))) τ (Nom) := τ (N)
τ (Nom) := τ (Nom)(τ (Relp)(τ (VP))) τ (Nom) := τ (Crd)(τ (Nom))(τ (Nom)) τ (Nom) := τ (Adj)(τ (Nom))
Pro → anybody τ (Pro) := λC.λC 0 .C v C 0 Pro → somebody τ (Pro) := λR.∃R
Pro → nobody τ (Pro) := λC.λC 0 .C v ¬C 0 Pro → nobody τ (Pro) := λR.¬∃R
Crd → and τ (Crd) := λC.λC 0 .C u C 0 Crd → or τ (Crd) := λC.λC 0 .C t C 0
Relp → who τ (Relp) := λC.C Relp → who τ (Relp) := λC.λC 0 .C:C 0
Neg → not τ (Neg) := λC.¬C Pro → only τ (Pro) := λR.λC.∀R:C
Pro → everybody τ (Pro) := λC.> v C Pro → nobody τ (Pro) := λC.C v ⊥
Det → some τ (Det) := λR.λC.∃R:C Det → every τ (Det) := λC.λC 0 .C v C 0
Det → no τ (Det) := λC.λC 0 .C v ¬C 0
Fig. 1. The grammar GDL of DL-English.
to know which fragments of ACE-OWL are (i) maximal w.r.t. tractable data complexity
(i.e., in PTime), and hence scale up with data, and (ii) minimal w.r.t. intractable data
complexity (i.e., coNP-hard), and hence do not scale up.
Figure 1 introduces DL-English’s grammar GDL . Following DL conventions [3],
we associate (and hence map) the non-recursive word categories N, Adj and IV to
atomic concepts. Category TV is associated to role names. Recursive constituents, by
contrast, are associated to arbitrary concepts. For reasons of simplicity and space, we
disregard morphology and polarity issues. We also omit specifying the (open) class of
content words. An example of a sentence recognized by DL-English (we spell out its
MR underneath) is:
No man who runs some business that does not make some money is a businessman.
Man u ∃run:(Business u ¬(∃make:Money)) v ¬Businessman
We now turn to the computational properties of each of the constructs in isolation
of DL-English. We do it by essentially restricting the kind of right (i.e., Cr ) and left
(i.e., Cl ) concepts we may express. All utterances comply with the sentence patterns
”every αl αr ” and ”everybody who αl αr ”.
The constituents αl and αr map to, respectively, left and right concepts, while sentences
map to IS-A assertions of the form Cl v Cr . We consider in this paper only 8 out of
all possible combinations obtained by allowing in Cl and Cr some subset of the DL
constructs in the upper part of Figure 2, giving rise to the family {IS-Ai }i∈[0,7] of CLs
shown in Figure 2. The basic kind of assertion they all express is IS-A among atomic
concepts, viz., A v A0 , captured by IS-A0 .
The complexity result for IS-A0 follow from the fact that Lite-English subsumes it.
For all the other fragments, the complexity lower bounds follow from the results in [6].
The upper bounds for the CLs IS-Ai , for i ∈ {2, 3, 4}, follow from results by [10] for
the DL EL, which subsumes the DL assertions they express. Membership in NL for
IS-A1 follows by reducing QA over IS-A1 KBs to linear datalog program evaluation,
which is well-known to be in NL (w.r.t. data complexity) [1].
3
Concept Cf Constituent αf Grammar Rules
A Nomf , VPf
∃P :A TV some Nomf , TV somebody who VPf VPf → is a Nomf | IV | is Adj
∃P − :A TV by some Nomf , TV by somebody who VPf Nomf → N
∀P :A TV only VPf , TV only who VPf
∃P TV something, TV somebody ∅
A1 u· · ·uAn Adj Nomf , Nomf who VPf VPf → is a Nomf | IV | is Adj | VPf and VPf
Nomf and Nomf , VPf and VPf Nomf → N | Adj Nomf | Nomf and Nomf
A1 t· · ·tAn VPf or VPf VPf → is a Nomf | IV | is Adj | VPf and VPf
Nomf → N | Nomf and Nomf
¬A is not Adj, does not IV, is not a Nomf Nomf → N
Fragment Assertions Sample Sentence(s) Data Complexity
IS-A0 A v A1 u· · ·uAn every businessman is a cunning man in L
IS-A1 A v ∀P :A every herbivorous eats only herbs NL-complete
IS-A2 A1 u· · ·uAn v ∀P :(A1 u· · ·uAm ) every Italian man drinks only strong coffee PTime-complete
IS-A3 ∃P :A v A1 u· · ·uAn anybody who murders some person PTime-complete
is a heartless killer,
∃P − :A v A1 u· · ·uAn anybody who is loved by some person
is a happy person,
A v ∃P every driver drives something
IS-A4 A1 u· · ·uAn v A1 u· · ·uAm every cruel man is a bad man, PTime-complete
∃P :(A1 u· · ·uAn ) v A1 u· · ·uAm anybody who runs some bankrupt company
is a bad businessman
IS-A5 ∀P :A v A1 u· · ·uAn anybody who values only money is a coNP-complete
greedy person
IS-A6 A v A1 t· · ·tAn every mammal is male or is female coNP-complete
IS-A7 ¬A v A1 u· · ·uAn anybody who is not selfish is a reasonable coNP-complete
person
Lite English EL-English DL-English ACE-OWL
Data Complexity in L PTime-complete coNP-complete coNP-hard
Fig. 2. The {IS-Ai }i∈[0,7] fragments and DL-English, where f ∈ {l, r} and IS-Ai , for all i > 0,
contains also the assertions of IS-A0 .
We can now individuate the constructs of DL-English, and a fortiori of any CL
expressing a coNP-hard ontology language such as SHIF (as does ACE-OWL) that
negatively affect the scalability of CL interfaces to ODBASs, namely:
– ”only” in subject position (coNP-hardness of IS-A5 ),
– disjunction in predicate position (coNP-hardness of IS-A6 ),
– negation in subject position (coNP-hardness of IS-A7 ).
They also allow us to identify maximal CLs contained in DL-English (and a fortiori
ACE-OWL) w.r.t. scalability (i.e., tractable data complexity). By merging the (tractable)
fragments IS-Ai , for i ≤ 4, we essentially express the ELI ontology language, with
syntax (notice that the assertion A v ∀P :A0 is equivalent to ∃P − :A v A0 ):
R → P | P− C → > | A | ∃R:C | C u C
That is, the DL where negation- and disjunction free existential concepts are allowed
to arbitrarily nest on both sides of v. ELI induces a PTime-complete fragment of DL-
English, that we term EL-English, which captures most of the constraints and axioms
of real-world large-scale biomedical ontologies such as GALEN or SNOWMED [2].
We can define EL-English top-down pretty easily by removing from GDL the grammar
rules for negation, disjunction, and universal quantification, and the negative function
words. In such a CL arbitrary sentence subordination (and relatives), in combination
4
with, existential quantification and conjunction among VPs or Noms is allowed. Uni-
versal quantification is highly controlled and negation and disjunction are ruled out.
4 Conclusions
We have studied the computational complexity of querying OBDASs in controlled
English. Optimal (i.e., L) data complexity is attained with Lite-English (expressing
DL-Liteu ) and GCQ-English (expressing TCQs) [4, 15]. Relaxing, however, the con-
straints put on negation, disjunction, universal determiners, and pronouns, until the
components in sentence subjects and predicates become symmetrical, as in DL-English
or more expressive CLs, yields coNP-hardness. In particular, data access in ODBASs
through CL interfaces is intractable when coverage in declarations is extended to nega-
tion and universal quantification in subject NPs, or when we allow disjunction in pred-
icate VPs. We also identify a maximal (scalable) CL, EL-English, which is in PTime
w.r.t. data complexity.
Acknowledgements. We thank Raffaella Bernardi and Norbert Fuchs for discussions
and criticism regarding earlier versions of this paper.
References
1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995.
2. F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In Proc. of IJCAI 2005.
3. F. Baader, D. Calvanese, D. Nardi, P. Patel-Schneider, and Deborah McGuinness. The De-
scription Logic Handbook. Cambridge University Press, 2003.
4. R. Bernardi, D. Calvanese, and C. Thorne. Lite Natural Language. In Proc. of the 7th Int.
Workshop on Computational Semantics (IWCS-7), 2007.
5. A. Bernstein, E. Kaufman, A. Göhring, and C. Kiefer. Querying ontologies: A controlled
English interface for end-users. In Proc. of ISWC 2005.
6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Data complexity of
query answering in description logics. In Proc. of KR 2006.
7. N. E. Fuchs and K. Kaljurand. Mapping Attempto Controlled English to OWL DL. In Demos
and Posters of the 3rd European Semantic Web Conf. (ESWC 2006).
8. I. Horrocks, P. F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL:
The making of a web ontology language. J. of Web Semantics, 1(1):7–26, 2003.
9. K. Kaljurand and N. E. Fuchs. Verbalizing OWL in Attempto Controlled English. In Proc.
of OWLED 2007.
10. A. Krisnadhi and C. Lutz. Data complexity in the EL family of description logics. In Proc.
of LPAR 2007.
11. S. Mador-Haim, Y. Winter, and A. Braun. Controlled language for geographical information
system queries. In Proc. of Inference in Computational Semantics, 2006.
12. M. Ortiz, D. Calvanese, and T. Eiter. Data complexity of query answering in expressive
description logics via tableaux. J. of Automated Reasoning, 41(1):61–98, 2008.
13. I. Pratt and A. Third. More fragments of language. Notre Dame J. of Formal Logic, 2005.
14. M. Slavkovik. Deep analysis for an interactive question answering system. Master’s thesis,
KRDB Research Centre, Free University of Bozen-Bolzano, 2007.
15. C. Thorne. Expressing aggregate queries over DL-Lite ontologies with controlled English.
In Proc. of the ESSLLI 2008 Student Session, 2008.
16. M. Y. Vardi. The complexity of relational query languages. In Proc. of STOC’82.
5