Rewriting Count Queries over DL-Lite TBoxes
              with Number Restrictions?

    Diego Calvanese1,2 , Julien Corman1 , Davide Lanti1 , and Simon Razniewski3
           1
                Free University of Bozen-Bolzano, Italy, lastname @inf.unibz.it
                    2
                      Umeå University, Sweden, diego.calvanese@umu.se
               3
                 Max-Planck-Institut für Informatik, Saarbrücken, Germany,
                                  srazniew@mpi-inf.mpg.de


        Abstract. We propose a query rewriting algorithm for a restricted class
        of conjunctive queries evaluated under count semantics over a DL-Lite
        knowledge base. The target query language is an extension of relational
        algebra with aggregation and arithmetic functions, which can be trans-
        lated into SQL. The algorithm supports number restrictions on the RHS
        of axioms in the input TBox, which can be used to encode statistics. The
        size of the output query remains linear in the binary encoding of these
        numbers, which improves upon previously proposed approaches.


1      Introduction
Counting answers to a query is an essential operation in data management, and is
supported by virtually every relational database management system (RDBMS).
In this paper, we focus on counting answers over a knowledge base (KB), which
may be viewed as a database (DB) enriched with background knowledge about
the domain of interest. Our work falls within the scope of Ontology Mediated
Query Answering (OMQA)1 [7,3], where the background knowledge takes the
form of a set of logical statements, called TBox, and the data is represented as a
set of facts, called ABox. TBoxes are in general expressed in Description Logics
(DLs), which are decidable fragments of first-order (FO) logic.
    In this setting, counting answers to a query over a KB may require operations
that go beyond counting ABox facts, as one also needs to take into account facts
that can be inferred through the TBox. For instance, consider a scenario where
a central repository keeps track of aggregated information about universities,
such as their size, measured in number of students. And assume that these
universities also keep detailed records about enrollment. These data sources may
be integrated into a DL KB with a TBox of the form:
                                Uni v ≥1000 enrolledIn−
                             MedUni v ≥5000 enrolledIn−
                             BigUni v ≥20000 enrolledIn−
?
  Copyright © 2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0)
1
  Also referred to as OBDA (for Ontology Based Data Access), when emphasis is
  placed on mappings connecting external data sources to a TBox [16].
2       Calvanese et al.

And an ABox of the form:
    Uni(UniBZ), BigUni(UW), BigUni(TUW), . . . ,
    region(UniBZ, SouthTyrol), region(UW, Vienna), region(TUW, Vienna), . . . ,
    enrolledIn(Alice, UniBZ), enrolledIn(Bob, UW), enrolledIn(Carol, TUW), . . .
    In practice, the information about enrollment in such KB may be incomplete.
First, aggregated data about a whole university may be available, but not the
detailed records. The opposite may also hold: detailed records may be provided
by a university, but the central repository may not have an aggregated value
for it. In such scenarios, even if the data is incomplete, one may still want to
compute a lower bound on the number of enrollments of students within each
region. In more formal terms, one might be interested in counting the number
of answers, as in the following query:
              q(count(x, y)) ← enrolledIn(x, y) ∧ region(y, Vienna)
    Answering such query may require arithmetic operations that combine num-
bers present in number restrictions in the TBox, and numbers obtained by count-
ing ABox statements. For instance, in this example, assume that TUW and UW
are the only two universities in the KB that are registered in the region of Vienna.
Assume also that UW does not share enrollment records, whereas TUW shares
them, however they may be incomplete (for instance, because some faculties
within TUW have not yet communicated their information regarding enroll-
ments). Then, the output value for count(x, y) should be the sum of: 20 000 on
the one hand (for UW ), and the largest value between 20 000 and n on the other
hand (for TUW ), where n is the number of enrollment records shared by TUW.
    Query answering over DL KBs has been extensively studied for boolean and
enumeration queries. With respect to the focus of this paper, a relevant result [6]
is that for conjunctive queries (CQs) and unions of CQs (UCQs), answering a
query over a KB expressed in certain lightweight DLs can be performed by
rewriting it, according to the axioms in the TBox, into a relational algebra
(RA) query over the ABox. This property, called FO rewritability, allows one
to use an RDBMS for the evaluation of the rewritten query, thus leveraging
already mature optimization techniques implemented in such systems. Among
these DLs, the DL-Lite family [6,1] has been widely studied and adopted in
OMQA/OBDA systems, resulting in the OWL 2 QL standard [13].
    In comparison, the problem of counting answers to a CQ over a DL-Lite
KB has seen little interest in the literature. A seminal result was provided in
[11], showing that the problem is significantly harder already for relatively inex-
pressive DLs, with a shift from AC0 -membership to coNP-completeness in data
complexity, i.e., assuming the sizes of the query and TBox to be fixed. Another
key result was provided in [14], but for a slightly different semantics called bag
semantics: for certain variants of DL-Lite, CQs that are rooted (i.e., with at
least one constant or answer variable) can be rewritten into a variant of bag
RA. However, this result is provided for DLs without number restrictions (like
≥1000 enrolledIn− in the example above). In addition, because bag semantics dif-
fers from the count semantics studied in [11], this rewritability result does not
     Rewriting Count Queries over DL-Lite TBoxes with Number Restrictions              3

directly apply to our setting. Finally, in [4,5], it was shown that under count
semantics (and for some variants of DL-Lite), rooted2 CQs can be rewritten
into a variant of RA with aggregation and arithmetic functions. However, the
provided rewriting algorithm may yield a query whose size is polynomial in the
values of the numbers that occur in the TBox, i.e., exponential in the binary
representation of such numbers, which is impractical.
    In this paper, we improve upon this last result, showing how for the same
DL, and under count semantics, rooted CQs can be rewritten into a different
extension of RA, such that the number of algebraic operators in the output
query does not depend on the numbers that occur in the TBox, but only on
the number of axioms of the TBox. As a consequence, the size of the resulting
query is polynomial in the numbers that appear in the number restrictions of the
TBox, even when such numbers are represented in binary. This is an important
step towards efficient query answering in such setting.


2     Preliminaries

We assume mutually disjoint countably infinite sets NI of individuals (a.k.a.
constants), NE of anonymous individuals (induced by existential quantification),
NV of variables, NC of concept names (i.e., unary predicates, denoted with A),
and NR of role names (i.e., binary predicates, denoted with P ). An atom is
an expression of the form A(s) or P (s, t), with A ∈ NC , P ∈ NR , and s, t ∈
NI ∪ NE ∪ NV . In the following, boldface letters like t denote tuples, and we
sometimes treat tuples as sets. If t = (t1 , .., tn ) is a tuple and f a function
defined for each ti , we use f (t) to denote the tuple (f (t1 ), .., f (tn )).
    An interpretation I is a FO structure h∆I , ·I i, where the domain ∆I is
a non-empty subset of NI ∪ NE , and the interpretation function ·I maps each
constant c ∈ NI to itself (cI = c, or in other words, we adopt the standard names
assumption), each concept name A ∈ NC to a set AI ⊆ ∆I , and each role name
P ∈ NR to a binary relation P I ⊆ ∆I × ∆I .
    If I = h∆I , .I i is an interpretation, we may use I to denote the set of atoms
{A(s) | A ∈ NC , s ∈ AI } ∪ {P (s, t) | P ∈ NR and (s, t) ∈ P I }. Conversely, if
S is a set of unary and binary atoms with arguments in NI ∪ NE , we may view
it as the interpretation h∆S , ·S i, where ∆S is the union of the arguments of
all atoms in S, and ·S is defined by AS = {s | A(s) ∈ S}, for A ∈ NC , and
P S = {(s, t) | P (s, t) ∈ S}, for P ∈ NR .
    A KB is a pair K = hT , Ai, where A, called ABox, is a finite set of atoms
with arguments in NI , and T , called TBox, is a finite set of axioms. In this
2
    Precisely, the result provided in [4,5] holds for rooted and connected CQs. This
    result immediately extends to rooted but disconnected CQs, by observing that the
    definition of rootedness requires each connected component to be rooted. Therefore,
    to answer such CQ, it suffices to compute the answers to each connected component
    separately, and then the cartesian product of these answers, multiplying the obtained
    cardinalities.
4       Calvanese et al.

    R −→ P | P −            B −→ A | ≥1 R               C −→ A | ¬A | ≥n R
                             –
Fig. 1. Syntax of DL-Lite HNcore   roles R, basic concepts B, and concepts C, where n
denotes a positive integer, i.e., n ∈ N+ .


                                                 –
paper, we focus on (fragments of) DL-Lite HN      core [4,5], a member of the the DL-
Lite family [1] where each axiom is a concept inclusion of the form B v C or
a role inclusion of the form R1 v R2 , following the grammar of Figure 1. From
now on, we use the symbols B, C, P and R in accordance with this grammar,
and we call these elements respectively basic concepts, concepts, atomic roles and
roles. Concepts of the form ≥n R are called number restrictions. The fragment of
             –
DL-Lite HN
         core that disallows role inclusions and number restrictions where n > 1
is called DL-Lite core [1].
     The evaluation C I (resp. RI ) of a concept C (resp. role R) over interpretation
an I is defined inductively over the structure of C (resp. R) as usual (see [2]
for a definition). An interpretation I is a model of hT , Ai iff A ⊆ I holds, and
B I ⊆ C I (resp. R1I ⊆ R2I ) holds for each concept inclusion (resp. role inclusion)
axiom B v C (resp. R1 v R2 ) in T . A KB is satisfiable iff it admits at least one
model. For readability, in what follows we focus on satisfiable KBs, that is, we
use “a KB” as a shortcut for “a satisfiable KB”.
     A conjunctive query (CQ) q is an expression of the form
q(x) ← p1 (t1 ), . . . , pn (tn ), where x ⊆ NV , and each pi (ti ) is an atom with
arguments in NV ∪ NI . In addition, we require safeness, i.e., x ⊆ t1 ∪ · · · ∪ tn .
We use dist(q) to denote x, called the distinguished variables of q, and we
use body(q) to denote {p1 (t1 ), . . . , pn (tn )}. The Gaifman graph Gq of q is the
undirected graph whose vertices are the variables appearing in body(q), and
that contains an edge between x1 and x2 iff P (x1 , x2 ) ∈ body(q) for some role
P . Following [14], we call a CQ q rooted if each connected component in Gq
contains at least one constant or distinguished variable.
     We adopt the semantics proposed in [11] for counting answers to a CQ over
a KB, which we call count semantics. If f is a function and D a subset of its
domain, then we use f |D to denote f restricted to D. A match for a CQ q
over an interpretation I is a homomorphism from body(q) to I. Let M be the
set of matches for q over I, let ω be a mapping from dist(q) to NI , and let
Γ = {γ ∈ M | γ|dist(q) = ω}. Then the pair hω, |Γ |i is an answer to q over I iff
|Γ | ≥ 1. And we use ans(q, I) to denote the set of answers to q over I. Finally,
a pair hω, ki is a certain answer to q over a KB K (under count semantics)
iff k is the largest element of N ∪ {+∞} such that, for each model I of K,
hω, kI i ∈ ans(q, I) for some kI ≥ k. We use certAns(q, K) to denote the set of
certain answers to q over K.
     A property that has been extensively studied in the OBDA/OMQA liter-
ature is FO rewritability [6,1]. Query answering for a class Q of queries is FO
rewritable for a DL L iff, for every L TBox T and Q-query q, there is a FO query
q 0 such that, for every ABox A, the certain answers to q over hT , Ai and the
answers to q 0 over A (viewed as an interpretation) coincide. In particular, under
    Rewriting Count Queries over DL-Lite TBoxes with Number Restrictions         5

standard certain answer semantics for DLs (which differs from the definition of
certAns(q, K) under count semantics given above), query answering for UCQs
is FO rewritable for several members of the DL-Lite family. Since RA captures
(domain-independent) FO logic, the evaluation of the query obtained via rewrit-
ing can be delegated to a RDBMS. Then in [14], this notion of rewritability has
been adapted to a different target query language, called BCALC, which cap-
tures relational bag algebra [10], and can be translated into SQL queries with
aggregation.


3   Related Work

Query answering under count semantics over a relational DB can be viewed as
a specific case of query answering under bag semantics, investigated notably in
[10,12]. Instead, in our setting, and in line with the OMQA/OBDA literature,
we assume that the input ABox is a set rather than a bag. The counting problem
over sets has also been studied recently in the relational DB setting [15,9], but
from the perspective of combined complexity, where the query is considered part
of the input (i.e., its size is not assumed to be fixed).
    As for (DL-Lite) KBs, in [8] an alternative (epistemic) count semantics is
defined, which counts over all grounded tuples (i.e., over NI ) entailed by the KB.
Such a semantics does not account for existentially implied individuals, and thus
cannot capture the statistics motivating our work.
    Closer to our concern is the work presented in [11], which introduces the
count semantics for CQs over a DL KB adopted in this paper. In particular the
Count problem, which is the decision variant of query answering under count se-
mantics, was shown in [11] to be coNP-hard in data complexity for the relatively
inexpressive DL DL-Lite core , even when negated concepts are forbidden. This
implies that for this DL and arbitrary CQs, query answering under count seman-
tics is not rewritable into a target query language for which query answering is
easier than coNP in data complexity.
    Then, rewritability of CQs over a DL-Lite KB was investigated in [14] for
the related problem of query answering under bag semantics. In particular, a
rewriting algorithm was provided for rooted CQs into the language BCALC
(which can be evaluated in TC0 ), and for a DL that extends DL-Lite core with a
restricted form of role subsumption. This result does not immediately transfer
to our setting though, for two reasons. First, this DL cannot express number
restrictions on the right-hand side (RHS) of TBox axioms, and therefore cannot
encode statistics like the ones from our motivating example. Second, as shown
in [14], for DL-Lite core already, bag and count semantics differ in the presence
of existential quantification on the left-hand side (LHS) of TBox axioms, which
are typically used to express the domain and range of an atomic role.
    To understand this difference, we recall the basics of query answering under
bag semantics (and refer to [14] for a complete definition.) A bag interpretation I
maps each atomic concept A (resp. atomic role P ) to a bag AI (resp. P I ). Such
bag can be seen as a function that maps each element of ∆I (resp. ∆I × ∆I )
6         Calvanese et al.

to the number of times it occurs in the bag. This function .I is then extended
to complex concepts and roles inductively (see [14]). And I satisfies an axiom
C1 v C2 (resp. R1 v R2 ) iff (C1 )I (d) ≤ (C2 )I (d) for every d ∈ ∆I (resp.
(R1 )I (d1 , d2 ) ≤ (R2 )I (d1 , d2 ) for every (d1 , d2 ) ∈ ∆I × ∆I ).
    Now let q(x) ← p1 (t1 ), . . . , pn (tn ) be a CQ, let I be a bag interpretation, let
V be the set of variables that appear in q, let ωPbe a Q        mapping from x to NI , let
Γ = {γ : V → ∆I | γ|dist(q) = ω}, and let k = γ∈Γ 1≤i≤n (pi )I (γ(ti )). Then
hω, ki is a bag answer to q over I iff k ≥ 1. We use bagAns(q, I) to denote the
set of bag answers to q over I. And if K is a KB, we use bagCertAns(q, K) to
denote the certain answers to q over K under bag semantics, defined analogously
to certain answers under count semantics.3
    The difference between bag and count semantics is illustrated in [14] with
the following example:

Example 1. Consider the KB K = hT , Ai, where

             T = {A1 v ∃P, ∃P − v A2 },                   A = {A1 (a), A1 (b)},

and the query q() ← A2 (y). If we evaluate this query under count semantics,
then certAns(q, K) = {h{}, 1i} (i.e., the only certain answer to q is the mapping
with empty domain {}, with multiplicity 1), because the following structure is a
model of K:
                                a      P       u      P       b
                               A1             A2              A1

However, such structure does not accurately represent a bag interpretation. Let
us now build a (minimal) bag interpretation I for K. To satisfy A, we set AI1 (a) =
1 and AI1 (b) = 1. Then to satisfy A1 v ∃P , we introduce a single element u (as
above) and obtain P I (a, u) = 1 and P I (b, u) = 1. Therefore (P − )I (u, a) = 1 and
(P − )I (u, b) = 1, which, according to the semantics proposed in [14], imply that
(∃P − )I (u) = 2. Then in order for this bag interpretation to satisfy ∃P − v A2 , it
must be the case that (A2 )I (u) = 2. So the only certain answer to q over K under
bag semantics is the empty mapping with multiplicity 2, i.e. bagCertAns(q, K) =
{h{}, 2i}.

It was also shown in [14] that query answering under bag and count semantics
coincide for the DL that extends DL-Litecore with role inclusion, but disallows
concepts of the form ≥1 R on the LHS of axioms. which we denote here with
3
    The notation used in [14] for certain answers under both bag and count semantics
    is slightly different from ours. Let dist(q) = {x1 , .., xn }. Then the certain answers to
    q under bag semantics are represented in [14] as a partial function µ : (NC )n → N+ .
    Instead, we use bagCertAns(q, K) = {h{x1 7→ c1 , .., xn 7→ cn }, ki | µ(c1 , .., cn ) = k}.
    As for count semantics, let hω, ki ∈ certAns(q, K). Then the definition provided in
    Section 2 implies that there is no k0 6= k such that hω, k0 i ∈ certAns(q, K). Instead,
    in [11,14], hω, k0 i is considered a certain answer under count semantics for each
    1 ≤ k0 ≤ k.
    Rewriting Count Queries over DL-Lite TBoxes with Number Restrictions              7

DL-Lite H6 ∃
         core . However, our focus is once again on logics that allow for number
restrictions on the RHS of axioms (and thus can encode statistics). So a nat-
ural question is whether this result still holds for DL-Lite H6    ∃
                                                                 core extended –with
                            HN – 6∃
such axioms. Let DL-Litecore denote this DL (equivalently, DL-LiteHN        core
                                                                                 6∃
                                                                                      is
                          HN –
the fragment of DL-Litecore that disallows concepts of the form ≥1 R on the
LHS of axioms). To answer this question, the bag semantics proposed in [14]
needs to be extended to  P conceptsI of the form ≥n R. One way to do this is to
define (≥n R)I (d1 ) =     d2 ∈∆I R (d1 , d2 ) /n, for any bag interpretation I and
         I
d1 ∈ ∆ , where “/” denotes integer division. With this definition, the proof
                                                                                  –
provided in [14, Proposition 85] can be immediately extended to DL-LiteHN    core :
                                                                                    6∃


                                         –
Proposition 2. For any DL-LiteHN
                              core
                                   6∃
                                      KB K and CQ q

                         certAns(q, K) = bagCertAns(q, K)

Proof (sketch). The proof of Proposition 85 in [14] uses the fact that for
any DL-Lite H6   ∃
               core KB K and model I of K under count semantics, there is a
bag interpretation Ib that is a model of K under bag semantics, and verifies
ans(q, I) = bagAns(q, Ib ) for any CQ q. This bag interpretation is defined for
M ∈ NC ∪ NR by M Ib (t) = 1 if t ∈ M I , and M Ib (t) = 0 otherwise. Now for
any axiom of the form A v ≥n R, it can be verified that if AI ⊆ (≥n R)I , then
                                                                                  –
AIb (d) ≤ (≥n R)Ib (d) holds, for any individual d. So if K is now a DL-LiteHNcore
                                                                                    6∃

KB and I a model of K, then Ib as defined above is still a model of K under bag
semantics.
   The proof of Proposition 85 in [14] also uses the fact that for any DL-Lite H6  ∃
                                                                                core
bag KB K and model I of K under bag semantics, there is an interpretation Is
that is a model of K under count semantics, and such that for any CQ q, mapping
ω over dist(q) and k ∈ N+ ∪ {+∞}, if hω, ki ∈ bagAns(q, I), then hω, k 0 i ∈
ans(q, Is ) for some k 0 ≤ k. This interpretation is defined for M ∈ NC ∪ NR by
t ∈ M Is iff M I (t) ≥ 1. Again, it can be verified that Is is still a model of K if
                     –
K is a DL-LiteHN core
                       6∃
                          KB.
   The rest follows from the original proof.                                       t
                                                                                   u

                                    –
    Finally, in [4,5], for DL-Lite N
                                   core and under count semantics, a query rewrit-
ing algorithm was provided for rooted CQs into a variant of RA extended with
aggregation and arithmetic functions. As for the rewriting of [14], the output
query does not depend on the ABox. However, such algorithm is mostly of the-
oretical interest, and not well suited for implementation (see Section 4). Two
negative results were also provided in [4,5], which further circumscribe the scope
of rewritability. Specifically, for DL-Lite H
                                            core , Count was shown to be P-hard
both for rooted CQs and for CQs whose body contains a single atom. This
implies that for this DL and these classes of queries, query answering under
count semantics is not rewritable into a target query language for which query
answering is easier than P.
8      Calvanese et al.

4   Rewriting
Query answering under count semantics was shown in [4,5] to be rewritable for
                             –
rooted CQs and DL-LiteN    core , into a target query language that extends RA with
aggregation and some algebraic functions, thanks to a query rewriting algorithm
inspired by PerfectRef [6]. However, the size of the rewritten query may be ex-
ponential in the size of the input TBox. More specifically, it may be exponential
both in the number of axioms (precisely, in the depth of the deepest concept
hierarchy that can be inferred from the TBox), and in the encoding of numbers
that appear in number restrictions (if encoded in binary, thus polynomial in the
value of such numbers). In many applications, it is reasonable to assume that the
number of axioms in the TBox remains limited, so the first source of exponential
blow-up may not be a major practical limitation. On the other hand, as illus-
trated in Section 1, values that appear in number restrictions (such as the one in
≥20000 enrolledIn− ) may depend on the size of the domain under consideration,
and thus, in some applications, they are likely to be of the same order of mag-
nitude as the size of the ABox itself. This might be unmanageable in practice,
in scenarios where we have to deal with large amounts of data.
    In the following, we describe how the rewriting algorithm of [4,5] can be
adapted (for the same DL, source, and target query languages), so that the size
of the output query remains linear in the size of the (binary) encoding of numbers
in number restrictions. Moreover, this new rewriting guarantees that the number
of algebraic operators in the output query only depends on the number of axioms
                                                                              K
of the TBox. The algorithm of [4,5] exploits the so-called chase model Ican      of
                                             N–
K. This model is such that, for DL-Litecore and rooted CQs, certAns(q, K) and
        K
ans(q, Ican ) coincide.
    Specifically, we illustrate how the rewriting algorithm from [4,5] can be mod-
ified by running it on the same example as the one used in [4,5]. Whenever
relevant, we will explicitly point out the differences between the two algorithms.
The rewriting from [4,5] generates a set of queries that can be directly translated
into a SQL query. The target language for this rewriting makes use of nested
aggregation in the form of special “atoms” of the form ∃=k .P (x, y), with k ∈ N,
which intuitively correspond to a SQL COUNT DISTINCT operation, together with
a boolean condition stating that the result of the aggregation operation over y
must be equal to k. If the TBox contains an axiom the form C v ≥n R, then for
each k ∈ [1..n], the rewriting of [4,5] generates one sub-query. Hence, the number
of generated sub-queries is linear in n, and exponential in the binary encoding
of n, which makes it unpractical.
    In our variant, we drop the boolean condition and use instead the nota-
tion z = count(y).P (x, y), where z is now a variable storing the result of the
same aggregation operation. Like in [4,5], we use negation, multiplication, and
subtraction. In addition, our target language also requires the SQL function
greatest(x,y) (that returns the largest value between x and y), and the aggre-
gation operator SUM. We show through our running example that, despite our
modifications, the target language for the rewriting can still be translated into
SQL.
     Rewriting Count Queries over DL-Lite TBoxes with Number Restrictions                   9

                                               a
                              P2         P1         P1         P2        d
                         P2                    A          b         P2   e
                              P2                               P2

Fig. 2. Chase model of our running example. Solid arrows represent the information
in the ABox, whereas dashed lines represent information implied by the KB.


     Consider the following KB K = hT , Ai and CQ q:
                                                                     
                 A      v ≥ 2 P1 ,                A(a),      P1 (a, b),
          T =                              A =
                 ∃P1− v ≥3 P2                     P2 (b, d), P2 (b, e)
                              q(x) ← A(x), P1 (x, y1 ), P2 (y1 , y2 )
                 K
The chase model Ican of K is represented in Figure 2. The rewriting algorithm
operates on a set Q of queries. This set is initialized with a syntactic variant of
the input query:
       hQ(x, cnt = count(y1 , y2 )), {q(x, y1 , y2 ) ← A(x), P1 (x, y1 ), P2 (y1 , y2 )}i
Intuitively, such query corresponds to the following SQL query4
SELECT x , COUNT ( DISTINCT y1 , y2 ) as cnt
FROM A , P1 , P2
WHERE A . x = P1 . x A AND P1 . y1 = P2 . y1
GROUP BY x


    Then each rewriting step selects a query Q in Q, and extends Q with a set
of new queries, obtained by applying some rewriting rule to Q, until saturation
is reached. In the previous query, since variable y2 is unbound, we can apply
a rewriting step over atom P2 (y1 , y2 ) with respect to axiom ∃P1− v ≥3 P2 ,
producing the query:
      hQ(x, cnt = sum(num)),
        hQ(x, y1 , num = greatest(0, 3 − z)),
          {q(x, y1 ) ← A(x), P1 (x, y1 ), P1 (_, y1 )} ∧ z = cnt(y2 ). P2 (y1 , y2 )ii
This query corresponds to the following SQL query:
SELECT x , SUM ( num ) as cnt
FROM ( SELECT x , y1 , greatest (0 , 3 - z ) as num
       FROM ( SELECT x , y1
              FROM A , P1 , ( SELECT x as _ , y1 FROM P1 ) as P1a )
              WHERE A . x = P1 . x AND P1 . y1 = P1a . y1
            ) AS J1 ,
            ( SELECT y1 , COUNT ( DISTINCT y2 ) as z
              FROM P2 GROUP BY y1
            ) AS J2
       WHERE J1 . y1 = J2 . y1
     )
GROUP BY x

4
    Note that the COUNT DISTINCT operator of SQL does not allow for multiple arguments.
    However, the desired result can be achieved through an injective concatenation op-
    erator, for instance making use of an additional fresh symbol.
10       Calvanese et al.

The interested reader might observe that such query fully captures the three
queries (one for each non-zero value of num) from [4,5]. On such a query, one
can apply a variant of the “reduce” rule of PerfectRef [6], leading to the query:

     hQ(x, cnt = sum(num)),
       hQ(x,
          y1 , num = greatest(0, 3 − z)),             
            q(x, y1 ) ← A(x), P1 (x, y1 ), P1 (_, y1 )
                                                         ∧ z = cnt(y2 ). P2 (y1 , y2 )ii
            q(x, y1 ) ← A(x), P1 (x, y1 )

This query corresponds to the following SQL query:
SELECT x , SUM ( num ) as cnt
FROM ( SELECT x , y1 , greatest (0 ,3 - z ) as num
       FROM (( SELECT x , y1
                FROM A , P1 , ( SELECT x as _ , y1 FROM P1 ) AS P1a
                WHERE A . x = P1 . x AND P1 . y1 = P1a . y1
              )
              UNION
              ( SELECT x , y1 FROM A , P1 WHERE A . x = P1 . x )
            ) AS J1 ,
            ( SELECT y1 , COUNT ( DISTINCT y2 ) as z
              FROM P2
              GROUP BY y1
            ) AS J2
       WHERE J1 . y1 = J2 . y1
     )
GROUP BY x


Again, in the rewriting from [4,5], this very step produces three different queries:
one for each non-zero value of num.
   By applying another rewriting step over atom P1 (x, y1 ) and with respect to
axiom A v ≥2 P1 , we obtain the following query:

                     hQ(x, cnt = sum(num)),
                       hQ(x, num = (greatest(0, (2 − z) · 3),
                         {q(x) ← A(x)} ∧ z = cnt(y1 ). P1 (x, y1 )ii

This query corresponds to the following SQL query:
SELECT x , SUM ( num ) as cnt
FROM ( SELECT x , greatest (0 ,2 - z ) * 3 as num
       FROM ( SELECT x FROM A ) AS J1 ,
            ( SELECT x , COUNT ( DISTINCT y1 ) AS z
              FROM P1
              GROUP BY x
            ) AS J2
       WHERE J1 . x = J2 . x
     )
GROUP BY x


    Let us now analyze the queries produced by the rewriting. The query after
the initialization step returns the number of paths (x, y1 , y2 ) in A conforming to
the structure dictated by the body of the input query. Since there are two such
paths, this query returns the answer {x 7→ a, cnt 7→ 2}. The query produced after
the first rewriting step checks for all sub-paths (x, y1 ) of (x, y1 , y2 ) such that x
is an instance of A, y1 is a P1 -successor of x, and y1 has less P2 -successors in the
                                                                            K
ABox than what the TBox prescribes. There is one such path in Ican             , namely
    Rewriting Count Queries over DL-Lite TBoxes with Number Restrictions        11

the one terminating in node b that has only two P2 -successors in A. This path
is captured by our query, which returns as answer {x 7→ a, cnt 7→ 1}: indeed,
there is a single way of extending this path into the anonymous part. The last
query has to be interpreted in a similar way. The query retrieves the individual
a, since this node has only one P1 -successor in A but should have at least two
P1 -successors according to T . The answer to such query is {x 7→ a, cnt 7→ 3}.
Indeed, there are three ways of extending the match {x 7→ a} into the anonymous
part. Now, a last aggregation (with SUM) over all queries in Q yields the desired
answer {x 7→ a, cnt 7→ 6}, which indeed corresponds to the answer h{x 7→ a}, 6i
                          K
to our input query over Ican .


5   Conclusions

In this paper, we have improved the query rewriting technique proposed in [4,5]
                                                             –
for rooted CQs under count semantics over a DL-LiteN       core KB, into a target
query language that extends RA with aggregation and arithmetic functions. We
have illustrated how the exponential blow-up of the output query in the numbers
that occur in the TBox, characteristic of the rewriting of [4,5], can be avoided by
extending the target rewriting language, specifically with the aggregate operator
sum and with the binary function greatest. We also show that this enriched
language can still be translated into SQL. Considering that numeric values that
appear in the TBox may in practice be of the same order of magnitude as the
size of the ABox, this new rewriting algorithm is a significant step towards a
practical implementation of query answering under this semantics.


Acknowledgements
This research has been partially supported by the Wallenberg AI, Autonomous
Systems and Software Program (WASP) funded by the Knut and Alice Wallen-
berg Foundation, by the Italian Basic Research (PRIN) project HOPE, by the
EU H2020 project INODE, grant agreement 863410, by the CHIST-ERA project
PACMEL, by the project IDEE (FESR1133) funded through the European Re-
gional Development Fund (ERDF) Investment for Growth and Jobs Programme
2014-2020, by the project “South Tyrol Longitudinal study on Longevity and
Ageing” (STyLoLa), funded through the 2017 Interdisciplinary Call issued by
the Research Committee of the Free University of Bozen-Bolzano, and by the
project “Ontology-based Geodata Integration, Visualization and Analysis” (On-
toGeo), funded through the 2018 CRC Call issued by the Research Committee
of the Free University of Bozen-Bolzano.
12      Calvanese et al.

References
 1. A. Artale, D. Calvanese, R. Kontchakov, and M. Zakharyaschev. The DL-Lite
    family and relations. J. of Artificial Intelligence Research, 36:1–69, 2009.
 2. F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, ed-
    itors. The Description Logic Handbook: Theory, Implementation and Applications.
    Cambridge University Press, 2003.
 3. M. Bienvenu and M. Ortiz. Ontology-mediated query answering with data-
    tractable description logics. In Reasoning Web: Web Logic Rules – 11th Int.
    Summer School Tutorial Lectures (RW), volume 9203 of LNCS, pages 218–307.
    Springer, 2015.
 4. D. Calvanese, J. Corman, D. Lanti, and S. Razniewski. Counting query answers
    over a DL-Lite knowledge base. In Proc. of the 29th Int. Joint Conf. on Artificial
    Intelligence (IJCAI). IJCAI Org., 2020.
 5. D. Calvanese, J. Corman, D. Lanti, and S. Razniewski. Counting query an-
    swers over a DL-Lite knowledge base (Extended version). CoRR Tech. Rep.
    arXiv:2005.05886, arXiv.org e-Print archive, 2020. Available at http://arxiv.org/
    abs/2005.05886.
 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. of Automated Reasoning, 39(3):385–429, 2007.
 7. D. Calvanese, G. De Giacomo, and M. Lenzerini. Conjunctive query containment
    and answering under description logics constraints. ACM Trans. on Comp. Logic,
    9(3):22.1–22.31, 2008.
 8. D. Calvanese, E. Kharlamov, W. Nutt, and C. Thorne. Aggregate queries over
    ontologies. In Proc. of the 2nd Int. Workshop on Ontologies and Inf. Systems for
    the Semantic Web (ONISW), pages 97–104, 2008.
 9. H. Chen and S. Mengel. Counting answers to existential positive queries: a com-
    plexity classification. In Proc. of the 35th ACM Symp. on Principles of Database
    Systems (PODS), pages 315–326, 2016.
10. S. Grumbach and T. Milo. Towards tractable algebras for bags. J. of Computer
    and System Sciences, 52(3):570–588, 1996.
11. E. V. Kostylev and J. L. Reutter. Complexity of answering counting aggregate
    queries over DL-Lite. J. of Web Semantics, 33:94–111, 2015.
12. L. Libkin and L. Wong. Query languages for bags and aggregate functions. J. of
    Computer and System Sciences, 55(2):241–272, 1997.
13. B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, and C. Lutz. OWL 2
    Web Ontology Language profiles (2nd ed.). W3C Rec., W3C, 2012. http://www.
    w3.org/TR/owl2-profiles/.
14. C. Nikolaou, E. V. Kostylev, G. Konstantinidis, M. Kaminski, B. Cuenca Grau,
    and I. Horrocks. Foundations of ontology-based data access under bag semantics.
    Artificial Intelligence, 274:91–132, 2019.
15. R. Pichler and S. Skritek. Tractable counting of the answers to conjunctive queries.
    J. of Computer and System Sciences, 79(6):984–1001, 2013.
16. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za-
    kharyaschev. Ontology-based data access: A survey. In Proc. of the 27th Int. Joint
    Conf. on Artificial Intelligence (IJCAI), pages 5511–5519. IJCAI Org., 2018.