=Paper= {{Paper |id=Vol-1350/paper-24 |storemode=property |title=Schema.org as a Description Logic |pdfUrl=https://ceur-ws.org/Vol-1350/paper-24.pdf |volume=Vol-1350 |dblpUrl=https://dblp.org/rec/conf/dlog/HernichLOW15 }} ==Schema.org as a Description Logic== https://ceur-ws.org/Vol-1350/paper-24.pdf
                  Schema.org as a Description Logic

            Andre Hernich1 , Carsten Lutz2 , Ana Ozaki1 and Frank Wolter1
                                   1
                                    University of Liverpool, UK
                               2
                                   University of Bremen, Germany



       Abstract. Schema.org is an initiative by the major search engine providers Bing,
       Google, Yahoo!, and Yandex that provides a collection of ontologies which web-
       masters can use to mark up their pages. Schema.org comes without a formal
       language definition and without a clear semantics. We formalize the language of
       Schema.org as a Description Logic (DL) and study the complexity of querying
       data using (unions of) conjunctive queries in the presence of ontologies formulated
       in this DL. In particular, we consider rewritability into FO queries and into datalog
       programs and investigate the possibility of classifying the data complexity of
       ontology-mediated queries.


1   Introduction
The Schema.org initiative was launched in 2011 and is supported today by Bing, Google,
Yahoo!, and Yandex. In the spirit of the Semantic Web, it provides a collection of
ontologies that establish a standard vocabulary to mark up website content with metadata
about itself (https://schema.org/). In particular, web content that is generated from
structured data as found in relational databases is often difficult to recover for search
engines and Schema.org markup elegantly solves this problem. The markup is used by
search engines to more precisely identify relevant pages, to provide richer search results,
and to enable new applications. Schema.org is experiencing very rapid adoption and is
used today by more than 15 million webpages including all major ones Guha [2013].
    Schema.org does neither formally specify the language in which its ontologies are
formulated nor does it provide a formal semantics for the published ontologies. However,
the provided ontologies are extended and updated frequently and follow an underlying
language pattern. This pattern and its meaning is described informally in natural language.
Schema.org adopts a class-centric representation enriched with binary relations and
datatypes, similar in spirit to description logics (DLs) and to the OWL family of ontology
languages; the current version includes 622 classes and 891 binary relations. Partial
translations into RDF and into OWL are provided by the linked data community. Based
on the informal descriptions at https://schema.org/ and on the mentioned translations,
Patel-Schneider [2014] develops an ontology language for Schema.org with a formal
syntax and semantics that, apart from some details, can be regarded as a fragment of
OWL DL.
    In this paper, we abstract slightly further and view the Schema.org ontology language
as a description logic, in line with the formalization by Patel-Schneider. Thus, what
Schema.org calls a type becomes a concept name and a property becomes a role name.
The main characteristics of the resulting ‘Schema.org DL’ are that (i) the language is very
restricted, allowing only inclusions between concept and role names, domain and range
restrictions, nominals, and datatypes; (ii) ranges and domains of roles can be restricted to
disjunctions of concept names (possibly mixed with datatypes in range restrictions) and
nominals are used in ‘one-of enumerations’ whose semantics also involves disjunction.
While Point (i) suggests that the Schema.org DL is closely related to the tractable profiles
of OWL2, because of Point (ii) it does actually not fall into any of them. There is also a
close connection to the DL-Lite family of DLs Calvanese et al. [2007], and in particular
to the DL-LiteH                                                      H
                bool variant Artale et al. [2009]. However, DL-Litebool admits existential
restriction, negation, conjunction, and free use of disjunction whereas the Schema.org
DL allows no existential quantification and includes nominals and datatypes. We use the
term schema.org-ontology to refer to ontologies formulated in the Schema.org language;
in contrast, ‘Schema.org 2015’ refers to the concrete collection of ontologies provided at
https://schema.org/ as of end of April, 2015.
    Our main aim is to investigate the complexity of querying data in the presence
of schema.org-ontologies, where the data is the markup that was extracted from web-
pages. While answering queries over such data is the main reasoning task that arises in
Schema.org applications and the Schema.org initiative specifies a format for the data
in terms of so-called items, no information at all is given on how the data is queried
(or used otherwise). We consider conjunctive queries (CQs) and unions of conjunctive
queries (UCQ), a basic querying mechanism that is ubiquitous in relational database
systems and research, and that also can be viewed as a core of the Semantic Web query
language SPARQL. In particular, we also consider CQs and UCQs without quantified
variables since these are not allowed in the relevant SPARQL entailment regimes Glimm
and Krötzsch [2010]. We view a pair (O, q) that consists of a schema.org-ontology and
an actual query as a compound query called an ontology-mediated query (OMQ).
    We start with the observation that evaluating OMQs is intractable in general, namely
Π2p -complete in combined complexity and CO NP-complete in data complexity. In the
main part of the paper, we therefore aim (i) to identify large and practically useful classes
of OMQs with lower computational complexity (both combined and data complexity),
and (ii) to explore the situation in much more detail to see whether we can obtain a full
classification of each schema.org ontology or each OMQ according to its data complexity.
While the utility of aim (i) is obvious, we note that aim (ii) is also most useful from a
user’s perspective as it clarifies the complexity of every concrete ontology or OMQ that
might be used in an actual application. Apart from classical tractability (that is, PT IME),
we are particularly interested in the rewritability of OMQs into first-order (FO) queries
(actually: UCQs) and into datalog programs. One reason is that this allows to implement
querying based on relational database systems and datalog engines, taking advantage
of those systems’ efficiency and maturity. Another reason is that there is significant
research on how to efficiently answer UCQs and datalog queries in cluster computing
models such as MapReduce Afrati and Ullman [2011, 2012], which is rather natural
when processing web-scale data.
    For both aims (i) and (ii) above, we start with analyzing basic schema.org ontologies
in which enumeration definitions (‘one of’ expressions) and datatypes are disallowed.
Regarding aim (i), we show that all OMQs which consist of a basic schema.org-ontology
and a CQ of qvar-size two (the connected components that consist exclusively of quanti-
fied variables have size at most two) are datalog-rewritable in polynomial time and can
be evaluated in PTime in combined complexity. This result complements results about
datalog-rewritability of OMQs for DLs with disjunction in Grau et al. [2013]; Kaminski
et al. [2014b,a]. We establish the same results for OMQs that consist of an unrestricted
schema.org-ontology and CQs without quantified variables.
    Regarding aim (ii), we start with classifying each single schema.org-ontology O
according to the data complexity of all OMQs (O, q) with q a UCQ. We establish a
dichotomy between AC0 and CO NP in the sense that for each ontology O either all these
OMQs are in AC0 or there is one OMQ that is CO NP-hard. The dichotomy comes with
a transparent syntactic characterization and is decidable in PT IME. Though beautiful,
the dichotomy is of limited use in practice since most interesting ontologies are of the
intractable kind.
    Therefore, we also consider an even more fine-grained classification on the level of
OMQs, establishing a useful connection to constraint satisfaction problems (CSPs) in the
spirit of Bienvenu et al. [2014b]. It turns out that even for basic schema.org-ontologies
and for ontologies that consist exclusively of enumeration definitions, a complexity
classification of OMQs implies a solution to the dichotomy conjecture for CSPs, which
is a famous open problem Feder and Vardi [1998]; Bulatov [2011]. However, the CSP
connection can also be used to obtain powerful positive results. In particular, we show
that it is decidable in NE XP T IME whether an OMQ based on a schema.org-ontology
and a restricted form of UCQ is FO-rewritable and, respectively, datalog-rewritable. We
also establish a PSpace lower bound for this problem.


2   Preliminaries

Let NC , NR , and NI be countably infinite and mutually disjoint sets of concept names,
role names, and individual names. Throughout the paper, concepts names will be denoted
by A, B, C, . . ., role names by r, s, t, . . ., and individual names by a, b, c, . . ..
    A schema.org-ontology consists of concept inclusions of different forms, role inclu-
sions, and enumeration definitions. A concept inclusion takes the form A v B (atomic
concept inclusion), ran(r) v A1 t· · ·tAn (range restriction), or dom(r) v A1 t· · ·tAn
(domain restriction). A role inclusion takes the form r v s.
Example 1. The following are examples of concept inclusions and role inclusions (last
line) in Schema.org 2015:
                   Movie v CreativeWork
             ran(musicBy) v Person t MusicGroup
            dom(musicBy) v Episode t Movie t RadioSeries t TVSeries
                   sibling v relatedTo
We now define enumeration definitions. Fix a set NE ⊆ NI of enumeration individuals
such that both NE and NI \ NE are infinite. An enumeration definition takes the form
A ≡ {a1 , . . . , an } with A ∈ NC and a1 , . . . , an ∈ NE .
Example 2. An example of an enumeration definition in Schema.org 2015 is
Booktype ≡ {ebook, hardcover, paperback}.
A datatype D = (D, ∆D ) consists of a datatype name D and a non-empty set of
data values ∆D . Examples of datatypes in Schema.org 2015 are Boolean, Integer, and
Text. We assume that datatype names and data values are distinct from the symbols
in NC ∪ NR ∪ NI and that there is an arbitrary but fixed set DT of datatypes such that
∆D1 ∩ ∆D2 = ∅ for all D1 6= D2 ∈ DT.
    To accommodate datatypes in ontologies, we generalize range restrictions to range
restrictions with datatypes, which are inclusions of the form ran(r) v A1 t · · · t An
with A1 , . . . , An concept names or datatype names from DT.
Example 3. An example of a range restriction with datatypes in Schema.org 2015 is
                       ran(acceptsReservation) v Boolean t Text
    A schema.org-ontology O is a finite set of concept inclusions (including range
restrictions with datatypes), role inclusions, and enumeration definitions. We denote by
NC (O) the set of concept names in O, by NR (O) the set of role names in O, and by
NE (O) the set of enumeration individuals in O.
    A data instance A is a finite set of concept assertions A(a) where  S A ∈ NC and
a ∈ NI ; and role assertions r(a, b) where r ∈ NR , a ∈ NI and b ∈ NI ∪ D∈DT ∆D . We
say that A is a data instance for the ontology O if A contains no enumeration individuals
except those in NE (O). We use Ind(A) to denote the set of all individuals (including
datatype elements) in A.
Example 4. Examples for assertions are Movie(a), name(a, ‘avatar’), director(a, b),
name(b, ‘Cam’).
Let O be a schema.org-ontology and A a data instance for O. An            S interpretation I =
(∆I , ·I ) for O consists of a non-empty set ∆I disjoint from D∈DT ∆D and with
∆I ∩ NE = NE (O), and a function ·I that maps
  – every concept name A to a subset AI of ∆I ,
  – every role name r to a subset rI of ∆I ×∆I,DT , where ∆I,DT = ∆I ∪ D∈DT ∆D ;
                                                                                    S
  – every individual name a ∈ (NI \ NE ) ∪ NE (O) to some aI ∈ ∆I such that aI = a
     for all a ∈ NE (O).
Note that we make the standard name assumption (and, therefore, unique name assump-
tion) for individuals in NE . Individual names from NE that do not occur in O (and thus
not in A) are not interpreted by I to avoid enforcing infinite domains.
     For an interpretation I, set dom(r)I = {d | (d, d0 ) ∈ rI } and ran(r)I = {d0 |
(d, d0 ) ∈ rI }. To achieve uniform notation, set DI = ∆D for every datatype (D, ∆D ) in
DT and dI = d for every d ∈ ∆D , D ∈ DT. For concept or datatype names A1 , . . . , An ,
set (A1 t · · · t An )I = AI1 ∪ · · · ∪ AIn . An interpretation I for an ontology O satisfies
a (concept or role) inclusion X1 v X2 ∈ O if X1I ⊆ X2I , an enumeration definition
A ≡ {a1 , . . . , an } if AI = {aI1 , . . . , aIn }, a concept assertion A(a) if aI ∈ AI , and a
role assertion r(a, b) if (aI , bI ) ∈ rI . Satisfaction of any of these objects is denoted
with “|=”, as in I |= X1 v X2 or I |= A(a).
     An interpretation I for O is a model of O if it satisfies all inclusions and definitions
in O and a model of a data instance A if it satisfies all assertions in A. We say that
A is satisfiable w.r.t. O if O and A have a common model. Let α be a concept or
role inclusion, or an enumeration definition. We say that α follows from O, in symbols
O |= α, if every model of O satisfies α.
    We introduceSthe query languages considered in this paper. A term t is either a
member of NI ∪ D∈DT ∆D or an individual variable taken from an infinite set NV
of such variables. A first-order query (FOQ) consist of a (domain-independent) first-
order formula ϕ(x) that uses unary predicates from NC ∪ {D | (D, D) ∈ DT}, binary
predicates from NR , and only terms as introduced above. The unary datatype predicates
are built-ins that identify the elements of the respective datatype. We call x the answer
variables of ϕ(x), the remaining variables are called quantified. A query without answer
variables is Boolean. A conjunctive query (CQ) is a FOQ of the form ∃y ϕ(x, y) where
ϕ(x, y) is a conjunction of atoms such that every answer variable x occurs in an atom
that uses a symbol from NC ∪ NR , that is, an answer variable x is not allowed to occur
exclusively in atoms of the form D(x) with D a datatype name (to ensure domain
independence). A union of conjunctive queries (UCQ) is a disjunction of CQs. A CQ
q can be regarded as a directed graph Gq with vertices {t | t term in q} and edges
{(t, t0 ) | r(t, t0 ) in q}. If Gq is acyclic and r(t1 , t2 ), s(t1 , t2 ) ∈ q implies r = s, then q
is an acyclic CQ. A UCQ is acyclic if all CQs in it are.
    We are interested in querying data instances A using a UCQ q(x) taking into account
the knowledge provided by an ontology O. A certain answer to q(x) in A under O is a
tuple a of elements of Ind(A) of the same length as x such that for every model I of O
and A, we have I |= q[a]. In this case, we write O, A |= q(a).
    Query evaluation is the problem to decide whether O, A |= q(a). For the combined
complexity of this problem, all of O, A, q, and a are the input. For the data complexity,
only A and a are the input. It often makes sense to combine the ontology O and actual
query q(x) into an ontology-mediated query (OMQ) Q = (O, q(x)), which can be
thought of as a compound overall query. The following can be shown using techniques
similar to those in Eiter et al. [1997]; Bienvenu et al. [2014b].
Theorem 1. Query evaluation of CQs and UCQs under schema.org-ontologies is Π2p -
complete in combined complexity. In data complexity, each OMQ (O, q) from this class
can be evaluated in CO NP; moreover, there is such a OMQ (with q a CQ) that is
CO NP-complete in data complexity.

An OMQ (O, q(x)) is FO-rewritable if there exists a FOQ Q(x) (called an FO-rewriting
of (O, q(x))) such that for every data instance A for O and all a ∈ Ind(A), we have
O, A |= q(a) iff IA |= Q(a) where IA is the interpretation that corresponds to A (in
the obvious way).
     We also consider datalog-rewritability, defined in the same way as FO-rewritability,
but using datalog programs in place of FOQs. Using Rossman’s homomorphism preser-
vation theorem Rossman [2008], one can show that an OMQ (O, q(x)) with O a
schema.org-ontology and q(x) a UCQ is FO-rewritable iff it has a UCQ-rewriting
iff it has a non-recursive datalog rewriting, see Bienvenu et al. [2014b] for more details
(in a slightly different context). Since non-recursive datalog-rewritings can be more
succinct than UCQ-rewritings, we will generally prefer the former.


3    Basic schema.org-Ontologies

We start with considering basic schema.org-ontologies, which are not allowed to contain
enumeration definitions and datatypes. The results obtained here can be easily extended
     A r             r       r             r     r B
b0                               ···                bm        A    A           A     A
     s   s       s       s             s                 a1 r b1 r b2 r      bm−1 r bm r a2
                                                                           ···
             a
                 (a) Data instance Am .                         (b) Data instance A0m .

             Fig. 1: ABoxes used in Example 5 and the paragraph below Theorem 10.

to basic schema.org-ontologies with datatypes but do not hold for ontologies with
enumeration definitions (as will be shown in the next section). In Schema.org 2015, 45
concept names from a total of 622 are defined using enumeration definitions, and hence
are not covered by the results presented in this section.
    We start with noting that the entailment problem for basic schema.org-ontologies is
decidable in polynomial time. This problem is to check whether O |= α for a given basic
schema.org-ontology O and a given inclusion α of the form allowed in such ontologies.
In fact, the algorithm is straightforward. For example, O |= ran(r) v A1 t · · · t An if
there is a role name s and a range restriction ran(s) v B1 t · · · t Bm ∈ O such that
OR |= r v s and OC |= Bj v A1 t · · · t An for all 1 ≤ j ≤ m, where OR and OC
denote the set of role inclusions and atomic concept inclusions in O.
Theorem 2. The entailment problem for basic schema.org-ontologies is in PT IME.
The hardness results reported in Theorem 5 crucially rely on existential quantification
in the actual query. In fact, it follows from results in Grau et al. [2013]; Kaminski et al.
[2014b] that given an OMQ Q = (O, q(x)) with O a basic schema.org-ontology and
q(x) a CQ without quantified variables, it is possible to construct a non-recursive datalog
rewriting of Q in polynomial time, and that such OMQs can be evaluated in PT IME in
combined complexity. We aim to push this bound further by admitting restricted forms
of quantification.
    A CQ q has qvar-size n if all connected components of quantified variables in the
undirected graph underlying Gq have size at most n. For example, quantifier-free CQs
have qvar-size 0 and the following query q(x, y) has qvar-size 1:
                            ^
                 ∃z1 ∃z2           (producedBy(z1 , v) ∧ musicBy(v, z2 ))
                                       v∈{x,y}

The above consequences of the work by Grau, Kaminski, et al. can easily be extended
to OMQs where queries have qvar-size one. In what follows, we consider qvar-size
two, which is more subtle and where, in contrast to qvar-size one, reasoning by case
distinction is required. The following example shows that there are CQs of qvar-size
two for which no non-recursive datalog rewriting exists.
Example 5. Let O = {ran(s) v A t B} and consider the following CQ of qvar-size
two: q(x) = ∃x1 ∃x2 (s(x, x1 ) ∧ A(x1 ) ∧ r(x1 , x2 ) ∧ B(x2 )). It is easy to see that
O, Am |= q(a) for every data instance Am with m ≥ 2 as defined in Figure 1a.
By applying locality arguments and using the data instances Am , one can in fact show
that (O, q(x)) is not FO-rewritable (note that removing one r(bi , bi+1 ) from Am results
in q(a) no longer being entailed).
Theorem 3. For every OMQ (O, q(x)) with O a basic schema.org-ontology and q(x)
a CQ of qvar-size at most two, one can construct a datalog-rewriting in polynomial time.
Moreover, evaluating OMQs from this class is in PT IME in combined complexity.
Applied to Example 5, the proof of Theorem 3 yields a datalog rewriting that consists of
the rules
              P (x1 , x2 , x) ← s(x, x1 ) ∧ X1 (x1 ) ∧ r(x1 , x2 ) ∧ X2 (x2 )
where the Xi range over A, B, and ∃y r(y, ·), plus

   IA (x1 , x) ← P (x1 , x2 , x) ∧ A(x1 )      IB (x2 , x) ← P (x1 , x2 , x) ∧ B(x2 )
   IA (x2 , x) ← P (x1 , x2 , x) ∧ IA (x1 , x) IB (x1 , x) ← P (x1 , x2 , x) ∧ IB (x2 , x)
   goal(x) ← s(x, x1 ) ∧ IA (x1 , x) ∧ r(x1 , x2 ) ∧ IB (x2 , x).

The recursive rule for IA (the one for IB is dual) says that if the only option to possibly
avoid a match for (x1 , x2 , x) is to color (x1 , x) with IA , then the only way to possibly
avoid a match for (x1 , x2 , x) is to color (x2 , x) with IA (otherwise, since ran(s) v
A t B ∈ O, it would have to be colored with IB which gives a match).
    The rewriting presented in Theorem 3 can easily be extended to accommodate
datatypes. For schema.org-ontologies O that are not basic, the rewriting is sound but not
necessarily complete, and can thus be used to compute approximate query answers.
    Interestingly, Theorem 3 cannot be generalized to UCQs. This follows from the
result in the full version that for basic schema.org-ontologies O and quantifier-free
UCQs q(x) (even without role atoms), the problem O, A |= q(a) is coNP-hard regarding
combined complexity for data instances A with a single individual a. Since evaluating
datalog programs in such data instances is in PT IME, datalog rewritings of UCQ-based
OMQs can thus not be constructed in polynomial time (unless PT IME equals NP).
We note that it is not difficult to show (and follows from FO-rewritability of instance
queries in DL-LiteHbool Artale et al. [2009]) that given an OMQ (O, q(x)) with O a basic
schema.org-ontology and q(x) a quantifier-free UCQ, one can construct an FO-rewriting
in exponential time, and thus query evaluation is in AC0 in data complexity.
    We now classify basic schema.org-ontologies O according to the data complexity of
evaluating OMQs (O, q) with q a UCQ (or CQ). It is convenient to work with minimized
ontologies where for all inclusions F v A1 t · · · t An ∈ O and all i ≤ n, there is a
model I of O and a d ∈ ∆I such that d satisfies F u Ai u u ¬Aj (defined in the usual
                                                              j6=i
way). Every schema.org-ontology can be rewritten in polynomial time into an equivalent
minimized one. We establish the following dichotomy theorem.
Theorem 4. Let O be a minimized basic schema.org-ontology. If there exists F v
A1 t · · · t An ∈ O with n ≥ 2, then there is a Boolean CQ q that uses only concept and
role names from O and such that (O, q) is CO NP-hard in data complexity. Otherwise, a
given OMQ (O, q) with q a UCQ can be rewritten into a non-recursive datalog-program
in polynomial time (and is thus in AC0 in data complexity).
The proof of the second part of Theorem 4 is easy: if there are no F v A1 t· · ·tAn ∈ O
with n ≥ 2, then O essentially is already a non-recursive datalog program and the
construction is straightforward. The proof of the hardness part is obtained by extending
the corresponding part of a dichotomy theorem for ALC-ontologies of depth one Lutz
and Wolter [2012]. The main differences between the two theorems are that (i) for basic
schema.org-ontologies, the dichotomy is decidable in PT IME (whereas decidability is
open for ALC), (ii) the CQs in CO NP-hard OMQs use only concept and role names
from O (this is not possible in ALC), and (iii) the dichotomy is between AC0 and CO NP
whereas for ALC OMQs can be complete for PT IME, NL, etc.
     By Theorem 4, disjunctions in domain and range restrictions are the only reason that
query answering is non-tractable for basic schema.org-ontologies. In Schema.org 2015,
14% of all range restrictions and 20% of all domain restrictions contain disjunctions.
     In Theorem 4, we have classified the data complexity of ontologies, quantifying over
the actual queries. In what follows, we aim to classify the data complexity of every OMQ.
This problem turns out to be much harder and, in fact, we show that a classification of
the data complexity of OMQs based on basic schema.org-ontologies and UCQs implies
a classification of constraint satisfaction problems according to their complexity (up
to FO-reductions), a famous open problem that is the subject of significant ongoing
research Feder and Vardi [1998]; Bulatov [2011].
     A signature is a set of concept and role names (also called symbols). Let B be a finite
interpretation that interprets only the symbols from a finite signature Σ. The constraint
satisfaction problem CSP(B) is to decide, given a data instance A over Σ, whether there
is a homomorphism from A to B. In this context, B is called the template of CSP(B).
Theorem 5. For every template B, one can construct in polynomial time an OMQ (O, q)
where O is a basic schema.org-ontology and q a Boolean acyclic UCQ such that the
complement of CSP(B) and (O, q) are mutually FO-reducible.
Theorem 13 below establishes the converse direction of Theorem 5 for unrestricted
schema.org-ontologies and a large class of (acyclic) UCQs. From Theorem 13, we obtain
a NE XP T IME-upper bound for deciding FO-rewritability and datalog-rewritability of
a large class of OMQs. It remains open whether this bound is tight, but we can show
a PS PACE lower bound for FO-rewritable using a reduction of the word problem of
PS PACE Turing machines. The proof uses the ontology O and data instances Am
from Example 5 and is similar to a PS PACE lower bound proof for FO-rewritability
in consistent query answering Lutz and Wolter [2015] which is, in turn, based on a
construction from Cosmadakis et al. [1988].
Theorem 6. It is PS PACE-hard to decide whether a given OMQ (O, q) with O a basic
schema.org-ontology and q a Boolean acyclic UCQ is FO-rewritable.

4   Incoherence and Unsatisfiability
In the subsequent section, we consider unrestricted schema.org ontologies instead of
basic ones, that is, we add back enumeration definitions and datatypes. The purpose of
this section is to deal with a complication that arises from this step, namely the potential
presence of inconsistencies. We start with inconsistencies that concern the ontology
alone and then consider inconsistencies that arise from combining an ontology with a
data instance.
    An ontology O is incoherent if there exists X ∈ NC ∪ NR such that X I = ∅ for
all models I of O. Incoherent ontologies can result from the UNA for enumeration
individuals such as in the ontology {A ≡ {a}, B ≡ {b}, A v B}, which has no model
if a 6= b; they can also arise from interactions between concept names and datatypes
such as in the ontology {ran(r) v Integer, ran(s) v A, r v s} with A ∈ NC which has
no model I with rI 6= ∅ since ∆I ∩ ∆Integer = ∅. Using Theorem 2, one can show:
Theorem 7. Incoherence of schema.org-ontologies can be decided in PTime.
We now turn to inconsistencies that arise from combining an ontology O with a data
instance A for O. As an example, consider O = {A ≡ {a}, B ≡ {b}} and A =
{A(c), B(c)}. Although O is coherent, A is unsatisfiable w.r.t. O. Like incoherence,
unsatisfiability is decidable in polynomial time. In fact, we can even show the following
stronger result.
Theorem 8. Given a schema.org-ontology O, one can compute in polynomial time
a non-recursive datalog program Π such that for any data instance A for O, A is
unsatisfiable w.r.t. O iff Π(A) 6= ∅.
In typical schema.org applications, the data is collected from the web and it is usually
not acceptable to simply report back an inconsistency and stop processing the query.
Instead, one would like to take maximum advantage of the data despite the presence of
an inconsistency. There are many semantics for inconsistent query answering that can be
used for this purpose. As efficiency is paramount in schema.org applications, our choice
is the pragmatic intersection repair (IAR) semantics which avoids CO NP-hardness in
data complexity Lembo et al. [2010]; Rosati [2011]; Bienvenu et al. [2014a]. A repair
of a data instance A w.r.t. an ontology O is a maximal subset A0 ⊆ A that is satisfiable
w.r.t. O. We use repO (A) to denoteT  the set of all repairs of A w.r.t. O. The idea of IAR
semantics is then to replace A with A0 ∈repO (A) A0 . In other words, we have to remove
from A all assertions that occur in some minimal subset A0 ⊆ A that is unsatisfiable
w.r.t. O. We call such an assertion a conflict assertion.
Theorem 9. Given a schema.org-ontology O and concept name A (resp. role name r),
one can compute a non-recursive datalog program Π such that for any data instance A
for O, Π(A) is the set of all a ∈ Ind(A) (resp. (a, b) ∈ Ind(A)2 ) such that A(a) (resp.
r(a, b)) is a conflict assertion in A.
By Theorem 9, we can adopt the IAR semantics by simply removing all conflict assertions
from the data instance before processing the query. Programs from Theorem 9 become
exponential in the worst case, but can be expected to be very small in practical cases.
In the remainder of the paper, we assume that ontologies are coherent and that A is
satisfiable w.r.t. O if we query a data instance A using an ontology O.


5   Unrestricted schema.org-Ontologies
We aim to lift the results from Section 3 to unrestricted schema.org-ontologies. Regarding
Theorem 3, it turns out that quantified variables in CQs are computationally much
more problematic when there are enumeration definitions in the ontology. In fact, one
can expect positive results only for quantifier-free CQs, and even then the required
constructions are quite subtle.
Theorem 10. Given an OMQ Q = (O, q) with O a schema.org-ontology and q a
quantifier-free CQ, one can construct in polynomial time a datalog-rewriting of Q.
Moreover, evaluating OMQs from this class is in PT IME in combined complexity. The
rewriting is non-recursive if q = A(x).
The following example illustrates the construction of the datalog program. Let O =
{A ≡ {a1 , a2 }} and q() = r(a1 , a2 ). Observe that O, A0m |= q() for every data instance
A0m defined in Figure 1b. Similarly to Example 5, one can use the data instances A0m to
show that (O, q()) is not FO-rewritable.
   A datalog-rewriting of (O, q()) is given by the program Πa1 ,a2 which contains

                         goal() ← r(a1 , a2 )
                         goal() ← r(a1 , x) ∧ pathA (x, y) ∧ r(y, a2 )
                    pathA (x, y) ← r(x, y) ∧ A(x) ∧ A(y)
                    pathA (x, y) ← pathA (x, z) ∧ pathA (z, y).

It is also instructive to check that O0 , A0m 6|= q() with O0 = {A ≡ {a1 , a2 , a3 }} because
in models of O0 , a3 can be identified with some bi , a1 with b1 , . . . , bi−1 and a2 with
bi+1 , . . . , bm , 1 ≤ i ≤ m.
     We now modify the datalog program above to obtain a rewriting of the OMQ
(O, q 0 (x, y)) with q(x, y) = r(x, y). First, we include in Πr the rules A(a1 ) ← true
and A(a2 ) ← true. Then we add the following rules:
                                                                    ^
      goal(x, y) ← r(x, y),        goal(x, y) ← A(x) ∧ A(y) ∧             Rai ,aj (x, y).
                                                                   1≤i,j≤2

We want to use the latter rule to check that x, y have to be mapped to {a1 , a2 }, and
that for every possible assignment ai , aj to x, y that is consistent (i.e., we do not have
x ∈ {a1 , a2 } and x 6= ai , and similarly for y), r(ai , aj ) is true. To this end, we add the
rules:
               Rai ,aj (x, y) ← neq(x, ai ) Rai ,aj (x, y) ← neq(y, aj )
               Rai ,aj (x, y) ← goal(ai , aj )
                neq(a1 , a2 ) ← true             neq(a2 , a1 ) ← true.
It remains to add rules 3 and 4 from Πa1 ,a2 and

                    goal(ai , aj ) ← r(ai , x) ∧ pathA (x, y) ∧ r(y, aj )

for 1 ≤ i, j ≤ 2 and i 6= j.
    Theorem 10 is tight in the sense that evaluating CQs with a single atom and a single
existentially quantified variable, as well as quantifier-free UCQs, is coNP-hard in data
complexity. For instance, let O = {dom(e) v A, ran(e) v A, A ≡ {r, g, b}}. Then, an
undirected graph G = (V, E) is 3-colorable iff O, {e(v, w) | (v, w) ∈ E} 6|= ∃x e(x, x).
Alternatively, one may replace the query by r(r, r) ∨ r(g, g) ∨ r(b, b). In fact, one can
prove the following variant of Theorem 5 which shows that classifying OMQs with
ontologies using only enumeration definitions and quantifier-free UCQs according to
their complexity is as hard as CSP.
Theorem 11. Given a template B, one can construct in polynomial time an OMQ (O, q)
where O only contains enumeration definitions and q is a Boolean variable-free UCQ
such that the complement of CSP(B) and (O, q) are mutually FO-reducible.
We now turn to classifying the complexity of ontologies and of OMQs, starting with a
generalization of Theorem 4 to unrestricted schema.org-ontologies.
Theorem 12. Let O be a coherent and minimized schema.org-ontology. If O contains
an enumeration definition A ≡ {a1 , . . . , an } with n ≥ 2 or contains an inclusion
F v A1 t · · · t An such that there are at least two concept names in {A1 , . . . , An } and
O 6|= F v A t       t D for any A with A ≡ {a} ∈ O, then (O, q) is coNP-hard
                  (D,∆D )∈DT
for some Boolean CQ q. Otherwise every (O, q) with q a UCQ is FO-rewritable (and
thus in AC0 in data complexity).
Note that, in contrast to Theorem 4, being in AC0 does not mean that no ‘real disjunction’
is available. For example, for O = {ran(r) v A t B, A v C, B v C, C ≡ {c}} and
A = {r(a, b)} we have O, A |= A(b) ∨ B(b) and neither A(b) nor B(b) are entailed.
This type of choice does not affect FO-rewritability, however, since it is restricted to
individuals that must be identified with a unique individual in NE (O). Note that, for
the hardness proof, we now need to use a role name that possibly does not occur in O.
For example, for O = {A ≡ {a1 , a2 }} there exists a Boolean CQ q such that (O, q) is
NP-hard, but constructing q requires a fresh role name.
    We now consider the complexity of single OMQs and show a converse of Theorems 5
and 11 for schema.org-ontologies and UCQs that are qvar-acyclic, that is, when all atoms
r(t, t0 ) with neither of t, t0 a quantified variable are dropped, then all CQs in it are acyclic.
We use generalized CSPs with marked elements in which instead of a single template B,
one considers a finite set Γ of templates whose signature contains, in addition to concept
and role names, a finite set of individual names. Homomorphisms have to respect also
the individual names and the problem is to decide whether there is a homomorphism
from the input interpretation to some B ∈ Γ . Every such CSP is mutually FO-reducible
with some standard CSP and FO-definability and datalog definability of the complement
of generalized CSPs with marked elements are NP-complete Bienvenu et al. [2014b].
Theorem 13. Given an OMQ (O, q) with O a schema.org-ontology and q a qvar-acyclic
UCQ, one can compute in exponential time a generalized CSP with marked elements Γ
such that (O, q) and the complement of CSP(Γ ) are mutually FO-reducible.
The proof uses an encoding of qvar-acyclic queries into concepts in the description
logic ALCIU O that extends ALC by inverse roles, the universal role, and nominals. It
extends the the template constructions of Bienvenu et al. [2014b] to description logics
with nominals. As a particularly interesting consequence of Theorem 13, we obtain:
Theorem 14. FO-rewritability and datalog-rewritability of OMQs (O, q) with O a
schema.org-ontology and q a qvar-acyclic UCQ are decidable in NE XP T IME.

6    Conclusion
The work presented in this paper lays a solid foundation for attacking many interesting
and practically relevant questions that can be asked about querying in the presence of
schema.org-ontologies. Topics of interest include different forms of queries such as
SPARQL and regular path queries as well as uncertainty in the data that accounts for
varying levels of trust in different data sources.
                                  Bibliography


Foto N. Afrati and Jeffrey D. Ullman. Optimizing multiway joins in a map-reduce
  environment. IEEE Trans. Knowl. Data Eng., 23(9):1282–1298, 2011.
Foto N. Afrati and Jeffrey D. Ullman. Transitive closure and recursive datalog imple-
  mented on clusters. In EDBT, pages 132–143, 2012.
Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev.
  The DL-Lite family and relations. J. of Artifical Intelligence Research, 36:1–69, 2009.
Meghyn Bienvenu, Camille Bourgaux, and François Goasdoué. Querying inconsistent
  description logic knowledge bases under preferred repair semantics. In AAAI, pages
  996–1002, 2014.
Meghyn Bienvenu, Balder ten Cate, Carsten Lutz, and Frank Wolter. Ontology-based
  data access: A study through disjunctive datalog, CSP, and MMSNP. ACM Trans.
  Database Syst., 39(4):33, 2014.
Andrei A. Bulatov. On the CSP dichotomy conjecture. In CSR, pages 331–344, 2011.
Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and
  Riccardo Rosati. Tractable reasoning and efficient query answering in description
  logics: The DL-Lite family. J. Autom. Reasoning, 39(3):385–429, 2007.
Stavros S. Cosmadakis, Haim Gaifman, Paris C. Kanellakis, and Moshe Y. Vardi. De-
  cidable optimization problems for database logic programs (preliminary report). In
  Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing
  (STOC), pages 477–490, 1988.
Thomas Eiter, Georg Gottlob, and Heikki Mannila. Disjunctive datalog. ACM Trans.
  Database Syst., 22(3):364–418, 1997.
Tomás Feder and Moshe Y. Vardi. The computational structure of monotone monadic
  SNP and constraint satisfaction: A study through datalog and group theory. SIAM J.
  Comput., 28(1):57–104, 1998.
Birte Glimm and Markus Krötzsch. SPARQL beyond subgraph matching. In ISWC,
  volume 6496 of LNCS, pages 241–256. Springer, 2010.
Bernardo Cuenca Grau, Boris Motik, Giorgos Stoilos, and Ian Horrocks. Computing
  datalog rewritings beyond horn ontologies. In IJCAI, 2013.
Ramanathan V. Guha. Light at the end of the tunnel? Invited Talk, ISWC,
  https://www.youtube.com/watch?v=oFY-0QoxBi8, 2013.
Mark Kaminski, Yavor Nenov, and Bernardo Cuenca Grau. Computing datalog rewritings
  for disjunctive datalog programs and description logic ontologies. In Web Reasoning
  and Rule Systems, pages 76–91, 2014.
Mark Kaminski, Yavor Nenov, and Bernardo Cuenca Grau. Datalog rewritability of
  disjunctive datalog programs and its applications to ontology reasoning. In AAAI,
  pages 1077–1083, 2014.
Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, Marco Ruzzi, and
  Domenico Fabio Savo. Inconsistency-tolerant semantics for description logics. In
  Web Reasoning and Rule Systems, pages 103–117, 2010.
Carsten Lutz and Frank Wolter. Non-uniform data complexity of query answering in
  description logics. In Proc. of KR, 2012.
Carsten Lutz and Frank Wolter. On the relationship between consistent query answering
  and constraint satisfaction problems. In ICDT, 2015.
Peter F. Patel-Schneider. Analyzing schema.org. In ISWC, Part I, pages 261–276, 2014.
Riccardo Rosati. On the complexity of dealing with inconsistency in description logic
  ontologies. In IJCAI, pages 1057–1062, 2011.
Benjamin Rossman. Homomorphism preservation theorems. J. ACM, 55(3), 2008.