Minimising RDF Graphs under Rules and
                Constraints Revisited ?

    Reinhard Pichler1 , Axel Polleres2 , Sebastian Skritek1 , and Stefan Woltran1
 1
     Technische Universität Wien, {pichler, skritek, woltran}@dbai.tuwien.ac.at
       2
         DERI, National University of Ireland, Galway axel.polleres@deri.org


        Abstract. Based on practical observations on rule-based inference on
        RDF data, we study the problem of redundancy elimination in RDF in
        the presence of rules (in the form of Datalog rules) and constraints (in the
        form of so-called tuple-generating dependencies). To this end, we investi-
        gate the influence of several problem parameters (like restrictions on the
        size of the rules and/or the constraints) on the complexity of detecting
        redundancy. The main result of this paper is a fine-grained complexity
        analysis of both graph and rule minimisation in various settings.


1     Introduction
The Semantic Web promises to enable computers to gather machine readable
meta-data in the form of RDF statements published on the Web and make in-
ferences about these statements by means of accompanying standards such as
RDFS and OWL2. While complete OWL2 reasoning is hard – and in some sense
even inappropriate for Web data [1] – incomplete rule-based inference is be-
coming quite popular and supported by many RDF Stores and query engines:
frameworks like GiaBATA [2], Jena, Sesame, OWLIM,3 etc. allow for custom
inference on top of RDF Stores, supporting different rule-based fragments of
RDFS and OWL. Several such fragments have been defined in the literature,
such as ρDF [3], DLP [4], OWL− [5], ter Horst’s pD* [6], or SAOR [7], and –
more recently – the W3C standardised OWL2RL, a fragment of OWL imple-
mentable purely in terms of rule-based inference [8]. All these fragments have in
common that they are implementable by simple Datalog-like rules over RDF. As
an example, let us take (1) the sub-property rule from RDFS [9, Section 7.3] and
(2–6) rules for OWL2RL representing inverse properties and property chains:4
(1) { S P O . P subPropertyOf Q . uri(Q) }        ⇒ { S Q O }
(2) { S P O . P inverseOf Q . blank(O) ∧ uri(Q) } ⇒ { O Q S }
?
  R. Pichler, S. Skritek and S. Woltran were supported by the Vienna Science and
  Technology Fund (WWTF), project ICT08-032. A. Polleres was supported by Sci-
  ence Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
3
  cf. http://jena.sourceforge.net/, http://openrdf.org/, and http://ontotext.com/owlim/
4
  We disregard full URIs for common RDF terms, i.e., we just write e.g. inverseOf , for
  <http://www.w3.org/2002/07/owl#inverseOf>, name for <http://xmlns.com/foaf/0.1/name>, or
  creator for <http://purl.org/dc/elements/1.1/>, etc. Further, (P1 . . . Pn ) in RDF is short
  for a fresh variable X plus additional triples X first P1 . X1 rest X2 . ...Xn
  first Pn . Xn rest nil . using reserved terms first, rest, nil .
2

(3) { S P O . P inverseOf Q . uri(O) ∧ uri(Q) }    ⇒ { O Q S }
(4) { P inverseOf Q . uri(Q) }                     ⇒ { Q inverseOf P }
(5) { P inverseOf Q . blank(Q) }                   ⇒ { Q inverseOf P }
(6) { S P0 O1 . ... On Pn O. P propertyChainAxiom (P0 ...Pn ) }⇒{ S P O }

Let GD be an RDF graph talking about authors and their publications:
(7) GD = { <http://semanticweb.org/wiki/Pat Hayes> made <http://www.w3.org/TR/rdf-mt/>.
(8)        <http://semanticweb.org/wiki/Pat Hayes> name "Patrick J. Hayes".
(9)        <http://www.w3.org/TR/rdf-mt/> creator "Patrick J. Hayes".}

Moreover, let graph GO be part of the ontology defining the terms used in GD :
(10) GO = { name subPropertyOf label.
(11)        maker inverseOf made. made inverseOf maker.
(12)        creator propertyChainAxiom (maker label). }
    When storing the graph G = GD ∪ GO in an RDF Store that supports in-
ference over rules (1)–(6), different questions of redundancy arise like if some
statements may be deleted since they can be inferred by the rules. In our ex-
ample, statement (9) as well as the statement maker inverseOf made. from
line (11) may be deleted, since they could be reproduced by inference. Similarly,
suppose that we transfer the graph G = GD ∪ GO to a “weaker” RDF Store
that only supports rules (1)–(3). Then the question is if we thus loose any in-
ferences. In fact, the answer is no. Interestingly enough, standard rule sets, such
as OWL2RL are even known to be non-minimal [8, Section 4.3].
    We thus want to be able to answer the general question about redundancy of
both triples and rules. However, it is often important to limit the minimisation of
RDF graphs in such a way that certain consistency conditions must be preserved.
These consistency conditions can be expressed by means of constraints [10]. We
shall restrict ourselves here to constraints in the form of so-called tuple-generating
dependency (tgd) constraints, which are a generalisation of the familiar foreign-
key dependencies in the relational database world. Roughly speaking, a tgd may
be viewed as a generalised rule “read” as constraint. So, for instance, if we read
rule (6) as a constraint, we could say that graph G alone without rules satisfies
this constraint, and likewise the closure of G with respect to rules (1)-(3) does.
    Note that tgds can be more general than (Horn) rules in that they allow
otherwise unbound, existential variables in the head, possibly occurring in a
larger conjunct. That is, tgds are – rather than rules – constraining queries (in
the head) “triggered” by bindings coming from a query in the body ; for instance,
a constraint
(13) { A made D } ⇒ { A label N . D creator N}
would hold only on graphs where everybody who made something also has a
declared label and that label is also used to denote the creator. Note that con-
straint (13) holds on the closure of G with respect to rule (1) but – as opposed
to the constraint reading of (6) – not on G alone.
    The primary goal of our work is a systematic complexity analysis of both
graph and rule minimisation under constraints. To this end, we investigate the
influence of several problem parameters (like restrictions on the size of the rules
and/or the constraints) on the complexity of detecting redundancy. A first im-
portant step in this investigation has been recently made by Meier [11]. He
                                                                                 3

studied the following problem: Given a graph G, a set R of rules and set C of
tgds, can G be reduced to a proper subgraph G0 ⊂ G, such that G0 still satisfies
C and the closure of G0 under R coincides with the closure of G under R? For
the special case that both the rules in R and the constraints in C have bounded
size (referred to as b-boundedness), this problem was shown to be NP-complete
in [11]. In this paper, we want to extend the work initiated in [11] and pro-
vide a much more fine-grained analysis of the complexity, e.g., by weakening or
strengthening restrictions such as b-boundedness and by considering redundancy
elimination that only preserves RDF entailment (rather than keeping the closure
of the original graph under the original rules unchanged).
    We shall come up with a collection of complexity results, ranging from
tractability to Σ3P -completeness. Additionally, we address the orthogonal prob-
lem of rule minimisation, which has not been studied so far. We shall also discuss
further variations of the graph and rule minimisation problem. For instance, the
rules and tgds in [11] do not allow variables in predicate positions, which is a
severe restriction in the sense that many of the common RDF inferences rules
are not covered (e.g., all except rules (4) and (5) above). We will not make this
restriction, since it can be dropped without significant change of the complexity
results. We shall also briefly touch on the related problem of reducing rules or
triples without preserving completeness of the entire closure, but only ensuring
that the answers to certain queries are preserved. For instance, suppose that,
in our example, we are interested only in completeness with respect to creator
statements. Then rules (2)–(5) could in fact be dropped.
Organisation of the paper and summary of results. In Section 2, we recall
some basic notions and results. A conclusion and an outlook to future work are
given in Section 6. Sections 3–5 contain the main results of the paper, namely:
• Graph Minimisation. In Section 3, we provide a comprehensive complexity
analysis of the RDF graph minimisation problem, both when full reconstruction
of the graph or only RDF entailment is required. We study various settings which
result from different restrictions on the rules and/or tgds like restricting their
size, considering them as fixed, omitting them, or imposing no restrictions at all.
Our complexity results range from tractability to Σ3P -completeness.
• Rule Minimisation. In Section 4, we consider the problem of minimising the
set of rules. We show that the problem of finding redundant rules with respect to
a given RDF graph is NP-complete for b-bounded rules and not harder than ∆P      2
for arbitrary rules. Note that rule minimisation is closely related to the field of
Datalog equivalence and optimisation. We therefore discuss how the large body
of results in this area can be fruitfully applied to the problems studied here.
• Problem Variations. In Section 5, we analyse the complexity of further prob-
lems which are either variations of or strongly related to the graph and rule
minimisation problems mentioned above. For instance, rather than asking if an
RDF graph contains redundant tuples, we consider the problem whether an
RDF graph can be reduced below a certain size. We show that this problem
is NP-complete also in those settings where the graph minimisation problem is
tractable. We also discuss the effect of allowing blank nodes in predicate posi-
tions in the Datalog rules and of requiring only that the answers to a given set
of queries must be preserved by the minimisation of the graphs or rules.
4

Due to lack of space, proofs are only sketched. While for most of the hardness
proofs we only describe the idea of the reduction, membership proofs are either
also informal or even omitted. All proofs are worked out in detail in [12].


2     Preliminaries

Let U , B, and L denote pairwise disjoint alphabets for URI references, Blank
nodes (or variables) and Literals, respectively. We denote unions of these sets
simply by concatenating their names.5 An RDF statement (or triple) is a state-
ment of the form (s, p, o) ∈ UB × U × UBL, and an RDF graph is a set of triples.
In this paper, we do not distinguish between variables and blank nodes, but just
note that blank nodes/variables appearing in the data are understood to be ex-
istentially quantified within the scope of the whole RDF graph they appear in.
We write elements from B (U ) as alphanumeric strings starting with an upper
case letter (lower case letter or number), elements from L as quoted strings, and
– inspired by the common Turtle [13] syntax – RDF statements as white-space
separated triples and RDF graphs as ’.’ separated lists of triples in curly braces.
    It is convenient to define the notion of entailment between two RDF graphs
via the interpolation lemma from [9, Section 2] rather than in a model-theoretic
way: an RDF graph G1 entails G2 , written G1 |= G2 if a subgraph of G1 is an
instance of G2 , that is, if there exists a graph homomorphism, i.e., a blank node
mapping µ : B → UBL such that µ(G2 ) ⊆ G1 , where µ(G) denotes the graph
obtained by replacing every variable B ∈ B with µ(B). A homomorphism h0 is an
extension of a homomorphism h if h0 (B) = h(B) for all B on which h is defined.
Given G1 , G2 , deciding whether there exists a homomorphism G2 → G1 (and
also G1 |= G2 ) is well known to be NP-complete.
    We define a basic graph pattern (BGP) as a set of generalised triples (s0 , p0 , o0 )
∈ UBL × UBL × UBL, a filter condition as a conjunct of the unary predicates
uri(·), blank(·), literal(·) (denoting the unary relations U , B, and L, respec-
tively). A filtered basic graph pattern (FBGP) is a BGP conjoined with a filter
condition, the latter containing only variables already appearing in the BGP.
Given an FBGP P , we write BGP (P ) and F (P ) to denote its components, i.e.
its BGP and its filter condition, respectively.
    We define an RDF tuple-generating dependency (tgd) constraint (or simply
constraint) r as Ante ⇒ Con, where the antecedent Ante is an FBGP and the
consequent Con is a BGP. A constraint Ante ⇒ Con is a short-hand notation
for the first-order formula ∀x Ante(x) → (∃y)Con(x, y) (where y denotes the
blank nodes occurring in Con only, while x are the remaining blank nodes)
Hence, a constraint Ante ⇒ Con is satisfied over an RDF graph G if for each
homomorphism on x mapping BGP(Ante) to G, there exists an extension h0 of
h to y s.t. h0 (Con) ⊆ G. To increase the readability, we will sometimes explicitly
write out the quantifiers and variable vectors. RDF rules (or simply rules), are
syntactically restricted constraints, where all variables appearing in Con also
appear in Ante (akin to the common notion of safety [14] in Datalog). In the
following, we will call RDF rules with an empty filter condition Datalog rules.
5
    In this paper, we use a slightly simplified notion of RDF compared to [9], e.g. not
    considering typed literals separately.
                                                                                     5

We define the closure of a graph G with respect to a set R of rules, written
ClR (G) as usual by the least fix-point of the immediate consequence operator.
   We say that a rule or constraint is b-bounded if both, antecedent and conse-
quent contain at most b triples. For a given graph G or a given set R of rules,
we use XG ,XR (X ∈ {U, B, L}) to denote X ∩ Σ(G), resp. X ∩ Σ(R), where by
Σ(·) we denote the signature of a graph or ruleset, that is, the subset of UBL
used in G, or R, respectively. Finally, we write [n] to denote the set {1, . . . , n}.

3       RDF Minimisation
In this section, we study the complexity of RDF graph minimisation. For different
restrictions on the input parameters, the complexity varies between tractability
and Σ3P -completeness. Formally, we consider the following two basic problems:
Definition 1. Let MINI-RDF|= (G, R, C) be the following decision problem:
INPUT: RDF graph G, set R of RDF rules, set C of tgds (G satisfies C).
QUESTION: Is there a G0 ⊂ G s.t. ClR (G0 ) |= ClR (G) and G0 satisfies C?
Definition 2. Let MINI-RDF⊆ (G, R, C) be the following decision problem [11]:
INPUT: RDF graph G, set R of RDF rules, set C of tgds (G satisfies C).
QUESTION: Is there a G0 ⊂ G s.t. ClR (G) = ClR (G0 ) and G0 satisfies C?
    The MINI-RDF⊆ problem and the minimization of RDF graphs via entail-
ment aim at two kinds of redundancy elimination: In MINI-RDF⊆ , triples which
can be restored via the rules are considered as redundant while graph minimiza-
tion via entailment allows us to replace a graph G by a subgraph Ḡ ⊂ G if
Ḡ |= G holds (see [20]). The MINI-RDF|= (G, R, C) problem combines these two
approaches and thus yields the strongest redundancy criterion. Nevertheless, in
most cases, its complexity is not higher than for MINI-RDF⊆ (see Theorem 1).
    It is easy to see that the condition ClR (G) = ClR (G0 ) in Definition 2 is
equivalent to G ⊆ ClR (G0 ). The following lemma shows that similarly, for MINI-
RDF|= , it is enough to show ClR (G0 ) |= G rather than ClR (G0 ) |= ClR (G).
Lemma 1. Let G1 , G2 be RDF graphs and R a set of rules. Then the following
equivalence holds: ClR (G2 ) |= ClR (G1 ) ⇔ ClR (G2 ) |= G1 .
Theorem 1. For MINI-RDF|= and MINI-RDF⊆ , the complexity w.r.t. different
assumptions on the input (arbitrary, b-bounded, or fixed rule set; arbitrary, b-
bounded, fixed, or no constraints) is as depicted in Table 1.
The following lemma justifies that we do not have to give an explicit completeness
proof for each entry in Table 1, and points out a proof plan for Theorem 1.


    1      5     9      2      3      4      6      7      8      10     11     12

Fig. 1. Dependency graph: Numbers refer to lines in Table 1. An arrow from A to B
means that B is a special case of A.
6
                                   MINI-RDF|=           MINI-RDF⊆
                                     P
         (1) R arb., C arb.        Σ3 -complete         Σ3P -complete
         (2) R arb., C bb          NP-complete          NP-complete
         (3) R arb., C fixed       NP-complete          NP-complete
         (4) R arb., C = ∅         NP-complete          NP-complete
         (5) R bb., C arb.         Σ3P -complete        Σ3P -complete
         (6) R bb, C bb            NP-complete        NP-complete [11]
         (7) R bb, C fixed         NP-complete          NP-complete
         (8) R bb, C = ∅           NP-complete               in P
         (9) R fixed, C arb.       Σ3P -complete        Σ3P -complete
         (10) R fixed, C bb        NP-complete          NP-complete
         (11) R fixed, C fixed     NP-complete          NP-complete
         (12) R fixed, C = ∅       NP-complete               in P
                                        |=               ⊆
Table 1. The complexity of MINI-RDF and MINI-RDF w.r.t. input parameters.
(Where “bb” indicates the set to be b-bounded, and “arb.” allows for arbitrary sets.)

Lemma 2. The graph in Figure 1 correctly describes the dependencies between
the problems (identified by their line number) in Table 1, i.e.: If there is an arrow
from A to B, then B is a special case of A.
Hence an arrow from A to B means that membership results for A hold also
for B, and that hardness results for B apply also to A. Therefore, to prove
Theorem 1, it suffices to show the membership for (1),(2),(8) and the hardness
for (4),(9),(11),(12). Due to lack of space, we only work out the hardness results
for (9) and (11) (the latter only for MINI-RDF⊆ ). Before, we shortly discuss the
membership results and give an intuition of why they are correct. All proofs are
worked out in detail in the full paper [12].
    The most general case, (1), can be solved by a guess and check algorithm
that is allowed to use a Π2P oracle for the checks. One has to guess: a subgraph
G0 of G, a sequence of rule applications on G0 , and for each rule application
a homomorphism justifying that the rule is applicable. Note that ClR (G0 ) ⊆
AD3 (with AD = UG UR BG BR LG LR ). Hence if considering all possible rule
applications of length |AD|3 , one of them has to return ClR (G0 ). The most
expensive check is to test if G0 satisfies C. However, it obviously fits into Π2P .
    The following properties lead to the cases of lower complexity: If R is a
b-bounded set, then ClR (G0 ) can be computed in polynomial time [11, Proposi-
tion 9] and if C is a b-bounded set, then testing if G0 satisfies C is in PTIME [11,
Proposition 3]. For the tractable cases, note that if C = ∅, then not all subgraphs
of G have to be checked, but only those missing exactly one triple from G.
Lemma 3. The problems MINI-RDF|= (G, R, C) and MINI-RDF⊆ (G, R, C),
for fixed R and arbitrary C, are Σ3P -hard.
Proof. Σ3P -hardness is shown by reduction from the well-known Σ3P -complete
problem QSAT3 , of which we only give an informal       Vndescription here. Let an
instance of QSAT3 be given by F = ∃x1 ∀y1 ∃x2 i=1 Ci , with Ci = (li,1 ∨
li,2 ∨ li,3 ) (clearly, the restriction to 3-CNF is w.l.o.g.). The graph G created
contains on the one hand triples encoding truth assignments on clauses (e.g.
{0 h1 a001 . 0 h2 a001 . 1 h3 a001 } for the assignment (false, false, true)), and on
                                                                                          7

the other hand triples encoding the two possible truth assignments for variables
(e.g. {vi q1 a01 . vi q1 a10 } for xi ∈ x1 where vi is a new URI for each xi and the
URI a01 (resp. a10 ) denotes that xi evaluates to false, hence ¬xi evaluates to
true (resp. xi to true and ¬xi to false), together with further triples that allow
us to actually refer to the truth value of xi (resp. ¬xi )) under a selected truth
assignment. The rules and constraints are chosen in such a way that (1) the
triples encoding the truth assignment (false, false, false) for clauses must not be
present in any valid subgraph G0 ⊂ G, (2) for every xi ∈ x1 exactly one of the two
triples encoding a truth assignment must be present in G0 and (3) for all    Vnother
variables, both triples have to remain in G0 . The restrictions imposed by i=1 Ci
are encoded in one big tgd, where every homomorphism from its antecedent to
G0 defines a truth assignment for x1 and y1 . Thereby for every valid G0 all such
homomorphisms define the same truth assignment on x1 , hence the values for
x1 are determined by the selection of G0 . But every homomorphism defines a
different truth assignment on y1 , and there exists exactly one homomorphism
for each of the 2|y1 | truth assignments on y1 . The consequent of the tgd contains
a representation of the literals in each clause Ci and has the following property:
for every homomorphism h from the antecedent to G0 , there exists an extension
of h to a homomorphism h0 from the consequent to G0 iff this extension defines
a truth assignment on x2 such that the assignment on x1 , y1 and x2 maps the
representations of the clauses onto the possible truth assignments for clauses
present in G0 . As all triples encoding these truth assignments must be in G0 ,
except the ones for (false, false, false) which must not, such an extension for
every homomorphism from the antecedent to G0 implies that F is valid.              t
                                                                                   u

Lemma 4. The problems MINI-RDF⊆ (G, R, C) and MINI-RDF|= (G, R, C),
where both, R and C are considered to be fixed, are NP-hard.

Proof. As NP-hardness of MINI-RDF|= follows easily from the coNP-hardness
of testing if G is lean, we concentrate on MINI-RDF⊆ and prove its NP-hardness
by reduction from the 3-SAT problem. First, we fix the rules and tgds as R =
{{X 0 in I . X active I} ⇒ {X 0 active I}} and C = {{X active I . X in J} ⇒
{X active J}; {X clash X 0 . X active I . X 0 active I 0 . Y in J} ⇒ {Y active J}}.
    Now let an instance of 3-SAT be given by the formula F = C1 ∧ · · · ∧ Cn ,
where Ci = (li,1 ∨li,2 ∨li,3 ) and the li,j are literals. W.l.o.g., we assume that every
variable appears negated and unnegated in F . Then we construct an RDF graph
        ∗                                ∗
G = {li,j  in ci | i ∈ [n], j ∈ [3]} ∪ {li,j active ci | i ∈ [n], j ∈ [3]} ∪ {xj clash x̄j |
xj in F }, where we introduce new URIs ci (for every clause Ci ) and xj , x̄j (for
                                   ∗
every variable xj in F ), and li,j   = xj (resp. x̄j ) if li,j = xj (resp. ¬xj ).
Intuitively, the triples in G with predicate in encode the literals in F . If a triple
with predicate active remains in the selected subgraph G0 then the corresponding
literal in F is set to true. The triples with clash keep track of dual literals. t        u


4    Rule Minimisation
In this section, we study the rule minimisation problem of RDF graphs. Although
there is a huge amount of literature in the Datalog world addressing related
problems (as query containment), the particular nature of the problems we study,
8

requires a distinguished complexity analysis. Note that rules for RDF, when
written as Datalog rules, have a fixed predicate arity of three, which makes
problems computationally easier than in the general Datalog setting (see, e.g.
[15]). Depending on whether we consider the Datalog rules as b-bounded or not,
we obtain complexity results from NP-completeness to ∆P    2 -membership. The
rule minimisation problem is formally defined as follows. As the RDF graph
remains unchanged, constraints are irrelevant here.
Definition 3. Let RDF-RULEMIN|= (G, R) be the following decision problem:
INPUT: An RDF graph G and a set R of RDF rules.
QUESTION: Does there exist R0 ⊂ R s.t. ClR0 (G) |= ClR (G)?
Definition 4. Let RDF-RULEMIN⊆ (G, R) be the following decision problem:
INPUT: An RDF graph G and a set R of RDF rules.
QUESTION: Does there exist R0 ⊂ R s.t. ClR0 (G) = ClR (G)?
For the case that the set of rules is b-bounded, we can pinpoint the complexity
of the problem to NP.
Theorem 2. Deciding RDF-RULEMIN|= (G, R) for a set R of b-bounded rules
(for fixed b) is NP-complete.
Proof. The hardness is shown by reduction from the 3-SAT problem. The idea
of the reduction is, given some formula F in 3-CNF, to provide in the graph G
an encoding of all combinations of truth values under which a clause evaluates
to true, and to design the set R of rules such that the triples derivable by R
encode which literals occur together in some clause in F . Further, R is chosen
such that ClR0 (G) |= ClR (G) holds for any subset R0 ⊂ R iff all triples derivable
by R can be mapped into G. This mapping then defines a valid satisfying truth
assignment for F .                                                               2
Theorem 3. The problem RDF-RULEMIN⊆ (G, R), for a b-bounded set R of
rules, can be decided in PTIME, while for arbitrary rules R, both,
RDF-RULEMIN⊆ (G, R) and RDF-RULEMIN|= (G, R) are in ∆P2.

Proof. In the b-bounded case, the closure can be computed efficiently: It suffices
to compare the closure of G under R with the closure of G under every subset
of R missing exactly one rule. For arbitrary rules, the same strategy can be
used, but now an NP-oracle is needed to test if a rule is applicable. For RDF-
RULEMIN|= , this oracle can then also be used to test for entailment.           2
    In order to reduce the complexity of the problems RDF-RULEMIN⊆ (G, R)
and RDF-RULEMIN|= (G, R), one could seek for approximations of those prob-
lems. In fact, one option is to check for redundant rules in the set R of given
Datalog rules; or whether some rule is subsumed by another rule from R. The
first problem is known to be tractable while the test for rule subsumption is
NP-complete (see [16]). The latter result can be shown to hold also for rules of
bounded arity (which we deal with here); but becomes tractable in the case of
b-bounded rules. Further methods (e.g., folding and unfolding of rules) are well
understood for logic programs (see [17]), and could also apply to our domain. An
in-depth analysis how to use those results in our setting is left for future work.
                                                                                9

5   Problem Variations
In this section, we discuss some further problems which are variations of or
strongly related to the problems studied in the previous sections. We start by
a variation of the graph minimisation problem. But now we ask if G can be
replaced by a subgraph G0 whose size is bounded by some given bound k (rather
than an arbitrary subgraph G0 ⊂ G). Formally, we study the following problem.
Definition 5. Let MINI-RDFcard (G, R, C, k) be the following decision problem:
INPUT: An RDF graph G, a set R of RDF rules, a set C of tgds and integer k.
QUESTION: Does there exist a subgraph G0 ⊂ G with |G0 | ≤ k, s.t. G0 satisfies
C and G ⊆ ClR (G0 )?
It can be easily verified that for all cases in Table 1 that are at least NP-hard,
the complexity for MINI-RDFcard does not change. Intuitively, this is because
the nondeterministic algorithms for solving these problems all start with “guess
a subgraph G0 ⊂ G”, which can be easily changed to “guess a subgraph with at
most k triples”. Therefore, the only two interesting cases are MINI-RDF⊆ with a
b-bounded or fixed set R and no constraints, as they can be decided in PTIME.
We show that for MINI-RDFcard , the complexity goes up to NP-completeness.
Theorem 4. The problem MINI-RDFcard (G, R, C, k) is NP-complete if C = ∅
and R is either considered as fixed or a set of b-bounded rules (for fixed b).

Proof. The hardness proof is by reduction from the Vertex Cover problem. We
give the basic ideas of this reduction. Given some graph G = (V, E), the RDF
graph Grdf contains one distinct triple for every v ∈ V . The intuition is that
the subset of those triples contained in a valid subgraph G0 ⊂ Grdf describes
a vertex cover. We further have three rules, one that (given G0 ⊂ G) adds all
edges covered by the remaining vertices in G0 , one that (by repeated application)
checks whether all edges are covered, and finally one rule that, if indeed all
edges are covered, allows to restore the vertices from Grdf \ G0 . To allow to
express according rules, Grdf contains triples encoding further information (like
e.g. neighbourhood of vertices and edges). But as they cannot be derived by any
rule, they must remain unchanged in any valid G0 ⊂ Grdf . Further, their number
(say K) only depends on G, such that there exists a vertex cover of size k iff
there exists a valid G0 ⊂ Grdf of size K + k.                                   2

Next we want to identify the sources of the complexity of MINI-RDF|= and MINI-
RDF⊆ for the cases where C is allowed to contain arbitrary tgds. We show that
the complexity is independent of the rules, but arises mainly from the question
whether there exists some non-empty subgraph that satisfies all constraints.
Theorem 5. Let G be a RDF graph and C a set of tgds. Deciding whether there
              6 G0 ⊂ G s.t. G0 satisfies C is Σ3P -complete.
exists some ∅ =

Proof. Membership follows from Theorem 1. Hardness is shown by a modification
of the reduction given in the proof of Lemma 3. We give the intuition of these
modifications. In the aforementioned proof, the intuitive meaning of the rules,
together with the requirement G ⊆ ClR (G0 ), was that for each vi ∈ x1 , either
{vi q1 a01 } or {vi q1 a10 } has to remain in the subgraph G0 . However, this can
10

be also formulated as a constraint. By introducing an additional triple for every
vi ∈ x1 (e.g. {vi opt vi }) that is enforced to be contained in any non-empty
subgraph, the tgd {V opt V } ⇒ {V q1 A} does the job.                          2

    From the (full) proof of Lemma 4, it follows that for MINI-RDF|= , one source
of the NP-hardness is just to decide the entailment. However, similarly to the
last theorem, we can show that for b-bounded tgds, just testing for the existence
of a valid subgraph already contains the full hardness too.
Theorem 6. Let G be an RDF graph and C a set of b-bounded tgds. Deciding
                            6 G0 ⊂ G s.t. G0 satisfies C is NP-complete.
whether there exists some ∅ =

Proof. Membership follows from Theorem 1. Hardness is shown by reduction
from the SAT problem. The reduction is very similar to the one of Lemma 4,
only that all the implicit information about which triples must not be removed
from G (expressed by not providing rules to derive them) now have to be made
explicit as tgds. This however no longer allows for a fixed set of tgds, but makes
the number of tgds dependent on F .                                          Vn
     Let an arbitrary instance of SAT be given by the formula F = i=1 Ci with
Ci = (li,1 ∨ · · · ∨ li,ki ) (where li,j are literals). We assume that every variable
in F occurs negated and unnegated. Introducing two new URIs xi and x̄i for
every variable xi in F , and one new URI ci for every clause in F , we define G =
  ∗                                    ∗
{li,j in ci | i ∈ [n], j ∈ [ki ]} ∪ {li,j active ci | i ∈ [n], j ∈ [ki ]} ∪ {xj clash x̄j |
xj in F } ∪ {ci clause ci | i ∈ [n]} and C = { {X active I . X in J} ⇒
{X active J}; {X clash X 0 . X active I . X 0 active I 0 . Y in J} ⇒ {Y active J};
{I clause I} ⇒ {X active I}} ∪ {{A B C} ⇒ {ci clause ci } | i ∈ [n]} ∪
                    ∗
{{A B C} ⇒ {li,j        in ci } | i ∈ [n], j ∈ [ki ]} ∪ {{A B C} ⇒ {xj clash x̄j } |
                    ∗
xj in F } where li,j = xα if li,j = xα and li,j  ∗
                                                    = x̄α if li,j = ¬xα .                2

Note that the above proof would work – by a grounding of the last constraints
and introducing some additional constraints – as well if no blank nodes in the
predicate positions were allowed. However, the proof would be less compact.
    We now show that the complexity of the problems remains unchanged by
allowing additional predicates uri (.), blank (.), lit(.) to restrict the type of a value
in a Datalog rule, that is, allowing general RDF rules as defined in Section 2.
Note that for every x ∈ U ∪ B ∪ L occurring in some RDF-graph G (i.e. for every
element of the active domain) it can be easily recognised whether it belongs
to U , B or L: This could be either decided using syntactic criteria, or by a
lookup in U , B and L (although those sets are supposed to be countable infinite,
one can assume that UG , BG and LG , i.e. the elements of the active domain,
are the “first” elements of these sets). Therefore, determining the type of some
element requires at most polynomial time in the size of G. Therefore, for every
element x of the active domain of G, we create a ground atom Bt (x), Ut (x)
or Lt (x), depending on the type of x. By encoding an atom blank (X) as triple
{X blank X} in G, we can make this information available for rule application
without increasing the complexity of the problem.
    The same argument allows us to overcome the problem that the closure w.r.t.
a rule set R contains invalid RDF triples (containing e.g. a blank node in a pred-
icate position). Depending on whether invalid triples are allowed in intermediate
                                                                                   11

results or not, we can pursue one of the following two strategies: (1) In a post-
processing step, we can check for every triple in R(G) whether it is valid or not.
In the latter case, it is removed. (2) If invalid triples should also be excluded from
any intermediate results, then the rules can be (automatically) augmented by at
most 2 additional predicates in the rule body, urib(A) and uri (B), assuming that
the rule head is {A B C}. The predicate urib(.) can be easily defined from uri (.)
and blank (.) by e.g. {uri (X)} ⇒ {urib(X)} and {blank (X)} ⇒ {urib(X)}. This
is similar to variations of rules (2)–(5) in Section 1, where the filter conditions
guaranteed valid intermediate triples.
    We conclude this section by discussing a further variant of (graph or rule)
minimisation that guarantees completeness only w.r.t. a given set of conjunctive
queries (CQs). Such a minimisation is of high interest, e.g. when importing data
into an RDF Store that provides a narrow query interface only. Below we give a
formal definition for the CQ-variant of the MINI-RDF⊆ problem.
Definition 6. MINI-RDF⊆,CQ (G, R, C, Q) is the following decision problem:
INPUT: An RDF graph G, a set R of RDF rules, a set C of tgds (G satisfies
C), and a set Q of CQs, defined by BGPs.
QUESTION: Is there a G0 ⊂ G s.t. (1) for every q ∈ Q, the answers to q over
ClR (G) coincide with the answers to q over ClR (G0 ) and (2) G0 satisfies C?
The CQ-variants of MINI-RDF|= , RDF-RULEMIN⊆ , and RDF-RULEMIN|= are
defined analogously. A detailed complexity analysis of the CQ-variants of all
problems studied here is outside the scope of this paper. However, we briefly
mention that the hardness results from Sections 3 and 4 carry over to the CQ-
variants. To see this, we note that, for instance, MINI-RDF⊆ corresponds to the
special case of MINI-RDF⊆,CQ where Q = {S P O}.


6    Conclusions
We proved a collection of complexity results for minimisation problems over RDF
graphs where we considered various restrictions on the rules and tdgs. One such
restriction was b-boundedness [11]. We note that this restriction can be relaxed
by bounding not necessarily the size of the rules (or tgds) but only the maximal
number of blank nodes occurring in the rules (or tgds) — in the Datalog world,
Vardi [18] showed that such a restriction decreases complexity.
    The minimisation problems considered here are driven by practical needs
to represent RDF data compactly or tailor them to engines supporting different
rule sets. Our results also provide a basis for eliminating redundancies in existing
practically relevant rule sets, such as OWL2RL [8]. We believe that our results
will gain even more relevance with the advent of novel standards such as the
W3C rule interchange format (RIF) which will allow one to enrich RDFS and
OWL with Web-publishable custom rule sets [19].
    We have briefly discussed further variants of minimisations like (graph or
rule) minimisations that guarantee completeness with respect to a given set of
conjunctive queries (CQs). We have sketched that the hardness results from
Sections 3 and 4 still hold when we take CQs into account. Identifying the
precise complexity of RDF minimisation relative to a set of CQs is an important
task for future work. As another direction of future work, we plan to cast the
12

obtained results into practical algorithms to “compress” RDF graphs and rule
sets, investigate related relevant problems such as “trading” triples for rules,
or vice versa, and experimentally evaluating effects of such transformations on
query answering with dynamic inference such as sketched in [2].


References
1. Hogan, A., Decker, S.: On the ostensibly silent ’W’ in OWL 2 RL. In: Proc. RR’09.
   Volume 5837 of LNCS., Springer (2009) 118–134
2. Ianni, G., Krennwallner, T., Martello, A., Polleres, A.: Dynamic querying of mass-
   storage rdf data with rule-based entailment regimes. In: Proc. ISWC’09. Volume
   5823 of LNCS., Springer (2009) 310–327
3. Muñoz, S., Pérez, J., Gutiérrez, C.: Minimal deductive systems for RDF. In: Proc.
   ESWC’07. Volume 4519 of LNCS., Springer (2007) 53–67
4. Grosof, B.N., Horrocks, I., Volz, R., Decker, S.: Description Logic Programs: Com-
   bining Logic Programs with Description Logics. In: Proc. WWW’03. (2003) 48–57
5. de Bruijn, J., Polleres, A., Lara, R., Fensel, D.: OWL− . WSML D20.1v0.2 (2005)
6. ter Horst, H.J.: Completeness, decidability and complexity of entailment for RDF
   schema and a semantic extension involving the OWL vocabulary. J. Web Sem.
   3(2-3) (2005) 79–115
7. Hogan, A., Harth, A., Polleres, A.: Scalable authoritative OWL reasoning for the
   web. International Journal on Semantic Web and Information Systems 5(2) (2009)
8. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web
   ontology language profiles (October 2009) W3C Rec.
9. Hayes, P.: RDF semantics. Technical report, W3C (February 2004) W3C Rec.
10. Lausen, G., 0002, M.M., Schmidt, M.: Sparqling constraints for rdf. In: Proc.
   EDBT’08. Volume 261 of ACM International Conference Proceeding Series., ACM
   (2008) 499–509
11. Meier, M.: Towards Rule-Based Minimization of RDF Graphs under Constraints.
   In: Proc. RR’08. Volume 5341 of LNCS., Springer (2008) 89–103
12. Pichler, R., Polleres, A., Skritek, S., Woltran, S.: Minimizing RDF graphs under
   rules and constraints revisited. Technical report (April 2010) available at http:
   //www.deri.ie/fileadmin/documents/DERI-TR-2010-04-23.pdf.
13. Beckett, D., Berners-Lee, T.: Turtle - Terse RDF Triple Language (January 2008)
   W3C Team Submission, http://www.w3.org/TeamSubmission/turtle/.
14. Ullman, J.D.: Principles of Database and Knowledge Base Systems. Computer
   Science Press, New York, NY, USA (1989)
15. Eiter, T., Faber, W., Fink, M., Woltran, S.: Complexity results for answer set
   programming with bounded predicate arities and implications. Ann. Math. Artif.
   Intell. 51(2-4) (2007) 123–165
16. Eiter, T., Fink, M., Tompits, H., Traxler, P., Woltran, S.: Replacements in non-
   ground answer-set programming. In: Proc. KR’06, AAAI Press (2006) 340–351
17. Pettorossi, A., Proietti, M.: Transformation of logic programs. In Gabbay, D.M.,
   Hogger, C.J., Robinson, J.A., eds.: Handbook of Logic in Artificial Intelligence and
   Logic Programming. Volume 5., Oxford University Press (1998) 697–787
18. Vardi, M.: On the complexity of bounded-variable queries. In: Proc. PODS’95,
   ACM Press (1995) 266–276
19. de Bruijn, J.: RIF RDF and OWL Compatibility (October 2008) W3C Cand. Rec.
20. Gutierrez, C., Hurtado, C., Mendelzon, A.: Foundations of semantic web databases.
   In: Proc. PODS’04, ACM (2004) 95–106