Revisiting RDF storage layouts for efficient
                 query answering

M. Buron1,2 , F. Goasdoué3 , I. Manolescu1,2 , T. Merabti1,2 , and M.-L. Mugnier4
                          1
                        Inria firstname.lastname@inria.fr
                      2
                       Institut Polytechnique de Paris, France
               3
                  Univ Rennes, CNRS, IRISA, France fg@irisa.fr
           4
             Univ Montpellier, LIRMM, Inria, France mugnier@lirmm.fr

       Abstract. The performance of query answering in an RDF database
       strongly depends on the data layout, that is, the way data is split in
       persistent data structures. We consider answering Basic Graph Pattern
       Queries (BGPQs), and in particular those with variables (also) in class
       and property positions, in the presence of RDFS ontologies, both through
       data saturation and query reformulation. We show that such demanding
       queries often lead to inefficient query answering on two popular storage
       layouts, so-called T and CP. We present novel query answering algo-
       rithms on the TCP layout, which combines T and CP. In exchange to
       occupying more storage space, e.g. on an inexpensive disk, TCP avoids
       the bad or even catastrophic performance that T and/or CP sometimes
       exhibit. We also introduce summary-based pruning, a novel technique
       based on existing RDF quotient summaries, which improves query an-
       swering performance on the T, CP and the more robust TCP layouts.

1    Introduction
We consider the problem of efficiently querying an RDF database, which stores
RDF graphs persistently (e.g., on a disk) and allows queries and updates on the
graphs, possibly concurrently by several users. We are interested in answering
queries on a graph, taking into account an RDF Schema (RDFS, in short)
ontology and associated RDFS entailment. We consider general SPARQL
conjunctive queries (a.k.a. basic graph pattern queries, or BGPQs), which
allow variables in any subject, property, or object position of query triples. For
instance, in the query q(x, u) ← (x, :name, :Alice), (x, y, z), (z, rdf:type, u), where
x, y, z, u are variables and the other terms are IRIs, the property y in the second
triple is a variable, just like u which is the type of z. Such queries allow to fully
take advantage of the freedom RDF provides: one does not need to know the
relation between x and z, nor the exact type of z, to query the graph.
Answering a BGPQ in an RDF database requires translating it into a description
of work that the execution engine must perform; without loss of generality, we
call this work description a query plan, as is common in the database literature.
Specifically, we distinguish a logical plan specifying the operations to use to
answer the query, from a physical (executable) plan, derived from the logical one
with the help of statistics and cost parameters characterizing the data (size,
value frequencies etc.) and the execution environment (hardware etc.)
Both plans start by accessing some data from the store and continue with various
other processing steps (e.g., filtering, combining multiple inputs etc.). The set of


                                         17
Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
2          M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

persistent data structures that hold the data of an RDF graph in the database
are called storage layout. When a set of frequent BGPQs are known in advance,
they can be used to design a workload-aware layout, which optimizes data access
for these queries, e.g., [7,15,20]. Lacking a known query set, a workload-unaware
layout is used, with the two most popular being: T (triple), e.g., [8,25,19],
which stores all triples together as a single (s, p, o) (subject, predicate, object)
collection; and CP (class and property) [4], which separates the data for each
property and class, i.e., as (i) a collection of (s, o) pairs for every property p,
and (ii) a collection of all the resources s that have a given type c in the graph.
Indexes can be added to both the T and CP layouts.
In this work, we focus on translating BGPQs, including general ones, for
query answering under RDFS ontologies and entailment, into logical plans,
on workload-unaware layouts. We target logical plans for generality, since
physical plans strongly depend on the RDF database implementation, the pres-
ence of indexes etc.; these decisions are best left to the optimization and ex-
ecution layer, and we do not study them here. However, as we will show, the
choice of the logical plan can massively impact the space of alternatives (physical
plans) considered for the query, and thus the query answering performance. In
particular, we translate our plans in SQL (if the RDF database has a relational
back-end) or SPARQL (if a native RDF engine evaluates them), which enables
to retain all benefits of system-specific optimization. We study translation for
the two classical ways of taking the ontology into account for query
answering: by materializing entailed triples in the RDF graph (graph satura-
tion) or by compiling relevant parts of the ontology in the query, which yields a
union of BGPQs (query reformulation). Our contributions are the following:
 1. We introduce the novel workload-unaware TCP layout, which combines the
     data structures of both T and CP, and an associated algebraic translation
     of BGPQs into logical plans over TCP.
 2. We introduce summary-based pruning, an optimization technique of inde-
     pendent interest, that reduces query answering costs on the T, CP and TCP
     layouts, both when using graph saturation and query reformulation.
 3. We experimentally validate the performance benefits of the TCP layout and
     translation, and of summary-based pruning, on a relational and a native
     RDF database. Our experiments are detailed online5 with the code and data
     necessary in order to reproduce them.
Below, after the preliminaries, Section 3 recalls algebraic query translations for
the T and CP layouts. We explain why naı̈ve algebraic translation on CP leads
to poor performance, not only for general BGPQs (as noted since [21]), but also
for reformulated ones (whether general or not). This is due to interleaved joins
and unions, which limit the optimization opportunities in the RDF database.
At the same time, the T layout entails repeated self-joins of the whole triple
set, degrading performance on large graphs and complex queries (this motivated
introducing the CP layout [4]). Section 4 presents our technical contributions
and Section 5 our experiments. We then discuss related works and conclude.
    5
        https://gitlab.inria.fr/mburon/graph-layout-experiments


                                          18
                     Revisiting RDF storage layouts for efficient query answering                               3

RDFS constraint Schema triple notation RDFS constraint Schema triple notation
Subclass           (s, ≺sc , o)             Subproperty             (s, ≺sp , o)
Domain typing      (s, ←-d , o)             Range typing            (s, ,→r , o)
            RDF assertion Data triple notation
            Class assertion     (s, τ, o)
            Property assertion (s, p, o) with p 6∈ {τ, ≺sc , ≺sp , ←-d , ,→r }
                                      Table 1. RDF statements.


       (:OpenArt, ≺sc , :Article), (:GOpenArt, ≺sc , :OpenArt), (:Prof, ≺sc , :Person),
       (:teaches, ←-d , :Prof), (:author, ,→r , :Person), (:firstAuth, ≺sp , :author),
Gex = (:art1, :title, “RDF storage”), (:Alice, :name, “Alice”)(:art1, :firstAuth, :Alice),
       (:Alice, :teaches, :algo101), (:art1, :author, :Bob), (:Bob, :name, “Bob”),
       (:art1, rdf:type, :GOpenArt)
  Fig. 1. Sample RDF graph Gex (schema triples on top and data triples below).
Name [2] Entailment rule                                       Name Entailment rule
rdfs5    (p1 , ≺sp , p2 ), (p2 , ≺sp , p3 ) → (p1 , ≺sp , p3 ) ext4 (p, ≺sp , p1 ), (p1 , ,→r , o) → (p, ,→r , o)
rdfs11   (s, ≺sc , o), (o, ≺sc , o1 ) → (s, ≺sc , o1 )         rdfs2 (p, ←-d , o), (s1 , p, o1 ) → (s1 , τ, o)
ext1     (p, ←-d , o), (o, ≺sc , o1 ) → (p, ←-d , o1 )         rdfs3 (p, ,→r , o), (s1 , p, o1 ) → (o1 , τ, o)
ext2     (p, ,→r , o), (o, ≺sc , o1 ) → (p, ,→r , o1 )         rdfs7 (p1 , ≺sp , p2 ), (s, p1 , o) → (s, p2 , o)
ext3     (p, ≺sp , p1 ), (p1 , ←-d , o) → (p, ←-d , o)         rdfs9 (s, ≺sc , o), (s1 , τ, s) → (s1 , τ, o)
                      Table 2. Sample set R of RDFS entailment rules.

2    Preliminaries
We rely on three pairwise disjoint sets of values: the sets I of IRIs (resource
identifiers), L of literals (constants) and B of blank nodes modeling unknown
IRIs or literals. A triple (s, p, o) ∈ (I ∪ B) × I × (L ∪ I ∪ B) states that its
subject s has the property p with the object value o [1]. We denote by Val(G) the
set of all values (IRIs, blank nodes and literals) occurring in an RDF graph G.
In triples, we use :b (possibly with indices) to denote blank nodes, and strings
between quotes to denote literals.
We distinguish schema triples from data ones. The former state RDF Schema
(RDFS) constraints on classes and properties: subclass (specialization relation
between classes), subproperty (specialization relation between properties), do-
main (typing of the first attribute of a property), and range (typing of the sec-
ond attribute of a property). An RDFS ontology (or ontology, in short) is a set of
schema triples. The ontology of an RDF graph G is the set of schema triples of G.
The other triples, i.e., data triples, describe data by typing resources with classes
or stating how resources are related through properties. Table 1 introduces our
short notations for schema and data triples.
Example 1 (Running example). Figure 1 shows a sample graph Gex , describing
articles and their authors; some articles are Open Access (:OpenArt), a subclass
of which are Green Open Access ones (:GOpenArt).
An entailment rule (or simply rule) r has the form body(r) → head(r), where
body(r) and head(r) are RDF graphs, respectively called body and head of the


                                                      19
4       M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

rule r. In this work, we consider the set of RDFS entailment rules R shown
in Table 2, which allow reasoning with an RDFS ontology; in the table, all values
except RDF reserved IRIs are blank nodes. These rules either combine schema
triples to entail schema triples (rdfs5, rdfs11, ext1 to ext4), or combine schema
triples together with data triples to entail data triples (rdfs2, rdfs3, rdfs7 and
rdfs9). The saturation of a graph G with the rule set R, denoted GR , allows
materializing its semantics, by iteratively augmenting it with the triples it entails
using R, until reaching a fixpoint; this process is finite [2].
Example 2. The saturation of Gex with R (Table 2) is: (Gex )R = Gex ∪
{(:GOpenArt, ≺sc , :Article), (:teaches, ←-d , :Person), (:firstAuth, ,→r , :Person),
(:Alice, τ, :Prof), (:Bob, τ, :Person), (:art1, :author, :Alice), (:art1, τ, :OpenArt),
(:Alice, τ, :Person), (:art1, τ, :Article)}
Given a set of variables V disjoint from I ∪ B ∪ L , a basic graph pat-
tern (BGP in short) is a set of triple patterns (or triples in short, when non-
ambiguous) belonging to (I ∪ B ∪ V ) × (I ∪ V ) × (I ∪ B ∪ L ∪ V ). The set of
variables (resp. values: IRIs, blank nodes, literals and variables) occurring in a
BGP P is denoted by Var(P) (resp. Val(P )). Note that a variable may occur
in any position of a triple pattern. In particular, we say that a variable x
occurs in property position for a triple of the form (−, x, −) and in class position
for a triple of the form (−, τ, x).
Definition 1 (BGPQ). A BGPQ q is of the form q(x̄) ← P , where P is a
BGP (also denoted body(q)) and x̄ ⊆ Var(P) are the answer variables of q.
For simplicity, below, we will assume that BGPQs have no blank nodes, as
it is well-known that these can be replaced by non-answer variables [3]. We
also consider a slight generalization of BGPQs, namely partially instantiated
BGPQs: such queries are obtained from BGPQs by substituting some variables
with values from I ∪L ∪B. For simplicity again, we will not distinguish between
a standard and a partially instantiated BGPQ. The semantics of a BGPQ on a
graph is defined through homomorphisms from the query body to the graph:
Definition 2 (BGP-to-RDF homomorphism). A homomorphism from a
BGP P to an RDF graph G is a function ϕ from Val(P ) to Val(G) such that for
any triple (s, p, o) ∈ P , the triple (ϕ(s), ϕ(p), ϕ(o)) is in G, with ϕ the identity
on IRIs and literals.
We distinguish query evaluation, whose result is just based on the explicit
graph triples, from query answering that also accounts for its implicit triples.
Definition 3 (Evaluation and answering). Let q be a (partially instantiated)
BGPQ. The answer set to q on a graph G w.r.t. rule set R is: q(G, R) = {hϕ(x̄)i |
ϕ homomorphism from body(q) to GR }. If x̄ = ∅, i.e., q is a Boolean query, q
is true iff q(G, R) = {hi}. The evaluation of q on G, denoted q(G, ∅), or q(G)
in short, is obtained through homomorphisms from body(q) to G only.
These notions and notations naturally extend to unions of (partially instanti-
ated) BGPQs, or UBGPQs in short.
Example 3 (Example query). Consider the BGPQ asking who is writing which


                                           20
                  Revisiting RDF storage layouts for efficient query answering             5

kind of articles: q(x, y) ← (z, :author, x), (z, τ, y), (y, ≺sc , :Article). Its evaluation
on Gex is empty. However, the answer set of q on Gex w.r.t. R is q(Gex , R) =
{h:Alice, :GOpenArti, h:Alice, :OpenArti, h:Bob, :GOpenArti, h:Bob, :OpenArti}.
Many RDF data management systems use saturation-based query answering,
i.e., query evaluation on the previously saturated graph; clearly, from the above
definition, q(G, R) = q(GR ) holds. An alternative technique is reformulation-
based query answering, e.g., [6,18,9], which injects the ontological knowledge
into a reformulated query, whose simple evaluation on G yields the complete
answer set of the original query. More precisely, given a BGPQ q asked on a graph
G, the reformulation of q w.r.t. to R and the ontology of G, denoted QR , is such
that q(G, R) = QR (G). A property of the technique proposed in [9], on which we
rely in this paper, is that the reformulated query does not contain schema triples
anymore; intuitively, such triples are evaluated on the ontology of the queried
graph during the query reformulation. Further, from now on, we assume that
(U)BGPQs, in particular those produced through reformulation, are minimal:
non-minimality incurs redundancy of triples in BGPQs, or of BGPQs within
UBGPQs. Well-known query minimization techniques exist for this purpose.
Example 4. The reformulation of the example query q (Example 3) is:
                 q(x, :OpenArt) ← (z, :author, x), (z, τ, :OpenArt)
                 ∪ q(x, :OpenArt) ← (z, :firstAuth, x), (z, τ, :OpenArt)
                 ∪ q(x, :OpenArt) ← (z, :author, x), (z, τ, :GOpenArt)
            QR =
                 ∪ q(x, :OpenArt) ← (z, :firstAuth, x), (z, τ, :GOpenArt)
                 ∪ q(x, :GOpenArt) ← (z, :author, x), (z, τ, :GOpenArt)
                 ∪ q(x, :GOpenArt) ← (z, :firstAuth, x), (z, τ, :GOpenArt)
It can be checked that QR (Gex ) = q(Gex , R).

3    BGPQ answering through translation on T and CP
We now detail how (saturation- or reformulation-based) query answering can be
performed on the T and the CP storage layouts (Sections 3.1 and 3.2).
3.1 BGPQ answering on the T layout
Let t(S, P, O) be the table storing the triples of a graph G for the T layout.
Saturation-based query answering. The saturation GR of G is stored in the table
t. The algebraic translation of a BGPQ q(x̄) ← t1 , . . . , tn on the T layout is:
                        T (q) = πq (./jcond (αT (t1 ), . . . , αT (tn ))
where αT , the query triple translation for the T layout, translates the i-th query
triple ti (si , pi , oi ) into an algebraic expression of the form σscond (t), where t is the
triple table, and scond is a (possibly empty) set of selections over the attributes
of t; specifically, if si (respectively, pi , oi ) is an IRI or a literal, scond contains a
predicate of the form S = si (and similarly for pi , oi ); jcond is a conjunction of
join predicates containing, for every variable appearing in several positions (in
one or several triples) in q, an equality between the respective attributes in the
αT (ti ) triple translations; finally πq is a projection on the attributes from the
α(ti )’s corresponding to the answer variables of q, or the values to which such
variables are bound in case of a partially instantiated query.


                                              21
6         M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

Example 5. The example query translates on the T layout as:
    πt1 .O,t2 .O (./t1 .S=t2 .S,t2 .O=t3 .S (σP =:author (t), σP =τ (t), σP =≺sc ∧O=:Article (t)))
In the above, the selection σP =:author (t) restricts the triples from the t table
to those having the attribute P equal to :author. Similarly, σP =τ (t) corre-
sponds to the selection P = τ . On the atom t3 , αT applies a double selec-
tion σP =≺sc ∧O=:Article (t), since t3 has only one variable in position S. Further,
./jcond =./t1 .S=t2 .S,t2 .O=t3 .S joins the three previous selections, where t1 .S =
t2 .S and t2 .O = t3 .S respectively reflect the co-occurrences of variables z and y.
The final projection πt1 .O,t2 .O returns the pairs of values obtained for (x, y).
Reformulation-based query answering. Here, the graph G is stored in t (but not
its saturation), and every incoming
                              Sn       BGPQ q is reformulated into a (partially
instantiated) UBGPQ QR = i=1 qi , whose algebraic translation on the T layout
is the union of the algebraic translations of its (partially instantiated) BGPQs:
                                         Sn
                               T (QR ) = i=1 T (qi )
Example 6. Consider again the example query q and its reformulation QR shown
in Example 4. The algebraic translation T (QR ) is:
 πt1 .O,:OpenArt (./t1 .S=t2 .S (σP =:author (t), σP =τ ∧O=:OpenArt (t))
 ∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (σP =:firstAuth (t), σP =τ ∧O=:OpenArt (t))
 ∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (σP =:author (t), σP =τ ∧O=:GOpenArt (t))
 ∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (σP =:firstAuth (t), σP =τ ∧O=:GOpenArt (t))
 ∪ πt1 .O,:GOpenArt (./t1 .S=t2 .S (σP =:author (t)), σP =τ ∧O=:GOpenArt (t))
 ∪ πt1 .O,:GOpenArt (./t1 .S=t2 .S (σP =:firstAuth (t), σP =τ ∧O=:GOpenArt (t))
3.2 BGPQ answering on the CP layout
With the CP layout, an RDF graph is stored as a set of tables corresponding
to classes and properties: for each class c, there is a table tc (S) that stores all
subjects s of triples (s, τ, c), and for each data or schema property p 6= τ , there is
a table tp (S, O) that stores all subject-object pairs (s, o) for triples (s, p, o). We
call any such tc a class table, and tp a property table. The CP layout speeds up
data access for queries which specify the class in every triple whose property is τ
and specify the property in every triple. However, as noted in [21], it may render
the evaluation of general queries, with variables in class or property position,
inefficient, as the triples they refer to may be in any tc or tp tables, respectively.
Saturation-based query answering. Assume that GR is stored using the CP
layout. To obtain the answers to a BGPQ q(x̄) ← t1 , . . . , tn , a first simple naı̈ve
translation, which can be traced back to [4,21], is:
                      CP (q) = πq (./jcond (αCP (t1 ), · · · , αCP (tn )))
where πq and jcond are defined as for the T layout, and αCP , the query triple
translation for the CP layout, is:
              πS,τ,c (σscond (tc )) if t = (s, τ, c) with c 6∈ V
            
                                                                               (1)
                                                                V
            
              π       (σ      (t  )) if t =  (s, p, o) with p ∈
                                                              6    ∪ {τ }       (2)
            
 αCP (t) = SS,p,O scond p
                     α CP ((s, τ, c))  if t = (s, τ, x) with  x ∈ V             (3)
             c∈C
            
              αCP ((s, τ, o)) ∪ p∈P αCP ((s, p, o)) if t = (s, x, o) with x ∈ V (4)
                                 S


                                                 22
                  Revisiting RDF storage layouts for efficient query answering             7

where C and P are, respectively, the set of classes and of properties other than τ
in the queried graph, and scond is a (possibly empty) conjunction of selections,
just as we defined it for αT , but over the class and property tables instead of
the triple table t.
Example 7. The naive translation of the example query q on the CP layout is:
 πt1 .O,t2 .O (./t1 .S=t2 .S,t2 .O=t3 .S (πS,:author,O (t:author ),
       πs,τ,:GOpenArt (t:GOpenArt ) ∪ πs,τ,:OpenArt (t:OpenArt ) ∪ πs,τ,:Article (t:Article )
       ∪ πs,τ,:Prof (t:Prof ) ∪ πs,τ,:Person (t:Person ) ∪ πS,≺sc ,O (σO=:Article (t≺sc )))
Note that in cases (3) and (4) above, αCP introduces unions under joins, as il-
lustrated by the previous example. This leads to suboptimal evaluation per-
formance in many data management engines, which may optimize and execute
efficiently a join over several data collections, but do not attempt to reorder (com-
mute) joins with unions. For instance, the query (x, :a, :a1 ), (x, y, z), (z, τ, u),
(z, :b, :b1 ) translates into a plan that joins (among others) the union of all the
tables (for (x, y, z)) with the union of all class tables (for (z, τ, u)). Most sys-
tems execute this “literally”, i.e., they build and materialize these two very large
unions, which is very costly, before joining them with the first and last triple6 .
To avoid such unions under joins, we rely on the notion of instantiation, which
has been used in various query answering techniques e.g., [15,18]:
Query instantiation. The instantiation of a BGPQ q consists in instantiating the
variables in q that must be bound to classes and properties of the queried graph,
in all possible ways, which yields a (partially instantiated) UBGPQ. Given a
BGPQ q and a graph G, we denote by q p,G (resp. q c,G ) its property instantiation
(resp. class instantiation), which is the UBGPQ obtained by instantiating all
its variables in property position (resp. in class position), by all combinations of
properties (resp. classes) of G.
Example 8. The class instantiation q c,Gex of the example query q, where the only
instantiated variable is y, is:
   q(x, :GOpenArt) ← (z, :author, x), (z, τ, :GOpenArt), (:GOpenArt, ≺sc , :Article)
   ∪ q(x, :OpenArt) ← (z, :author, x), (z, τ, :OpenArt), (:OpenArt, ≺sc , :Article)
   ∪ q(x, :Article) ← (z, :author, x), (z, τ, :Article), (:Article, ≺sc , :Article)
   ∪ q(x, :Prof) ← (z, :author, x), (z, τ, :Prof), (:Prof, ≺sc , :Article)
   ∪ q(x, :Person) ← (z, :author, x), (z, τ, :Person), (:Person, ≺sc , :Article)

Class and property instantiations extend from BGPQs to UBGPQs in the natural
way. Given a UBGPQ of the form Q = q 1 ∪ q 2 · · · ∪ q n , we set:

          Qp,G = q1p,G ∪ q2p,G · · · ∪ qnp,G and Qc,G = q1c,G ∪ q2c,G · · · ∪ qnc,G

Then, the instantiation of Q w.r.t. a graph G is the following:

                              QG = (Qc,G )p,G ∪ (Qp,G )c,G
  6
    We checked this on systems that disclose their query execution strategy; experi-
ments with others who do not, confirm the same hypothesis (see Section 5).


                                            23
8        M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

We need both terms of the above union, exactly in the case when some variable
of Q appears both in a property and in a class position. These cases are rare
and easy to detect, thus in practice we only use one of the unions as soon as
we detect both are not needed. Crucially, (U)BGPQ instantiation is correct for
saturation- and reformulation-based query answering. Intuitively, this is because
instantiation enumerates all possible combinations of classes and properties that
query reformulation or evaluation may find in G.
We can now define the Sinstantiation-based query translation. A BGPQ q is first
                         n
instantiated into q G = i=1 q i , then translated on the CP layout as:
                                          [n
                             CP (q G ) =      CP (q i )
                                                i=1

                           G
Importantly, because q does not contain any variable in class or property po-
sition, every naı̈ve translation CP (q i ) within CP (q G ) avoids both (3) or (4) in
the αCP triple transformation function. Hence, the translation avoids the intro-
duction of unions under joins, with their potential bad impact on performance.
Example 9. Consider the query q and its instantiation q Gex = q c,Gex in Exam-
ple 8. The instantiation-based translation of q corresponds to the naı̈ve transla-
tion of q Gex :
                    [
                       πt1 .O,u (./t1 .S=t2 .S,t2 .O=t3 .S (πS,:author,O (t:author ),
u∈{:GOpenArt,:OpenArt,:Article,:Prof,:Person}               πS,τ,u (tu ),
                                                            πS,≺sc ,O (σS=u,O=:Article (t≺sc ))
Reformulation-based query answering. The graph G is again stored using the
CP layout (but not saturated). In this case, answering a BGPQ q starts by
computing its reformulation QR w.r.t. the ontology of G. Then, we obtain the
answers q(G, R) either through CP (QR ), the naı̈ve translation of QR , or through
CP (QG R ), the instantiation-based translation of QR , i.e., the naı̈ve translation
of its instantiation QGR ; as in the previous section, the algebraic translation of
a UBGPQ is defined as the union of the algebraic translations of its BGPQs.
Instantiating QR generally increases its size, but, by removing variables in class
and property positions, it avoids the unions under joins introduced in cases (3)
and (4) by the αCP triple translation function.
Example 10. Consider the query q and its reformulation QR from Example 4.
Here, since no variable of QR occurs in class or property position, CP (QR ) and
CP (QG  R ) lead to the same algebraic expression:
πt1 .O,:OpenArt (./t1 .S=t2 .S (πS,:author,O (t:author ), πS,τ,:OpenArt (t:OpenArt ))
∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (πS,:firstAuth,O (t:firstAuth ), πS,τ,:OpenArt (t:OpenArt ))
∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (πS,:author,O (t:author ), πS,τ,:GOpenArt (t:GOpenArt ))
∪ πt1 .O,:OpenArt (./t1 .S=t2 .S (πS,:firstAuth,O (t:firstAuth ), πS,τ,:GOpenArt (t:GOpenArt ))
∪ πt1 .O,:GOpenArt (./t1 .S=t2 .S (πS,:author,O (t:author ), πS,τ,:GOpenArt (t:GOpenArt ))
∪ πt1 .O,:GOpenArt (./t1 .S=t2 .S (πS,:firstAuth,O (t:firstAuth ), πS,τ,:GOpenArt (t:GOpenArt ))
4    Taming general BGP answering performance
Below, we present our technical contributions: the TCP layout and its algebraic
translation (Section 4.1), and summary-based pruning (Section 4.2).


                                              24
                      Revisiting RDF storage layouts for efficient query answering                           9

4.1     BGPQ answering based on the TCP layout
The TCP layout combines T and CP with the aim of getting the best of both,
while avoiding the performance problems they respectively entail (Sections 3.1
and 3.2). Here, an RDF graph is stored both in the triple table t of the T layout
and in the tc class and tp property tables of the CP layout. The rationale behind
this is that CP is efficient to access triples when the data structures holding the
triples we need to access are immediately clear from the query, and small ; this
is the case with query triples of the form (s, τ, c) or (s, p, o) for a known class
c or property p. However, with query triples of the form (s, τ, x) and (s, x, o),
the CP translation introduces unions, typically executed before joins, degrading
performance. Interestingly, direct access to a potentially large share of a graph’s
triples is exactly what the T layout supports well - thus our idea to combine them.
As we show in the next section, this allows to significantly improve performance,
at expense of some extra storage space, typically inexpensive since it is on disk.
Saturation-based query answering. Let us assume that the saturation GR of a
graph G is stored in the TCP layout. The answers to a BGPQ q ← t1 , . . . , tn
are obtained through its algebraic transformation for the TCP layout:

                     T CP (q) = πq (./jcond (αT CP (t1 ), · · · , αT CP (tn )))

where πq and jcond are defined as for the T and CP layouts, and αT CP , the
query triple translation for the TCP layout, is:
                αCP (t) if t = (s, τ, c) or t = (s, p, o) with c 6∈ V , p 6∈ V ∪ {τ }
              
  αT CP (t) =
                αT (t) otherwise, i.e., if t = (s, τ, x) or t = (s, x, o) with x ∈ V
Importantly, αT CP translates the triples that penalize the CP layout into t
atoms, and never into a union: hence, αT CP avoids the cases (3) and (4) of αCP .
Example 11. The translation of the example query for the TCP layout combines
the T layout for the second triple and the CP layout for the others:
πt1 .O,t2 .O (./t1 .S=t2 .S,t2 .O=t3 .S (πS,:author,O (t:author ), σP =τ (t), πS,≺sc ,O (σO=:Article (t≺sc ))))

Reformulation-based query answering. Again, the answers to a query q are
                                                                       Sncom-
puted by evaluating the algebraic translation of its reformulation QR = i=1 q i ,
but now for the TCP layout:
                                        Sn
                          T CP (QR ) = i=1 T CP (q i )
Example 12. Consider the query q(x, y, z) → (x, τ, z), (x, :firstAuth, y) asking
for all resources with their type and first author. Its reformulation w.r.t. Gex ’s
ontology is shown below (the last four union terms are omitted for space reasons):
                   q(x, y, z) ← (x, τ, z), (x, :firstAuth, y)
                   ∪ q(x, y, :Article) ← (x, τ, :OpenArt), (x, :firstAuth, y)
              QR = ∪ q(x, y, :Article) ← (x, τ, :GOpenArt), (x, :firstAuth, y)
                   ∪ q(x, y, :OpenArt) ← (x, τ, :GOpenArt), (x, :firstAuth, y)
                   ∪ q(x, y, :Person) ← (x, τ, :Prof), (x, :firstAuth, y) . . .

Its algebraic translation on the TCP layout (similarly abridged) is:


                                                     25
10      M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

πt1 .S,t2 .O,t1 .O (./t1 .S=t2 .S (σP =τ (t), πS,:firstAuth,O (t:firstAuth )))
∪ πt1 .S,t2 .O,:Article (./t1 .S=t2 .S (πS,τ,:OpenArt (t:OpenArt ), πS,:firstAuth,O (t:firstAuth )))
∪ πt1 .S,t2 .O,:Article (./t1 .S=t2 .S (πS,τ,:GOpenArt (t:GOpenArt ), πS,:firstAuth,O (t:firstAuth )))
∪ πt1 .S,t2 .O,:OpenArt (./t1 .S=t2 .S (πS,τ,:GOpenArt (t:GOpenArt ), πS,:firstAuth,O (t:firstAuth )))
∪ πt1 .S,t2 .O,:Person (./t1 .S=t2 .S (πS,τ,:Prof (t:Prof ), πS,:firstAuth,O (t:firstAuth ))) . . .
Above, the first union term refers to the triple table t, while the others do not.

4.2   Summary-based query pruning
We now introduce an optimization technique, which can be applied on any stor-
age layout to reduce (U)BGPQ answering time. It allows detecting some BGPQs
with an empty answer set on a graph, without evaluating them, by using a (typ-
ically much smaller) structural summary of this graph.
Given an RDF graph G and an equivalence relation ≡ among the nodes in G, i.e.,
the subjects and objects of triples, an RDF quotient summary [12] is an RDF
graph G/≡ built as follows. A node is created in G/≡ for each equivalence class
among G’s nodes; further, for any triple (n1 , p, n2 ) ∈ G, the triple (m1 , p, m2 )
appears in G/≡ , where m1 and m2 represent the equivalence class of n1 and n2
respectively. If there are large equivalence classes in G, summarization is a form of
compression. Several types of RDF quotient summaries have been proposed [12];
in our experiments, we used the RDFQuotient summary construction tool [14],
due to its online availability and low summary construction cost (linear in the
number of triples of G). An RDFQuotient summary represents each class and
property node by itself, and consider they are not equivalent to any other G
node; thus, G and any quotient summary G/≡ have the same schema triples.
Crucially, it holds that if q(G/≡ ) = ∅ then q(G) = ∅, for q a structural (U)BGPQ,
i.e., in which the subjects and objects of query triples are either class and prop-
erty IRIs, or variables. Intuitively, this result holds because structural queries
only allow selecting subject, property and object values that are preserved through
summarization (class and property IRIs). Note however that the opposite does
not hold in general, i.e., q(G/≡ ) may have results while q(G) does not. We
exploit this result by defining the structural version of a BGPQ q, denoted
q str , which is obtained by replacing in q the literals and the IRIs that are
not class or property IRIs, by fresh variables. For example, the structural ver-
sion of the query q(x) ← (x, τ, :OpenArt), (x, :firstAuth, :Alice) is: q str (x) ←
(x, τ, :OpenArt), (x, :firstAuth, y), with :Alice replaced by y. Hence,
                                                                  S     when a sum-
mary G/≡ is available, we can use it to prune a UBGPQ Q = i qi by removing
from the union all the qi terms for which qistr (G/≡ ) = ∅. As explained above, this
may fail to prune some qi with no results on G, but it preserves query results:
                                    Q(G) = Qpruned (G)

where Qpruned is the result of pruning Q. As our experiments will show, this
generally leads to a significant reduction of query answering time.
5     Experimental evaluation
We now describe experiments comparing the query answering methods presented
in the previous sections, on the T, CP and TCP layouts.


                                              26
                Revisiting RDF storage layouts for efficient query answering    11

5.1 Experimental settings
We implemented the T, CP and TCP layouts in OntoSQL (https://ontosql.
inria.fr), a Java platform providing efficient RDF storage and saturation- and
reformulation-based query answering on top of an RDBMS [9,10,16] (Postgres
9.6 in these experiments). OntoSQL encodes IRIs and literals as integers, and
a dictionary table allows going from one to the other; each table (t, tp or tc )
is indexed on all the subsets of its attributes. To use OntoSQL, we express our
algebraic translations in SQL. We checked that in Postgres query plans, the
relative positions of unions and joins in the query plans chosen by the RDBMS
are those from our translations; [10,11] showed that this holds for two other
major RDBMSs. However, the RDBMS takes all optimization decisions, based
on its cost model and statistics. To put this into perspective also with respect
to native RDF engines, we ran the same experiments also on Virtuoso Open
Source Edition 7.2, to whom we provided SPARQL queries, which correspond
exactly to our algebraic translation on the T layout. Virtuoso also controls its
optimization decisions, and has full control over its store.
For summary-based pruning, we used the RDFQuotient (https://rdfquotient.
inria.fr) tool to build the “typed strong” summary [14] of a graph G; this
summary is denoted G/TS . The summary groups typed nodes according to their
types, and untyped nodes by exploiting the similarity of their incoming/outgo-
ing properties (see [14] for details). In general, any quotient summary could be
used; a large (more detailed) summary makes pruning more accurate, but also
slower since it needs to query the summary.
Hardware We used a server with 2,7 GHz Intel Core i7 processors and 160 GB
of RAM, running CentOs Linux 7.5.
  Graph     |G| |G/TS | |GR | |(GR )/TS | |G|T |G|CP |G|TCP |GR |T |GR |CP |GR |TCP
  LUBM 100M 340 131M                439 28.95 11.95 37.89 32.52 13.52         39.31
  DBLP 88M 290 147M                 708 26.07 16.35 35.89 38.39 22.91         52.72
Table 3. Graph and summary sizes (number of triples), OntoSQL database sizes (in
GB), including all indexes, for the T, CP and TCP layouts.

RDF graphs We used two benchmark graphs: a LUBM [17] graph of 100M
triples, as well as a graph of DBLP bibliographic data endowed with an ontology
of 14 classes and 44 schema triples. Table 3 shows, for these graphs and their
saturation, the graph and summary sizes, and the sizes of the OntoSQL databases
storing them in the T, CP and TCP layout. As expected, TCP takes most space,
approx. 90% of the sum of the T and CP database sizes. However, this is stable
storage (e.g., disk) space; the selective data access enabled by the class and
property tables, and by indexes, as well as cost-based optimization, ensure that
the data loaded in memory to process a query is much smaller.
Queries We used two diverse sets of queries, having from 1 to 11 triples (4
on average) on LUBM, and from 2 to 9 triples (5.9 on average) on DBLP. Each
query has 1 or 2 triple(s) of the form (s, τ, x) and/or (s, x, o), except Q11 and
Q15 on DBLP which do not contain any. Table 4 shows their number of answers,
and the number of BGPQs in: their instantiation (q G ), reformulation (QR ),
and instantiation of their reformulation (QGR ), before and after summary-based


                                        27
12      M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

Query               Q01 Q02 Q03 Q04 Q05 Q06 Q07 Q08 Q09                Q10 Q11 Q12 Q13 Q14
|q(G, R)|         2.78M 0.59M 2.15M 1.72M 0.47M 2.77M 25K 3696 2003 17.35M 187 518K 857 9.85M
|q G |                45    45     45    82 2025      45 45 82 3690      82 82    37 1369   82
|(q G )pruned |       45    45     45    67 1806      44 44    3 88      68 59    23 43     51
|QR |                318   106    146    68   216     80 318   9 36      88  9    27 39 1152
|(QR )pruned |       120    56     59    48   108     34 173   7  2      78  7    21 21    889
|QG R|               447   149    226    68   216    121 447   9 36     169  9    63 1797 1188
|(QG R)
        pruned
                |    206    99    135    48   108     73 299   7  2     144  7    42 50    892
Query             Q01 Q02 Q03 Q04 Q05 Q06 Q07 Q08 Q09 Q10 Q11 Q12 Q13 Q14 Q15
|q(G, R)|         72K 24K 96.3M 361K 4K 10 138 42K 1.52M 3.09M 957K 414K 409K 16.3K 460K
|q G |              18 18       18    18 50 900 900 18       900 900     1  18 324      50   1
|(q G )pruned |     18 18       18    18    3 136 136 17      85  85     1  18 324       3   1
|QR |              117 297     265 117 873 4257 3163 36 1500 3652 243       39 381 129 243
|(QR )pruned | 56 205          151    56 616 3089 2166 32 1184 2620     36  24 174 114       3
|QG R|             252 440     408 252 1529 5927 3345 72       1 324     3  84 2088 270 243
|(QG R)
        pruned
                | 161 331      277 161 1224 3275 2189 37       1  64     3  69 1497 117      3
Table 4. Statistics of our queries on LUBM (top) and DBLP (bottom); M stands for
millions and K for thousands.


 T SAT CP SAT NAIVE CP SAT INS CP SAT INS PRUN TCP SAT VIRTUOSO SAT
T (q)(GR ) CP (q)(GR ) CP (q G )(GR ) CP ((q G )pruned )(GR ) T CP (q)(GR ) T (q)(GR )
Fig. 2. Query answering times (milliseconds) on LUBM (top) and DBLP (bottom),
through saturation.                                              G
pruning. The impact of pruning ranges from none (in particular for q , on LUBM
Q01 to Q03 and on 8 DBLP queries) to very significant (97.8% of the q G BGPs
are pruned on LUBM Q09). Details on our experiments, and code which can be
used to reproduce them, are online5 .
5.2 Experiment results: query answering times
For each query, we report the average of the last five (“hot”) runs out of six.
Through saturation Figure 2 shows the query answering times through sat-
uration, for LUBM (top) and DBLP (bottom), with a timeout of 10 minutes;


                                             28
                 Revisiting RDF storage layouts for efficient query answering     13

in all our graphs, executions that reached the timeout have been interrupted.
Below the graphs, we show the label used in the plot for each query answer-
ing strategy, e.g., T SAT stands for T (q)(GR ). For readability, some very fast
queries are repeated in a “zoom” plot (the LUBM one has a logarithmic y axis).
On LUBM (top), we notice some very high running times for VIRTUOSO SAT,
e.g., on Q03, Q06, and a time-out on Q14. Among the SQL-based strategies, the
naı̈ve translation on CP (green bars) is slowest in 10 out of 14 queries, with large
performance penalties for Q04, Q07, Q08, Q11-13. Instantiation (CP SAT INS,
red bars) is generally faster than naı̈ve CP. It strongly speeds up Q04, Q07,
Q08, Q10-Q14; it brings a modest improvement to Q01 and Q05, but also a
modest overhead for Q02, Q03, Q06. However, on the complex Q09, which has
the largest q G size, namely 3690, instantiating each of these more than doubles
the answering time w.r.t. naı̈ve CP translation (and ran until the timeout);
pruning (yellow bars) brings it back below the timeout. T SAT is generally
faster than all executions on the CP layout, because all queries contain triples
of the form (s, τ, x) and/or (s, x, o), which, as explained in Section 3.2, challenge
CP execution. TCP SAT avoids the (sometimes drastic) performance problems
of all CP variants, and is the fastest (by up to several orders of magnitude) on
all queries but Q14, where the CP variant with pruning is a bit faster. Virtuoso
is also always slower than TCP (by up to 95×, almost two orders of magnitude).
On DBLP (bottom), poor performance is exhibited by Virtuoso (Q03, Q07,
Q09, Q10), and on even more queries by the naı̈ve CP strategy (green bars, Q05-
Q07, Q09-Q10, Q14). T SAT performs very badly on Q09, Q10, Q14 and Q15.
These are rather large (6 to 9-triples) queries; an analysis of their plans shows
significant errors in the RDBMS’ estimation of join cardinality. As is well-known,
join cardinality estimation errors multiply along subsequent joins; when all joins
carry over a single, very large table, the negative impact of such cumulated
errors can be quite strong. Historically, this observation actually motivated the
introduction of the CP layout, on which join estimation errors still multiply, but
usually much smaller tables are involved. Indeed, as expected, for the queries
Q11 and Q15, exactly those where no triple has a variable in class and property
position, naı̈ve CP largely outperforms T SAT (by very far for Q15). Again, we
observe the robust behavior of TCP SAT. We conclude that through saturation,
T and CP execution each underperform on some queries, but TCP avoids all
these pitfalls and is consistently very efficient.
Through reformulation Figure 3 shows reformulation-based query answering
times (note the logarithmic y axis in the zoom), again with the correspondences
between the bar labels and the strategy names previously used in the paper.
On LUBM, among the evaluation strategies without pruning, TCP REF is
generally the fastest (or very close to it), with the exception of Q14, where
CP REF INS is 1.4× faster. This query with the most results (9.85M) and a
large reformulation (Table 4) has two atoms of the form (x, p, z), (y, p, z). On
CP, this leads to a large number of self-joins of the form tp ./o tp , executed
very fast because loading tp in memory once ensures the join runs completely
in-memory. While the rather unusual Q14 shows a case when CP may still out-


                                        29
14      M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier


               T REF      T REF PRUN CP REF INS CP REF INS PRUN
             T (QR )(G) T ((QR )pruned )(G) CP (QG            G pruned
                                                 R )(G) CP ((QR )      )(G)
             TCP REF       TCP REF PRUN VIRTUOSO REF VIRTUOSO REF INS
           T CP (QR )(G) T CP ((QR )pruned )(G) T (QR )(G) T (QG
                                                               R )(G)

Fig. 3. LUBM (top) and DBLP (bottom) query answering times (ms) through refor-
mulation.
perform TCP, the difference is not dramatic. On the three layouts, pruning gen-
erally helps: it saves, e.g., more than half of the CP answering time for Q01,
Q02. In the zoomed view (shortest-running queries), pruning brings an overhead
(takes more time that the query evaluation time it saves) of a few milliseconds.
Among the strategies with pruning, TCP REF PRUN is the fastest, except for
Q14 discussed above. As Virtuoso did not support reasoning with our rule set R
(details online5 ), we gave it reformulations expressed in SPARQL; for Q07 and
Q14, they failed to run, with the error “union nesting is too deep”. The impact
of instantiation for Virtuoso is unclear; it helped for Q04, Q08 but not for Q02,
Q06 etc. All missing Virtuoso bars in Figure 3 are execution failures, mostly due
to large unions.
On DBLP, VIRTUOSO REF failed for Q06, Q07, Q09, Q10, Q13; VIRTU-
OSO REF INS was consistently worse, and we omitted it from the plot. The
rest of the analysis is similar to the one above, except that T REF performs
very badly in a few cases (Q07, Q11). Overall, in Figure 3, TCP query answer-
ing with pruning is the fastest, or very close to it, on all queries, while all other
strategies’s weaknesses are exposed by one or more queries.
5.3 Experiment conclusion
We studied the performance of query answering in RDF databases through sat-
uration and reformulation, on challenging queries that remain poorly supported:


                                          30
                   Revisiting RDF storage layouts for efficient query answering   15

those with variables in class or property position. We have exhibited queries
that lead to poor to catastrophic performance of query answering on the T lay-
out (mainly due to many self-joins on a large table) and/or on the CP layout
(mainly due to large unions, brought by variables in class and property positions,
and/or by reformulation). Query answering on the TCP layout is extremely ro-
bust; it avoids all these pitfalls by taking the best of both T and CP, at the
expense of more storage space. As disks are getting ever cheaper7 , TCP appears
to be a robust, practical layout, compatible with well-established large-scale RDF
storage and query engines. For the challenging queries we study, summary-based
pruning helps improve performance, in particular for the TCP layout.
6       Related work and conclusion
Our work studies the impact of RDF graph storage on query answering in
the presence of RDFS ontologies, both through graph saturation (SatQA) and
query reformulation (RefQA). Prior works such as [4,7,19,20,21] only considered
SatQA. While [4] advocated the CP layout, [21] nuanced the analysis: in a row
store, they show that proper indexing (such as we used here) can significantly
improve performance using T, while many distinct properties may hurt CP per-
formance. It is not fully clear if [4,21] used the naı̈ve or instantiation-based CP
translation; as our experiments show, TCP outperforms both, in particular with
summary pruning. The optimized T layout of [24], indexed on all (s, p, o) permu-
tations, has been used for RefQA in [15,16,10]; in our experiments, TCP avoids
all its bad-performance scenarios. Storage was optimized based on a known work-
load by creating materialized views in [15]. Query reformulation has also been
used with the CP layout in [9,11]. Both [10] for T and [11] for CP explored how
to express a reformulated query as a join of several subqueries, so as to mini-
mize the evaluation cost through the RDBMS. Applying this technique to the
TCP layout could presumably also improve its performance. A simplified ver-
sion of TCP is briefly mentioned in [13], which studies generic SPARQL-to-SQL
translation functions, as an example of possible relational layout. However, [13]
does not consider the performance impact of this layout; nor do they consider
RDFS entailment. Optimized storage layouts [7,20,25] or indexes [22,5] have
been investigated to limit joins by storing e.g., the values of several properties
for a given subject together. They allow translating several BGPQ triples into a
single table (or index) access; they could also be applied on the TCP layout to
further improve it. Finally, in [23], the storage layout based on binary tables is
adapted to the graph topology, in order to speed up query evaluation.
TCP is a robust layout, which does not require query workload knowledge, and
allows to significantly reduce BGPQ answering times. On queries well supported
by the T, respectively, CP layout, TCP matches that performance; but most
importantly, it avoids all the performance (or plain unfeasibility) issues they en-
counter, in saturation- or reformulation-based query answering. Summary-based
pruning also importantly improves performance. We believe the TCP layout,
and pruning, can be adopted with little effort, and can strongly consolidate and
improve query answering performance in many RDF databases.
    7
        E.g., https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/


                                          31
16      M. Buron, F. Goasdoué , I. Manolescu, T. Merabti, and M.-L. Mugnier

References
 1. RDF 1.1 Concepts, https://www.w3.org/TR/rdf11-concepts/
 2. RDF 1.1 Semantics, https://www.w3.org/TR/rdf11-mt/#rdfs-entailment
 3. SPARQL 1.1 Query Language, https://www.w3.org/TR/sparql11-query/
 4. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web
    Data Management Using Vertical Partitioning. PVLDB (2007)
 5. Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix ”bit” loaded: a scalable
    lightweight join query processor for RDF data. In: WWW (2010)
 6. Bischof, S., Krötzsch, M., Polleres, A., Rudolph, S.: Schema-agnostic query rewrit-
    ing in SPARQL 1.1. In: ISWC (2014)
 7. Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,
    Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational
    database. In: SIGMOD (2013)
 8. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic architecture for
    storing and querying RDF and RDF schema. In: ISWC (2002)
 9. Buron, M., Goasdoué, F., Manolescu, I., Mugnier, M.: Reformulation-based query
    answering for RDF graphs with RDFS ontologies. In: ESWC (2019)
10. Bursztyn, D., Goasdoué, F., Manolescu, I.: Optimizing reformulation-based query
    answering in RDF. In: EDBT (2015)
11. Bursztyn, D., Goasdoué, F., Manolescu, I.: Teaching an RDBMS about ontological
    constraints. PVLDB 9(12) (2016)
12. Cebiric, S., Goasdoué, F., Kondylakis, H., Kotzinos, D., Manolescu, I., Troullinou,
    G., Zneika, M.: Summarizing Semantic Graphs: A Survey. VLDB Journal (2018)
13. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL transla-
    tion. Data Knowl. Eng. 68(10) (2009)
14. Goasdoué, F., Guzewicz, P., Manolescu, I.: RDF Graph Summarization for First-
    sight Structure Discovery. The VLDB Journal (Apr 2020)
15. Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic
    web databases. PVLDB 5(2) (2011)
16. Goasdoué, F., Manolescu, I., Roatis, A.: Efficient query answering against dynamic
    RDF databases. In: EDBT (2013)
17. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-
    tems. J. Web Sem. 3(2-3) (2005)
18. Kontchakov, R., Rezk, M., Rodriguez-Muro, M., Xiao, G., Zakharyaschev, M.:
    Answering SPARQL queries over databases under OWL 2 QL entailment regime.
    In: ISWC (2014)
19. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF
    data. VLDB J. (2010)
20. Pham, M., Passing, L., Erling, O., Boncz, P.A.: Deriving an emergent relational
    schema from RDF data. In: WWW (2015)
21. Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store
    support for RDF data management: not all swans are white. PVLDB 1(2) (2008)
22. Udrea, O., Pugliese, A., Subrahmanian, V.S.: GRIN: A graph based RDF index.
    In: AAAI (2007)
23. Urbani, J., Jacobs, C.J.H.: Adaptive low-level storage of very large knowledge
    graphs. In: WWW (2020)
24. Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple indexing for Semantic
    Web data management. PVLDB (2008)
25. Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D.: Efficient RDF storage and
    retrieval in Jena2. In: SWDB (2003)


                                          32