=Paper= {{Paper |id=Vol-2572/paper13 |storemode=property |title=Hierarchical Property Set Merging for SPARQL Query Optimization |pdfUrl=https://ceur-ws.org/Vol-2572/paper13.pdf |volume=Vol-2572 |authors=Marios Meimaris,George Papastefanatos,Panos Vassiliadis |dblpUrl=https://dblp.org/rec/conf/dolap/MeimarisPV20 }} ==Hierarchical Property Set Merging for SPARQL Query Optimization== https://ceur-ws.org/Vol-2572/paper13.pdf

Hierarchical Property Set Merging for SPARQL Query
Optimization
Marios Meimaris George Papastefanatos Panos Vassiliadis
Athena Research Center Athena Research Center University of Ioannina
m.meimaris@athenarc.gr gpapas@athenarc.gr pvassil@cs.uoi.gr

ABSTRACT and a subject node s, the characteristic set cs(s) of s is defined as
Characteristic sets (CS) organize RDF triples based on the set follows [13]:
of properties associated with their subject nodes. This concept cs(s) = {p | ∃o : (s, p, o) ∈ D}
was recently used in indexing techniques, as it can capture the
implicit schema of RDF data. While most CS-based approaches In plain words, a CS is the set of attributes (or RDF properties)
yield significant improvements in space and query performance, of a given subject node. Abstracting from RDF triples to their
they fail to perform well when answering complex query work- CSs, an RDF graph can be represented on the structural level by
loads in the presence of schema heterogeneity, i.e., when the a graph containing CSs as nodes, and links between CSs as edges,
number of CSs becomes very large, resulting in a highly parti- where a link between two CSs exists whenever a link between
tioned data organization. In this paper, we address this problem two subject nodes exists in the original RDF graph. Due to the
by introducing a novel technique, for merging CSs based on their ability to represent all of the properties of a subject node with a
hierarchical structure. Our method employs a lattice to capture single set, rather than multiple triples, CSs have been thoroughly
the hierarchical relationships between CSs, identifies dense CSs used as a means to optimize query planning, storage, indexing
and merges dense CSs with their ancestors, thus reducing the size and query processing [5, 9, 12, 13, 16]. In their most general
of the CSs as well as the links between them. We implemented our form, they are used as the basis for mapping RDF to a relational
algorithm on top of a relational backbone, where each merged structure, where each CS forms a relational table. An illustration
CS is stored in a relational table, and we performed an extensive of this mapping can be seen in Figure 1. There are two entities,
experimental study to evaluate the performance and impact of Alice and Claire, each represented in an RDF graph as a node. The
merging to the storage and querying of RDF datasets, indicating properties of these two nodes are (a) their type, (b) the company
significant improvements. for which they work and (c) their supervisor. The set of these tree
properties, {rdf:type, worksFor, supervises}, forms the characteristic
set for these two nodes. The direct representation as a relation
1 INTRODUCTION is depicted at the bottom of Fig. 1, with the characteristic set
becoming the attributes of the relation.
The Resource Description Framework1 (RDF) and SPARQL Pro-
tocol and RDF Query Language2 (SPARQL) are W3C recommen-
dations for representing and querying data on the semantic web.
The semantic web has been established as a source of diverse
datasets from a multitude of domains, such as life sciences, sta-
tistics, finance, open science and health. Wide adoption has led
to increasingly larger dataset sizes, and at the same time com-
plex analytical queries have started to emerge, reflecting the ever
increasing recognition of value of analytical processing in the
modern data-centric economy. In light of this, RDF data manage-
ment methods are calling for improvements in the performance
of RDF storage and querying engines, as has been discussed in
recent works, where state-of-the-art RDF engines are found to be
very efficient in simple workloads, but not efficient enough when
it comes to more complex query workloads [17][16][9][12].
In response to this limitation, recent works have shown that Figure 1: A simple RDF graph consisting of a single char-
extraction and exploitation of the implicit schema of the data can acteristic set, c 1 , and its resulting table.
be beneficial in both storage and SPARQL query performance
[9][12]. In order to organize on disk, index and query triples effi- However, a mapping from RDF to the relational space is not
ciently, these efforts heavily rely on two structural components always helpful, as the structural looseness that is allowed in the
of an RDF dataset, namely (i) the notion of characteristic sets (CS), RDF world can translate to multiple, heterogeneous CSs that
i.e., different property sets that characterize subject nodes, and represent skewed distributions of triples. For example, instead
(ii) the join links between CSs. Formally, given an RDF dataset D, of the homogeneity of the graph in Figure 1, where all of the
nodes share the structure, i.e., the same CS, consider the case
1 https://www.w3.org/RDF/
of Figure 2(a) where nodes are described by four different CSs.
2 https://www.w3.org/TR/sparql11-overview/
In fact, it is frequent,that there exist many different CSs within
the same dataset, representing varying internal structure for
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of the nodes of the source graph. This schema heterogeneity in
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution loosely-structured datasets is indeed frequently found in the real
4.0 International (CC BY 4.0). world (e.g., Geonames contains 851 CSs and 12136 CS links),
Table 1: RDF datasets along with their number of CSs and isMarriedTo query conditions (hasBirthday and hasN ationality
links between CSs. conditions are executed as select operations in SQL). At the same
time, there is a price of NULL values, too, as (Figure 2c) shows.
Dataset # Tables (CSs) # of CS joins At the other end of the spectrum, in the multiple CS tables case
Reactome 112 346 (Figure 2b), each one of the three query conditions would require
Geonames 851 12136 one self join and additionally three joins between each CS table
LUBM 2000 14 68 and all other CS tables in the database – i.e., 4 joins per table.
WatDiv 100 5667 802 Thus, despite its space efficiency, the latter case imposes per-
Wordnet 779 7250 formance overheads due to the large number of joins the query
EFO 520 2515 must perform to fetch data from multiple tables. This number
DBLP 95 733 can become significantly large in real-world datasets, as shown
in the second column of Table 1.
In this paper, we tackle the problem of mapping heterogeneous
imposing large overheads in the extraction, storage and disk- RDF datasets to a relational schema with the aim to facilitate the
based retrieval[17][9]. For reference, some well established RDF processing of complex analytical SPARQL queries, by automating
datasets along with their associated numbers of CSs (first column) the decision of which tables will be created in order to host the
are shown in Table 1. incoming data, such that there are no overly empty tables and
In these cases, we end up with a lot of CSs, each of which may extremely large numbers of joins.
actually represent very few nodes. There are two antagonizing Interestingly, the problem is both open and hard to solve. The
approaches in creating a relational schema from a set of CSs. (i) current approaches to address the problem do not avoid the
Creating a relational table for each different CS would result in problems that we have mentioned before and include approaches
a large numbers of relational tables with few tuples in each of to store data generically as triples, as property tables (practically
them, as shown in Figure 2(b) that require a very large number of one table per CS) or as attribute value pairs as vertical partitions
joins to answer queries; (ii) on the other hand, creating a single, (see Section 2 for a detailed discussion). All these solutions are
"universal" table, accommodating the CS’s of all the nodes of found in the extremes of the dilemma between empty space and
the graph would create a very wide relation, as in Figure 2(c), query performance without achieving a "sweet" compromise
which is practically overly empty (i.e., too many NULL values) between the two forces. Therefore, the state of practice and the
and space inefficient, due to the fact that different tuples would state of the art provide room for improvement, and in fact, to
have very different cells populated. the best of our knowledge, this is the first effort to address the
There are two factors that affect or constrain the design choices problem. At the same time, the problem is hard to solve: as we
we have. First, one could wonder "why not splitting the universal show in Section 4, the complexity is exponential, and therefore,
relation in the traditional manner?". Unfortunately, the traditional brute force methods are not adequate.
decomposition via functional dependencies is neither available, In our approach, we introduce an algorithm to automate the
or enforceable to the source data. It is very possible that the design process. We start by considering that each CS is a sin-
incoming data to be stored are not necessarily accompanied by gle table and by exploiting their hierarchical relationships, we
metadata descriptions that prescribe functional dependencies. merge related CSs into the same table. Moreover, we define a
Even if a DBA would design them, it is also possible that the density-based cost function that help us stop the merging process
data violate them. Thus, in the context of this paper, we proceed for CSs that contain a large numbers of triples. In this way, we
with an automated design that assumes that no functional de- achieve merging of CS based on the structural similarity as well
pendencies are available and all schema dependencies are online as the number of triples they contain. We follow a relational
detected by the incoming data. Second, the desired analysis and implementation approach by storing all triples corresponding to
the expected workload we consider in this paper play a signifi- a set of merged CSs into a separate relational table and by execut-
cant role in the determination of the appropriate structure. Most ing queries through a SPARQL-to-SQL transformation. Although
SPARQL queries involve complex graph patterns, i.e., query con- alternative storage technologies can be considered (key-value,
ditions, which must be evaluated on the different CSs[9–11]; ap- graph stores,etc), we have selected well-established technologies
plying such patterns to a universal relation will result in multiple and database systems for the implementation of our approach,
self-joins whereas applying patterns to a fine-grained relational in order to take advantage of existing relational data indexing
schema, would again impose multiple joins (and multiple unions and query processing techniques that have been proven to scale
of the intermediate results for merging the query output) across efficiently in large datasets and complex workloads. To this end,
a high number of CS relational tables. Consider the evaluation of we present a novel system, named raxonDB, that exploits these
the following SPARQL query on the nodes of Figure 2(a), hierarchies in order to merge together hierarchically related CSs
SELECT ? x ? y ? z ?w and decrease the number of CSs and the links between them,
WHERE { ? x w o r k s F o r ? y . resulting in a more compact schema with better data distribution.
?x supervises ?z . raxonDB, built on top of PostgreSQL, provides significant perfor-
mance improvements in both storage and query performance of
? z h a s B i r t h d a y ' 2011 −02 −24 ' .
RDF data with respect to the state of the art techniques presented
? z i s M a r r i e d T o ?w .
in Section 2. In short, our contributions are as follows:
?w h a s N a t i o n a l i t y ' GR ' } • We introduce a novel CS merging algorithm that takes
Assuming a design with a "universal" relation (Figure 2c), the advantage of CS hierarchies,
query evaluation over the relational schema (see Section 4 for • we implement raxonDB, an RDF engine built on top of a
the details on the SPARQL-to-SQL query evaluation) would re- relational backbone that takes advantage of this merging
quire one self-join for each one of the worksFor , supervises and for both storing and query processing,
(a) Four subject nodes, s 1 , s 2 , s 3 , s 4 with their properties and their CSs (left) and the resulting CS hierarchy graph (right).

(b) Edge case where each CS becomes a relational table. No NULL values exist in any of the tables.

(d) Merging c 1 , c 3 , c 4 together and leaving c 2 unmerged.

Figure 2: Example of four CSs coming from different source datasets, and their respective CS hierarchy graph. Examples
of the two edge cases (all tables vs one table), as well as the merging case can be seen. In the figure, the CSs can be seen
as derived from data instances (nodes s 1, s 2, s 3, s 4 ). It can generally be assumed that are more instances belonging to these
CSs, not shown in the figure.
• we perform an experimental evaluation that indicates sig- performance, however, the approach is limited by the constraints
nificant performance improvements for various parameter of human-readable schema discovery. In our work, query per-
configurations. formance, indexing and storage optimization are the main aims
of the merging process, and thus we are not concerned about
Roadmap. In Section 2, we present the background and related
providing human-readable schema information or any form of
work for this paper. In Section 3, we provide preliminary defi-
schema exploration. In [12], the authors use CSs and ECSs in
nition and delineate the problem, and in Section 4 we present
order to assist cost estimation for federated queries, while in [5],
algorithms towards its solution. In Section 5, we discuss the ex-
the authors use CSs in order to provide better triple reordering
perimental evaluation of the paper. We conclude the paper in
plans. To the best of our knowledge, this is the first work to ex-
Section 6, with a summary of our findings and future work.
ploit hierarchical CS relations in order to merge CSs and improve
query performance.
2 RELATED WORK
Due to the tabular structure that tends to implicitly underlay RDF
data, recent works for RDF data management systems have been 3 PRELIMINARIES
implemented in relational backbones. They generally follow three RDF triples consist of a subject s, a predicate p and an object o.
storage schemes, namely (a) triples tables, (b) property tables, An RDF dataset is represented as a directed labelled graph where
and, (c) vertical partitioning. A triples table has three columns, subjects and objects are nodes, and predicates are labelled edges.
representing the subject, predicate and object (SPO) of an RDF Formally, let I , B, L be infinite, pairwise disjoint sets of URIs,
triple. This technique replicates data in different orderings in blank nodes and literals, respectively. Then, (s, p, o) ∈ (I ∪ B) ×
order to facilitate sort-merge joins. RDF-3X [14] and Hexastore (I ) × (I ∪ B ∪ L) is a triple. RDF does not enforce structural rules
[21] build tables on all six permutations of SPO. Built on a rela- in the representation of triples; within the same dataset there
tional backbone, Virtuoso [4] uses a 4-column table for quads, can be largely diverse sets of predicates emitting from nodes
and a combination of full and partial indexes. These methods of the same type [9, 13, 17]. Characteristic Sets (CS)[13] capture
work well for queries with small numbers of joins, however, they this diversity by representing implied node types based on the
degrade with increasing sizes, unbound variables and joins. set of properties they emit. Formally, given a dataset D, and a
Property Tables places data in tables with columns correspond- subject node s, the characteristic set cs(s) of s is cs(s) = {p |
ing to properties of the dataset, where each table identifies a ∃o : (s, p, o) ∈ D}, and the set of all CSs is denoted with C. In
specific resource type. Each row identifies a subject node and what follows, we present basic definitions for CSs and their more
holds the value of each property. This technique has been imple- generalized form of Property Sets. Intuitively, a property set is a
mented experimentally in Jena [22] and DB2RDF [3], and shows set of RDF predicates p1 . . . pn . Therefore, a CS is also a property
promising results when resource types and their properties are set. As we are interested in creating property sets by merging
well-defined. However, this causes extra space overhead for null CSs, we use this more general form (i.e., property set instead of
values in cases of sparse properties [1]. Also, it raises perfor- CS) in order to denote sets of attributes that are not necessarily
mance issues when handling complex queries with many joins, CSs in the original dataset, but are used after merging several
as the amounts of intermediate results increase [8]. CSs as the basis for a relational table.
Vertical partitioning segments data in two-column tables. Each Although each real world measurement can have its very
table corresponds to a property, and each row to a subject node own characteristic set, for all practical purposes, the information
[1]. This provides great performance for queries with bound ob- stored in RDF triples typically refers to recurring concepts and
jects, but suffers when the table sizes have large variations in size their features. Thus, some co-occurrences of features are common,
[20]. TripleBit [23] broadly falls under vertical partitioning. In e.g., to triples coming from the same source and representing
TripleBit, the data is vertically partitioned in chunks per predicate. "records" pertaining to the same "concept" (e.g., triples for events,
While this reduces replication, it suffers from similar problems pertaining to the concept Event come with the same set of features
as property tables. It does not consider the inherent schema of like datetime and description. All event instances from the same
the triples in order to speed up the evaluation of complex query source are likely (albeit not obliged) to carry overlapping subsets
patterns. of these features. At the same time, despite the commonalities,
In distributed settings, a growing body of literature exists, with there are differences too, typically produced by the heterogeneity of
systems such as Sempala [18], H2RDF [15] and S2RDF [19]. How- data sources and the lack of structural restrictions in RDF data. For
ever, these systems are based on the parallelization of centralized example, an RDF dataset containing information about people, is
indexing and query evaluation schemes. likely to associate a name property for each instance of a Person
Due to the high heterogeneity in the schema during the in- class, while a isFatherOf property would not be associated with
tegration and analysis of multiple RDF datasets, latest state of all instances for obvious reasons. Thus, there is a possibility
the art approaches rely on implicit schema detection in order to define subset relationships between the features of similar
to index/store triples based on their schema. In our previous concepts. To this end, we introduce the notion of subsumption
work [9], we defined Extended Characteristic Sets (ECSs) as typed and the notion of hierarchy of subsumed property sets.
links betwen CSs, and we showed how ECSs can be used to in- Definition 1. (Property Sets and Property Tables). In its more
dex triples and greatly improve query performance. In [17], the general form, a CS is a property set Pi , i.e., a set of RDF predicates
authors identify and merge CSs, similar to our approach, into and the set of all property sets is denoted with P. A Property
what they call an emergent schema. However, their main focus is Table (also: CS Table) Ti for a given property set Pi is a relational
to extract a human-readable schema with appropriate relation table comprising an identifier sid for the subject node s and
labelling and they do not use hierarchical information of CSs, |Pi | columns, where Pi = {pi,1, pi,2, . . . , pi,n } are the predicates
rather they use semantics to drive the merging process. In [16] it emitting from s. Ti contains the set r i of tuples that instantiate
is shown how this emergent schema approach can assist query the values of the properties for sid, i.e., Ti = (sid ∪ Pi , r i ). A tuple
Figure 3: (a) A CS hierarchy graph with dense nodes colored in deep purple, (b) the connected components derived by
cutting off descendants from dense nodes, (c) a connected component with dashed lines representing inferred hierarchical
relationships, (d) all possible assignments of dense nodes to non-dense nodes.

can hold values for the predicate columns in I ∪B ∪L ∪NU LL, i.e., say that Hcanc bas e
is an ancestral sub-graph over cbase when ∀i ∈
′
each cell in the relational table can either have a standard value of [1..k], it holds that c i ≻ cbase and (c i , cbase ) ∈ E . Intuitively,
an RDF triple object, or NU LL when the subject node identified any set of ancestors of a node cbase forms an ancestral sub-
in sid does not have a property in any of its triples. In Figure graph. More than one ancestral sub-graphs can be defined over
2(a), four subject nodes, s 1, s 2, s 3, s 4 , are shown. These have four cbase , as any subset of its parents is considered an ancestral
different CSs based on their property sets, namely c 1, c 2, c 3, c 4 . sub-graph over cbase . For instance, in Figure 3(c), nodes c 7, c 4, c 2
Definition 2. (CS Subsumption). Given two CSs, c i and c j , form an ancestral sub-graph over c 7 . Similarly, nodes c 6, c 4, c 2
then c i subsumes c j , or c i ≻ c j , when c i is a subset of c j , or and c 6, c 5, c 2 form ancestral sub-graphs over c 6 .
c i ⊂ c j . This subsumption forms a parent-child relationship. Having defined the hierarchy of characteristic sets and the
For example, consider c 1, c 2 that describe human beings, with respective graph, we are now ready to provide the foundation
c 1 = {type, name} and c 2 = {type, name, marriedTo}. It can be for the core of the problem of this paper. Basically, the goal is to
seen that c 1 ⊂ c 2 and therefore c 1 subsumes c 2 . CS subsumption find a way to store data in a way that balances two antagoniz-
relationships can be seen in Figure 3(a) as directed edges between ing goals: (a) the number of NULL values and the unnecessary
nodes. In the example of Figure 2(a), the four CSs exhibit strong space increase, vs., (b) the resulting decrease in query processing
subset-superset relationships between their properties, For in- time that would result from the fragmentation of stored data
stance, c 1 and c 2 have property sets that are supersets of both in multiple nodes and the need to join them. To address this
c 3 and c 4 , while c 3 also subsumes c 4 . The set of all parent-child issue, we can attempt to merge different nodes into one, paying
relationships defines a CS hierarchy. the price of empty space to avoid the price of joins. There exist
Definition 3. (CS Hierarchy and Inferred Hierarchy). CS two edge cases here, namely (i) assign a table to each CS in the
subsumption creates a partial hierarchical ordering such that incoming data, resulting in as many tables as CSs, and (ii) assign
when c i ≻ c j , then c i is a parent of c j . Formally, a CS hierarchy one universal tables for all CSs. This table would contain all of
is a graph lattice H = (V , E) where V ∈ C and E ∈ (V × V ). A the properties found in the input dataset. The two edge cases for
directed edge between two CS nodes c 1, c 2 exists in H , when the running example of Fig. 2 can be seen in Fig. 2(b,c).
c 1 ≻ c 2 and there exists no other c i such that c 1 ≻ c i ≻ c 2 . Definition 5. (Hierarchical CS Merge). Given an ancestral sub-
′ ′ ′
The directed edge stems from the parent node and arrives at graph a = (V , E ), where V = {c 1, c 2, . . . , c k } as defined above,
′ ′
the child node. An example CS hierarchy can be seen in Figure and the set of property tables T (V ), then hier _merдe(a,T (V ))
3(a). Given a hierarchy H , we denote the hierarchical closure of H is a hierarchical merge of a that results in a single table Ta =
with Hc , so that Hc extends H to contain inferred edges between (cbase , r a ). As cbase is the most specialized CS in a, the columns
hierarchically related nodes that are not consecutive, e.g. a node k ′
and its grandchildren. Intuitively, for every set of features c that of Ta are exactly the properties in cbase , while r a = r i is the
Ð
i=1
describes some concept in the dataset, we introduce a node in ′ ′
union of the records of all CSs in V , where r i is the projection
our graph. Every strict superset of features of c is a descendant ′
of c, and every strict subset of features of c is a superset of c. An of r i on the properties in cbase . Consequently, r i contains NULL
example inferred hierarchy can be seen on the right of Figure 2(a), values for all the non-shared properties of cbase and c i . In essence,
with the inferred relationships in dashed lines as well as in Figure this is an edge contraction operator that merges all tables of the
3(c) for a sub-graph of the graph in Figure 3(a). In the remainder of nodes of an ancestral sub-graph into one and keeps the signature
this paper, we refer to Hc as the inferred hierarchy of H . The lattice (i.e., the properties) of the base CS cbase . For instance, assume
′
resembles the traditional OLAP cube lattice (see [7]), although in that V = {c 0 = (P 0, r 0 ), c 1 = (P1, r 1 ), c 2 = (P2, r 2 )} is the set
our case, the construction is ad-hoc, depending on the available of vertices of an ancestral sub-graph with three CSs, with P 0 =
CS’s, and serves a different purpose. {pa , pb }, P1 = {pa , pb , pc } and P 2 = {pa , pb , pc , pd }. Thus, c 0 ≻
Definition 4. (CS Ancestral Sub-graphs). Given an inferred c 1 ≻ c 2 . The output of the merging process for our running
hierarchy Hc = (V , E), a CS cbase , a set of CSs c 1, . . . , c k , and example can be seen in Figure 2(d). Hierarchical merging can be
a sub-graph Hcanc seen in Figure 4.
′ ′ ′ ′
bas e
= (V , E ) with V ⊆ V and E ⊆ E, we
Definition 5. (Merge Graph). Given an inferred CS hierarchy r null represents the ratio of null values to the cardinality of
′ ′ ′
Hc = (V , E), a merge graph is a graph H = (V , E ) that consists the merged table. The numerator of the fraction represents the
of a set of n ancestral sub-graphs, and has the following prop- total number of cell values that will be null, as the product of
′ ′
erties: (i) H contains all nodes in H such that V ≡ V , i.e., it the number of non-shared properties and the cardinality of the
′
covers all CSs in the input dataset, (ii) H contains a subset of parent CS. The denominator represents the resulting cardinality
′
the edges in H such that E ⊂ E, (iii) each node is contained in of the table.
exactly one ancestral sub-graph ai , (iv) all ancestral sub-graphs In order to assess an ancestral sub-graph, we use a generalized
are pair-wise disconnected, i.e., there exist no edges between the version of r null that captures the NULL value effect on the whole
nodes of different ancestral sub-graphs. Thus, each ancestral sub- sub-graph:
graph can be contracted into one node unambiguously, using the Í |д |
hier _merдe operator. Also, the total number of relational tables д i=1 |c d \ c i | × |r i |
r null (д)|Td = Í |д | (2)
will be equal to the number of ancestral sub-graphs in the merge |rd | + i=1 (|r i |)
graph.
The primary focus of this work is to improve the efficiency of Here, Td = (cd , rd ) is the root of sub-graph д. However, merging
storage and querying in relational RDF engines by exploiting the a parent to a child changes the structure of the input graph, as
implicit schema of the data in the form of CSs. The two extreme the cardinality of the merged child is increased. Thus, we define
approaches – i.e., (a) a separate table for each CS, on the one end a cost function that works on the graph level, as follows:
of the spectrum, or, (b) merging all the data into a universal table
with a schema expressed as the union of the individual features n
д
Õ
of the different CS’s, at the other end of the spectrum, have the cost(д) = r null (дi )|cd i (3)
following drawbacks, respectively: (i) multiple joins of separate i=1
tables at query processing, if many tables are used, and (ii) bad
utilization of disk space if nodes are excessively merged into a where n is the number of dense nodes, cdi is a dense node and дi
very small number of tables (due to excessive empty space with is the ancestral sub-graph with cdi as the base node.
NULLs). Thus, choosing dense CSs as bases is a seeding process that
Problem Formulation. Given an inferred CS hierarchy Hc = aims to minimize this NULL value effect by making sure that
′ ′
(V , E), the problem is to find a merge graph H = (V , E ) in the a large fraction of the input records will not be extended with
form of a set of disconnected ancestral sub-graphs, that provides NULL values. This is true because a CS base and its resulting
a good way to merge CS nodes. In other words, the problem merged table will have exactly the same properties (columns)
is to find the best set of ancestral sub-graphs from an inferred and thus introduction of NULL values will not be needed for the
hierarchy Hc that minimize an objective cost function cost(x). records of the CS base.
The problem of selecting ancestral sub-graphs for the merge
is computationally hard, as mentioned earlier. For this reason,
4 HIERARCHICAL CS MERGING we rely on heuristics in order to seed the process and provide an
What makes the problem hard, is the complexity of finding a initial set of ancestral sub-graph bases for the final merged tables.
sweet spot in the Pareto equilibrium between conciseness of the The CS bases will be the only relational tables in the output, with
relational schema (few tables) and internal density (with respect the remaining tables merged into them. For this, we identify dense
to the empty space they contain). CS nodes in the hierarchy (i.e, with large cardinalities) and use
Schema conciseness. To start with the arguments in favor these nodes as the bases of the ancestral sub-graphs. While node
of reducing the number of nodes via merging, we can see that, by density can be defined in many different ways, in the context
reducing the number of CS, the number of joins between them of this work we define a c i to be dense, if the cardinality of its
is also reduced. Furthermore, merging together CSs leads to a relational table is larger than a linear function of the maximum
less skewed distribution of data to relational tables. Ultimately, cardinality of CSs in D, i.e., a function d : N → R, with d(Ti ) =
this results in a drastically decreased disk-based I/O cost, as less m × |rmax |. Here, m ∈ [0, 1] is called the density factor, and rmax
tables are fetched, positively affecting query processing as well. is the cardinality of the largest CS table in D. By definition, if
Density. On the contrary, merging tables results in the intro- m = 0, no CSs will be merged (i.e., each CS will be stored in
duction of NULL values for the non-shared columns, which can each own table), while if m = 1, no tables will be created, as no
degrade performance. Specifically, merging CSs with different CS has a cardinality larger than that of the largest CS. With a
attribute sets can result in large numbers of NULL values in the given m, the problem is reduced to finding the optimal ancestral
resulting table. Given a parent CS table T1 = (sid ∪ c 1, r 1 ) and a sub-graph for each given dense node.
child CS table T2 = (sid ∪ c 2, r 2 ) with |c 1 | < |c 2 | and |r 1 | >> |r 2 |, Given this cost model and a predefined set of dense nodes,
the resulting |c 2 \ c 1 | × |r 1 | NULL cells will be significantly large our algorithm will find the optimal sub-graph for each dense
compared to the total number of r 1 + r 2 records, thus poten- node. An inferred hierarchy graph can be converted to a set of
tially causing poor storage and querying performance[17]. For connected components that are derived by removing the outgoing
this reason, CS merging must be performed in a way that will edges from dense nodes, since we are not interested in merging
minimize the presence of NULL values. The following function children to parents, but only parents to children. An example
captures the NULL-value effect of the merge of two CS tables of this can be seen in Figure 3(b). For each component, we can
Ti = (sid ∪ c i , r i ),T j = (sid ∪ c j , r j ) with c i ≻ c j : compute cost(д) as the sum of the costs of these components.
The main idea is to identify all connected components in the CS
|c j \ c i | × |r i | graph, iterate through them, enumerate all sub-graphs within
r null (Ti ,T j ) = (1) the components that start from the given set of dense nodes, and
(|r j | + |r i |)
select the optimal partitioning for each component.
Algorithm 1: optimalMerge
Data: An inferred hierarchy lattice Lc as a adjacency list , and a set of dense CSs D
Result: A set of optimal ancestral sub-graphs
1 init f inal List ;
2 connect edComponent s ← f indConnect edComponent s(L c );
3 for each connect edComponent do
4 init min ← MAX _V ALU E ;
5 init best Subдr aph ;
6 while nex t ← connect edComponent .дener at e N ex t Subдr aph()
do
7 if cost (nex t ) < min then
8 min ← cost (nex t );
9 best Subдr aph ← nex t ;
10 end
11 f inal List .add(best Subдr aph);
12 end
Figure 4: Merging the tables of c 0 , c 1 and c 2 . 13 return f inal List ;

The main idea behind the algorithm is to iterate the non-dense
The algorithm can be seen in Algorithm 1. The algorithm nodes, and for each non-dense node, calculate r null and find the
works by first identifying all connected components of the in- dense node that minimizes this function for the given non-dense
ferred hierarchy (Line 2). Identifying connected components is node. Then, the cardinalities will be recomputed and the next
trivially done using standard DFS traversal, and is not shown in non-dense node will be examined. The algorithm can be seen in
the Algorithm. Then, we iterate each component (Line 3), and for Algorithm 2. In the beginning, the algorithm initiates a hash table,
each component, we generate all possible sub-graphs. Then, we merдeMap, with an empty list for each dense node (Lines 1-4).
calculate the cost of each sub-graph (Line 7) and if it is smaller Then, the algorithm iterates all non-dense nodes (Line 5), and for
than the current minimum, the minimum cost and best sub-graph each dense node, it calculates the cost r null of merging it to each
are updated (Lines 8-9). Finally, we add the best sub-graph to the of its connected dense nodes (Lines 5-13), keeping the current
final list (Line 11) and move to the next component. minimum cost and dense node. In the end, the current non-dense
To generate the sub-graphs, we do not need to do an exhaustive node is added to the list of the dense node that minimizes r null
generation of 2n combinations, but we can rely on the observa- (Line 14). Notice that we do not need to split the hierarchy into
tion that each non-dense node must be merged to exactly one connected components in order for дreedyMerдe to work.
dense node. Therefore, sub-graph generation is reduced to find- Complexity Analysis. Given k non-dense nodes and d dense
ing all possible assignments of dense nodes to the non-dense nodes, where each non-dense node ki has e(ki ) dense children,
nodes. An example of this can be seen in Figure 3. In the figure, the дreedyMerдe algorithm needs ki=1 e(ki ) iterations, because
Í
nodes c 2, c 4, c 5 are non-dense, while nodes c 6, c 7, c 8 are dense. All it requires iteration of all e(ki ) nodes for each ki . In the worst case,
possible and meaningful sub-graphs are enumerated in the table every ki is related to all d dense nodes, requiring kd iterations.
at the right of the figure, where we assign a dense node to each Assuming a constant cost for the computation of r null , then the
of the non-dense nodes. An assignment is only possible if there asymptotic complexity of the greedy algorithm is O(kd), which
exists a parent-child relationship between a non-dense node and is a significant performance when compared to the exponential
a dense node, even if it is an inferred one (e.g. c 2 is an inferred complexity of optimalMerдe.
parent of c 7 ). Hence, the problem of sub-graph generation be-
comes one of generating combinations from different lists, by
selecting one element from each list. The number of lists is equal Algorithm 2: greedyMerge
Data: A hash table p mapping non-dense CSs to their dense descendants, a set of
to the number of non-dense nodes, and the elements of each list dense CSs D , and a set of non-dense CSs K
are the dense nodes that are related to the non-dense node. Result: A hash table mapping dense CSs to sets of non-dense CSs to be merged
1 init mer дe Map ;
Complexity Analysis. Assuming that a connected compo- 2 for each d ∈ D do
nent д has k non-dense nodes and d dense nodes, and each non- 3 mer дe Map .put (d , new List ());
dense node ki is related to e(ki ) dense nodes, then the num- 4 end
5 for each k ∈ K do
ber of sub-graphs that need to be enumerated are ki=1 e(ki ).
Î
6 min ← MAX _V ALU E ;
In the example of figure 3, the total number of sub-graphs is 7 init best Dense ;
8 for each d k ∈ p .дet (k ) do
e(c 2 ) × e(c 4 ) × e(c 5 ) = 3 × 2 × 1 = 6. In the worst case all k 9 cost ← r nul l (k , d k );
nodes are parents of all d nodes. Then, the number of total sub- 10 if cost < min then
min ← cost ;
graphs is d k , which makes the asymptotic time complexity of
11
12 best Dense ← d k ;
the algorithm O(d k ). 13 end
14 mer дe Map .дet (best Dense).add (k);
15 end
4.1 Greedy Approximation 16 return mer дe Map ;

For very small d, k (e.g. d, k < 4), the asymptotic complexity of
O(d k ) is acceptable. However, in real-world cases, the number of This process does not necessarily cover all CSs of the input
connected components can be small, making d and k large. For dataset. For example, some CS nodes might not have any dense
this reason, we introduce a heuristic algorithm for approximat- children. Given this, the percentage of the dataset that is covered
ing the problem, that does not require enumerating all possible by this process is called dense CS coverage. The remainder of the
combinations, relying instead on a greedy objective function that CSs are aggregated into one large table, Tr est , containing all of
attempts to find the local minimum with respect to the defined their predicates. If the total coverage of the merging process is
cost model for each non-dense node. large, then Tr est does not impose a heavy overhead in query
Figure 5: An example of greedy merging. Dense nodes are coloured in deep purple. At each step, the non-dense node under
examination is coloured with green, while the edge that minimizes r null can be seen in bold.

performance, as will be shown in the experiments. Finally, we permutations of table joins that need to be evaluated. For instance,
load the data in the corresponding tables. assume that q 1 matches with c 1, c 2 , while q 2 matches with c 3 and
q 3 matches with c 4, c 5 . Assume also that by looking up the ECS
4.2 Implementation index, we derived that the links [c 1, c 3 ], [c 2, c 3 ], [c 3, c 4 ] and [c 3, c 5 ]
exist in the index, i.e., they correspond to candidate joins in the
We implemented raxonDB as a storage and querying engine that
data. Then, [c 1, c 3, c 4 ], [c 1, c 3, c 5 ], [c 2, c 3, c 4 ] and [c 2, c 3, c 5 ] are
supports hierarchical CS merging, and can be deployed on top
all valid table permutations that must be processed. Two main
of standard RDBMS’s. Specifically, we used PostgreSQL 9.6, but
strategies can be employed here. The first is to join the UNIONs
raxonDB can be adapted for other relational databases as well.
of the matching tables for each qi , and the other is to process each
CS Retrieval and Merging. The processes of retrieving and
permutation of tables separately and append the results. Given
merging CSs take place during the loading stage of an incom-
the filtering performed by the ECS indexing approach, where
ing RDF dataset. CS retrieval is a trivial procedure that requires
we can pre-filter CSs based on the relationships between them,
scanning the whole dataset and storing the unique sets of prop-
the UNION would impose significant overhead and eliminate
erties that are emitted from the subject nodes in the incoming
the advantage of ECS indexing. Therefore, we have implemented
triples, and is adopted from our previous work in [9] where it is
the second approach, that is, process a separate query for each
described in detail. After retrieving the CSs, the main idea is to
permutation. Finally, due to the existence of NULL values in the
compute the inferred CS hierarchy and apply one of the described
merged tables, we must add explicit IS NOT NULL restrictions
merging algorithms. Finally, each set of merged CSs is stored
for all the properties that are contained in each matched CS and
in a relational table. In each table, the first column represents
are not part of any other restriction or filter in the original query.
the subject identifier, while the rest of the columns represent the
union of the property sets of the merged CSs. For multi-valued
properties, we use PostgreSQL’s array data type in order to avoid
5 EXPERIMENTAL EVALUATION
duplication of the rows. We implemented raxonDB on top of PostgreSQL3 . As the focus of
Indexing. We deploy several indexes in raxonDB. First off, this paper is to improve RDF storage and querying efficiency in
we index the subject id for each row. We also build foreign-key relational settings, we rely on existing mechanisms within Post-
indexes on object-subject links between rows in different CSs, i.e., greSQL for I/O operations, physical storage and query planning.
when a value of a property in one CS is the subject id of another In this set of experiments, we report results of implementing
CS. Next, we use standard B+trees for indexing single-valued hier _merдe with the greedy approximation algorithm, as experi-
property columns, while we use PostgreSQL’s GIN indexes, which menting with the optimal algorithm failed to finish the merging
apply to array datatypes for indexing multi-valued properties. process even in datasets with small numbers of CSs.
This enables fast access to CS chain queries, i.e., queries that apply Datasets. For this set of experiments, we used two synthetic
successive joins for object-subject relationships. Furthermore, datasets, namely LUBM2000 (≈300m triples), and WatDiv (≈100m
we store these links on the schema level as well, i.e., we keep an triples), as well as two real-world datasets, namely Geonames
index of CS pairs that are linked with at least one object-subject (≈170m triples) and Reactome (≈15m triples). LUBM [6] is a cus-
pair of records. These links are called Extended Characteristic tomizable generator of synthetic data that describes academic
Sets (ECSs) and are based on our previous work in [9]. With the information about universities, departments, faculty, and so on.
ECS index, we can quickly filter out CSs that are guaranteed not Similarly, WatDiv[2] is a customizable generator with more op-
to be related, i.e., no joins exist between them, even if they are tions for the production and distribution of triples to classes.
individually matched in a chain of query CSs. Other metadata and Reactome4 is a biological dataset that describes biological path-
indexes include the property sets of CSs, and which properties ways, and Geonames5 is a widely used ontology of geographical
can contain multiple values in the same CS. entities with varying properties and rich structure.
Query Processing. Processing SPARQL queries on top of Loading. In order to assess the effect of hierarchical merging
merged CSs entails (i) parsing the queries, (ii) retrieving the query in the loading phase, we performed a series of experiments using
CSs, (iii) identifying the joins between them, and (iv) mapping all four datasets. For this experiment, we measure the size on
them to merged tables in the database. Steps (i)-(iii) are inherited disk, the loading time, the final number of merged tables, as well
from our previous work in [9]. For (iv), a query CS can match as the number of ECSs (joins between merged tables) and the
with more than one table in the database. For instance, consider percentage of triple coverage by CSs included in the merging
a query containing a chain of three CSs, q 1 ▷◁ q 2 ▷◁ q 3 , joined process, for varying values of the density factor m ∈ [0, 1]. The
sequentially with object-subject joins. Each query CS qi matches 3 The code and queries are available in https://github.com/mmeimaris/raxonDB
with all tables whose property sets are supersets of the property 4 http://www.ebi.ac.uk/rdf/services/reactome
set of qi . Thus, each join in the initial query creates a set of 5 http://www.geonames.org/ontology/documentation.html
(a) Execution time (seconds) for LUBM (b) Execution time (seconds) for Geonames (c) Execution time (seconds) for Reactome

Figure 6: Query execution times in milliseconds

(a) # of CS permutations for LUBM (b) # of CS permutations for Geonames (c) # of CS permutations for Reactome

Figure 7: # of CS permutations for increasing m

Table 2: Loading experiments for all datasets

Dataset Size (MB) Time # Tables (CSs) # of ECSs Dense CS
Coverage
Reactome Simple 781 3min 112 346 100%
Reactome (m=0.05) 675 4min 35 252 97%
Reactome (m=0.25) 865 4min 14 73 77%
Geonames Simple 4991 69min 851 12136 100%
Geonames (m=0.0025) 4999 70min 82 2455 97%
Geonames (m=0.05) 5093 91min 19 76 87%
Geonames (m=0.1) 5104 92min 6 28 83%
LUBM Simple 591 3min 14 68 100%
LUBM (m=0.25) 610 3min 6 21 90%
LUBM (m=0.5) 620 3min 3 6 58%
WatDiv Simple 4910 97min 5667 802 100%
WatDiv (m=0.01) 5094 75min 67 99 77%
WatDiv (m=0.1) 5250 75min 25 23 63%
WatDiv (m=0.5) 5250 77min 16 19 55%

(a) Execution time (seconds) for LUBM2000 (b) Execution time (seconds) for Geonames (c) Execution time (seconds) for Reactome

Figure 8: Query execution times in milliseconds for different RDF engines

results are summarized in Table 2. As can be seen, the number of CS, and consequently tables, is greatly reduced with increas-
ing values of m. As the number of CSs is reduced, the expected
number of joins between CSs is also reduced, which can be seen compared with various alternative RDF engines and the results
in the column that measures ECSs. Consequently, the number for the performance of indexing and querying showed that our
of tables can be decreased significantly without trading off large system outperforms for various types of workloads.
amounts of coverage by dense CSs, i.e. large tables with many As future work, we will study computation of the optimal value
null values. Loading time tends to be slightly greater as the num- for m, taking into consideration workload characteristics as well
ber of CSs decreases, and thus the number of merges increases, as a more refined cost model for the ancestral paths. Furthermore,
the only exception being WatDiv, where loading time is actually we will study and compare our approach to a graph database
decreased. This is a side-effect of the excessive number of tables setting, as well as experiment with a column-stored relational
(= 5667) in the simple case which imposes large overheads for the DB, in order to further scale the capabilities of raxonDB.
persistence of the tables on disk and the generation of indexes Acknowledgments. This work is partially funded by the
and statistics for each one. projects VisualFacts (#1614 - 1st Call of the Hellenic Foundation
Query Performance. In order to assess the effect of the den- for Research and Innovation Research Projects for the support of
sity factor parameter m during query processing, we perform a post-doctoral researchers), and "Moving from Big Data Manage-
series of experiments on LUBM, Reactome and Geonames. For ment to Data Science" (MIS 5002437/3)- Action "Reinforcement of
the workload, we used the sets of queries from [9] 6 . We em- the Research and Innovation Infrastructure" (funded by Greece
ploy two metrics, namely execution time and number of table and the European Regional Development Fund).
permutations. The results can be seen in Figures 6 and 7. As can
be seen, hierarchical CS merging can help speed up query per- REFERENCES
formance significantly as long as the dense coverage remains [1] Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. 2007.
Scalable semantic web data management using vertical partitioning. In VLDB.
high. For example, in all datasets, query performance degrades [2] Güneş Aluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. 2014. Diver-
dramatically when m = 1, in which case the merging process sified stress testing of RDF data management systems. In ISWC.
cannot find any dense CSs. In this case, all rows are added to [3] Mihaela A Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas,
Patrick Dantressangle, Octavian Udrea, and Bishwaranjan Bhattacharjee. 2013.
one large table, which makes the database only contain one table Building an efficient RDF store over a relational database. In ACM SIGMOD.
with many NULL cells. These findings are consistent across all [4] Orri Erling and Ivan Mikhailov. 2010. Virtuoso: RDF support in a native RDBMS.
three datasets (Q 6 in LUBM exhibits a higher increase for m = 1 Springer.
[5] A. Gubichev and T. Neumann. 2014. Exploiting the query structure for efficient
due to the dataset’s characteristics; for lack of space we omit the join ordering in SPARQL queries. In EDBT.
explanation of this behaviour) and require further future work [6] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for
OWL knowledge base systems. Web Semantics: Science, Services and Agents on
in order to identify the optimal value for m. the World Wide Web 3, 2 (2005), 158–182.
In order to assess the performance of raxonDB and establish [7] Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ull-
that no overhead is imposed by the relational backbone, we per- man. 1997. Index Selection for OLAP. In ICDE,. 208–219.
[8] Maciej Janik and Krys Kochut. 2005. BRAHMS: a workbench RDF store and
formed a series of queries on LUBM2000, Geonames and Reac- high performance memory system for semantic association discovery. In
tome, assuming the best merging of CSs is employed as captured ISWC.
by m with respect to our previous findings. We also compared [9] M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. 2017.
Extended Characteristic Sets: Graph Indexing for SPARQL Query Optimization.
the query performance with rdf-3x, Virtuoso 7.1, TripleBit and In ICDE.
the emergent schema approach described in [16]. The results can [10] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis
Anagnostopoulos. 2016. Efficient Computation of Containment and Com-
be seen in Figure 8 and indicate that raxonDB provides equal or plementarity in RDF Data Cubes. In Proceedings of the 19th International
better performance from the original axonDB implementation, as Conference on Extending Database Technology, EDBT 2016, Bordeaux, France,
well as the rest of the systems, including the emergent schema March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 281–292. https:
//doi.org/10.5441/002/edbt.2016.27
approach, which is the only direct competitor for merging CSs. [11] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis Anag-
Especially for queries with large intermediate results and low nostopoulos. 2018. Computational methods and optimizations for contain-
selectivity that correspond to a few CSs and ECSs (e.g. LUBM Q5 ment and complementarity in web data cubes. Inf. Syst. 75 (2018), 56–74.
https://doi.org/10.1016/j.is.2018.02.010
and Q6, Geonames Q5 and Q6) several of the other approaches [12] G. Montoya, H. Skaf-Molli, and K. Hose. 2017. The Odyssey Approach for
fail to answer fast and in some cases time out. Optimizing Federated SPARQL Queries. In ISWC.
[13] T. Neumann and G. Moerkotte. 2011. Characteristic sets: Accurate cardinality
estimation for RDF queries with multiple joins. In ICDE.
6 CONCLUSIONS AND FUTURE WORK [14] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable
management of RDF data. The VLDB Journal 19, 1 (2010), 91–113.
In this paper, we tackled the problem of automatically mapping [15] Nikolaos Papailiou, Dimitrios Tsoumakos, Ioannis Konstantinou, Panagiotis
heterogeneous RDF datasets to a relational schema by consider- Karras, and Nectarios Koziris. 2014. H 2 RDF+: an efficient data management
ing the implicit schema in RDF triples. We presented a method system for big RDF graphs. In ACM SIGMOD.
[16] MD Pham and P. Boncz. 2016. Exploiting emergent schemas to make RDF
that extracts the Characteristics Sets, i.e., the set of properties systems more efficient. In ISWC.
describing the different classes of RDF instances in the data and [17] MD. Pham, L. Passing, O. Erling, and P. Boncz. 2015. Deriving an emergent
relational schema from rdf data. In WWW.
exploits the hierarchical relationships between different CSs in or- [18] Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg
der to merge and map them to relational tables. We have provided Lausen. 2014. Sempala: interactive SPARQL query processing on hadoop.
two algorithms, an optimal (exhaustive) one which selects ances- In ISWC.
[19] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and Georg
tral sub-graphs of CS for merging in exponential time and greedy Lausen. 2016. S2RDF: RDF querying with SPARQL on spark. In VLDB.
one, which via the use of heuristics improves the performance to [20] Lefteris Sidirourgos, Romulo Goncalves, Martin Kersten, Niels Nes, and Stefan
polynomial time. We have implemented our methods on top of a Manegold. 2008. Column-store support for RDF data management: not all
swans are white. In VLDB.
standard RDBMS solution, i.e., PostgreSQL for extracting, index- [21] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore:
ing and query processing of SPARQL queries. Finally, we have sextuple indexing for semantic web data management. In VLDB.
[22] Kevin Wilkinson. 2006. Jena property table implementation.
experimented with two synthetic and two real-world datasets, [23] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013.
all of them exhibiting high heterogeneity in their schemas, we TripleBit: a fast and compact system for large scale RDF data. In VLDB.

6 Available also in https://github.com/mmeimaris/raxonDB