=Paper= {{Paper |id=Vol-2572/paper13 |storemode=property |title=Hierarchical Property Set Merging for SPARQL Query Optimization |pdfUrl=https://ceur-ws.org/Vol-2572/paper13.pdf |volume=Vol-2572 |authors=Marios Meimaris,George Papastefanatos,Panos Vassiliadis |dblpUrl=https://dblp.org/rec/conf/dolap/MeimarisPV20 }} ==Hierarchical Property Set Merging for SPARQL Query Optimization== https://ceur-ws.org/Vol-2572/paper13.pdf
            Hierarchical Property Set Merging for SPARQL Query
                                Optimization
                Marios Meimaris                                      George Papastefanatos                              Panos Vassiliadis
            Athena Research Center                                      Athena Research Center                         University of Ioannina
            m.meimaris@athenarc.gr                                       gpapas@athenarc.gr                              pvassil@cs.uoi.gr

ABSTRACT                                                                                 and a subject node s, the characteristic set cs(s) of s is defined as
Characteristic sets (CS) organize RDF triples based on the set                           follows [13]:
of properties associated with their subject nodes. This concept                                             cs(s) = {p | ∃o : (s, p, o) ∈ D}
was recently used in indexing techniques, as it can capture the
implicit schema of RDF data. While most CS-based approaches                                 In plain words, a CS is the set of attributes (or RDF properties)
yield significant improvements in space and query performance,                           of a given subject node. Abstracting from RDF triples to their
they fail to perform well when answering complex query work-                             CSs, an RDF graph can be represented on the structural level by
loads in the presence of schema heterogeneity, i.e., when the                            a graph containing CSs as nodes, and links between CSs as edges,
number of CSs becomes very large, resulting in a highly parti-                           where a link between two CSs exists whenever a link between
tioned data organization. In this paper, we address this problem                         two subject nodes exists in the original RDF graph. Due to the
by introducing a novel technique, for merging CSs based on their                         ability to represent all of the properties of a subject node with a
hierarchical structure. Our method employs a lattice to capture                          single set, rather than multiple triples, CSs have been thoroughly
the hierarchical relationships between CSs, identifies dense CSs                         used as a means to optimize query planning, storage, indexing
and merges dense CSs with their ancestors, thus reducing the size                        and query processing [5, 9, 12, 13, 16]. In their most general
of the CSs as well as the links between them. We implemented our                         form, they are used as the basis for mapping RDF to a relational
algorithm on top of a relational backbone, where each merged                             structure, where each CS forms a relational table. An illustration
CS is stored in a relational table, and we performed an extensive                        of this mapping can be seen in Figure 1. There are two entities,
experimental study to evaluate the performance and impact of                             Alice and Claire, each represented in an RDF graph as a node. The
merging to the storage and querying of RDF datasets, indicating                          properties of these two nodes are (a) their type, (b) the company
significant improvements.                                                                for which they work and (c) their supervisor. The set of these tree
                                                                                         properties, {rdf:type, worksFor, supervises}, forms the characteristic
                                                                                         set for these two nodes. The direct representation as a relation
1     INTRODUCTION                                                                       is depicted at the bottom of Fig. 1, with the characteristic set
                                                                                         becoming the attributes of the relation.
The Resource Description Framework1 (RDF) and SPARQL Pro-
tocol and RDF Query Language2 (SPARQL) are W3C recommen-
dations for representing and querying data on the semantic web.
The semantic web has been established as a source of diverse
datasets from a multitude of domains, such as life sciences, sta-
tistics, finance, open science and health. Wide adoption has led
to increasingly larger dataset sizes, and at the same time com-
plex analytical queries have started to emerge, reflecting the ever
increasing recognition of value of analytical processing in the
modern data-centric economy. In light of this, RDF data manage-
ment methods are calling for improvements in the performance
of RDF storage and querying engines, as has been discussed in
recent works, where state-of-the-art RDF engines are found to be
very efficient in simple workloads, but not efficient enough when
it comes to more complex query workloads [17][16][9][12].
    In response to this limitation, recent works have shown that                         Figure 1: A simple RDF graph consisting of a single char-
extraction and exploitation of the implicit schema of the data can                       acteristic set, c 1 , and its resulting table.
be beneficial in both storage and SPARQL query performance
[9][12]. In order to organize on disk, index and query triples effi-                        However, a mapping from RDF to the relational space is not
ciently, these efforts heavily rely on two structural components                         always helpful, as the structural looseness that is allowed in the
of an RDF dataset, namely (i) the notion of characteristic sets (CS),                    RDF world can translate to multiple, heterogeneous CSs that
i.e., different property sets that characterize subject nodes, and                       represent skewed distributions of triples. For example, instead
(ii) the join links between CSs. Formally, given an RDF dataset D,                       of the homogeneity of the graph in Figure 1, where all of the
                                                                                         nodes share the structure, i.e., the same CS, consider the case
1 https://www.w3.org/RDF/
                                                                                         of Figure 2(a) where nodes are described by four different CSs.
2 https://www.w3.org/TR/sparql11-overview/
                                                                                         In fact, it is frequent,that there exist many different CSs within
                                                                                         the same dataset, representing varying internal structure for
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of   the nodes of the source graph. This schema heterogeneity in
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution           loosely-structured datasets is indeed frequently found in the real
4.0 International (CC BY 4.0).                                                           world (e.g., Geonames contains 851 CSs and 12136 CS links),
Table 1: RDF datasets along with their number of CSs and               isMarriedTo query conditions (hasBirthday and hasN ationality
links between CSs.                                                     conditions are executed as select operations in SQL). At the same
                                                                       time, there is a price of NULL values, too, as (Figure 2c) shows.
          Dataset       # Tables (CSs)     # of CS joins               At the other end of the spectrum, in the multiple CS tables case
         Reactome            112                 346                   (Figure 2b), each one of the three query conditions would require
        Geonames             851               12136                   one self join and additionally three joins between each CS table
        LUBM 2000             14                  68                   and all other CS tables in the database – i.e., 4 joins per table.
        WatDiv 100           5667                802                   Thus, despite its space efficiency, the latter case imposes per-
         Wordnet             779                7250                   formance overheads due to the large number of joins the query
           EFO               520                2515                   must perform to fetch data from multiple tables. This number
           DBLP               95                 733                   can become significantly large in real-world datasets, as shown
                                                                       in the second column of Table 1.
                                                                          In this paper, we tackle the problem of mapping heterogeneous
imposing large overheads in the extraction, storage and disk-          RDF datasets to a relational schema with the aim to facilitate the
based retrieval[17][9]. For reference, some well established RDF       processing of complex analytical SPARQL queries, by automating
datasets along with their associated numbers of CSs (first column)     the decision of which tables will be created in order to host the
are shown in Table 1.                                                  incoming data, such that there are no overly empty tables and
   In these cases, we end up with a lot of CSs, each of which may      extremely large numbers of joins.
actually represent very few nodes. There are two antagonizing             Interestingly, the problem is both open and hard to solve. The
approaches in creating a relational schema from a set of CSs. (i)      current approaches to address the problem do not avoid the
Creating a relational table for each different CS would result in      problems that we have mentioned before and include approaches
a large numbers of relational tables with few tuples in each of        to store data generically as triples, as property tables (practically
them, as shown in Figure 2(b) that require a very large number of      one table per CS) or as attribute value pairs as vertical partitions
joins to answer queries; (ii) on the other hand, creating a single,    (see Section 2 for a detailed discussion). All these solutions are
"universal" table, accommodating the CS’s of all the nodes of          found in the extremes of the dilemma between empty space and
the graph would create a very wide relation, as in Figure 2(c),        query performance without achieving a "sweet" compromise
which is practically overly empty (i.e., too many NULL values)         between the two forces. Therefore, the state of practice and the
and space inefficient, due to the fact that different tuples would     state of the art provide room for improvement, and in fact, to
have very different cells populated.                                   the best of our knowledge, this is the first effort to address the
   There are two factors that affect or constrain the design choices   problem. At the same time, the problem is hard to solve: as we
we have. First, one could wonder "why not splitting the universal      show in Section 4, the complexity is exponential, and therefore,
relation in the traditional manner?". Unfortunately, the traditional   brute force methods are not adequate.
decomposition via functional dependencies is neither available,           In our approach, we introduce an algorithm to automate the
or enforceable to the source data. It is very possible that the        design process. We start by considering that each CS is a sin-
incoming data to be stored are not necessarily accompanied by          gle table and by exploiting their hierarchical relationships, we
metadata descriptions that prescribe functional dependencies.          merge related CSs into the same table. Moreover, we define a
Even if a DBA would design them, it is also possible that the          density-based cost function that help us stop the merging process
data violate them. Thus, in the context of this paper, we proceed      for CSs that contain a large numbers of triples. In this way, we
with an automated design that assumes that no functional de-           achieve merging of CS based on the structural similarity as well
pendencies are available and all schema dependencies are online        as the number of triples they contain. We follow a relational
detected by the incoming data. Second, the desired analysis and        implementation approach by storing all triples corresponding to
the expected workload we consider in this paper play a signifi-        a set of merged CSs into a separate relational table and by execut-
cant role in the determination of the appropriate structure. Most      ing queries through a SPARQL-to-SQL transformation. Although
SPARQL queries involve complex graph patterns, i.e., query con-        alternative storage technologies can be considered (key-value,
ditions, which must be evaluated on the different CSs[9–11]; ap-       graph stores,etc), we have selected well-established technologies
plying such patterns to a universal relation will result in multiple   and database systems for the implementation of our approach,
self-joins whereas applying patterns to a fine-grained relational      in order to take advantage of existing relational data indexing
schema, would again impose multiple joins (and multiple unions         and query processing techniques that have been proven to scale
of the intermediate results for merging the query output) across       efficiently in large datasets and complex workloads. To this end,
a high number of CS relational tables. Consider the evaluation of      we present a novel system, named raxonDB, that exploits these
the following SPARQL query on the nodes of Figure 2(a),                hierarchies in order to merge together hierarchically related CSs
SELECT ? x ? y ? z ?w                                                  and decrease the number of CSs and the links between them,
WHERE { ? x w o r k s F o r ? y .                                      resulting in a more compact schema with better data distribution.
         ?x supervises ?z .                                            raxonDB, built on top of PostgreSQL, provides significant perfor-
                                                                       mance improvements in both storage and query performance of
         ? z h a s B i r t h d a y ' 2011 −02 −24 ' .
                                                                       RDF data with respect to the state of the art techniques presented
         ? z i s M a r r i e d T o ?w .
                                                                       in Section 2. In short, our contributions are as follows:
         ?w h a s N a t i o n a l i t y ' GR ' }                            • We introduce a novel CS merging algorithm that takes
Assuming a design with a "universal" relation (Figure 2c), the                 advantage of CS hierarchies,
query evaluation over the relational schema (see Section 4 for              • we implement raxonDB, an RDF engine built on top of a
the details on the SPARQL-to-SQL query evaluation) would re-                   relational backbone that takes advantage of this merging
quire one self-join for each one of the worksFor , supervises and              for both storing and query processing,
              (a) Four subject nodes, s 1 , s 2 , s 3 , s 4 with their properties and their CSs (left) and the resulting CS hierarchy graph (right).




                          (b) Edge case where each CS becomes a relational table. No NULL values exist in any of the tables.




                                (c) Edge case where all CSs become one universal table. NULL values exist in this table.




                                                  (d) Merging c 1 , c 3 , c 4 together and leaving c 2 unmerged.


Figure 2: Example of four CSs coming from different source datasets, and their respective CS hierarchy graph. Examples
of the two edge cases (all tables vs one table), as well as the merging case can be seen. In the figure, the CSs can be seen
as derived from data instances (nodes s 1, s 2, s 3, s 4 ). It can generally be assumed that are more instances belonging to these
CSs, not shown in the figure.
    • we perform an experimental evaluation that indicates sig-          performance, however, the approach is limited by the constraints
      nificant performance improvements for various parameter            of human-readable schema discovery. In our work, query per-
      configurations.                                                    formance, indexing and storage optimization are the main aims
                                                                         of the merging process, and thus we are not concerned about
   Roadmap. In Section 2, we present the background and related
                                                                         providing human-readable schema information or any form of
work for this paper. In Section 3, we provide preliminary defi-
                                                                         schema exploration. In [12], the authors use CSs and ECSs in
nition and delineate the problem, and in Section 4 we present
                                                                         order to assist cost estimation for federated queries, while in [5],
algorithms towards its solution. In Section 5, we discuss the ex-
                                                                         the authors use CSs in order to provide better triple reordering
perimental evaluation of the paper. We conclude the paper in
                                                                         plans. To the best of our knowledge, this is the first work to ex-
Section 6, with a summary of our findings and future work.
                                                                         ploit hierarchical CS relations in order to merge CSs and improve
                                                                         query performance.
2    RELATED WORK
Due to the tabular structure that tends to implicitly underlay RDF
data, recent works for RDF data management systems have been             3    PRELIMINARIES
implemented in relational backbones. They generally follow three         RDF triples consist of a subject s, a predicate p and an object o.
storage schemes, namely (a) triples tables, (b) property tables,         An RDF dataset is represented as a directed labelled graph where
and, (c) vertical partitioning. A triples table has three columns,       subjects and objects are nodes, and predicates are labelled edges.
representing the subject, predicate and object (SPO) of an RDF           Formally, let I , B, L be infinite, pairwise disjoint sets of URIs,
triple. This technique replicates data in different orderings in         blank nodes and literals, respectively. Then, (s, p, o) ∈ (I ∪ B) ×
order to facilitate sort-merge joins. RDF-3X [14] and Hexastore          (I ) × (I ∪ B ∪ L) is a triple. RDF does not enforce structural rules
[21] build tables on all six permutations of SPO. Built on a rela-       in the representation of triples; within the same dataset there
tional backbone, Virtuoso [4] uses a 4-column table for quads,           can be largely diverse sets of predicates emitting from nodes
and a combination of full and partial indexes. These methods             of the same type [9, 13, 17]. Characteristic Sets (CS)[13] capture
work well for queries with small numbers of joins, however, they         this diversity by representing implied node types based on the
degrade with increasing sizes, unbound variables and joins.              set of properties they emit. Formally, given a dataset D, and a
   Property Tables places data in tables with columns correspond-        subject node s, the characteristic set cs(s) of s is cs(s) = {p |
ing to properties of the dataset, where each table identifies a          ∃o : (s, p, o) ∈ D}, and the set of all CSs is denoted with C. In
specific resource type. Each row identifies a subject node and           what follows, we present basic definitions for CSs and their more
holds the value of each property. This technique has been imple-         generalized form of Property Sets. Intuitively, a property set is a
mented experimentally in Jena [22] and DB2RDF [3], and shows             set of RDF predicates p1 . . . pn . Therefore, a CS is also a property
promising results when resource types and their properties are           set. As we are interested in creating property sets by merging
well-defined. However, this causes extra space overhead for null         CSs, we use this more general form (i.e., property set instead of
values in cases of sparse properties [1]. Also, it raises perfor-        CS) in order to denote sets of attributes that are not necessarily
mance issues when handling complex queries with many joins,              CSs in the original dataset, but are used after merging several
as the amounts of intermediate results increase [8].                     CSs as the basis for a relational table.
   Vertical partitioning segments data in two-column tables. Each            Although each real world measurement can have its very
table corresponds to a property, and each row to a subject node          own characteristic set, for all practical purposes, the information
[1]. This provides great performance for queries with bound ob-          stored in RDF triples typically refers to recurring concepts and
jects, but suffers when the table sizes have large variations in size    their features. Thus, some co-occurrences of features are common,
[20]. TripleBit [23] broadly falls under vertical partitioning. In       e.g., to triples coming from the same source and representing
TripleBit, the data is vertically partitioned in chunks per predicate.   "records" pertaining to the same "concept" (e.g., triples for events,
While this reduces replication, it suffers from similar problems         pertaining to the concept Event come with the same set of features
as property tables. It does not consider the inherent schema of          like datetime and description. All event instances from the same
the triples in order to speed up the evaluation of complex query         source are likely (albeit not obliged) to carry overlapping subsets
patterns.                                                                of these features. At the same time, despite the commonalities,
   In distributed settings, a growing body of literature exists, with    there are differences too, typically produced by the heterogeneity of
systems such as Sempala [18], H2RDF [15] and S2RDF [19]. How-            data sources and the lack of structural restrictions in RDF data. For
ever, these systems are based on the parallelization of centralized      example, an RDF dataset containing information about people, is
indexing and query evaluation schemes.                                   likely to associate a name property for each instance of a Person
   Due to the high heterogeneity in the schema during the in-            class, while a isFatherOf property would not be associated with
tegration and analysis of multiple RDF datasets, latest state of         all instances for obvious reasons. Thus, there is a possibility
the art approaches rely on implicit schema detection in order            to define subset relationships between the features of similar
to index/store triples based on their schema. In our previous            concepts. To this end, we introduce the notion of subsumption
work [9], we defined Extended Characteristic Sets (ECSs) as typed        and the notion of hierarchy of subsumed property sets.
links betwen CSs, and we showed how ECSs can be used to in-                  Definition 1. (Property Sets and Property Tables). In its more
dex triples and greatly improve query performance. In [17], the          general form, a CS is a property set Pi , i.e., a set of RDF predicates
authors identify and merge CSs, similar to our approach, into            and the set of all property sets is denoted with P. A Property
what they call an emergent schema. However, their main focus is          Table (also: CS Table) Ti for a given property set Pi is a relational
to extract a human-readable schema with appropriate relation             table comprising an identifier sid for the subject node s and
labelling and they do not use hierarchical information of CSs,           |Pi | columns, where Pi = {pi,1, pi,2, . . . , pi,n } are the predicates
rather they use semantics to drive the merging process. In [16] it       emitting from s. Ti contains the set r i of tuples that instantiate
is shown how this emergent schema approach can assist query              the values of the properties for sid, i.e., Ti = (sid ∪ Pi , r i ). A tuple
Figure 3: (a) A CS hierarchy graph with dense nodes colored in deep purple, (b) the connected components derived by
cutting off descendants from dense nodes, (c) a connected component with dashed lines representing inferred hierarchical
relationships, (d) all possible assignments of dense nodes to non-dense nodes.


can hold values for the predicate columns in I ∪B ∪L ∪NU LL, i.e.,          say that Hcanc  bas e
                                                                                                  is an ancestral sub-graph over cbase when ∀i ∈
                                                                                                                                       ′
each cell in the relational table can either have a standard value of       [1..k], it holds that c i ≻ cbase and (c i , cbase ) ∈ E . Intuitively,
an RDF triple object, or NU LL when the subject node identified             any set of ancestors of a node cbase forms an ancestral sub-
in sid does not have a property in any of its triples. In Figure            graph. More than one ancestral sub-graphs can be defined over
2(a), four subject nodes, s 1, s 2, s 3, s 4 , are shown. These have four   cbase , as any subset of its parents is considered an ancestral
different CSs based on their property sets, namely c 1, c 2, c 3, c 4 .     sub-graph over cbase . For instance, in Figure 3(c), nodes c 7, c 4, c 2
    Definition 2. (CS Subsumption). Given two CSs, c i and c j ,            form an ancestral sub-graph over c 7 . Similarly, nodes c 6, c 4, c 2
then c i subsumes c j , or c i ≻ c j , when c i is a subset of c j , or     and c 6, c 5, c 2 form ancestral sub-graphs over c 6 .
c i ⊂ c j . This subsumption forms a parent-child relationship.                Having defined the hierarchy of characteristic sets and the
For example, consider c 1, c 2 that describe human beings, with             respective graph, we are now ready to provide the foundation
c 1 = {type, name} and c 2 = {type, name, marriedTo}. It can be             for the core of the problem of this paper. Basically, the goal is to
seen that c 1 ⊂ c 2 and therefore c 1 subsumes c 2 . CS subsumption         find a way to store data in a way that balances two antagoniz-
relationships can be seen in Figure 3(a) as directed edges between          ing goals: (a) the number of NULL values and the unnecessary
nodes. In the example of Figure 2(a), the four CSs exhibit strong           space increase, vs., (b) the resulting decrease in query processing
subset-superset relationships between their properties, For in-             time that would result from the fragmentation of stored data
stance, c 1 and c 2 have property sets that are supersets of both           in multiple nodes and the need to join them. To address this
c 3 and c 4 , while c 3 also subsumes c 4 . The set of all parent-child     issue, we can attempt to merge different nodes into one, paying
relationships defines a CS hierarchy.                                       the price of empty space to avoid the price of joins. There exist
    Definition 3. (CS Hierarchy and Inferred Hierarchy). CS                 two edge cases here, namely (i) assign a table to each CS in the
subsumption creates a partial hierarchical ordering such that               incoming data, resulting in as many tables as CSs, and (ii) assign
when c i ≻ c j , then c i is a parent of c j . Formally, a CS hierarchy     one universal tables for all CSs. This table would contain all of
is a graph lattice H = (V , E) where V ∈ C and E ∈ (V × V ). A              the properties found in the input dataset. The two edge cases for
directed edge between two CS nodes c 1, c 2 exists in H , when              the running example of Fig. 2 can be seen in Fig. 2(b,c).
c 1 ≻ c 2 and there exists no other c i such that c 1 ≻ c i ≻ c 2 .            Definition 5. (Hierarchical CS Merge). Given an ancestral sub-
                                                                                              ′   ′           ′
The directed edge stems from the parent node and arrives at                 graph a = (V , E ), where V = {c 1, c 2, . . . , c k } as defined above,
                                                                                                                     ′                              ′
the child node. An example CS hierarchy can be seen in Figure               and the set of property tables T (V ), then hier _merдe(a,T (V ))
3(a). Given a hierarchy H , we denote the hierarchical closure of H         is a hierarchical merge of a that results in a single table Ta =
with Hc , so that Hc extends H to contain inferred edges between            (cbase , r a ). As cbase is the most specialized CS in a, the columns
hierarchically related nodes that are not consecutive, e.g. a node                                                                        k ′
and its grandchildren. Intuitively, for every set of features c that        of Ta are exactly the properties in cbase , while r a =         r i is the
                                                                                                                                          Ð
                                                                                                                                          i=1
describes some concept in the dataset, we introduce a node in                                                         ′           ′
                                                                            union of the records of all CSs in V , where r i is the projection
our graph. Every strict superset of features of c is a descendant                                                                     ′
of c, and every strict subset of features of c is a superset of c. An       of r i on the properties in cbase . Consequently, r i contains NULL
example inferred hierarchy can be seen on the right of Figure 2(a),         values for all the non-shared properties of cbase and c i . In essence,
with the inferred relationships in dashed lines as well as in Figure        this is an edge contraction operator that merges all tables of the
3(c) for a sub-graph of the graph in Figure 3(a). In the remainder of       nodes of an ancestral sub-graph into one and keeps the signature
this paper, we refer to Hc as the inferred hierarchy of H . The lattice     (i.e., the properties) of the base CS cbase . For instance, assume
                                                                                     ′
resembles the traditional OLAP cube lattice (see [7]), although in          that V = {c 0 = (P 0, r 0 ), c 1 = (P1, r 1 ), c 2 = (P2, r 2 )} is the set
our case, the construction is ad-hoc, depending on the available            of vertices of an ancestral sub-graph with three CSs, with P 0 =
CS’s, and serves a different purpose.                                       {pa , pb }, P1 = {pa , pb , pc } and P 2 = {pa , pb , pc , pd }. Thus, c 0 ≻
    Definition 4. (CS Ancestral Sub-graphs). Given an inferred              c 1 ≻ c 2 . The output of the merging process for our running
hierarchy Hc = (V , E), a CS cbase , a set of CSs c 1, . . . , c k , and    example can be seen in Figure 2(d). Hierarchical merging can be
a sub-graph Hcanc                                                           seen in Figure 4.
                                ′  ′              ′           ′
                   bas e
                          = (V , E ) with V ⊆ V and E ⊆ E, we
    Definition 5. (Merge Graph). Given an inferred CS hierarchy                    r null represents the ratio of null values to the cardinality of
                                            ′     ′   ′
Hc = (V , E), a merge graph is a graph H = (V , E ) that consists                  the merged table. The numerator of the fraction represents the
of a set of n ancestral sub-graphs, and has the following prop-                    total number of cell values that will be null, as the product of
              ′                                         ′
erties: (i) H contains all nodes in H such that V ≡ V , i.e., it                   the number of non-shared properties and the cardinality of the
                                              ′
covers all CSs in the input dataset, (ii) H contains a subset of                   parent CS. The denominator represents the resulting cardinality
                             ′
the edges in H such that E ⊂ E, (iii) each node is contained in                    of the table.
exactly one ancestral sub-graph ai , (iv) all ancestral sub-graphs                     In order to assess an ancestral sub-graph, we use a generalized
are pair-wise disconnected, i.e., there exist no edges between the                 version of r null that captures the NULL value effect on the whole
nodes of different ancestral sub-graphs. Thus, each ancestral sub-                 sub-graph:
graph can be contracted into one node unambiguously, using the                                                       Í |д |
hier _merдe operator. Also, the total number of relational tables                                   д                   i=1 |c d \ c i | × |r i |
                                                                                                   r null (д)|Td =              Í |д |              (2)
will be equal to the number of ancestral sub-graphs in the merge                                                       |rd | + i=1 (|r i |)
graph.
    The primary focus of this work is to improve the efficiency of                 Here, Td = (cd , rd ) is the root of sub-graph д. However, merging
storage and querying in relational RDF engines by exploiting the                   a parent to a child changes the structure of the input graph, as
implicit schema of the data in the form of CSs. The two extreme                    the cardinality of the merged child is increased. Thus, we define
approaches – i.e., (a) a separate table for each CS, on the one end                a cost function that works on the graph level, as follows:
of the spectrum, or, (b) merging all the data into a universal table
with a schema expressed as the union of the individual features                                                      n
                                                                                                                         д
                                                                                                                     Õ
of the different CS’s, at the other end of the spectrum, have the                                       cost(д) =      r null (дi )|cd i            (3)
following drawbacks, respectively: (i) multiple joins of separate                                                    i=1
tables at query processing, if many tables are used, and (ii) bad
utilization of disk space if nodes are excessively merged into a                   where n is the number of dense nodes, cdi is a dense node and дi
very small number of tables (due to excessive empty space with                     is the ancestral sub-graph with cdi as the base node.
NULLs).                                                                                Thus, choosing dense CSs as bases is a seeding process that
    Problem Formulation. Given an inferred CS hierarchy Hc =                       aims to minimize this NULL value effect by making sure that
                                                    ′      ′
(V , E), the problem is to find a merge graph H = (V , E ) in the                  a large fraction of the input records will not be extended with
form of a set of disconnected ancestral sub-graphs, that provides                  NULL values. This is true because a CS base and its resulting
a good way to merge CS nodes. In other words, the problem                          merged table will have exactly the same properties (columns)
is to find the best set of ancestral sub-graphs from an inferred                   and thus introduction of NULL values will not be needed for the
hierarchy Hc that minimize an objective cost function cost(x).                     records of the CS base.
                                                                                       The problem of selecting ancestral sub-graphs for the merge
                                                                                   is computationally hard, as mentioned earlier. For this reason,
4    HIERARCHICAL CS MERGING                                                       we rely on heuristics in order to seed the process and provide an
What makes the problem hard, is the complexity of finding a                        initial set of ancestral sub-graph bases for the final merged tables.
sweet spot in the Pareto equilibrium between conciseness of the                    The CS bases will be the only relational tables in the output, with
relational schema (few tables) and internal density (with respect                  the remaining tables merged into them. For this, we identify dense
to the empty space they contain).                                                  CS nodes in the hierarchy (i.e, with large cardinalities) and use
   Schema conciseness. To start with the arguments in favor                        these nodes as the bases of the ancestral sub-graphs. While node
of reducing the number of nodes via merging, we can see that, by                   density can be defined in many different ways, in the context
reducing the number of CS, the number of joins between them                        of this work we define a c i to be dense, if the cardinality of its
is also reduced. Furthermore, merging together CSs leads to a                      relational table is larger than a linear function of the maximum
less skewed distribution of data to relational tables. Ultimately,                 cardinality of CSs in D, i.e., a function d : N → R, with d(Ti ) =
this results in a drastically decreased disk-based I/O cost, as less               m × |rmax |. Here, m ∈ [0, 1] is called the density factor, and rmax
tables are fetched, positively affecting query processing as well.                 is the cardinality of the largest CS table in D. By definition, if
   Density. On the contrary, merging tables results in the intro-                  m = 0, no CSs will be merged (i.e., each CS will be stored in
duction of NULL values for the non-shared columns, which can                       each own table), while if m = 1, no tables will be created, as no
degrade performance. Specifically, merging CSs with different                      CS has a cardinality larger than that of the largest CS. With a
attribute sets can result in large numbers of NULL values in the                   given m, the problem is reduced to finding the optimal ancestral
resulting table. Given a parent CS table T1 = (sid ∪ c 1, r 1 ) and a              sub-graph for each given dense node.
child CS table T2 = (sid ∪ c 2, r 2 ) with |c 1 | < |c 2 | and |r 1 | >> |r 2 |,       Given this cost model and a predefined set of dense nodes,
the resulting |c 2 \ c 1 | × |r 1 | NULL cells will be significantly large         our algorithm will find the optimal sub-graph for each dense
compared to the total number of r 1 + r 2 records, thus poten-                     node. An inferred hierarchy graph can be converted to a set of
tially causing poor storage and querying performance[17]. For                      connected components that are derived by removing the outgoing
this reason, CS merging must be performed in a way that will                       edges from dense nodes, since we are not interested in merging
minimize the presence of NULL values. The following function                       children to parents, but only parents to children. An example
captures the NULL-value effect of the merge of two CS tables                       of this can be seen in Figure 3(b). For each component, we can
Ti = (sid ∪ c i , r i ),T j = (sid ∪ c j , r j ) with c i ≻ c j :                  compute cost(д) as the sum of the costs of these components.
                                                                                   The main idea is to identify all connected components in the CS
                                           |c j \ c i | × |r i |                   graph, iterate through them, enumerate all sub-graphs within
                     r null (Ti ,T j ) =                                    (1)    the components that start from the given set of dense nodes, and
                                             (|r j | + |r i |)
                                                                                   select the optimal partitioning for each component.
                                                                               Algorithm 1: optimalMerge
                                                                                 Data: An inferred hierarchy lattice Lc as a adjacency list , and a set of dense CSs D
                                                                                 Result: A set of optimal ancestral sub-graphs
                                                                               1 init f inal List ;
                                                                               2 connect edComponent s ← f indConnect edComponent s(L c );
                                                                               3 for each connect edComponent do
                                                                               4        init min ← MAX _V ALU E ;
                                                                               5        init best Subдr aph ;
                                                                               6        while nex t ← connect edComponent .дener at e N ex t Subдr aph()
                                                                                          do
                                                                               7              if cost (nex t ) < min then
                                                                               8                    min ← cost (nex t );
                                                                               9                    best Subдr aph ← nex t ;
                                                                              10        end
                                                                              11        f inal List .add(best Subдr aph);
                                                                              12 end
         Figure 4: Merging the tables of c 0 , c 1 and c 2 .                  13 return f inal List ;




                                                                                  The main idea behind the algorithm is to iterate the non-dense
   The algorithm can be seen in Algorithm 1. The algorithm                    nodes, and for each non-dense node, calculate r null and find the
works by first identifying all connected components of the in-                dense node that minimizes this function for the given non-dense
ferred hierarchy (Line 2). Identifying connected components is                node. Then, the cardinalities will be recomputed and the next
trivially done using standard DFS traversal, and is not shown in              non-dense node will be examined. The algorithm can be seen in
the Algorithm. Then, we iterate each component (Line 3), and for              Algorithm 2. In the beginning, the algorithm initiates a hash table,
each component, we generate all possible sub-graphs. Then, we                 merдeMap, with an empty list for each dense node (Lines 1-4).
calculate the cost of each sub-graph (Line 7) and if it is smaller            Then, the algorithm iterates all non-dense nodes (Line 5), and for
than the current minimum, the minimum cost and best sub-graph                 each dense node, it calculates the cost r null of merging it to each
are updated (Lines 8-9). Finally, we add the best sub-graph to the            of its connected dense nodes (Lines 5-13), keeping the current
final list (Line 11) and move to the next component.                          minimum cost and dense node. In the end, the current non-dense
   To generate the sub-graphs, we do not need to do an exhaustive             node is added to the list of the dense node that minimizes r null
generation of 2n combinations, but we can rely on the observa-                (Line 14). Notice that we do not need to split the hierarchy into
tion that each non-dense node must be merged to exactly one                   connected components in order for дreedyMerдe to work.
dense node. Therefore, sub-graph generation is reduced to find-                   Complexity Analysis. Given k non-dense nodes and d dense
ing all possible assignments of dense nodes to the non-dense                  nodes, where each non-dense node ki has e(ki ) dense children,
nodes. An example of this can be seen in Figure 3. In the figure,             the дreedyMerдe algorithm needs ki=1 e(ki ) iterations, because
                                                                                                                      Í
nodes c 2, c 4, c 5 are non-dense, while nodes c 6, c 7, c 8 are dense. All   it requires iteration of all e(ki ) nodes for each ki . In the worst case,
possible and meaningful sub-graphs are enumerated in the table                every ki is related to all d dense nodes, requiring kd iterations.
at the right of the figure, where we assign a dense node to each              Assuming a constant cost for the computation of r null , then the
of the non-dense nodes. An assignment is only possible if there               asymptotic complexity of the greedy algorithm is O(kd), which
exists a parent-child relationship between a non-dense node and               is a significant performance when compared to the exponential
a dense node, even if it is an inferred one (e.g. c 2 is an inferred          complexity of optimalMerдe.
parent of c 7 ). Hence, the problem of sub-graph generation be-
comes one of generating combinations from different lists, by
selecting one element from each list. The number of lists is equal             Algorithm 2: greedyMerge
                                                                                 Data: A hash table p mapping non-dense CSs to their dense descendants, a set of
to the number of non-dense nodes, and the elements of each list                         dense CSs D , and a set of non-dense CSs K
are the dense nodes that are related to the non-dense node.                      Result: A hash table mapping dense CSs to sets of non-dense CSs to be merged
                                                                               1 init mer дe Map ;
   Complexity Analysis. Assuming that a connected compo-                       2 for each d ∈ D do
nent д has k non-dense nodes and d dense nodes, and each non-                  3       mer дe Map .put (d , new List ());
dense node ki is related to e(ki ) dense nodes, then the num-                  4 end
                                                                               5 for each k ∈ K do
ber of sub-graphs that need to be enumerated are ki=1 e(ki ).
                                                               Î
                                                                               6       min ← MAX _V ALU E ;
In the example of figure 3, the total number of sub-graphs is                  7       init best Dense ;
                                                                               8       for each d k ∈ p .дet (k ) do
e(c 2 ) × e(c 4 ) × e(c 5 ) = 3 × 2 × 1 = 6. In the worst case all k           9             cost ← r nul l (k , d k );
nodes are parents of all d nodes. Then, the number of total sub-              10             if cost < min then
                                                                                                  min ← cost ;
graphs is d k , which makes the asymptotic time complexity of
                                                                              11
                                                                              12                  best Dense ← d k ;
the algorithm O(d k ).                                                        13       end
                                                                              14       mer дe Map .дet (best Dense).add (k);
                                                                              15 end
4.1     Greedy Approximation                                                  16 return mer дe Map ;


For very small d, k (e.g. d, k < 4), the asymptotic complexity of
O(d k ) is acceptable. However, in real-world cases, the number of               This process does not necessarily cover all CSs of the input
connected components can be small, making d and k large. For                  dataset. For example, some CS nodes might not have any dense
this reason, we introduce a heuristic algorithm for approximat-               children. Given this, the percentage of the dataset that is covered
ing the problem, that does not require enumerating all possible               by this process is called dense CS coverage. The remainder of the
combinations, relying instead on a greedy objective function that             CSs are aggregated into one large table, Tr est , containing all of
attempts to find the local minimum with respect to the defined                their predicates. If the total coverage of the merging process is
cost model for each non-dense node.                                           large, then Tr est does not impose a heavy overhead in query
Figure 5: An example of greedy merging. Dense nodes are coloured in deep purple. At each step, the non-dense node under
examination is coloured with green, while the edge that minimizes r null can be seen in bold.


performance, as will be shown in the experiments. Finally, we            permutations of table joins that need to be evaluated. For instance,
load the data in the corresponding tables.                               assume that q 1 matches with c 1, c 2 , while q 2 matches with c 3 and
                                                                         q 3 matches with c 4, c 5 . Assume also that by looking up the ECS
4.2    Implementation                                                    index, we derived that the links [c 1, c 3 ], [c 2, c 3 ], [c 3, c 4 ] and [c 3, c 5 ]
                                                                         exist in the index, i.e., they correspond to candidate joins in the
We implemented raxonDB as a storage and querying engine that
                                                                         data. Then, [c 1, c 3, c 4 ], [c 1, c 3, c 5 ], [c 2, c 3, c 4 ] and [c 2, c 3, c 5 ] are
supports hierarchical CS merging, and can be deployed on top
                                                                         all valid table permutations that must be processed. Two main
of standard RDBMS’s. Specifically, we used PostgreSQL 9.6, but
                                                                         strategies can be employed here. The first is to join the UNIONs
raxonDB can be adapted for other relational databases as well.
                                                                         of the matching tables for each qi , and the other is to process each
   CS Retrieval and Merging. The processes of retrieving and
                                                                         permutation of tables separately and append the results. Given
merging CSs take place during the loading stage of an incom-
                                                                         the filtering performed by the ECS indexing approach, where
ing RDF dataset. CS retrieval is a trivial procedure that requires
                                                                         we can pre-filter CSs based on the relationships between them,
scanning the whole dataset and storing the unique sets of prop-
                                                                         the UNION would impose significant overhead and eliminate
erties that are emitted from the subject nodes in the incoming
                                                                         the advantage of ECS indexing. Therefore, we have implemented
triples, and is adopted from our previous work in [9] where it is
                                                                         the second approach, that is, process a separate query for each
described in detail. After retrieving the CSs, the main idea is to
                                                                         permutation. Finally, due to the existence of NULL values in the
compute the inferred CS hierarchy and apply one of the described
                                                                         merged tables, we must add explicit IS NOT NULL restrictions
merging algorithms. Finally, each set of merged CSs is stored
                                                                         for all the properties that are contained in each matched CS and
in a relational table. In each table, the first column represents
                                                                         are not part of any other restriction or filter in the original query.
the subject identifier, while the rest of the columns represent the
union of the property sets of the merged CSs. For multi-valued
properties, we use PostgreSQL’s array data type in order to avoid
                                                                         5     EXPERIMENTAL EVALUATION
duplication of the rows.                                                 We implemented raxonDB on top of PostgreSQL3 . As the focus of
   Indexing. We deploy several indexes in raxonDB. First off,            this paper is to improve RDF storage and querying efficiency in
we index the subject id for each row. We also build foreign-key          relational settings, we rely on existing mechanisms within Post-
indexes on object-subject links between rows in different CSs, i.e.,     greSQL for I/O operations, physical storage and query planning.
when a value of a property in one CS is the subject id of another        In this set of experiments, we report results of implementing
CS. Next, we use standard B+trees for indexing single-valued             hier _merдe with the greedy approximation algorithm, as experi-
property columns, while we use PostgreSQL’s GIN indexes, which           menting with the optimal algorithm failed to finish the merging
apply to array datatypes for indexing multi-valued properties.           process even in datasets with small numbers of CSs.
This enables fast access to CS chain queries, i.e., queries that apply      Datasets. For this set of experiments, we used two synthetic
successive joins for object-subject relationships. Furthermore,          datasets, namely LUBM2000 (≈300m triples), and WatDiv (≈100m
we store these links on the schema level as well, i.e., we keep an       triples), as well as two real-world datasets, namely Geonames
index of CS pairs that are linked with at least one object-subject       (≈170m triples) and Reactome (≈15m triples). LUBM [6] is a cus-
pair of records. These links are called Extended Characteristic          tomizable generator of synthetic data that describes academic
Sets (ECSs) and are based on our previous work in [9]. With the          information about universities, departments, faculty, and so on.
ECS index, we can quickly filter out CSs that are guaranteed not         Similarly, WatDiv[2] is a customizable generator with more op-
to be related, i.e., no joins exist between them, even if they are       tions for the production and distribution of triples to classes.
individually matched in a chain of query CSs. Other metadata and         Reactome4 is a biological dataset that describes biological path-
indexes include the property sets of CSs, and which properties           ways, and Geonames5 is a widely used ontology of geographical
can contain multiple values in the same CS.                              entities with varying properties and rich structure.
   Query Processing. Processing SPARQL queries on top of                    Loading. In order to assess the effect of hierarchical merging
merged CSs entails (i) parsing the queries, (ii) retrieving the query    in the loading phase, we performed a series of experiments using
CSs, (iii) identifying the joins between them, and (iv) mapping          all four datasets. For this experiment, we measure the size on
them to merged tables in the database. Steps (i)-(iii) are inherited     disk, the loading time, the final number of merged tables, as well
from our previous work in [9]. For (iv), a query CS can match            as the number of ECSs (joins between merged tables) and the
with more than one table in the database. For instance, consider         percentage of triple coverage by CSs included in the merging
a query containing a chain of three CSs, q 1 ▷◁ q 2 ▷◁ q 3 , joined      process, for varying values of the density factor m ∈ [0, 1]. The
sequentially with object-subject joins. Each query CS qi matches         3 The code and queries are available in https://github.com/mmeimaris/raxonDB
with all tables whose property sets are supersets of the property        4 http://www.ebi.ac.uk/rdf/services/reactome
set of qi . Thus, each join in the initial query creates a set of        5 http://www.geonames.org/ontology/documentation.html
     (a) Execution time (seconds) for LUBM         (b) Execution time (seconds) for Geonames         (c) Execution time (seconds) for Reactome


                                             Figure 6: Query execution times in milliseconds




      (a) # of CS permutations for LUBM              (b) # of CS permutations for Geonames              (c) # of CS permutations for Reactome


                                             Figure 7: # of CS permutations for increasing m

                                              Table 2: Loading experiments for all datasets

                               Dataset             Size (MB)      Time     # Tables (CSs)      # of ECSs       Dense CS
                                                                                                               Coverage
                         Reactome Simple              781        3min            112                 346         100%
                        Reactome (m=0.05)             675        4min             35                 252          97%
                        Reactome (m=0.25)             865        4min             14                  73          77%
                         Geonames Simple              4991       69min           851                12136        100%
                       Geonames (m=0.0025)            4999       70min            82                2455          97%
                        Geonames (m=0.05)             5093       91min            19                  76          87%
                        Geonames (m=0.1)              5104       92min             6                  28          83%
                           LUBM Simple                591        3min             14                  68         100%
                          LUBM (m=0.25)               610        3min              6                  21          90%
                          LUBM (m=0.5)                620        3min              3                  6           58%
                          WatDiv Simple               4910       97min           5667                802         100%
                         WatDiv (m=0.01)              5094       75min            67                  99          77%
                          WatDiv (m=0.1)              5250       75min            25                  23          63%
                          WatDiv (m=0.5)              5250       77min            16                  19          55%




    (a) Execution time (seconds) for LUBM2000           (b) Execution time (seconds) for Geonames             (c) Execution time (seconds) for Reactome


                           Figure 8: Query execution times in milliseconds for different RDF engines


results are summarized in Table 2. As can be seen, the number               of CS, and consequently tables, is greatly reduced with increas-
                                                                            ing values of m. As the number of CSs is reduced, the expected
number of joins between CSs is also reduced, which can be seen         compared with various alternative RDF engines and the results
in the column that measures ECSs. Consequently, the number             for the performance of indexing and querying showed that our
of tables can be decreased significantly without trading off large     system outperforms for various types of workloads.
amounts of coverage by dense CSs, i.e. large tables with many             As future work, we will study computation of the optimal value
null values. Loading time tends to be slightly greater as the num-     for m, taking into consideration workload characteristics as well
ber of CSs decreases, and thus the number of merges increases,         as a more refined cost model for the ancestral paths. Furthermore,
the only exception being WatDiv, where loading time is actually        we will study and compare our approach to a graph database
decreased. This is a side-effect of the excessive number of tables     setting, as well as experiment with a column-stored relational
(= 5667) in the simple case which imposes large overheads for the      DB, in order to further scale the capabilities of raxonDB.
persistence of the tables on disk and the generation of indexes           Acknowledgments. This work is partially funded by the
and statistics for each one.                                           projects VisualFacts (#1614 - 1st Call of the Hellenic Foundation
   Query Performance. In order to assess the effect of the den-        for Research and Innovation Research Projects for the support of
sity factor parameter m during query processing, we perform a          post-doctoral researchers), and "Moving from Big Data Manage-
series of experiments on LUBM, Reactome and Geonames. For              ment to Data Science" (MIS 5002437/3)- Action "Reinforcement of
the workload, we used the sets of queries from [9] 6 . We em-          the Research and Innovation Infrastructure" (funded by Greece
ploy two metrics, namely execution time and number of table            and the European Regional Development Fund).
permutations. The results can be seen in Figures 6 and 7. As can
be seen, hierarchical CS merging can help speed up query per-          REFERENCES
formance significantly as long as the dense coverage remains            [1] Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. 2007.
                                                                            Scalable semantic web data management using vertical partitioning. In VLDB.
high. For example, in all datasets, query performance degrades          [2] Güneş Aluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. 2014. Diver-
dramatically when m = 1, in which case the merging process                  sified stress testing of RDF data management systems. In ISWC.
cannot find any dense CSs. In this case, all rows are added to          [3] Mihaela A Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas,
                                                                            Patrick Dantressangle, Octavian Udrea, and Bishwaranjan Bhattacharjee. 2013.
one large table, which makes the database only contain one table            Building an efficient RDF store over a relational database. In ACM SIGMOD.
with many NULL cells. These findings are consistent across all          [4] Orri Erling and Ivan Mikhailov. 2010. Virtuoso: RDF support in a native RDBMS.
three datasets (Q 6 in LUBM exhibits a higher increase for m = 1            Springer.
                                                                        [5] A. Gubichev and T. Neumann. 2014. Exploiting the query structure for efficient
due to the dataset’s characteristics; for lack of space we omit the         join ordering in SPARQL queries. In EDBT.
explanation of this behaviour) and require further future work          [6] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for
                                                                            OWL knowledge base systems. Web Semantics: Science, Services and Agents on
in order to identify the optimal value for m.                               the World Wide Web 3, 2 (2005), 158–182.
   In order to assess the performance of raxonDB and establish          [7] Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ull-
that no overhead is imposed by the relational backbone, we per-             man. 1997. Index Selection for OLAP. In ICDE,. 208–219.
                                                                        [8] Maciej Janik and Krys Kochut. 2005. BRAHMS: a workbench RDF store and
formed a series of queries on LUBM2000, Geonames and Reac-                  high performance memory system for semantic association discovery. In
tome, assuming the best merging of CSs is employed as captured              ISWC.
by m with respect to our previous findings. We also compared            [9] M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. 2017.
                                                                            Extended Characteristic Sets: Graph Indexing for SPARQL Query Optimization.
the query performance with rdf-3x, Virtuoso 7.1, TripleBit and              In ICDE.
the emergent schema approach described in [16]. The results can        [10] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis
                                                                            Anagnostopoulos. 2016. Efficient Computation of Containment and Com-
be seen in Figure 8 and indicate that raxonDB provides equal or             plementarity in RDF Data Cubes. In Proceedings of the 19th International
better performance from the original axonDB implementation, as              Conference on Extending Database Technology, EDBT 2016, Bordeaux, France,
well as the rest of the systems, including the emergent schema              March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 281–292. https:
                                                                            //doi.org/10.5441/002/edbt.2016.27
approach, which is the only direct competitor for merging CSs.         [11] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis Anag-
Especially for queries with large intermediate results and low              nostopoulos. 2018. Computational methods and optimizations for contain-
selectivity that correspond to a few CSs and ECSs (e.g. LUBM Q5             ment and complementarity in web data cubes. Inf. Syst. 75 (2018), 56–74.
                                                                            https://doi.org/10.1016/j.is.2018.02.010
and Q6, Geonames Q5 and Q6) several of the other approaches            [12] G. Montoya, H. Skaf-Molli, and K. Hose. 2017. The Odyssey Approach for
fail to answer fast and in some cases time out.                             Optimizing Federated SPARQL Queries. In ISWC.
                                                                       [13] T. Neumann and G. Moerkotte. 2011. Characteristic sets: Accurate cardinality
                                                                            estimation for RDF queries with multiple joins. In ICDE.
6    CONCLUSIONS AND FUTURE WORK                                       [14] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable
                                                                            management of RDF data. The VLDB Journal 19, 1 (2010), 91–113.
In this paper, we tackled the problem of automatically mapping         [15] Nikolaos Papailiou, Dimitrios Tsoumakos, Ioannis Konstantinou, Panagiotis
heterogeneous RDF datasets to a relational schema by consider-              Karras, and Nectarios Koziris. 2014. H 2 RDF+: an efficient data management
ing the implicit schema in RDF triples. We presented a method               system for big RDF graphs. In ACM SIGMOD.
                                                                       [16] MD Pham and P. Boncz. 2016. Exploiting emergent schemas to make RDF
that extracts the Characteristics Sets, i.e., the set of properties         systems more efficient. In ISWC.
describing the different classes of RDF instances in the data and      [17] MD. Pham, L. Passing, O. Erling, and P. Boncz. 2015. Deriving an emergent
                                                                            relational schema from rdf data. In WWW.
exploits the hierarchical relationships between different CSs in or-   [18] Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg
der to merge and map them to relational tables. We have provided            Lausen. 2014. Sempala: interactive SPARQL query processing on hadoop.
two algorithms, an optimal (exhaustive) one which selects ances-            In ISWC.
                                                                       [19] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and Georg
tral sub-graphs of CS for merging in exponential time and greedy            Lausen. 2016. S2RDF: RDF querying with SPARQL on spark. In VLDB.
one, which via the use of heuristics improves the performance to       [20] Lefteris Sidirourgos, Romulo Goncalves, Martin Kersten, Niels Nes, and Stefan
polynomial time. We have implemented our methods on top of a                Manegold. 2008. Column-store support for RDF data management: not all
                                                                            swans are white. In VLDB.
standard RDBMS solution, i.e., PostgreSQL for extracting, index-       [21] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore:
ing and query processing of SPARQL queries. Finally, we have                sextuple indexing for semantic web data management. In VLDB.
                                                                       [22] Kevin Wilkinson. 2006. Jena property table implementation.
experimented with two synthetic and two real-world datasets,           [23] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013.
all of them exhibiting high heterogeneity in their schemas, we              TripleBit: a fast and compact system for large scale RDF data. In VLDB.

6 Available also in https://github.com/mmeimaris/raxonDB