Hierarchical Property Set Merging for SPARQL Query Optimization Marios Meimaris George Papastefanatos Panos Vassiliadis Athena Research Center Athena Research Center University of Ioannina m.meimaris@athenarc.gr gpapas@athenarc.gr pvassil@cs.uoi.gr ABSTRACT and a subject node s, the characteristic set cs(s) of s is defined as Characteristic sets (CS) organize RDF triples based on the set follows [13]: of properties associated with their subject nodes. This concept cs(s) = {p | ∃o : (s, p, o) ∈ D} was recently used in indexing techniques, as it can capture the implicit schema of RDF data. While most CS-based approaches In plain words, a CS is the set of attributes (or RDF properties) yield significant improvements in space and query performance, of a given subject node. Abstracting from RDF triples to their they fail to perform well when answering complex query work- CSs, an RDF graph can be represented on the structural level by loads in the presence of schema heterogeneity, i.e., when the a graph containing CSs as nodes, and links between CSs as edges, number of CSs becomes very large, resulting in a highly parti- where a link between two CSs exists whenever a link between tioned data organization. In this paper, we address this problem two subject nodes exists in the original RDF graph. Due to the by introducing a novel technique, for merging CSs based on their ability to represent all of the properties of a subject node with a hierarchical structure. Our method employs a lattice to capture single set, rather than multiple triples, CSs have been thoroughly the hierarchical relationships between CSs, identifies dense CSs used as a means to optimize query planning, storage, indexing and merges dense CSs with their ancestors, thus reducing the size and query processing [5, 9, 12, 13, 16]. In their most general of the CSs as well as the links between them. We implemented our form, they are used as the basis for mapping RDF to a relational algorithm on top of a relational backbone, where each merged structure, where each CS forms a relational table. An illustration CS is stored in a relational table, and we performed an extensive of this mapping can be seen in Figure 1. There are two entities, experimental study to evaluate the performance and impact of Alice and Claire, each represented in an RDF graph as a node. The merging to the storage and querying of RDF datasets, indicating properties of these two nodes are (a) their type, (b) the company significant improvements. for which they work and (c) their supervisor. The set of these tree properties, {rdf:type, worksFor, supervises}, forms the characteristic set for these two nodes. The direct representation as a relation 1 INTRODUCTION is depicted at the bottom of Fig. 1, with the characteristic set becoming the attributes of the relation. The Resource Description Framework1 (RDF) and SPARQL Pro- tocol and RDF Query Language2 (SPARQL) are W3C recommen- dations for representing and querying data on the semantic web. The semantic web has been established as a source of diverse datasets from a multitude of domains, such as life sciences, sta- tistics, finance, open science and health. Wide adoption has led to increasingly larger dataset sizes, and at the same time com- plex analytical queries have started to emerge, reflecting the ever increasing recognition of value of analytical processing in the modern data-centric economy. In light of this, RDF data manage- ment methods are calling for improvements in the performance of RDF storage and querying engines, as has been discussed in recent works, where state-of-the-art RDF engines are found to be very efficient in simple workloads, but not efficient enough when it comes to more complex query workloads [17][16][9][12]. In response to this limitation, recent works have shown that Figure 1: A simple RDF graph consisting of a single char- extraction and exploitation of the implicit schema of the data can acteristic set, c 1 , and its resulting table. be beneficial in both storage and SPARQL query performance [9][12]. In order to organize on disk, index and query triples effi- However, a mapping from RDF to the relational space is not ciently, these efforts heavily rely on two structural components always helpful, as the structural looseness that is allowed in the of an RDF dataset, namely (i) the notion of characteristic sets (CS), RDF world can translate to multiple, heterogeneous CSs that i.e., different property sets that characterize subject nodes, and represent skewed distributions of triples. For example, instead (ii) the join links between CSs. Formally, given an RDF dataset D, of the homogeneity of the graph in Figure 1, where all of the nodes share the structure, i.e., the same CS, consider the case 1 https://www.w3.org/RDF/ of Figure 2(a) where nodes are described by four different CSs. 2 https://www.w3.org/TR/sparql11-overview/ In fact, it is frequent,that there exist many different CSs within the same dataset, representing varying internal structure for © Copyright 2020 for this paper held by its author(s). Published in the proceedings of the nodes of the source graph. This schema heterogeneity in DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution loosely-structured datasets is indeed frequently found in the real 4.0 International (CC BY 4.0). world (e.g., Geonames contains 851 CSs and 12136 CS links), Table 1: RDF datasets along with their number of CSs and isMarriedTo query conditions (hasBirthday and hasN ationality links between CSs. conditions are executed as select operations in SQL). At the same time, there is a price of NULL values, too, as (Figure 2c) shows. Dataset # Tables (CSs) # of CS joins At the other end of the spectrum, in the multiple CS tables case Reactome 112 346 (Figure 2b), each one of the three query conditions would require Geonames 851 12136 one self join and additionally three joins between each CS table LUBM 2000 14 68 and all other CS tables in the database – i.e., 4 joins per table. WatDiv 100 5667 802 Thus, despite its space efficiency, the latter case imposes per- Wordnet 779 7250 formance overheads due to the large number of joins the query EFO 520 2515 must perform to fetch data from multiple tables. This number DBLP 95 733 can become significantly large in real-world datasets, as shown in the second column of Table 1. In this paper, we tackle the problem of mapping heterogeneous imposing large overheads in the extraction, storage and disk- RDF datasets to a relational schema with the aim to facilitate the based retrieval[17][9]. For reference, some well established RDF processing of complex analytical SPARQL queries, by automating datasets along with their associated numbers of CSs (first column) the decision of which tables will be created in order to host the are shown in Table 1. incoming data, such that there are no overly empty tables and In these cases, we end up with a lot of CSs, each of which may extremely large numbers of joins. actually represent very few nodes. There are two antagonizing Interestingly, the problem is both open and hard to solve. The approaches in creating a relational schema from a set of CSs. (i) current approaches to address the problem do not avoid the Creating a relational table for each different CS would result in problems that we have mentioned before and include approaches a large numbers of relational tables with few tuples in each of to store data generically as triples, as property tables (practically them, as shown in Figure 2(b) that require a very large number of one table per CS) or as attribute value pairs as vertical partitions joins to answer queries; (ii) on the other hand, creating a single, (see Section 2 for a detailed discussion). All these solutions are "universal" table, accommodating the CS’s of all the nodes of found in the extremes of the dilemma between empty space and the graph would create a very wide relation, as in Figure 2(c), query performance without achieving a "sweet" compromise which is practically overly empty (i.e., too many NULL values) between the two forces. Therefore, the state of practice and the and space inefficient, due to the fact that different tuples would state of the art provide room for improvement, and in fact, to have very different cells populated. the best of our knowledge, this is the first effort to address the There are two factors that affect or constrain the design choices problem. At the same time, the problem is hard to solve: as we we have. First, one could wonder "why not splitting the universal show in Section 4, the complexity is exponential, and therefore, relation in the traditional manner?". Unfortunately, the traditional brute force methods are not adequate. decomposition via functional dependencies is neither available, In our approach, we introduce an algorithm to automate the or enforceable to the source data. It is very possible that the design process. We start by considering that each CS is a sin- incoming data to be stored are not necessarily accompanied by gle table and by exploiting their hierarchical relationships, we metadata descriptions that prescribe functional dependencies. merge related CSs into the same table. Moreover, we define a Even if a DBA would design them, it is also possible that the density-based cost function that help us stop the merging process data violate them. Thus, in the context of this paper, we proceed for CSs that contain a large numbers of triples. In this way, we with an automated design that assumes that no functional de- achieve merging of CS based on the structural similarity as well pendencies are available and all schema dependencies are online as the number of triples they contain. We follow a relational detected by the incoming data. Second, the desired analysis and implementation approach by storing all triples corresponding to the expected workload we consider in this paper play a signifi- a set of merged CSs into a separate relational table and by execut- cant role in the determination of the appropriate structure. Most ing queries through a SPARQL-to-SQL transformation. Although SPARQL queries involve complex graph patterns, i.e., query con- alternative storage technologies can be considered (key-value, ditions, which must be evaluated on the different CSs[9–11]; ap- graph stores,etc), we have selected well-established technologies plying such patterns to a universal relation will result in multiple and database systems for the implementation of our approach, self-joins whereas applying patterns to a fine-grained relational in order to take advantage of existing relational data indexing schema, would again impose multiple joins (and multiple unions and query processing techniques that have been proven to scale of the intermediate results for merging the query output) across efficiently in large datasets and complex workloads. To this end, a high number of CS relational tables. Consider the evaluation of we present a novel system, named raxonDB, that exploits these the following SPARQL query on the nodes of Figure 2(a), hierarchies in order to merge together hierarchically related CSs SELECT ? x ? y ? z ?w and decrease the number of CSs and the links between them, WHERE { ? x w o r k s F o r ? y . resulting in a more compact schema with better data distribution. ?x supervises ?z . raxonDB, built on top of PostgreSQL, provides significant perfor- mance improvements in both storage and query performance of ? z h a s B i r t h d a y ' 2011 −02 −24 ' . RDF data with respect to the state of the art techniques presented ? z i s M a r r i e d T o ?w . in Section 2. In short, our contributions are as follows: ?w h a s N a t i o n a l i t y ' GR ' } • We introduce a novel CS merging algorithm that takes Assuming a design with a "universal" relation (Figure 2c), the advantage of CS hierarchies, query evaluation over the relational schema (see Section 4 for • we implement raxonDB, an RDF engine built on top of a the details on the SPARQL-to-SQL query evaluation) would re- relational backbone that takes advantage of this merging quire one self-join for each one of the worksFor , supervises and for both storing and query processing, (a) Four subject nodes, s 1 , s 2 , s 3 , s 4 with their properties and their CSs (left) and the resulting CS hierarchy graph (right). (b) Edge case where each CS becomes a relational table. No NULL values exist in any of the tables. (c) Edge case where all CSs become one universal table. NULL values exist in this table. (d) Merging c 1 , c 3 , c 4 together and leaving c 2 unmerged. Figure 2: Example of four CSs coming from different source datasets, and their respective CS hierarchy graph. Examples of the two edge cases (all tables vs one table), as well as the merging case can be seen. In the figure, the CSs can be seen as derived from data instances (nodes s 1, s 2, s 3, s 4 ). It can generally be assumed that are more instances belonging to these CSs, not shown in the figure. • we perform an experimental evaluation that indicates sig- performance, however, the approach is limited by the constraints nificant performance improvements for various parameter of human-readable schema discovery. In our work, query per- configurations. formance, indexing and storage optimization are the main aims of the merging process, and thus we are not concerned about Roadmap. In Section 2, we present the background and related providing human-readable schema information or any form of work for this paper. In Section 3, we provide preliminary defi- schema exploration. In [12], the authors use CSs and ECSs in nition and delineate the problem, and in Section 4 we present order to assist cost estimation for federated queries, while in [5], algorithms towards its solution. In Section 5, we discuss the ex- the authors use CSs in order to provide better triple reordering perimental evaluation of the paper. We conclude the paper in plans. To the best of our knowledge, this is the first work to ex- Section 6, with a summary of our findings and future work. ploit hierarchical CS relations in order to merge CSs and improve query performance. 2 RELATED WORK Due to the tabular structure that tends to implicitly underlay RDF data, recent works for RDF data management systems have been 3 PRELIMINARIES implemented in relational backbones. They generally follow three RDF triples consist of a subject s, a predicate p and an object o. storage schemes, namely (a) triples tables, (b) property tables, An RDF dataset is represented as a directed labelled graph where and, (c) vertical partitioning. A triples table has three columns, subjects and objects are nodes, and predicates are labelled edges. representing the subject, predicate and object (SPO) of an RDF Formally, let I , B, L be infinite, pairwise disjoint sets of URIs, triple. This technique replicates data in different orderings in blank nodes and literals, respectively. Then, (s, p, o) ∈ (I ∪ B) × order to facilitate sort-merge joins. RDF-3X [14] and Hexastore (I ) × (I ∪ B ∪ L) is a triple. RDF does not enforce structural rules [21] build tables on all six permutations of SPO. Built on a rela- in the representation of triples; within the same dataset there tional backbone, Virtuoso [4] uses a 4-column table for quads, can be largely diverse sets of predicates emitting from nodes and a combination of full and partial indexes. These methods of the same type [9, 13, 17]. Characteristic Sets (CS)[13] capture work well for queries with small numbers of joins, however, they this diversity by representing implied node types based on the degrade with increasing sizes, unbound variables and joins. set of properties they emit. Formally, given a dataset D, and a Property Tables places data in tables with columns correspond- subject node s, the characteristic set cs(s) of s is cs(s) = {p | ing to properties of the dataset, where each table identifies a ∃o : (s, p, o) ∈ D}, and the set of all CSs is denoted with C. In specific resource type. Each row identifies a subject node and what follows, we present basic definitions for CSs and their more holds the value of each property. This technique has been imple- generalized form of Property Sets. Intuitively, a property set is a mented experimentally in Jena [22] and DB2RDF [3], and shows set of RDF predicates p1 . . . pn . Therefore, a CS is also a property promising results when resource types and their properties are set. As we are interested in creating property sets by merging well-defined. However, this causes extra space overhead for null CSs, we use this more general form (i.e., property set instead of values in cases of sparse properties [1]. Also, it raises perfor- CS) in order to denote sets of attributes that are not necessarily mance issues when handling complex queries with many joins, CSs in the original dataset, but are used after merging several as the amounts of intermediate results increase [8]. CSs as the basis for a relational table. Vertical partitioning segments data in two-column tables. Each Although each real world measurement can have its very table corresponds to a property, and each row to a subject node own characteristic set, for all practical purposes, the information [1]. This provides great performance for queries with bound ob- stored in RDF triples typically refers to recurring concepts and jects, but suffers when the table sizes have large variations in size their features. Thus, some co-occurrences of features are common, [20]. TripleBit [23] broadly falls under vertical partitioning. In e.g., to triples coming from the same source and representing TripleBit, the data is vertically partitioned in chunks per predicate. "records" pertaining to the same "concept" (e.g., triples for events, While this reduces replication, it suffers from similar problems pertaining to the concept Event come with the same set of features as property tables. It does not consider the inherent schema of like datetime and description. All event instances from the same the triples in order to speed up the evaluation of complex query source are likely (albeit not obliged) to carry overlapping subsets patterns. of these features. At the same time, despite the commonalities, In distributed settings, a growing body of literature exists, with there are differences too, typically produced by the heterogeneity of systems such as Sempala [18], H2RDF [15] and S2RDF [19]. How- data sources and the lack of structural restrictions in RDF data. For ever, these systems are based on the parallelization of centralized example, an RDF dataset containing information about people, is indexing and query evaluation schemes. likely to associate a name property for each instance of a Person Due to the high heterogeneity in the schema during the in- class, while a isFatherOf property would not be associated with tegration and analysis of multiple RDF datasets, latest state of all instances for obvious reasons. Thus, there is a possibility the art approaches rely on implicit schema detection in order to define subset relationships between the features of similar to index/store triples based on their schema. In our previous concepts. To this end, we introduce the notion of subsumption work [9], we defined Extended Characteristic Sets (ECSs) as typed and the notion of hierarchy of subsumed property sets. links betwen CSs, and we showed how ECSs can be used to in- Definition 1. (Property Sets and Property Tables). In its more dex triples and greatly improve query performance. In [17], the general form, a CS is a property set Pi , i.e., a set of RDF predicates authors identify and merge CSs, similar to our approach, into and the set of all property sets is denoted with P. A Property what they call an emergent schema. However, their main focus is Table (also: CS Table) Ti for a given property set Pi is a relational to extract a human-readable schema with appropriate relation table comprising an identifier sid for the subject node s and labelling and they do not use hierarchical information of CSs, |Pi | columns, where Pi = {pi,1, pi,2, . . . , pi,n } are the predicates rather they use semantics to drive the merging process. In [16] it emitting from s. Ti contains the set r i of tuples that instantiate is shown how this emergent schema approach can assist query the values of the properties for sid, i.e., Ti = (sid ∪ Pi , r i ). A tuple Figure 3: (a) A CS hierarchy graph with dense nodes colored in deep purple, (b) the connected components derived by cutting off descendants from dense nodes, (c) a connected component with dashed lines representing inferred hierarchical relationships, (d) all possible assignments of dense nodes to non-dense nodes. can hold values for the predicate columns in I ∪B ∪L ∪NU LL, i.e., say that Hcanc bas e is an ancestral sub-graph over cbase when ∀i ∈ ′ each cell in the relational table can either have a standard value of [1..k], it holds that c i ≻ cbase and (c i , cbase ) ∈ E . Intuitively, an RDF triple object, or NU LL when the subject node identified any set of ancestors of a node cbase forms an ancestral sub- in sid does not have a property in any of its triples. In Figure graph. More than one ancestral sub-graphs can be defined over 2(a), four subject nodes, s 1, s 2, s 3, s 4 , are shown. These have four cbase , as any subset of its parents is considered an ancestral different CSs based on their property sets, namely c 1, c 2, c 3, c 4 . sub-graph over cbase . For instance, in Figure 3(c), nodes c 7, c 4, c 2 Definition 2. (CS Subsumption). Given two CSs, c i and c j , form an ancestral sub-graph over c 7 . Similarly, nodes c 6, c 4, c 2 then c i subsumes c j , or c i ≻ c j , when c i is a subset of c j , or and c 6, c 5, c 2 form ancestral sub-graphs over c 6 . c i ⊂ c j . This subsumption forms a parent-child relationship. Having defined the hierarchy of characteristic sets and the For example, consider c 1, c 2 that describe human beings, with respective graph, we are now ready to provide the foundation c 1 = {type, name} and c 2 = {type, name, marriedTo}. It can be for the core of the problem of this paper. Basically, the goal is to seen that c 1 ⊂ c 2 and therefore c 1 subsumes c 2 . CS subsumption find a way to store data in a way that balances two antagoniz- relationships can be seen in Figure 3(a) as directed edges between ing goals: (a) the number of NULL values and the unnecessary nodes. In the example of Figure 2(a), the four CSs exhibit strong space increase, vs., (b) the resulting decrease in query processing subset-superset relationships between their properties, For in- time that would result from the fragmentation of stored data stance, c 1 and c 2 have property sets that are supersets of both in multiple nodes and the need to join them. To address this c 3 and c 4 , while c 3 also subsumes c 4 . The set of all parent-child issue, we can attempt to merge different nodes into one, paying relationships defines a CS hierarchy. the price of empty space to avoid the price of joins. There exist Definition 3. (CS Hierarchy and Inferred Hierarchy). CS two edge cases here, namely (i) assign a table to each CS in the subsumption creates a partial hierarchical ordering such that incoming data, resulting in as many tables as CSs, and (ii) assign when c i ≻ c j , then c i is a parent of c j . Formally, a CS hierarchy one universal tables for all CSs. This table would contain all of is a graph lattice H = (V , E) where V ∈ C and E ∈ (V × V ). A the properties found in the input dataset. The two edge cases for directed edge between two CS nodes c 1, c 2 exists in H , when the running example of Fig. 2 can be seen in Fig. 2(b,c). c 1 ≻ c 2 and there exists no other c i such that c 1 ≻ c i ≻ c 2 . Definition 5. (Hierarchical CS Merge). Given an ancestral sub- ′ ′ ′ The directed edge stems from the parent node and arrives at graph a = (V , E ), where V = {c 1, c 2, . . . , c k } as defined above, ′ ′ the child node. An example CS hierarchy can be seen in Figure and the set of property tables T (V ), then hier _merдe(a,T (V )) 3(a). Given a hierarchy H , we denote the hierarchical closure of H is a hierarchical merge of a that results in a single table Ta = with Hc , so that Hc extends H to contain inferred edges between (cbase , r a ). As cbase is the most specialized CS in a, the columns hierarchically related nodes that are not consecutive, e.g. a node k ′ and its grandchildren. Intuitively, for every set of features c that of Ta are exactly the properties in cbase , while r a = r i is the Ð i=1 describes some concept in the dataset, we introduce a node in ′ ′ union of the records of all CSs in V , where r i is the projection our graph. Every strict superset of features of c is a descendant ′ of c, and every strict subset of features of c is a superset of c. An of r i on the properties in cbase . Consequently, r i contains NULL example inferred hierarchy can be seen on the right of Figure 2(a), values for all the non-shared properties of cbase and c i . In essence, with the inferred relationships in dashed lines as well as in Figure this is an edge contraction operator that merges all tables of the 3(c) for a sub-graph of the graph in Figure 3(a). In the remainder of nodes of an ancestral sub-graph into one and keeps the signature this paper, we refer to Hc as the inferred hierarchy of H . The lattice (i.e., the properties) of the base CS cbase . For instance, assume ′ resembles the traditional OLAP cube lattice (see [7]), although in that V = {c 0 = (P 0, r 0 ), c 1 = (P1, r 1 ), c 2 = (P2, r 2 )} is the set our case, the construction is ad-hoc, depending on the available of vertices of an ancestral sub-graph with three CSs, with P 0 = CS’s, and serves a different purpose. {pa , pb }, P1 = {pa , pb , pc } and P 2 = {pa , pb , pc , pd }. Thus, c 0 ≻ Definition 4. (CS Ancestral Sub-graphs). Given an inferred c 1 ≻ c 2 . The output of the merging process for our running hierarchy Hc = (V , E), a CS cbase , a set of CSs c 1, . . . , c k , and example can be seen in Figure 2(d). Hierarchical merging can be a sub-graph Hcanc seen in Figure 4. ′ ′ ′ ′ bas e = (V , E ) with V ⊆ V and E ⊆ E, we Definition 5. (Merge Graph). Given an inferred CS hierarchy r null represents the ratio of null values to the cardinality of ′ ′ ′ Hc = (V , E), a merge graph is a graph H = (V , E ) that consists the merged table. The numerator of the fraction represents the of a set of n ancestral sub-graphs, and has the following prop- total number of cell values that will be null, as the product of ′ ′ erties: (i) H contains all nodes in H such that V ≡ V , i.e., it the number of non-shared properties and the cardinality of the ′ covers all CSs in the input dataset, (ii) H contains a subset of parent CS. The denominator represents the resulting cardinality ′ the edges in H such that E ⊂ E, (iii) each node is contained in of the table. exactly one ancestral sub-graph ai , (iv) all ancestral sub-graphs In order to assess an ancestral sub-graph, we use a generalized are pair-wise disconnected, i.e., there exist no edges between the version of r null that captures the NULL value effect on the whole nodes of different ancestral sub-graphs. Thus, each ancestral sub- sub-graph: graph can be contracted into one node unambiguously, using the Í |д | hier _merдe operator. Also, the total number of relational tables д i=1 |c d \ c i | × |r i | r null (д)|Td = Í |д | (2) will be equal to the number of ancestral sub-graphs in the merge |rd | + i=1 (|r i |) graph. The primary focus of this work is to improve the efficiency of Here, Td = (cd , rd ) is the root of sub-graph д. However, merging storage and querying in relational RDF engines by exploiting the a parent to a child changes the structure of the input graph, as implicit schema of the data in the form of CSs. The two extreme the cardinality of the merged child is increased. Thus, we define approaches – i.e., (a) a separate table for each CS, on the one end a cost function that works on the graph level, as follows: of the spectrum, or, (b) merging all the data into a universal table with a schema expressed as the union of the individual features n д Õ of the different CS’s, at the other end of the spectrum, have the cost(д) = r null (дi )|cd i (3) following drawbacks, respectively: (i) multiple joins of separate i=1 tables at query processing, if many tables are used, and (ii) bad utilization of disk space if nodes are excessively merged into a where n is the number of dense nodes, cdi is a dense node and дi very small number of tables (due to excessive empty space with is the ancestral sub-graph with cdi as the base node. NULLs). Thus, choosing dense CSs as bases is a seeding process that Problem Formulation. Given an inferred CS hierarchy Hc = aims to minimize this NULL value effect by making sure that ′ ′ (V , E), the problem is to find a merge graph H = (V , E ) in the a large fraction of the input records will not be extended with form of a set of disconnected ancestral sub-graphs, that provides NULL values. This is true because a CS base and its resulting a good way to merge CS nodes. In other words, the problem merged table will have exactly the same properties (columns) is to find the best set of ancestral sub-graphs from an inferred and thus introduction of NULL values will not be needed for the hierarchy Hc that minimize an objective cost function cost(x). records of the CS base. The problem of selecting ancestral sub-graphs for the merge is computationally hard, as mentioned earlier. For this reason, 4 HIERARCHICAL CS MERGING we rely on heuristics in order to seed the process and provide an What makes the problem hard, is the complexity of finding a initial set of ancestral sub-graph bases for the final merged tables. sweet spot in the Pareto equilibrium between conciseness of the The CS bases will be the only relational tables in the output, with relational schema (few tables) and internal density (with respect the remaining tables merged into them. For this, we identify dense to the empty space they contain). CS nodes in the hierarchy (i.e, with large cardinalities) and use Schema conciseness. To start with the arguments in favor these nodes as the bases of the ancestral sub-graphs. While node of reducing the number of nodes via merging, we can see that, by density can be defined in many different ways, in the context reducing the number of CS, the number of joins between them of this work we define a c i to be dense, if the cardinality of its is also reduced. Furthermore, merging together CSs leads to a relational table is larger than a linear function of the maximum less skewed distribution of data to relational tables. Ultimately, cardinality of CSs in D, i.e., a function d : N → R, with d(Ti ) = this results in a drastically decreased disk-based I/O cost, as less m × |rmax |. Here, m ∈ [0, 1] is called the density factor, and rmax tables are fetched, positively affecting query processing as well. is the cardinality of the largest CS table in D. By definition, if Density. On the contrary, merging tables results in the intro- m = 0, no CSs will be merged (i.e., each CS will be stored in duction of NULL values for the non-shared columns, which can each own table), while if m = 1, no tables will be created, as no degrade performance. Specifically, merging CSs with different CS has a cardinality larger than that of the largest CS. With a attribute sets can result in large numbers of NULL values in the given m, the problem is reduced to finding the optimal ancestral resulting table. Given a parent CS table T1 = (sid ∪ c 1, r 1 ) and a sub-graph for each given dense node. child CS table T2 = (sid ∪ c 2, r 2 ) with |c 1 | < |c 2 | and |r 1 | >> |r 2 |, Given this cost model and a predefined set of dense nodes, the resulting |c 2 \ c 1 | × |r 1 | NULL cells will be significantly large our algorithm will find the optimal sub-graph for each dense compared to the total number of r 1 + r 2 records, thus poten- node. An inferred hierarchy graph can be converted to a set of tially causing poor storage and querying performance[17]. For connected components that are derived by removing the outgoing this reason, CS merging must be performed in a way that will edges from dense nodes, since we are not interested in merging minimize the presence of NULL values. The following function children to parents, but only parents to children. An example captures the NULL-value effect of the merge of two CS tables of this can be seen in Figure 3(b). For each component, we can Ti = (sid ∪ c i , r i ),T j = (sid ∪ c j , r j ) with c i ≻ c j : compute cost(д) as the sum of the costs of these components. The main idea is to identify all connected components in the CS |c j \ c i | × |r i | graph, iterate through them, enumerate all sub-graphs within r null (Ti ,T j ) = (1) the components that start from the given set of dense nodes, and (|r j | + |r i |) select the optimal partitioning for each component. Algorithm 1: optimalMerge Data: An inferred hierarchy lattice Lc as a adjacency list , and a set of dense CSs D Result: A set of optimal ancestral sub-graphs 1 init f inal List ; 2 connect edComponent s ← f indConnect edComponent s(L c ); 3 for each connect edComponent do 4 init min ← MAX _V ALU E ; 5 init best Subдr aph ; 6 while nex t ← connect edComponent .дener at e N ex t Subдr aph() do 7 if cost (nex t ) < min then 8 min ← cost (nex t ); 9 best Subдr aph ← nex t ; 10 end 11 f inal List .add(best Subдr aph); 12 end Figure 4: Merging the tables of c 0 , c 1 and c 2 . 13 return f inal List ; The main idea behind the algorithm is to iterate the non-dense The algorithm can be seen in Algorithm 1. The algorithm nodes, and for each non-dense node, calculate r null and find the works by first identifying all connected components of the in- dense node that minimizes this function for the given non-dense ferred hierarchy (Line 2). Identifying connected components is node. Then, the cardinalities will be recomputed and the next trivially done using standard DFS traversal, and is not shown in non-dense node will be examined. The algorithm can be seen in the Algorithm. Then, we iterate each component (Line 3), and for Algorithm 2. In the beginning, the algorithm initiates a hash table, each component, we generate all possible sub-graphs. Then, we merдeMap, with an empty list for each dense node (Lines 1-4). calculate the cost of each sub-graph (Line 7) and if it is smaller Then, the algorithm iterates all non-dense nodes (Line 5), and for than the current minimum, the minimum cost and best sub-graph each dense node, it calculates the cost r null of merging it to each are updated (Lines 8-9). Finally, we add the best sub-graph to the of its connected dense nodes (Lines 5-13), keeping the current final list (Line 11) and move to the next component. minimum cost and dense node. In the end, the current non-dense To generate the sub-graphs, we do not need to do an exhaustive node is added to the list of the dense node that minimizes r null generation of 2n combinations, but we can rely on the observa- (Line 14). Notice that we do not need to split the hierarchy into tion that each non-dense node must be merged to exactly one connected components in order for дreedyMerдe to work. dense node. Therefore, sub-graph generation is reduced to find- Complexity Analysis. Given k non-dense nodes and d dense ing all possible assignments of dense nodes to the non-dense nodes, where each non-dense node ki has e(ki ) dense children, nodes. An example of this can be seen in Figure 3. In the figure, the дreedyMerдe algorithm needs ki=1 e(ki ) iterations, because Í nodes c 2, c 4, c 5 are non-dense, while nodes c 6, c 7, c 8 are dense. All it requires iteration of all e(ki ) nodes for each ki . In the worst case, possible and meaningful sub-graphs are enumerated in the table every ki is related to all d dense nodes, requiring kd iterations. at the right of the figure, where we assign a dense node to each Assuming a constant cost for the computation of r null , then the of the non-dense nodes. An assignment is only possible if there asymptotic complexity of the greedy algorithm is O(kd), which exists a parent-child relationship between a non-dense node and is a significant performance when compared to the exponential a dense node, even if it is an inferred one (e.g. c 2 is an inferred complexity of optimalMerдe. parent of c 7 ). Hence, the problem of sub-graph generation be- comes one of generating combinations from different lists, by selecting one element from each list. The number of lists is equal Algorithm 2: greedyMerge Data: A hash table p mapping non-dense CSs to their dense descendants, a set of to the number of non-dense nodes, and the elements of each list dense CSs D , and a set of non-dense CSs K are the dense nodes that are related to the non-dense node. Result: A hash table mapping dense CSs to sets of non-dense CSs to be merged 1 init mer дe Map ; Complexity Analysis. Assuming that a connected compo- 2 for each d ∈ D do nent д has k non-dense nodes and d dense nodes, and each non- 3 mer дe Map .put (d , new List ()); dense node ki is related to e(ki ) dense nodes, then the num- 4 end 5 for each k ∈ K do ber of sub-graphs that need to be enumerated are ki=1 e(ki ). Î 6 min ← MAX _V ALU E ; In the example of figure 3, the total number of sub-graphs is 7 init best Dense ; 8 for each d k ∈ p .дet (k ) do e(c 2 ) × e(c 4 ) × e(c 5 ) = 3 × 2 × 1 = 6. In the worst case all k 9 cost ← r nul l (k , d k ); nodes are parents of all d nodes. Then, the number of total sub- 10 if cost < min then min ← cost ; graphs is d k , which makes the asymptotic time complexity of 11 12 best Dense ← d k ; the algorithm O(d k ). 13 end 14 mer дe Map .дet (best Dense).add (k); 15 end 4.1 Greedy Approximation 16 return mer дe Map ; For very small d, k (e.g. d, k < 4), the asymptotic complexity of O(d k ) is acceptable. However, in real-world cases, the number of This process does not necessarily cover all CSs of the input connected components can be small, making d and k large. For dataset. For example, some CS nodes might not have any dense this reason, we introduce a heuristic algorithm for approximat- children. Given this, the percentage of the dataset that is covered ing the problem, that does not require enumerating all possible by this process is called dense CS coverage. The remainder of the combinations, relying instead on a greedy objective function that CSs are aggregated into one large table, Tr est , containing all of attempts to find the local minimum with respect to the defined their predicates. If the total coverage of the merging process is cost model for each non-dense node. large, then Tr est does not impose a heavy overhead in query Figure 5: An example of greedy merging. Dense nodes are coloured in deep purple. At each step, the non-dense node under examination is coloured with green, while the edge that minimizes r null can be seen in bold. performance, as will be shown in the experiments. Finally, we permutations of table joins that need to be evaluated. For instance, load the data in the corresponding tables. assume that q 1 matches with c 1, c 2 , while q 2 matches with c 3 and q 3 matches with c 4, c 5 . Assume also that by looking up the ECS 4.2 Implementation index, we derived that the links [c 1, c 3 ], [c 2, c 3 ], [c 3, c 4 ] and [c 3, c 5 ] exist in the index, i.e., they correspond to candidate joins in the We implemented raxonDB as a storage and querying engine that data. Then, [c 1, c 3, c 4 ], [c 1, c 3, c 5 ], [c 2, c 3, c 4 ] and [c 2, c 3, c 5 ] are supports hierarchical CS merging, and can be deployed on top all valid table permutations that must be processed. Two main of standard RDBMS’s. Specifically, we used PostgreSQL 9.6, but strategies can be employed here. The first is to join the UNIONs raxonDB can be adapted for other relational databases as well. of the matching tables for each qi , and the other is to process each CS Retrieval and Merging. The processes of retrieving and permutation of tables separately and append the results. Given merging CSs take place during the loading stage of an incom- the filtering performed by the ECS indexing approach, where ing RDF dataset. CS retrieval is a trivial procedure that requires we can pre-filter CSs based on the relationships between them, scanning the whole dataset and storing the unique sets of prop- the UNION would impose significant overhead and eliminate erties that are emitted from the subject nodes in the incoming the advantage of ECS indexing. Therefore, we have implemented triples, and is adopted from our previous work in [9] where it is the second approach, that is, process a separate query for each described in detail. After retrieving the CSs, the main idea is to permutation. Finally, due to the existence of NULL values in the compute the inferred CS hierarchy and apply one of the described merged tables, we must add explicit IS NOT NULL restrictions merging algorithms. Finally, each set of merged CSs is stored for all the properties that are contained in each matched CS and in a relational table. In each table, the first column represents are not part of any other restriction or filter in the original query. the subject identifier, while the rest of the columns represent the union of the property sets of the merged CSs. For multi-valued properties, we use PostgreSQL’s array data type in order to avoid 5 EXPERIMENTAL EVALUATION duplication of the rows. We implemented raxonDB on top of PostgreSQL3 . As the focus of Indexing. We deploy several indexes in raxonDB. First off, this paper is to improve RDF storage and querying efficiency in we index the subject id for each row. We also build foreign-key relational settings, we rely on existing mechanisms within Post- indexes on object-subject links between rows in different CSs, i.e., greSQL for I/O operations, physical storage and query planning. when a value of a property in one CS is the subject id of another In this set of experiments, we report results of implementing CS. Next, we use standard B+trees for indexing single-valued hier _merдe with the greedy approximation algorithm, as experi- property columns, while we use PostgreSQL’s GIN indexes, which menting with the optimal algorithm failed to finish the merging apply to array datatypes for indexing multi-valued properties. process even in datasets with small numbers of CSs. This enables fast access to CS chain queries, i.e., queries that apply Datasets. For this set of experiments, we used two synthetic successive joins for object-subject relationships. Furthermore, datasets, namely LUBM2000 (≈300m triples), and WatDiv (≈100m we store these links on the schema level as well, i.e., we keep an triples), as well as two real-world datasets, namely Geonames index of CS pairs that are linked with at least one object-subject (≈170m triples) and Reactome (≈15m triples). LUBM [6] is a cus- pair of records. These links are called Extended Characteristic tomizable generator of synthetic data that describes academic Sets (ECSs) and are based on our previous work in [9]. With the information about universities, departments, faculty, and so on. ECS index, we can quickly filter out CSs that are guaranteed not Similarly, WatDiv[2] is a customizable generator with more op- to be related, i.e., no joins exist between them, even if they are tions for the production and distribution of triples to classes. individually matched in a chain of query CSs. Other metadata and Reactome4 is a biological dataset that describes biological path- indexes include the property sets of CSs, and which properties ways, and Geonames5 is a widely used ontology of geographical can contain multiple values in the same CS. entities with varying properties and rich structure. Query Processing. Processing SPARQL queries on top of Loading. In order to assess the effect of hierarchical merging merged CSs entails (i) parsing the queries, (ii) retrieving the query in the loading phase, we performed a series of experiments using CSs, (iii) identifying the joins between them, and (iv) mapping all four datasets. For this experiment, we measure the size on them to merged tables in the database. Steps (i)-(iii) are inherited disk, the loading time, the final number of merged tables, as well from our previous work in [9]. For (iv), a query CS can match as the number of ECSs (joins between merged tables) and the with more than one table in the database. For instance, consider percentage of triple coverage by CSs included in the merging a query containing a chain of three CSs, q 1 ▷◁ q 2 ▷◁ q 3 , joined process, for varying values of the density factor m ∈ [0, 1]. The sequentially with object-subject joins. Each query CS qi matches 3 The code and queries are available in https://github.com/mmeimaris/raxonDB with all tables whose property sets are supersets of the property 4 http://www.ebi.ac.uk/rdf/services/reactome set of qi . Thus, each join in the initial query creates a set of 5 http://www.geonames.org/ontology/documentation.html (a) Execution time (seconds) for LUBM (b) Execution time (seconds) for Geonames (c) Execution time (seconds) for Reactome Figure 6: Query execution times in milliseconds (a) # of CS permutations for LUBM (b) # of CS permutations for Geonames (c) # of CS permutations for Reactome Figure 7: # of CS permutations for increasing m Table 2: Loading experiments for all datasets Dataset Size (MB) Time # Tables (CSs) # of ECSs Dense CS Coverage Reactome Simple 781 3min 112 346 100% Reactome (m=0.05) 675 4min 35 252 97% Reactome (m=0.25) 865 4min 14 73 77% Geonames Simple 4991 69min 851 12136 100% Geonames (m=0.0025) 4999 70min 82 2455 97% Geonames (m=0.05) 5093 91min 19 76 87% Geonames (m=0.1) 5104 92min 6 28 83% LUBM Simple 591 3min 14 68 100% LUBM (m=0.25) 610 3min 6 21 90% LUBM (m=0.5) 620 3min 3 6 58% WatDiv Simple 4910 97min 5667 802 100% WatDiv (m=0.01) 5094 75min 67 99 77% WatDiv (m=0.1) 5250 75min 25 23 63% WatDiv (m=0.5) 5250 77min 16 19 55% (a) Execution time (seconds) for LUBM2000 (b) Execution time (seconds) for Geonames (c) Execution time (seconds) for Reactome Figure 8: Query execution times in milliseconds for different RDF engines results are summarized in Table 2. As can be seen, the number of CS, and consequently tables, is greatly reduced with increas- ing values of m. As the number of CSs is reduced, the expected number of joins between CSs is also reduced, which can be seen compared with various alternative RDF engines and the results in the column that measures ECSs. Consequently, the number for the performance of indexing and querying showed that our of tables can be decreased significantly without trading off large system outperforms for various types of workloads. amounts of coverage by dense CSs, i.e. large tables with many As future work, we will study computation of the optimal value null values. Loading time tends to be slightly greater as the num- for m, taking into consideration workload characteristics as well ber of CSs decreases, and thus the number of merges increases, as a more refined cost model for the ancestral paths. Furthermore, the only exception being WatDiv, where loading time is actually we will study and compare our approach to a graph database decreased. This is a side-effect of the excessive number of tables setting, as well as experiment with a column-stored relational (= 5667) in the simple case which imposes large overheads for the DB, in order to further scale the capabilities of raxonDB. persistence of the tables on disk and the generation of indexes Acknowledgments. This work is partially funded by the and statistics for each one. projects VisualFacts (#1614 - 1st Call of the Hellenic Foundation Query Performance. In order to assess the effect of the den- for Research and Innovation Research Projects for the support of sity factor parameter m during query processing, we perform a post-doctoral researchers), and "Moving from Big Data Manage- series of experiments on LUBM, Reactome and Geonames. For ment to Data Science" (MIS 5002437/3)- Action "Reinforcement of the workload, we used the sets of queries from [9] 6 . We em- the Research and Innovation Infrastructure" (funded by Greece ploy two metrics, namely execution time and number of table and the European Regional Development Fund). permutations. The results can be seen in Figures 6 and 7. As can be seen, hierarchical CS merging can help speed up query per- REFERENCES formance significantly as long as the dense coverage remains [1] Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. 2007. Scalable semantic web data management using vertical partitioning. In VLDB. high. For example, in all datasets, query performance degrades [2] Güneş Aluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. 2014. Diver- dramatically when m = 1, in which case the merging process sified stress testing of RDF data management systems. In ISWC. cannot find any dense CSs. In this case, all rows are added to [3] Mihaela A Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas, Patrick Dantressangle, Octavian Udrea, and Bishwaranjan Bhattacharjee. 2013. one large table, which makes the database only contain one table Building an efficient RDF store over a relational database. In ACM SIGMOD. with many NULL cells. These findings are consistent across all [4] Orri Erling and Ivan Mikhailov. 2010. Virtuoso: RDF support in a native RDBMS. three datasets (Q 6 in LUBM exhibits a higher increase for m = 1 Springer. [5] A. Gubichev and T. Neumann. 2014. Exploiting the query structure for efficient due to the dataset’s characteristics; for lack of space we omit the join ordering in SPARQL queries. In EDBT. explanation of this behaviour) and require further future work [6] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on in order to identify the optimal value for m. the World Wide Web 3, 2 (2005), 158–182. In order to assess the performance of raxonDB and establish [7] Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ull- that no overhead is imposed by the relational backbone, we per- man. 1997. Index Selection for OLAP. In ICDE,. 208–219. [8] Maciej Janik and Krys Kochut. 2005. BRAHMS: a workbench RDF store and formed a series of queries on LUBM2000, Geonames and Reac- high performance memory system for semantic association discovery. In tome, assuming the best merging of CSs is employed as captured ISWC. by m with respect to our previous findings. We also compared [9] M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. 2017. Extended Characteristic Sets: Graph Indexing for SPARQL Query Optimization. the query performance with rdf-3x, Virtuoso 7.1, TripleBit and In ICDE. the emergent schema approach described in [16]. The results can [10] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis Anagnostopoulos. 2016. Efficient Computation of Containment and Com- be seen in Figure 8 and indicate that raxonDB provides equal or plementarity in RDF Data Cubes. In Proceedings of the 19th International better performance from the original axonDB implementation, as Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, well as the rest of the systems, including the emergent schema March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 281–292. https: //doi.org/10.5441/002/edbt.2016.27 approach, which is the only direct competitor for merging CSs. [11] Marios Meimaris, George Papastefanatos, Panos Vassiliadis, and Ioannis Anag- Especially for queries with large intermediate results and low nostopoulos. 2018. Computational methods and optimizations for contain- selectivity that correspond to a few CSs and ECSs (e.g. LUBM Q5 ment and complementarity in web data cubes. Inf. Syst. 75 (2018), 56–74. https://doi.org/10.1016/j.is.2018.02.010 and Q6, Geonames Q5 and Q6) several of the other approaches [12] G. Montoya, H. Skaf-Molli, and K. Hose. 2017. The Odyssey Approach for fail to answer fast and in some cases time out. Optimizing Federated SPARQL Queries. In ISWC. [13] T. Neumann and G. Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In ICDE. 6 CONCLUSIONS AND FUTURE WORK [14] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable management of RDF data. The VLDB Journal 19, 1 (2010), 91–113. In this paper, we tackled the problem of automatically mapping [15] Nikolaos Papailiou, Dimitrios Tsoumakos, Ioannis Konstantinou, Panagiotis heterogeneous RDF datasets to a relational schema by consider- Karras, and Nectarios Koziris. 2014. H 2 RDF+: an efficient data management ing the implicit schema in RDF triples. We presented a method system for big RDF graphs. In ACM SIGMOD. [16] MD Pham and P. Boncz. 2016. Exploiting emergent schemas to make RDF that extracts the Characteristics Sets, i.e., the set of properties systems more efficient. In ISWC. describing the different classes of RDF instances in the data and [17] MD. Pham, L. Passing, O. Erling, and P. Boncz. 2015. Deriving an emergent relational schema from rdf data. In WWW. exploits the hierarchical relationships between different CSs in or- [18] Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg der to merge and map them to relational tables. We have provided Lausen. 2014. Sempala: interactive SPARQL query processing on hadoop. two algorithms, an optimal (exhaustive) one which selects ances- In ISWC. [19] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and Georg tral sub-graphs of CS for merging in exponential time and greedy Lausen. 2016. S2RDF: RDF querying with SPARQL on spark. In VLDB. one, which via the use of heuristics improves the performance to [20] Lefteris Sidirourgos, Romulo Goncalves, Martin Kersten, Niels Nes, and Stefan polynomial time. We have implemented our methods on top of a Manegold. 2008. Column-store support for RDF data management: not all swans are white. In VLDB. standard RDBMS solution, i.e., PostgreSQL for extracting, index- [21] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore: ing and query processing of SPARQL queries. Finally, we have sextuple indexing for semantic web data management. In VLDB. [22] Kevin Wilkinson. 2006. Jena property table implementation. experimented with two synthetic and two real-world datasets, [23] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013. all of them exhibiting high heterogeneity in their schemas, we TripleBit: a fast and compact system for large scale RDF data. In VLDB. 6 Available also in https://github.com/mmeimaris/raxonDB