A Highly Parallel Engine for RDF

         Shujun Wang, Yongxin Yu, Xiaowang Zhang? , and Zhiyong Feng

    College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
    Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
                   ?
                      Corresponding author: xiaowangzhang@tju.edu.cn


        Abstract. In this paper, we designed a highly parallel engine for RDF,
        which includes a new RDF storage model with little memory dependence
        and a Machine Learning algorithm. Based on the above two points, we
        present a heuristic query decomposition algorithm to split a SPARQL
        query into subqueries and then process these subqueries parallelly. Ex-
        periments show that our engine perform better on SPARQL query exe-
        cution with maximizing the usage of memory.


1     Introduction
With the continuous development of computer hardware, the memory of a single
computer becomes larger and the parallel processing capability becomes higher.
Some application of knowledge graph, such as RDF Q/A system, which has strict
requirements on system efficiency and stability. However, the amount of data it
needs is not very large, a single computer is sufficient.
   Currently, single-machine RDF engine, such as RDF3X [2], gStore [5], SMat [4],
they can efficiently execute SPARQL queries serially. Their performance and
stability are relatively high. However, due to the improvement of single-machine
parallel capability, how to efficiently execute SPARQL queries in parallel on a
computer has become a very interesting problem. At the same time, although
many parallel processing schemes for SPARQL are presented in distributed RDF
Engines, the design bottleneck of SPARQL is quite different from that of single-
machine parallel processing, Specifically:
 – Parallel execution of SPARQL queries on a computer will cause tremen-
   dous pressure on computer memory, while the main optimization direction
   of distributed systems is how to reduce communication costs.
 – It is necessary to design a reasonable and efficient query decomposition
   scheme for parallel execution of SPARQL queries on a single computer.
 – In parallel environment, single-machine RDF Engine should choose more
   suitable joining order strategy.
In this paper, we present a new storage model of RDF Engine with low memory
dependence. Moreover, we develop a heuristic SPARQL decomposition algorithm
based on Machine Learning to split a SPARQL query into subqueries.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2                    S. Wang et al.

2           RDF Storage
In this paper, we propose a novel approach to manage RDF data(named TriS-
tore). Figure 1 is an example of TriStore used to store RDF data. In order to
save memory usage, instead of storing entire strings or URIs, we use shortened
versions or keys. For each RDF element value, TriStore maintains a mapping
table that maps these keys to their corresponding strings. After encoding, we
convert Figure 1(a) into Figure 1(b). In order to match queries more efficiently,


      subject        predicate       object
                                  rdf:FullProfes
     rdf:George      rdf:hasJob
                                        sor        s   p o                                            Row                               Row
                                                                  Row                                number   SID        <PID,OID>     number   OID        <PID,SID>
                                    rdf:United                            PID         (SID,OID)
     rdf:George      rdf:liveIn                    0   0     0   number
                                      States                                                           0       0         <0,0> <1,1>     0       0           <0,0>
      rdf:United                                   0   1     1     0       0            (0,0)
                      rdf:type     rdf:Country
        States                                     1   2     2                                         1       1         <2,2> <1,1>     1       1         <1,0> <1,1>
      rdf:United                   rdf:United                      1       1          (0,1) (1,1)
                     rdf:liveIn                    1   1     1                                         2       2           <2,3>         2       2         <2,1> <2,4>
        States                       States
                                                   2   2     3     2       2     (1,2) (2,3) (4,2)
    rdf:FullProfes                                                                                     3       3           <3,4>         3       3           <2,2>
                      rdf:type    rdf:Professor
          sor                                      3   3     4     3       3            (3,4)
                                                   4   2     2
                                                                                                       4       4           <2,2>         4       4           <3,3>
    rdf:Professor rdf:workAt rdf:University

     rdf:Canada       rdf:type     rdf:Country

                      (a)                              (b)                      (c)                                (d)                               (e)


                                          Fig. 1. An example of TriStore storing RDF data


we use three tables(Figure 1(c), Figure 1(d), Figure 1(e)) to store set of triples
Di in memory, which support the following query operations, where s, p and o
are subject, predicate and object:
    1. given p, return set {(s, o)| < s, p, o >∈ Di }
    2. given s and p, return set {o| < s, p, o >∈ Di }
    3. given o and p, return set {s| < s, p, o >∈ Di }
    Where Figure 1(c), Figure 1(d) and Figure 1(e) are used to support query
types 1, 2 and 3 above, respectively. For the first type of query, only the predicate
p is known, and we want to get all p-related s and o. For TriStore, we first get
the encoding number m of string p, and then we only need to output all the data
of line m in Figure 1(c) (it can be observed that the line number of the table is
equal to the encoding of the predicate). For example, we perform this SPARQL
query(Select∗where{?x rdf : type ?y.}). First find the digital encoding of the
predicate rdf : type, after finding that the encoding is 2, then directly output
all the data in the second row of Figure 1(c). Therefore, after obtaining the
encoding of string p, corresponding data can be quickly found with O(1) time
complexity.
    Figure 1(d) used to support fast matching of the second type of query, where
s and p are known. What we want to get is o which is related to both s and p.
For TriStore, the encoding m and n of string s and p should be obtained first.
As we can see from the running examples, line m in Figure 1(d) stores all PIDs
and OIDs associated with the SID m (there are a mapping table that maps PIDs
to OIDs). Hence, we only need to output all the data in the m-th line with the
key n.
                                          A Highly Parallel Engine for RDF        3

3   Query Optimization

In order to maximize the efficiency of SPARQL queries in parallel environment,
we split Q into several subqueries with the same cost. We accept the machine
learning based method in [1] to estimate the time cost of the SPARQL query.
We use T (Q) to represent the time cost of executing query Q.

Heuristic SPARQL Decomposition Using exhaustive method to decompose
the SPARQL is less efficient. In this section, we present a heuristic query de-
composition algorithm, which can efficiently split query into two subqueries with
equal execution cost. We use degree(denoted as dn ) to represent the number of


             SELECT ?prof ?stud WHERE {                                CS
                                                          worksFor
               ?prof worksFor CS .
               ?stud advisor     ?prof.           ?prof
             }                                              advisor   ?stud
                    (a) SPARQL                            (b) Graph

       Fig. 2. A SPARQL query that finds CS professors with their advisees.


edges connected to node n in SPARQL graph. For example, in Figure2 (b),
d?prof = 2, dCS = 1, d?stud = 1. It is easy to see that the nodes with higher de-
gree are usually the center of SPARQL query graph (or regional center). Hence,
we try to split the query from the nodes with higher degree. Note that, the
query Q is decomposed only when the formula: max{T (q1 ), T (q2 )} < 0.8 · T (Q)
is satisfied. 0.8 is the value we obtained through experiments. Since a Q is de-
composed into q1 and q2 , an additional join operation must be performed. Next,
we will try to decompose the subqueries q1 and q2 obtained in this round until
the formula cannot be satisfied.

Joining Order There are three types of Join processing trees commonly used
in databases: left-deep tree, right-deep tree and bushy tree. In order to improve
efficiency as much as possible, and to use the parallel processing power of the
computer. it is essential to use a strategy that can create bushy join trees (rather
than focusing on left-deep or right-deep trees).


4   Experiments and Evaluation

In the case of not processing the SPARQL query in parallel, we compare our
engine with RDF3X in query efficiency. As can be seen from Figure 3, our engine
is more efficient than RDF3X even without parallel acceleration. In Figure 4, we
tested our engine in a parallel environment. We have found that our engine can
indeed improve the efficiency of query execution in a parallel environment.
4                                 S. Wang et al.

                 7 0 0                                                                    3 5 0 0                       s e r ia l                                                                                              O u r A p p ro a c h
                                                 R D F 3 X                                                               p a r a lle l                                                                                           H e x a s to re
                                                                                                                                                             5 0 0
                                                  O u r A p p ra o c h                    3 0 0 0
                 6 0 0

                                                                                                                                                             4 0 0
                 5 0 0                                                                    2 5 0 0


                 4 0 0                                                                    2 0 0 0                                                            3 0 0


                                                                         T im e ( m s )


                                                                                                                                         M e m o ry (M B )
T im e ( m s )


                 3 0 0                                                                    1 5 0 0
                                                                                                                                                             2 0 0

                 2 0 0                                                                    1 0 0 0

                                                                                                                                                             1 0 0
                 1 0 0                                                                      5 0 0


                                                                                                0                                                                0
                     0                                                                              1   2   3   4   5                                                lu b m 1   lu b m 1 0   lu b m 2 0   lu b m 5 0   lu b m 1 0 0
                         1    2      3   4   5   6            7


Fig. 3. Efficiency compari-Fig. 4. Comparison of Se- Fig. 5. Comparing memory
son with RDF3x             rial and Parallel Efficiency usage with Hexastore


   Finally, we compare the memory usage of our storage model with Hexas-
tore [3] on different data sets. Hexastore has six tables to quickly respond to
any type of query. Because our method only supports normal SPARQL queries.
For equivalence comparison, we only calculated the memory occupancy of three
tables in Hexastore. Figure 4 shows that our storage model takes up less memory
than Hexatore does.


5                            Conclusion
In this paper, we design a highly parallel RDF engine for RDF by our designed
RDF storage model, to take less memory and respond faster to queries in a
highly parallel way.. We believe that our approach is helpful to maximize the
performation of limited computing resources.


Acknowledgments
This work is supported by the National Key Research and Development Program
of China (2017YFC0908401) and the National Natural Science Foundation of
China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young
Scholars in Tianjin University (2019XRX-0032).


References
1. Hasan R., Gandon F.: A machine learning approach to sparql query performance
   prediction. In Proc. of WI 2014, pp. 266–273.
2. Neumann T., Weikum G.: RDF-3X: A RISC-style engine for RDF. PVLDB,
   1(1):647–659 (2009).
3. Weiss C., Karras P., Bernstein A.: Hexastore: Sextuple indexing for semantic web
   data management. PVLDB, 1(1):1008–1019 (2008).
4. Zhang X., Zhang M., Peng P., Song J., Feng Z., Zou L.: A scalable sparse matrix-
   based join for SPARQL query processing. In Proc. of DASFAA 2019, pp.510–514.
5. Zou L., Tamer Özsu M., Chen L., Shen X., Huang R., Zhao D.: gStore: A graph-
   based SPARQL query engine. VLDB J., 23(4):565–590 (2014).