IRSMG: Accelerating Inexact RDF Subgraph
             Matching on the GPU

Junzhao Zhang1,3 , Bingyi Zhang1,3 , Xiaowang Zhang1,3,∗ , and Zhiyong Feng2,3
    1
   School of Computer Science and Technology, Tianjin University, Tianjin, China
        2
          School of Computer Software, Tianjin University, Tianjin, China
3
  Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
               ∗ Corresponding author: xiaowangzhang@tju.edu.cn


         Abstract. Many approaches have been proposed to solve subgraph match-
         ing problem based on filter-and-refine strategy. The efficiency of those ex-
         isting serial approaches relies on the computational capabilities of CPU.
         In this paper, we propose an RDF subgraph matching algorithm based
         on type-isomorphism using GPU since GPU has higher computational
         performance, more scalability, and lower price than CPU. Firstly, we
         present a concurrent matching model for type-isomorphism so that sub-
         graph matching can be tackled in a parallel way. Secondly, we develop a
         parallel algorithm for capturing our proposed concurrent matching model
         and implement a prototype called IRSMG using GPU. Finally, we eval-
         uate IRSMG on the benchmark datasets LUBM. The experiments show
         that IRSMG significantly outperforms the state-of-the-art algorithms on
         the CPU.


1       Introduction

Subgraph matching, also known as subgraph isomorphism, is a widely known
NP-complete problem [3]. The rapid growth of RDF data and high complexi-
ty of SPARQL query language together make a significant challenge to process
RDF subgraph matching efficiently. To deal with subgraph matching over larger
graphs, many algorithms explore basic graph pattern to achieve better perfor-
mance [1]. Since those algorithms are designed for the architecture of CPU, their
efficiencies rely on the computational capabilities of CPU. However, the clock
rate of CPU has almost reached its peak recently. Compared to CPU, GPU pro-
vides a higher level of parallelism by reducing the complexity of a single compute
unit and thus they are referred to as massively parallel hardware [2].
     In this paper, we propose a parallel RDF subgraph matching on GPU and
implement a prototype called IRSMG (I nexact RDF S ubgraph M atching on
the GPU). where inexact RDF subgraph matching is a generalization of the
exact RDF subgraph matching [5]. The major contributions of our work are
summarized as follows:
  – We propose a concurrent RDF subgraph matching model by extending type-
      isomorphism for supporting massively parallel processors.
 – We develop a GPU-based parallel matching algorithm for embedding our
   proposed matching model on GPU.
 – We implement and evaluate our proposal IRSMG on the benchmark datasets
   LUBM.
The experiments show that IRSMG significantly outperforms the state-of-the-art
methods approximately 4 times.


2    Concurrent matching model of subgraph matching

Type-isomorphism Let G = (VG , EG , µ, σ) be a semantic graph (as an ob-
ject graph) and GP = (VP , EP , µ, σ) be a connected semantic graph (as a
subject graph), where labeling functions map to a common labeling alphabet;
LV and LE are sets of discrete symbols; LVP ⊆ LVG and LEP ⊆ LEG . Let
WP = (v0P , eP         P            P      P
                  0 , v1 , . . . , ek−1 , vk ) be a k-edge walk over GP , where vi ∈ VP and
                                                                                            P

eP
 i ∈ E.
     Then a k-edge walk WG = (v0G , eG                G            G      G
                                                 0 , v1 , . . . , ek−1 , vk ) over G is a type-isomorphic
match [5] iff Cµ (vi , vi ) = 1 for j = 0, . . . , k and Cσ (eP
                           P       G                                                  G
                                                                                 j , ej ) = 1 for j =
0, . . . , k − 1.
     Type comparator Cµ (va , vb ) returns true if the labels for va and vb match,
which also goes well for Cσ (ei , ej ).

Basic graph pattern (BGP) query and its concurrent strategy BGP query [4], as
the basic form and main subset of SPARQL query, is a query based on BGP.
Other (graph) patterns in SPARQL such as union pattern, optional pattern, can
be converted to BGP with additional algebra operation.
    Concurrently, BGP query can be decomposed into two basic operations: map-
ping and join [?]. Suppose that tp is a triple pattern in BGP query, which
matches all the triples in RDF graphs, and let Etp be the candidate triple set
that satisfy the condition in tp, this process of screening is called mapping. If
a type-isomorphic walk with k length, the mapping set has k separate portions
                                                      i       i+1
while all those sets have no relationship. Note that Etp and Etp  have the vertex
                                                                            i
whose id is the same. That is to say, the two triples shape a walk. If all Etp can
conduct a walk with k-length then all the type-isomorphism of BGP query in
RDF data graph are found.
    Because the GPU cannot support dynamic memory allocation on the device
memory during the execution of the GPU code, we will employ plain arrays as
the main data structure. As to RDF graphs, each triple is composed of subject,
predicate, and object. All elements in triples are string values (i.e. URIs and
literal objects). When parsing a triple, both subject and object are treated e-
quivalently while the predicate is treated separately. The dictionary is built to
convert string values to integers. Therefore, we get a new quintuple: a set of
label values associated with each vertex (also referred to subject and object)
and another set of label values associated with each edge (predicate).


                                                  2
Join Algorithm in GPU Join operation merges matched triples or partial matched
graphs into a matched graph. The output of mappings is delivered to join if a
BGP contains one or more shared variables. If a BGP has no shared variables,
there is no reason to proceed next join operation. As to type-isomorphism, if
two matched triples have a shared variable, they can be merged into a walk.
    In order to make full use of parallelization advantages, the join process con-
tains three phases: mapping parallel, sort by join id, and reducing duplicate.
When mapping parallel, we set flag left or right according to the shared variable’s
position in triples. The flag left or right contributes to reducing unnecessary com-
putation in the stage of reducing duplicate. GPU’s single instruction multiple
data architecture [2] contributes to accelerating cartesian product parallel.


Algorithm 1 IRSMG: Find all type-isomorphism matches using GPU
Input: S, the set of triples in RDF, one for each edges in graph.
          Tp , stored as an ordered list of k segments.
Output: P, the set of matching walks in RDF.
 1: InitWalks: Read a segment Sg from S.
          If it matches the first segment (S1 , P1 , O1 ) from WP ,
          then convert Sg into a candidate walk of length 1.
 2: for t = 2 to k
 3: ExtendWalks: Read a segment Sg from S. If it matches the first segment
    (St , Pt , Ot ) from WP , then convert Sg into a candidate walk of length t.
          Also read a candidate walk from the most recent candidate list.
 4: Use bitonic sorting to order candidate walk
 5: Reduce duplicate:
 6: (Iterate over records with a given vertex key, sorted into two groups.)
 7:      Group 1, candidate walks of length t − 1 from the ExtendWalks.
 8:      Group 2, segments Sg from the ExtendWalks mapper.
 9: For each vertex key:
10: loop over all members of CandWalkList
11:      If Cartesian product is non-null, then append Sg to make a candidate walk of
    length t.
12: end loop


GPU-based Algorithm Algorithm 1 starts with InitWalks, which chooses seg-
ments from an RDF graph S that matches the first segment of the pattern Wp .
The loop beginning at line 2 runs a matching and join operation at each iter-
ation to extend candidate walks by one segment. The mapper emits segments
and walks with vertex ID as key. Matching is implemented in line 3. A bitonic
sort (line 4) [2] is used to sort candidate walks by shared variables vertex key.
The remaining parts of Algorithm 1 (line 8 to line 12) perform cartesian prod-
uct operation to reduce the duplicate. IRSMG iterates through Wp one segment
at a time with constructing all candidate walks in M that match the segments
considered so far. Clearly, the set of full length directed paths in M generates
the set of all type-isomorphic walks.


                                         3
3        Experiments and evaluations
Our experiments were performed on a PC with a GTX590 GPU and an Intel
Quad-Core CPU 2.66GHz running Ubuntu 14.04 64-bit. The main memory is
2GB, and the device memory of the GPU is 1536MB. We employ LUBM 4 as
the benchmark dataset in our experiments to compare the performance with its
CPU implementation in different dataset scale.
    The experimental results are shown in Figure 1. With the increasing scale
of data, the query time increases and the speedup grows up to four times. If
computation is done on the GPU, data must be copied over the PCI express bus
to the device and results have to be copied back. That explains the reason that
queries on CPU is faster than queries on GPU when the scale of dataset is small.
    As the increase of data scale, data transfer time between the device and the
host takes smaller proportions of the whole process time, thus the query time
increases while the growth rate becomes slower and slower.

               6,000         cpu                                     speedup
                             gpu                              3
    Time(ms)


               4,000
                                                    speedup


                                                              2
               2,000
                                                              1
                  0
                         LU LU LU LU LU LU LU                     LU LU LU LU LU LU LU
                           BM BM BM BM BM BM BM                     BM BM BM BM BM BM BM
                              1 2 4 8 16 32 64                         1 2 4 8 16 32 64


                       Fig. 1: Querying running time comparison and SpeedUp

Acknowledgement. This work is supported by the program of the National
Key Research and Development Program of China (2016YFB1000603) and the
National Natural Science Foundation of China (NSFC) (61672377).

References
1. Berry, J.W., Hendrickson, B., Kahan, S., & Konecny, P.: Software and algorithms
   for graph queries on multithreaded architectures. In: Proc. of IPDPS 2007, pp.1–14,
   2007.
2. CUDA best practices guide. v5. 5. NVIDIA, May 2013.
3. Garey, M.R. & Johnson, D.S.: Computers and intractability: A guide to the theory
   of NP-completeness. W.H. Freeman and Company, 1979.
4. Perez, J., Arenas, A., & Gutierrez, C.: Semantics and complexity of SPARQL. ACM
   Transactions on Database Systems, 34(3):30–43, 2009.
5. Plantenga, T.: Inexact subgraph isomorphism in MapReduce. Journal of Parallel
   and Distributed Computing, 73(2): 164–175, 2013.

4
    http://swat.cse.lehigh.edu/projects/lubm/


                                                4