Introduction

ASSG: Adaptive structural summary for RDF graph data

Haiwei Zhang

Yuanyuan Duan

Xiaojie Yuan

Ying Zhang⋆

0 0 Department of Computer Science and Information Security, Nankai University. 94,Weijin Road, Tianjin , China

RDF is considered to be an important data model for Semantic Web as a labeled directed graph. Querying in massive RDF graph data is known to be hard. In order to reduce the data size, we present ASSG, an Adaptive Structural Summary for RDF Graph data by bisimulations between nodes. ASSG compresses only the part of the graph related to queries. Thus ASSG contains less nodes and edges than existing work. More importantly, ASSG has the adaptive ability to adjust its structure according to the updating query graphs. Experimental results show that ASSG can reduce graph data with the ratio 85% in average, higher than that of existing work.

Adaptive structural summary RDF graph Equivalence class

Introduction

The resource description framework (RDF) data model has been designed as a flexible representation of schema-relaxable or even schema-free information for the Semantic Web [ 1 ]. RDF can be modeled by a labeled directed graph and querying in RDF data is usually thought to be a process of subgraph matching. The subgraph matching problem is defined as follows: for a data graph G and a query graph Q, retrieve all subgraphs of G that are isomorphic to Q. Existing two solutions, subgraph isomorphism and graph simulation, are expensive where subgraph isomorphism is NP-complete and graph simulation takes quadratic time. Further, indices are used to accelerate subgraph queries on large graph data, but indices incur extra cost on construction and maintainence (see [ 2 ] for a survey). Motivated by this, a new approach, using graph compression, has been proposed recently [ 3 ]. In [ 3 ], Fan et al. proposed query preserving graph compression Gr, which compresses massive graph into a small one by partitioning nodes into equivalence classes. For subgraph matching, Gr can reduce graph data with the ratio 57% in average. However, for a designated query graph, lots of components (nodes and edges) in Gr are redundant. Hence it is possible to construct a compressed graph for designed subgraph matching. ⋆ Corresponding author.

In this paper, we present ASSG (Adaptive Structural Summary of Graphs), a graph compression method that further reduces the size of the graph data. ASSG has less components than Gr and more importantly, it has adaptive ability to adjust its structure according to different subgraph matchings. In the following sections, we mainly introduce our novel technique. 2

Adaptive Structural Summary

In this section, we present our approach of adaptive structural summary for labeled directed graph data (such as RDF). ASSG is actually an compressed graph constructed by equivalence classes of nodes and it has adaptive ability to adjust its structure according to different query graphs.

Graph data is divided into different equivalence classes by bisimulation relations as [ 3 ] proposed. For computing bisimulation relation, we refer to the notion rank proposed in [ 4 ] for describing structural feature from leaf nodes (if exist). A.Dovier, et al.[ 4 ] proposed function of computing ranks of nodes for both directed acyclic graph (DAG) and directed cyclic graph (DCG). Rank is something like structural feature of nodes from leaf nodes in graph data.

An equivalence class ECG of nodes in graph data G = (V; E; L) is denoted by a triple (Ve; Re; Le), where (1) Ve is a set of nodes included in the equivalence class, (2) Re is the rank of the nodes, and (3) Le denotes the labels of the nodes.

A1 B1

A2 B2

A3 B3

A1A2A3

B1B2B3 C1

D1 C2

D2 C3

C1C2C3

D1D2D3

B C

D Q1

A B

Q2 C

D (a) Graph data G

(b) ASSG (c) Query graphs Fig. 1. Graph data and equivalence classes A1

B1B2B3 C1C2C3

D1D2D3 (d) ASSGÿ

Obviously, ASSG is the minimum pattern that can describe labeled directed graph data because nodes with the same label and rank will be collapsed. Unfortunately, the process of measuring ranks will lose some descendants or ancestors of nodes. And this case will not conform to the definition of bisimulation, and thus bring out wrong answers for subgraph matching. For example, in Fig. 1(b), the nodes A1 and A2 in the same equivalence class have different children. To solve the problem, ASSG will adaptively adjust its structure for updating query graphs.

For each subgraph matching, the procedure of adaptively updating ASSG includes two stages: matching and partitioning. Given a query graph Q = (VQ; EQ; LQ) and ASSG GASS = (VASS ; EASS ; LASS ; RASS ), assuming that RQ=frank(vQ)jvQ2VQg. For the matching stage, 8v2VQ and u2VQ, 9v′, u′2VASS , if LQ(v) = LASS (v′), LQ(u) = LASS (u′), and RQ(v) RQ(u) = RASS (v′) RASS (u′), then v, u matches v′, u′ respectively. For the partitioning stage, nodes in ASSG matching current query graph will be partitioned into different parts according to its neighbors by the algorithm presented in [ 5 ] with the complexity of time O(jEjlogjVQj). In Fig. 1(c), ASSG will not change while matching Q1, but ASSG will change to the structure shown in Fig. 1(d) while matching Q2. It is obvious that the size of ASSG will increase after further partition, but each partition will adjust minimum amount of nodes. While subgraph matching focuses on frequent nodes, ASSG will remain stable. 3

Experimental Evaluation

In this section, we performed experiments on both realistic and synthetic data sets to verify the performance of ASSG.

Firstly, we use compression ratio as a measurement for evaluating the effectiveness of ASSG for subgraph matchings compared with Gr. We define compression ratio of ASSG as: CASS = jVASS j=jV j. Similarly, the compression ratio of Gr is CGr = jVrj=jV j. The ration is lower, the better. The effectiveness of ASSG compared with Gr is reported in Table 1 where jGj denotes to the size of graph data. For a query graph Gq = (Vq; Eq; Lq), the compression ratio of ASSG is decided by the number of labels jLqj in the query graph. Assuming that jLqj = 15% jLj, then we can study from table 1: By ASSG, graph data can be highly compressed according to query graphs. ASSG reduces graph data by 85% in average. The compression ratio of ASSG is lower than that of Gr.

Secondly, we evaluate the efficiency of updating ASSG. Assuming that number of labels in query graph is 15% of jLj. We generate two query graphs for updating ASSG. The number of repeated labels in these two graphs are 0, 1, 2, 5 respectively as table 2 shows. We can study that the more repeated labels in different query graphs, the less time occupation for ASSG to update. As a result, for frequent subgraph matchings, ASSG can be updated and maintained with low cost of time.

Conclusion and Future work

We have proposed ASSG, adaptive structural summary for RDF graph data. ASSG is based on equivalence classes of nodes, and ASSG compresses graph data according to the query graphs. We presented main idea for constructing and updating ASSG and designed experiments on realistic and synthetic data sets to evaluate the effectiveness and efficiency of our technique. Experimental results show that the compression ratio of ASSG is lower than that of existing work Gr and ASSG is efficiently updated for frequent queries. Further more, we will use ASSG for optimizing SPARQL queries on RDF data for semantic web. Acknowledgments. This work is supported by National Natural Science Foundation of China under Grant No. 61170184, 61402243, the National 863 Project of China under Grant No. 2013AA013204, National Key Technology R&D Program under Grant No.2013BAH01B05, and the Tianjin Municipal Science and Technology Commission under Grant No.13ZCZDGX02200, 13ZCZDGX01098 and 13JCQNJC00100.

Neumann ., G.Weikum.: The rdf-3x engine for scalable management of rdf data . VLDB J ., 19 ( 1 ), 91 - 113 , 2010 .

Sun .,

Wang .,

Wang ., B.Shao. , J.Li. : Efficient Subgraph matching on billion node graphs . The VLDB Journal , 5 ( 9 ), 788 - 799 ( 2012 )

Fan ., J.Li. ,

Wang ., Y.Wu. : Query preserving graph compression . In: ACM SIGMOD International Conference on Management of Data , pp. 157 - 168 . ACM, New York ( 2012 )

Dovier ., C.Piazza. , A.Policriti.: A fast bisimulation algorithm . In: Conference on Computer Aided Verification, pp. 79 - 90 . Springer-Verlag Berlin Heidelberg ( 2001 )

Paige ., R.E.Tarjan. ,

Bonic .: A linear time solution to the single function coarsest partition problem . Theoretical Computer Science , 40 ( 1 ), 67 - 84 ( 1985 )