<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ASSG: Adaptive structural summary for RDF graph data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haiwei Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuanyuan Duan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaojie Yuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ying Zhang⋆</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Security, Nankai University.</institution>
          <addr-line>94,Weijin Road, Tianjin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>RDF is considered to be an important data model for Semantic Web as a labeled directed graph. Querying in massive RDF graph data is known to be hard. In order to reduce the data size, we present ASSG, an Adaptive Structural Summary for RDF Graph data by bisimulations between nodes. ASSG compresses only the part of the graph related to queries. Thus ASSG contains less nodes and edges than existing work. More importantly, ASSG has the adaptive ability to adjust its structure according to the updating query graphs. Experimental results show that ASSG can reduce graph data with the ratio 85% in average, higher than that of existing work.</p>
      </abstract>
      <kwd-group>
        <kwd>Adaptive structural summary</kwd>
        <kwd>RDF graph</kwd>
        <kwd>Equivalence class</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The resource description framework (RDF) data model has been designed as a
flexible representation of schema-relaxable or even schema-free information for
the Semantic Web [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. RDF can be modeled by a labeled directed graph and
querying in RDF data is usually thought to be a process of subgraph matching.
The subgraph matching problem is defined as follows: for a data graph G and a
query graph Q, retrieve all subgraphs of G that are isomorphic to Q. Existing
two solutions, subgraph isomorphism and graph simulation, are expensive where
subgraph isomorphism is NP-complete and graph simulation takes quadratic
time. Further, indices are used to accelerate subgraph queries on large graph
data, but indices incur extra cost on construction and maintainence (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for
a survey). Motivated by this, a new approach, using graph compression, has
been proposed recently [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Fan et al. proposed query preserving graph
compression Gr, which compresses massive graph into a small one by partitioning
nodes into equivalence classes. For subgraph matching, Gr can reduce graph data
with the ratio 57% in average. However, for a designated query graph, lots of
components (nodes and edges) in Gr are redundant. Hence it is possible to
construct a compressed graph for designed subgraph matching.
⋆ Corresponding author.
      </p>
      <p>In this paper, we present ASSG (Adaptive Structural Summary of Graphs),
a graph compression method that further reduces the size of the graph data.
ASSG has less components than Gr and more importantly, it has adaptive ability
to adjust its structure according to different subgraph matchings. In the following
sections, we mainly introduce our novel technique.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Adaptive Structural Summary</title>
      <p>In this section, we present our approach of adaptive structural summary for
labeled directed graph data (such as RDF). ASSG is actually an compressed
graph constructed by equivalence classes of nodes and it has adaptive ability to
adjust its structure according to different query graphs.</p>
      <p>
        Graph data is divided into different equivalence classes by bisimulation
relations as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed. For computing bisimulation relation, we refer to the notion
rank proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for describing structural feature from leaf nodes (if exist).
A.Dovier, et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed function of computing ranks of nodes for both
directed acyclic graph (DAG) and directed cyclic graph (DCG). Rank is something
like structural feature of nodes from leaf nodes in graph data.
      </p>
      <p>An equivalence class ECG of nodes in graph data G = (V; E; L) is denoted
by a triple (Ve; Re; Le), where (1) Ve is a set of nodes included in the equivalence
class, (2) Re is the rank of the nodes, and (3) Le denotes the labels of the nodes.</p>
      <p>A1
B1</p>
      <p>A2
B2</p>
      <p>A3
B3</p>
      <p>A1A2A3</p>
      <p>B1B2B3
C1</p>
      <p>D1 C2</p>
      <p>D2 C3</p>
      <p>D3</p>
      <p>C1C2C3</p>
      <p>D1D2D3</p>
      <p>B
C</p>
      <p>D
Q1</p>
      <p>A
B</p>
      <p>Q2
C</p>
      <p>D
(a) Graph data G</p>
      <p>(b) ASSG (c) Query graphs
Fig. 1. Graph data and equivalence classes
A1</p>
      <p>A2</p>
      <p>A3</p>
      <p>B1B2B3
C1C2C3</p>
      <p>D1D2D3
(d) ASSGÿ</p>
      <p>Obviously, ASSG is the minimum pattern that can describe labeled directed
graph data because nodes with the same label and rank will be collapsed.
Unfortunately, the process of measuring ranks will lose some descendants or ancestors
of nodes. And this case will not conform to the definition of bisimulation, and
thus bring out wrong answers for subgraph matching. For example, in Fig. 1(b),
the nodes A1 and A2 in the same equivalence class have different children. To
solve the problem, ASSG will adaptively adjust its structure for updating query
graphs.</p>
      <p>
        For each subgraph matching, the procedure of adaptively updating
ASSG includes two stages: matching and partitioning. Given a query graph Q =
(VQ; EQ; LQ) and ASSG GASS = (VASS ; EASS ; LASS ; RASS ), assuming that
RQ=frank(vQ)jvQ2VQg. For the matching stage, 8v2VQ and u2VQ, 9v′, u′2VASS ,
if LQ(v) = LASS (v′), LQ(u) = LASS (u′), and RQ(v) RQ(u) = RASS (v′)
RASS (u′), then v, u matches v′, u′ respectively. For the partitioning stage,
nodes in ASSG matching current query graph will be partitioned into different
parts according to its neighbors by the algorithm presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with the
complexity of time O(jEjlogjVQj). In Fig. 1(c), ASSG will not change while matching
Q1, but ASSG will change to the structure shown in Fig. 1(d) while matching
Q2. It is obvious that the size of ASSG will increase after further partition, but
each partition will adjust minimum amount of nodes. While subgraph matching
focuses on frequent nodes, ASSG will remain stable.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Evaluation</title>
      <p>In this section, we performed experiments on both realistic and synthetic data
sets to verify the performance of ASSG.</p>
      <p>Firstly, we use compression ratio as a measurement for evaluating the
effectiveness of ASSG for subgraph matchings compared with Gr. We define
compression ratio of ASSG as: CASS = jVASS j=jV j. Similarly, the compression ratio
of Gr is CGr = jVrj=jV j. The ration is lower, the better. The effectiveness of
ASSG compared with Gr is reported in Table 1 where jGj denotes to the size
of graph data. For a query graph Gq = (Vq; Eq; Lq), the compression ratio of
ASSG is decided by the number of labels jLqj in the query graph. Assuming
that jLqj = 15% jLj, then we can study from table 1: By ASSG, graph data
can be highly compressed according to query graphs. ASSG reduces graph data
by 85% in average. The compression ratio of ASSG is lower than that of Gr.</p>
      <p>Secondly, we evaluate the efficiency of updating ASSG. Assuming that
number of labels in query graph is 15% of jLj. We generate two query graphs for
updating ASSG. The number of repeated labels in these two graphs are 0, 1,
2, 5 respectively as table 2 shows. We can study that the more repeated labels
in different query graphs, the less time occupation for ASSG to update. As a
result, for frequent subgraph matchings, ASSG can be updated and maintained
with low cost of time.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future work</title>
      <p>We have proposed ASSG, adaptive structural summary for RDF graph data.
ASSG is based on equivalence classes of nodes, and ASSG compresses graph
data according to the query graphs. We presented main idea for constructing
and updating ASSG and designed experiments on realistic and synthetic data
sets to evaluate the effectiveness and efficiency of our technique. Experimental
results show that the compression ratio of ASSG is lower than that of existing
work Gr and ASSG is efficiently updated for frequent queries. Further more, we
will use ASSG for optimizing SPARQL queries on RDF data for semantic web.
Acknowledgments. This work is supported by National Natural Science
Foundation of China under Grant No. 61170184, 61402243, the National 863 Project
of China under Grant No. 2013AA013204, National Key Technology R&amp;D
Program under Grant No.2013BAH01B05, and the Tianjin Municipal Science and
Technology Commission under Grant No.13ZCZDGX02200, 13ZCZDGX01098
and 13JCQNJC00100.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          .,
          <string-name>
            <surname>G.Weikum.:</surname>
          </string-name>
          <article-title>The rdf-3x engine for scalable management of rdf data</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <fpage>91</fpage>
          -
          <lpage>113</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          .,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          .,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          .,
          <string-name>
            <surname>B.Shao.</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.Li.</surname>
          </string-name>
          :
          <article-title>Efficient Subgraph matching on billion node graphs</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>5</volume>
          (
          <issue>9</issue>
          ),
          <fpage>788</fpage>
          -
          <lpage>799</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          .,
          <string-name>
            <surname>J.Li.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          .,
          <string-name>
            <surname>Y.Wu.</surname>
          </string-name>
          :
          <article-title>Query preserving graph compression</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data</source>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
          . ACM, New York (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          .,
          <string-name>
            <surname>C.Piazza.</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.Policriti.:</surname>
          </string-name>
          <article-title>A fast bisimulation algorithm</article-title>
          . In: Conference on Computer Aided Verification, pp.
          <fpage>79</fpage>
          -
          <lpage>90</lpage>
          . Springer-Verlag Berlin Heidelberg (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Paige</surname>
          </string-name>
          .,
          <string-name>
            <surname>R.E.Tarjan.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bonic</surname>
          </string-name>
          .:
          <article-title>A linear time solution to the single function coarsest partition problem</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>40</volume>
          (
          <issue>1</issue>
          ),
          <fpage>67</fpage>
          -
          <lpage>84</lpage>
          (
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>